I Rewrote A Java Microservice In Rust And Lost My Job
A dark comedy about choosing the “wrong” technology. On Monday I had a badge. On Tuesday my badge was a coaster. The crime? I rewrote “Billing-Quotes,” a sleepy Java microservice with thirteen upstreams, in Rust. p95 got leaner. CPU chilled. Memory dropped. The infra bill blinked smaller numbers like a hotel minibar. And then the CTO told me to bring a box. This is the autopsy of a decision that was technically right, politically wrong, and culturally radioactive. If you’ve ever stared at a JVM flame graph at 2 a.m. and fantasized about shipping a single, tidy Rust binary, this is your cautionary rom-com with a body count.
The Scene: A Service That Looked Guilty
The patient: Spring Boot 3.x on Java 21. Two replicas. 2 vCPU, 4 GB RAM each. SLO: p95 under 120 ms; four 9s availability (aspirational, like my gym membership). Traffic: Lunch spikes — batch refresh + humans clicking “Get Quote” like a woodpecker on espresso. Perf stink: GC burps during bursty JSON parsing and a well-meaning “DTO of DTOs” pattern that inflated allocations. Bonus drag: A gateway hop that touched everything “for consistency,” plus sidecars for auth, metrics, and snacks. We weren’t failing — just wearing a winter coat to the gym.
The Itch: Why I Reached For Rust Three things begged for a systems language:
- High-fanout I/O across internal gRPC and a chatty payments adapter.
- Hot JSON paths where every extra allocation came back to haunt p99.
- Predictable latency mattered more than raw throughput; long tails hurt revenue.
So I spiked a prototype with Axum, Tokio, serde, reqwest (tonic for gRPC), sqlx for Postgres, and tracing + OpenTelemetry. I mirrored every endpoint and error contract, preserved headers like museum pieces, and hid it all behind a feature-flagged strangler so I could route 1% → 10% → 50% → 100% without waking Security.
Two-week canary, same traffic mix:
- p95: 118 ms → 94 ms (steady)
- p99: spiky → tamed (shorter, rarer spikes)
- CPU per RPS: ~30% lower at peak
- Memory: ~45% lower steady state
- Infra line item: a tidy single-digit % down — not a movie plot, but a CFO smile
Startup time dropped to “blink and it’s ready.” The binary was small. The dashboards were boring in the best way. I had graphs. I had a README. I had a grin. I did not have a job for long.
The Meeting Where I Lost The Plot
The review began hopeful. “Numbers look good,” said SRE. “Binary size is cute,” said DevOps. “Can the on-call rotate this?” asked my manager. “Where’s the threat model?” asked Security. “What’s our policy on language creep?” asked the CTO. Language creep. I had optimized milliseconds; they worried about governance — the quiet glue that keeps a company predictable.
Translation of the raised eyebrows:
- On-call literacy. Our playbooks are JVM-shaped: JFR, heap dumps, familiar alerts. Rust needs new muscle memory.
- Hiring & coverage. At 3 a.m., who can touch this safely? Our bench is Java-heavy.
- Security pipeline. SBOMs, SAST, license checks — all tuned for the JVM. Rust’s great; ours wasn’t ready for it.
- Platform consistency. A thousand local wins can be canceled by one organization-wide outlier.
- Time-to-change. We shaved latency but added weeks of cross-team work.
My technical win was a social regression. I improved the tail while blowing up the map.
Four Mistakes That Made A Promotion Look Like A Pink Slip
- Optimizing The Wrong KPI I chased p95. Leadership cared about time-to-market and team mobility across services. My graph didn’t move their graph.
- Underestimating “Mean Time To Explain” Retros are powered by common language and common tools. I introduced a new dialect mid-sentence.
- Treating Tooling Debt As “Future Work” Engineers see toil as a puzzle. Organizations see toil as risk. My puzzle was their pager.
- Confusing “Cheaper, Faster, Safer” With “Predictable” Rewriting one service in a new language is a strategy claim dressed as a local refactor.
What Rust Actually Changed (And Didn’t) Changed, for real:
- Heap drama → Ownership clarity. Those hot JSON paths stopped allocating like a soap opera.
- Tail latency. Less GC variance; fewer “why is p99 screaming” moments.
- Start-up & idle footprint. Cold starts and scale-to-zero games got easier.
Didn’t change (sorry):
- Databases. If your bottleneck is Postgres in Java, it’s still Postgres in Rust — now with immaculate lifetimes.
- Cross-team drag. New stack → new tooling → new humans to train.
- Feature speed. If product logic dominates, language speed ≠ shipping speed.
The Funny Part (If You Squint) The morning Finance emailed a happy note about the lowered bill, Security asked who approved the new SBOM pipeline. PM asked whether this jeopardized the Q4 promo. SRE asked how to debug on-box when eBPF throws a tantrum. The CTO asked how many services would “benefit” from Rust. I answered honestly: “A handful, maybe five.” He nodded. “I love the craft. I don’t love the precedent.” Precedent, it turns out, weighs more than binaries. By Friday, my badge beeped red.
The Better Plan I Should Have Followed If you’re Rust-curious (and sometimes you should be), here’s the boring, correct sequence: 1) Propose A “Runtime Exception” Lane One page. One quarter. One service. Entry criteria: measured SLO pain, isolatable hot path, mature libraries, rollback plan, and a sunset clause if it under-delivers. 2) Start With A Sidecar, Not A Rewrite Peel off one hot path (serialization, crypto, image ops) into a co-resident Rust sidecar. Keep the Java service the boss. Measure tail improvements with identical dashboards. 3) Make Platform Own The Tooling Ask for a small funded initiative: SBOMs, SAST, artifact signing, tracing conventions, crash capture, dashboards. If Platform blesses it, you’re a citizen — not a rebel. 4) Treat Observability As A Contract Before code, lock the log format, trace IDs, error taxonomy, and dashboards. “Looks the same; behaves better” is a winning pitch. 5) Strangler With Business Toggles Start with a single endpoint. Roll forward and back via a flag or Envoy route. Rollback is a minute, not a meeting. 6) Publish The Deletion Plan On Day One The ability to delete is the soul of an experiment. Write the obituary with the birth certificate.
Postmortem Without The Buzzwords What Went Well
- Clean canary, clear instrumentation, reversible rollout.
- Real, repeatable tail improvements.
- Docs that explained shape, not just syntax.
What Went Badly
- Perf over predictability.
- Unfunded tooling and training debt.
- Governance treated as “someone else’s Jira.”
What We’ll Do Next Time
- Sidecar first; rewrite later (maybe never).
- Platform-owned security and SBOMs before code lands.
- Ask for policy up front; don’t smuggle strategy in a PR.
A Tiny Go/No-Go Checklist (Steal This)
- SLO pain is measured and business-visible.
- Hot path is isolatable behind a flag.
- Platform buy-in exists for SBOM, SAST, signing, tracing.
- On-call literacy: at least four people can debug it at 3 a.m.
- Rollback in minutes, not hours.
- Deletion plan approved.
Tape it near your keyboard. If you can’t check the boxes, tune your GC, fix your N+1, and save the hero cape for Halloween.
The Box, The Bus Stop, The Epilogue HR was kind. Security was efficient. I took a bus — irony travels light. On the ride, a Rust-first startup messaged: “Saw your talk last year. We’re hiring. We like your numbers.” Here’s my unheroic conclusion: I still like Java. I still love Rust. I love companies even more — because they pay for both. Tools are easy to champion; systems of people are hard to change. The better engineer isn’t the one who wins the benchmark; it’s the one who ships improvements without surprising the org chart. If you’re holding a rewrite itch, do the unsexy thing: write the memo, start smaller than your ego, and count every new tool you’re asking the on-call to learn at 3 a.m. If the math still checks out, I’ll cheer for your tail. If not, brag about the graph that matters even more: time-to-change.