Jump to content

He Migrated 100,000 Lines to Rust — Then Everything Broke

From JOHNWICK

The night the graphs went weird We flipped 20% of traffic to the shiny Rust service.
CPU fell. Latency… spiked.
Dashboards were green. Users were not. No one got fired. But it was close. This is the story of what actually broke in a large Rust migration at a bank-scale backend, why it broke, and how to fix it fast — without rewriting the rewrite. I’ll keep language simple. Short sections. Concrete code. No myths.


What we migrated (so you can map this to your world)

  • Domain: payments + ledger posts + notifications.
  • Old stack: Java 17 + Spring Boot, Reactor, Postgres, Kafka, Redis.
  • New stack: Rust 1.79+, axum 0.7, tokio, sqlx (compile-time checked SQL), rust_decimal for money, tracing + OTEL, rdkafka.
  • Why Rust: predictable latency, memory safety, lean containers, cold-start speed for batch workers.

We didn’t “port files.” We re-modeled the critical path. And that’s where the trouble started.


Architecture (hand-drawn style)

+-----------------+         HTTP/gRPC         +--------------------+
|  Mobile / Web   |  ───────────────────────> |  Edge / API Router |
+-----------------+                           +----------+---------+
                                                         |
                                                         v
                                              +----------+---------+
                                              |    Rust Gateway    |  (axum)
                                              +----------+---------+
                                                         |
                                             async       |       sync
                                         events (Kafka)  |  pg txns (sqlx)
                                                         |
                         +--------------------+----------+--------------------+
                         |                    |                               |
                         v                    v                               v
               +---------+-------+   +--------+--------+            +---------+--------+
               |  Matching/FX    |   |  Ledger Service |            |  Notifications   |
               |  (old Java)     |   |  (Rust)         |            |  (old Java)      |
               +---------+-------+   +--------+--------+            +---------+--------+
                         |                    |                               |
                         v                    |                               |
                    Kafka topics  <-----------+----------- idempotent events -+
                         |
                         v
              +----------+-----------+
              |  Postgres (ACID)     |
              |  tables: tx, ledger  |
              +----------------------+

Idea: Rust owns the hot path for posting payments to the ledger.
Java services still handle matching and downstream fanout.
Kafka stitches the world together. Postgres is truth.


What broke (and why)

1) Async that wasn’t

We used blocking crypto and DNS lookups inside tokio handlers.
The threadpool starved; tail latency exploded. Fix: move blocking to spawn_blocking; audit every dependency for non-async I/O. Use reqwest with rustls-TLS and sqlx’s fully async pool.

2) Money as f64

A few conversions sneaked in via serde defaults. Rounding drifted; reconciliation failed. Fix: use rust_decimal::Decimal end-to-end. Validate JSON schema so floats can’t slip in.

3) Idempotency drift

Java side treated the idempotency key as (userId, merchantRef).
Rust service assumed requestId. Duplicates hit the ledger. Fix: central contract: (tenantId, source, key) as unique. Enforce at DB with unique index and at code with UPSERT + status transitions.

4) Exactly-once fantasies

Kafka “exactly once” wasn’t. A consumer crashed between write and commit.
Ledger showed a gap; notifications fired twice. Fix: Outbox + transactional writes. Produce from DB-persisted outbox within the same transaction, then relay.

5) Time zones and business days

The Java world used ZonedDateTime with IST business rules.
Our Rust used UTC epoch seconds. Daily cutoffs missed. Fix: time crate with named time zones, cutoffs computed with a calendar table. Don’t compute calendars in code.

6) Backpressure mismatch

Old Reactor pipeline had bounded queues. New Rust pipelines happily read from Kafka as fast as possible.
Downstream choked. Fix: tokio_util::TaskTracker + bounded channels; commit offsets only after durable write. Make “pull” match “sink.”

7) Binary & base-image surprises

Musl image looked neat. TLS handshakes weren’t.
We hit odd DNS and cert issues under load. Fix: glibc distroless base; pinned openssl/rustls. Keep images boring in prod.


The one thing that saved us: a boring handler We refactored the critical path into one small, testable handler: “capture payment with idempotency”. 
This is the shape that calmed the fire. Rust (axum + sqlx + rust_decimal + tracing)

use axum::{extract::{State, Json}, http::StatusCode};
use rust_decimal::Decimal;
use serde::Deserialize;
use sqlx::{PgPool, Postgres, Transaction};
use tracing::{info, instrument};

#[derive(Deserialize)]
struct Capture { tenant: String, key: String, amount: Decimal, currency: String }

#[instrument(skip(db))]
async fn capture(
    State(db): State<PgPool>,
    Json(c): Json<Capture>,
) -> Result<StatusCode, StatusCode> {
    let mut tx: Transaction<Postgres> = db.begin().await.map_err(|_| StatusCode::BAD_GATEWAY)?;
    // Reserve or read existing row (idempotent gate).
    let rec = sqlx::query!(
        r#"INSERT INTO payments (tenant, idem_key, amount, currency, status)
           VALUES ($1,$2,$3,$4,'PENDING')
           ON CONFLICT (tenant, idem_key) DO UPDATE SET tenant=EXCLUDED.tenant
           RETURNING id, status"#,
        c.tenant, c.key, c.amount, c.currency
    ).fetch_one(&mut tx).await.map_err(|_| StatusCode::CONFLICT)?;

    if rec.status != "PENDING" {
        tx.rollback().await.ok();
        return Ok(StatusCode::OK); // already captured or processing
    }

    // Business rules + ledger post
    sqlx::query!("UPDATE payments SET status='CAPTURED' WHERE id=$1", rec.id)
        .execute(&mut tx).await.map_err(|_| StatusCode::BAD_GATEWAY)?;
    sqlx::query!("INSERT INTO outbox (payment_id, topic, payload) VALUES ($1,'payments.captured', '{}')",
        rec.id).execute(&mut tx).await.map_err(|_| StatusCode::BAD_GATEWAY)?;

    tx.commit().await.map_err(|_| StatusCode::BAD_GATEWAY)?;
    info!(payment_id=%rec.id, "captured");
    Ok(StatusCode::CREATED)
}

Why it works

  • DB is the gate. First write wins. Retries are harmless.
  • Outbox is inside the same transaction. A relay publishes later; no gap.
  • No floats. Decimal everywhere.
  • Trace spans wrap the whole thing. Fail fast if the DB or publish path blinks.


SQL that enforces peace

CREATE TABLE payments (
  id BIGSERIAL PRIMARY KEY,
  tenant TEXT NOT NULL,
  idem_key TEXT NOT NULL,
  amount NUMERIC(20,6) NOT NULL,
  currency TEXT NOT NULL,
  status TEXT NOT NULL CHECK (status IN ('PENDING','CAPTURED','REFUNDED')),
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX uniq_payment_idem ON payments (tenant, idem_key);

CREATE TABLE outbox (
  id BIGSERIAL PRIMARY KEY,
  payment_id BIGINT NOT NULL REFERENCES payments(id),
  topic TEXT NOT NULL,
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at TIMESTAMPTZ
);

This schema reduced angry pages more than any micro-optimization.


How we made Rust feel like the old system (in a good way)

  • Contracts stayed stable. We versioned protobufs/JSON. No stealth renames.
  • Logs spoke the same language. tracing fields matched old MDC keys. Search worked day one.
  • SLOs matched reality. p95 targeted the slowest dependency, not our wish list.
  • We shipped behind a flag. 5% → 20% → 50% → 100%, with shadow reads on every step.

“But isn’t Rust supposed to be faster?”

Yes — when you let it be. Rust gives you control. Control cuts both ways.
When you block the reactor, mix float money, or widen queues with no backpressure, you’re choosing slow, just in a different language. The win came from better shape, not just a new compiler:

  • Small async handlers.
  • ACID first.
  • Outbox, not hope.
  • Decimal money.
  • Bounded everything.


What we measured (and why it mattered) We didn’t publish vanity charts.
We published operational deltas that matter to a bank:

  • Charge idempotency collisions: from dozens/day to ~0 after index + UPSERT.
  • Replay safety: outbox relay retried safely; duplicates dropped at unique gate.
  • Tail latency: p99 fell after removing blocking DNS + crypto to dedicated pools.
  • On-call pages: fewer “ghost charges,” fewer “double notifications.”

Numbers were boring. Support queues were quiet. That was the point.


The migration playbook that didn’t read like a playbook I promised no checklists. Here’s a story you can reuse. Start with one verb. 
We chose capture. Put it in Rust. Nail it.
Keep every other verb (quote, authorize, refund) where it is. Let Postgres decide who wins.
Not your app. Not Kafka. Postgres.
One unique index is worth a dozen meetings. Publish after commit. 
If someone asks “why not exactly once,” smile and point to the outbox. Give every component a speed limit. 
Bounded queues. Bounded pools. Bounded retries. Make logs boring and the same. 
Your future self at 3 a.m. is the customer. Common traps you can dodge on day one

Trap: moving CPU-heavy crypto into async tasks.
Dodge: spawn_blocking with a separate sized pool; budget it. Trap: JSON floats sneaking into amounts.
Dodge: serde(with = "rust_decimal::serde::str") on all money types. Trap: consumer reads faster than you can write.
Dodge: backpressure + commit after durable write only. Trap: “we replaced Reactor with Tokio; that’s it.”
Dodge: check every library: DNS, TLS, DB, metrics. All async.


A small adapter that kept Java happy Our gateway had to talk to old Reactor services. Here’s the Rust side proxying with reqwest + timeouts + cancellation:

use std::time::Duration;
use axum::{routing::post, Router};
use reqwest::Client;
use tokio::time::timeout;

fn app(client: Client) -> Router {
    Router::new().route("/notify", post(move |body: String| notify(client.clone(), body)))
}

async fn notify(client: Client, body: String) -> Result<(), String> {
    let req = client.post("http://legacy-notify.svc/notify")
        .timeout(Duration::from_secs(2))
        .header("content-type", "application/json")
        .body(body);

    timeout(Duration::from_secs(3), req.send())
        .await.map_err(|_| "notify timeout".to_string())?
        .map_err(|e| e.to_string())?
        .error_for_status()
        .map(|_| ())
        .map_err(|e| e.to_string())
}

We killed dozens of “hung” requests just by setting timeouts and propagating cancel.


What I’d do if you told me “go again tomorrow”

  • Pick one verb.
  • Draw the idempotent path on a whiteboard.
  • Code the handler + SQL first.
  • Add the outbox.
  • Wrap with tracing + OTEL.
  • Run on distroless glibc.
  • Release behind a traffic flag.

That sequence will save you months.


Two minutes on org reality (the part no one writes) People don’t resist Rust.
They resist surprises. When data didn’t reconcile, reconciliation analysts took the hit.
When graph lines were pretty but the ledger wasn’t, risk teams got nervous.
Once duplicate charges stopped, no one cared what language it was. Build for the humans who clean up after your code. They decide the fate of your migration.


If you need one diagram to sell this at work

CLIENT ──> RUST HANDLER ──> BEGIN TX
                   |            |
                   |            ├─ UPSERT payments (tenant, idem_key)
                   |            ├─ INSERT outbox (payments.captured)
                   |            └─ COMMIT
                   |                     |
                   |                 OUTBOX RELAY ──> Kafka topic
                   |                                   |
                   └───────────────────────────────────┴──> Java consumers
                                                          (notifications, analytics)

Everyone can understand this in five minutes.
It’s simple. It’s durable. It’s testable.


Closing: why we stayed with Rust anyway After the fixes, the rewrite felt… boring.
That’s a compliment.

We kept Rust because:

  • Handlers stayed small and sharp.
  • Memory stayed flat.
  • Cold starts vanished.
  • The on-call felt lighter.

And because the shape was right: one gate, one commit, one outbox. If your team is staring at a Rust migration and your stomach is tense, start here.
Make the smallest path correct.
Then scale out with confidence.

When the ledger balances on the first try, no one asks for the language. Want the full module layout?

If you want this as a ready-to-clone template (axum routes, sqlx migrations, outbox relay worker, OTEL config, Dockerfiles for distroless), say the word and tell me your Kafka vs Pulsar preference. I’ll sketch the tree and you can ship a pilot in a week.

Read the full article here: https://medium.com/@samurai.stateless.coder/he-migrated-100-000-lines-to-rust-then-everything-broke-0926c850a22e