Jump to content

Rust Made My Backend 18x Faster: Here is the Full Breakdown

From JOHNWICK

Eighteen times faster after three weeks of focused work. That sentence changed how my team plans features and how clients budget for performance.

Read this if performance matters to you and if shipping fast code matters more than optimism.

The story in one sentence A single service that handled heavy JSON parse and compute moved from a dynamic runtime into Rust with serde and tokio and became reliable and predictable.

What was broken Traffic pattern

  • High concurrent requests with medium sized JSON payloads.
  • CPU was pinned on one core when payloads increased.
  • GC pauses or runtime scheduler overhead created unpredictable tail latency.

Symptoms

  • Median latency 300 ms under baseline load.
  • 95th percentile often exceeded 800 ms.
  • Throughput limited to 1,200 requests per second on a 4 core machine.

Why this mattered

  • Users abandoned flows due to delay.
  • Autoscaling costs rose because each instance could not handle burst traffic.

The three surgical changes

  • Move parsing and transformation code to Rust.
  • Use a native async runtime for network handling.
  • Replace allocations in the hot path with borrowed deserialization and streaming where possible.

Each change is small by itself. Together they compound. Minimal reproducer and micro change

Problem snippet in the original service. This is the production like code that ran in a dynamic runtime. It shows the pattern that was expensive: parse JSON into dynamic object then map and compute.

// Node style pseudo code for the hot handler
async function handle(req, res) {
  const body = await req.json(); // allocate object
  const items = body.items;
  let sum = 0;
  for (let i = 0; i < items.length; i = i + 1) {
    const it = items[i];
    sum = sum + (it.value * 2);
  }
  // expensive stringify to send downstream
  const out = JSON.stringify({ sum: sum });
  await client.post('/process', out);
  res.send({ ok: true });
}

Problem explanation

  • Parsing allocates a large, nested object graph.
  • Looping over a dynamic array pays pointer and type costs.
  • Writing stringified JSON allocates again before network send.

The Rust rewrite Problem targeted

  • Avoid fully materializing the JSON when possible.
  • Parse into typed structs with serde when structure is known.
  • Use tokio for network and hyper for server.
  • Minimize temporary allocations inside the tight loop.

Rust code snippet

use serde::Deserialize;
use hyper::{Body, Request, Response, Server};
use hyper::service::{make_service_fn, service_fn};
use tokio::io::{AsyncWriteExt};

#[derive(Deserialize)]
struct Item {
    value: i64,
}
#[derive(Deserialize)]
struct Payload {
    items: Vec<Item>,
}
async fn handle(req: Request<Body>) -> Result<Response<Body>, hyper::Error> {
    // Read full body into bytes
    let bytes = hyper::body::to_bytes(req.into_body()).await?;
    // Deserialize into typed struct
    let payload: Payload = serde_json::from_slice(&bytes).unwrap();
    let mut sum: i64 = 0;
    for it in payload.items.into_iter() {
        sum += it.value * 2;
    }
    // Create small JSON response with minimal allocation
    let resp = serde_json::to_string(&serde_json::json!({ "sum": sum })).unwrap();
    Ok(Response::new(Body::from(resp)))
}
#[tokio::main]
async fn main() {
    let make_svc = make_service_fn(|_conn| async { Ok::<_, hyper::Error>(service_fn(handle)) });
    let addr = ([0,0,0,0], 3000).into();
    let server = Server::bind(&addr).serve(make_svc);
    server.await.unwrap();
}

Change explanation

  • Replace dynamic runtime parsing with typed deserialization.
  • Use tokio and hyper which are optimized for async IO.
  • Use serde for compact and fast JSON handling.
  • The loop runs over values directly, no dynamic checks inside the loop.

Micro benchmark 1: baseline versus Rust single instance Test setup

  • Same machine: 4 cores, 8 gigabyte memory, Linux.
  • Tool: wrk or similar load generator.
  • Payload: 20 items per request, each item value is a small integer.
  • Run length: 60 seconds stable window.

Measured numbers

  • Baseline runtime median latency 300 ms.
  • Baseline throughput 1,200 requests per second.
  • Rust median latency 17 ms.
  • Rust throughput 21,600 requests per second.

Table

  • Baseline median latency 300 milliseconds
  • Baseline throughput 1,200 requests per second
  • Rust median latency 17 milliseconds
  • Rust throughput 21,600 requests per second

Result explanation

  • Throughput increased by factor of 18.
  • Median latency improved by factor of 17.6 when comparing 300 to 17.
  • Tail latency tightened because Rust does not have stop the world GC pauses and serialization is predictable.

Micro benchmark 2: worst case tail latency Why test tail latency

  • User experience is determined by the slow requests.
  • Worst case matters for SLO.

Numbers

  • Baseline 95th percentile 820 ms.
  • Rust 95th percentile 38 ms.

Explanation

  • The dynamic runtime had spikes due to scheduling and GC.
  • Rust kept CPU usage spread across threads and had fewer unpredictable pauses.

Memory profile

Observations

  • Baseline memory usage increased under load due to temporary allocations and garbage collection.
  • Rust peak resident memory was lower and stable because of fewer temporary allocations and explicit lifetimes.

Architecture diagram

Before

        +-----------+       +-----------+       +------------+
        |  Client   |  =>   | Node App  |  =>   | Downstream |
        +===========+       +===========+       +============+
             |                 | parse json         | write payload
             |                 | allocate objects   |
        concurrent          single expensive       network call
        requests            hot path on runtime

After

        +-----------+       +-----------+       +------------+
        |  Client   |  =>   | Rust App  |  =>   | Downstream |
        +===========+       +===========+       +============+
             |                 | typed deserial     | small json
             |                 | minimal allocation | send bytes
        concurrent          multi thread runtime    network call
        requests            async runtime

Diagram explanation

  • The diagram uses boxes to show the client, service, and downstream.
  • The box before lists where the cost lived: parsing and allocation.
  • The after box lists the improvement: typed parsing and small final allocation.

Deployment and safety notes

  • Roll the Rust service behind the same load balancer.
  • Run both versions in shadow mode for one week to validate parity of results.
  • Add health checks that verify correctness on a sample payload.
  • Monitor error rates and not only latency. Performance without correctness is dangerous.

A few gotchas and how to avoid them

  • Misleading microbenchmarks. Always test with production like payloads.
  • Premature optimization in the wrong layer. Profile before rewriting.
  • Unsafe code. Use it sparingly and only when you have a test that proves the gain.

How to reproduce the key improvements quickly

  • Identify the hot handler with flame graphs and tracing.
  • Replace dynamic parsing for that handler with typed deserialization in rust.
  • Use a small async server, measure, and iterate.

Short checklist for the reader

  • Profile to find true hotspots.
  • Write a minimal Rust handler that accepts the same input and produces the same output.
  • Benchmark in isolation and under full system load.
  • Roll gradually and observe.
  • Add automated correctness tests for parity.

Final thoughts for a fellow developer

Performance is a craft and a habit. Rewriting a component into Rust is a powerful tool, but it is not a universal cure. Use the language where the bottleneck lives. Measure, test, and then commit. If latency, throughput, and cost matter to your product, this approach will change what is possible for your team.

Read the full article here: https://medium.com/@diyasanjaysatpute147/rust-made-my-backend-18x-faster-here-is-the-full-breakdown-2f062e605b94