Jump to content

Java 23 vs. Rust on the Hot Path: Where GC Still Wins

From JOHNWICK

We chased sub-millisecond p99. Rust beat Java — until we changed how we allocate.
A single tweak to object lifetimes put Java back in the lead and shaved 18% CPU.
On the hot path, the garbage collector wasn’t the villain. It was the secret weapon.

The Hot Path That Started the Fight

A request fan-out: parse 2–4 KB JSON, hit three caches, stitch a 1 KB response. At 420k req/s on 32 cores, Java 23’s p99 hovered at 5.2 ms; our Rust rewrite came in at 4.8 ms. Close — but Rust won. Then we profiled allocations: ~190 small objects/request in Java vs a trail of Arc bumps and clones in Rust. When 95% of allocations die young, a modern generational GC (G1/ZGC) can be cheaper than hand-managing lifetimes with reference counting or overly defensive cloning. How-to.

  • Measure young-gen survival. Track promotion rate and TLAB refill counts under load.
  • Hunt gratuitous copying in both stacks: String splits in Java, Arc<T> churn in Rust.
  • Normalize I/O: same parser, same response size, fixed-rate load (no coordinated omission).

Counterpoint. Rust still sets a lower latency floor in tight, single-core loops. But hot-path services rarely look like single-core loops.

Generational GC Isn’t a Trash Truck. It’s a Bump Pointer.

We assumed “GC = pauses.” The profile said otherwise: Java 23’s generational ZGC (available since Java 21; still standard as of Oct 2025) did 1–2 ms concurrent young-gen cycles with zero stop-the-world spikes on our run. Allocation showed up as cheap pointer bumps in TLABs. Short-lived objects are fast path for modern GCs. The cost is dominated by pointer bumps and light concurrent marking, not full tracing. How-to.

  • Start with ZGC: -XX:+UseZGC -XX:+ZGenerational (enabled by default on recent builds).
  • Give the young gen room: target <5% promotion under peak (watch gc.goodput).
  • Keep objects small & linear: favor flat records/POJOs over nested graphs.

Counterpoint. Huge objects (multi-MB buffers) still pressure GC. Move those off-heap or pool them.

Rust’s Surprise Cost on the Hot Path

Our Rust fan-out looked “zero-copy” on paper. Then we added cross-task work stealing and structured fan-in. Lifetimes crossed boundaries; we reached for Arc<str> and cheap clones to break borrow tangles. CPU +12%, p99 up 0.6 ms under skew. Without a GC, cross-cutting pipelines push you toward shared ownership (refcounts) or explicit arenas. Both are great tools; both are easy to misuse under schedule pressure. How-to.

  • Prefer borrowing over Arc<T> in leaf functions; escalate ownership only at boundaries.
  • Use bump/arena allocators (e.g., typed arenas) for per-request scratch where it fits.
  • Freeze interfaces: design for lifetimes first, then code. Moving later = clones sneaking in.

Counterpoint. A well-designed Rust pipeline can be zero-copy and blisteringly fast. It just demands more design time and discipline.

Tuning Java 23’s Young-Gen: The Flip

We cut Java allocations from ~190 to ~70 objects/request by reusing builders and interning a tiny set of 24 response keys. Young-gen promotion fell below 3%. Result: p99 dropped from 5.2 ms to 4.5 ms at 420k req/s; CPU fell 18%. You don’t need “no allocations.” You need predictable, short-lived allocations that die in the nursery. How-to.

  • Thread-local scratch: ThreadLocal<StringBuilder> or ByteArrayOutputStream for hot serialization.
  • Flyweight keys: keep a small constants map for repeated field names.
  • Avoid hidden boxing: watch autoboxing in streams/collectors on the hot path.

Counterpoint. Over-pooling can create cross-thread contention. Keep reuse per-thread, not global.

A Minimal Hot-Path Baseline in Java 23 (Code)

This tiny batcher eliminated a storm of builder/array allocations. It’s boring. That’s the point. Shifting churn to per-thread scratch keeps GC work cheap and young-gen-local.

// Java 23; keep scratch per thread, allocate results short-lived
record Item(int id, String payload) {}

final class HotPath {
  private static final ThreadLocal<StringBuilder> SB =
      ThreadLocal.withInitial(() -> new StringBuilder(512));

  Item decode(byte[] bytes) {
    var sb = SB.get();
    sb.setLength(0);                       // reuse buffer
    for (int i = 0; i < bytes.length; i++) {
      byte b = bytes[i];
      if (b != ',') sb.append((char) b);   // trivial parse for demo
    }
    String s = sb.toString();              // short-lived object
    return new Item(hash32(s), s);         // dies young; GC cheap
  }

  private int hash32(String s) {           // stable, fast hash
    int h = 0;
    for (int i = 0; i < s.length(); i++) h = (h << 5) - h + s.charAt(i);
    return h;
  }
}

How-to.

  • Make scratch per-thread or per-virtual-thread; reset, don’t reallocate.
  • Keep result objects small and short-lived; avoid long-lived caches of parsed data.

Counterpoint. If strings are truly massive, consider off-heap buffers or streaming encoders.

Virtual Threads Help Throughput. They Don’t Change Physics.

We flipped a switch: virtual threads for the wait-heavy fan-out (Java 21+). Throughput climbed 7–12% by using cores better during I/O waits. Allocation profile barely changed.

Virtual threads improve concurrency and I/O efficiency, not allocation economics. They still love nursery-friendly objects.

How-to.

  • Use a virtual-thread-per-request model where you block on I/O.
  • Keep per-task scratch local to the virtual thread (same ThreadLocal trick works).
  • Cap in-flight work; backpressure beats queues growing into the thousands.

Counterpoint. CPU-bound hot paths don’t benefit much from virtual threads; keep those tight.

Off-Heap Where It Matters (and Nowhere Else)

Our worst GC pressure wasn’t the 120-byte DTOs; it was the occasional 1.5 MB payload. Moving those to off-heap (Java’s Foreign Memory API is stable since JDK 22) stopped rare but ugly promotions. Push large, few buffers off-heap; leave the small, many objects to the GC.

How-to.

  • Keep big blobs in MemorySegment or pooled ByteBuffers; copy only at boundaries.
  • Serialize directly from off-heap to socket buffers if your stack allows.
  • Track promotion spikes correlated with rare large allocations, not averages.

Counterpoint. Off-heap adds safety foot-guns. Wrap access in narrowly scoped helpers.

Where Rust Clearly Wins

We tested a fixed-size codec loop: decode–transform–encode arrays of 64-byte structs with no I/O. Rust beat Java by ~22% and held a tighter p99 (sub-200 µs, vs Java’s JIT-warmed ~260 µs). For tight numerical kernels, SIMD-friendly layouts, and zero I/O, Rust’s ahead: no safepoints, no JIT warmup variance, cache-friendly data layouts.

How-to.

  • Use #[repr(C)] or packed structs; iterate in AoS→SoA layouts if needed.
  • Avoid Arc in inner loops; borrow immutably, reuse buffers.
  • Keep hot data contiguous; measure L1/L2 miss rates when chasing wins.

Counterpoint. As soon as you add orchestration (parsing, cross-thread ownership, resilience), Java’s economics become competitive again.

Don’t Let Benchmarks Lie to You

Our first victory lap hid a bug: open-loop load injected bursts that our client suppressed (coordinated omission). Switching to fixed-rate (wrk2-style) changed the picture; tails mattered. Tail latency tells you where money goes. Test with fixed-rate clients, static response sizes, and controlled skew.

How-to.

  • Fix request rate; report dropped requests as failures.
  • Run long enough for JIT to stabilize (warmup + steady-state).
  • Pin NUMA: isolate IRQ cores, affinitize worker threads.

Counterpoint. Microbenchmarks still have value — if they isolate a single hypothesis and feed production tuning.

A Simple Rule for Choosing the Right Hammer

One meeting, two asks: “sub-200 µs p99” for a codec, and “3x fan-outs with retries” for an API tier. We used Rust for the codec and Java 23 for the API. Both shipped. No heroics. Pick by shape of work, not language ideology. If most objects die young and you juggle I/O, Java’s GC will likely outrun your refcount gymnastics. If you’re crunching bytes in place, Rust’s determinism pays rent. How-to.

  • Map your path: I/O wait %, object count/request, largest allocation size.
  • If “young and many,” favor Java; if “few, fixed-size, compute-heavy,” favor Rust.
  • Measure deltas after one cleanliness change (builder reuse, borrow tightening).

Counterpoint. Teams, tools, and deadlines matter. The best language is the one you can tune today. [Net] -> [Parse] -> [Transform] -> [Serialize] -> [Net]

         ^^^^^^^  Young-gen churn zone (cheap if short-lived)

3 Action Steps for This Week

  • Profile young-gen survival. Under real load, capture promotion %, TLAB refill counts, and p99 GC metrics; aim for <5% promotions at peak.
  • Kill three hidden allocations. Replace per-request builders/arrays with per-thread scratch; flatten one nested DTO chain. Re-measure p99.
  • Rust borrow audit. Find one hot function using Arc<T> and refactor to borrowing + arena scratch; compare CPU and clones/request.

Open Thread

What specific change (a line diff, a flag, a lifetime refactor) moved your p99 by ≥10% in a real service, and what did the flame graph look like before vs after? Bring numbers and context.

Read the full article here: https://medium.com/@toyezyadav/java-23-vs-rust-on-the-hot-path-where-gc-still-wins-ba5cc3335f8a