Jump to content

The Day Our Go Goroutines Blew Up Memory and Rust Did Not

From JOHNWICK

Our production server died in under three minutes.

No graceful degradation. No slow crawl.
Just a wall of alerts, a frozen dashboard, and 32GB of RAM gone. The autopsy report was brutal:
47,000 goroutines, all alive, all hungry, all waiting on I/O. That was the night I learned that concurrency is not about how cheaply you can spawn work. It is about how fast you can slam into the limits of the machine if you do not put guardrails in front of it.


The Crime Scene

The service looked innocent on paper. Take webhook events from partners.
Transform the data.
Forward it downstream. Classic ETL pipeline. We picked Go because goroutines are supposed to be cheap and simple.

Our first implementation looked like this:

func processWebhook(event Event) {
    go func() {
        data := transform(event)
        sendDownstream(data)
    }()
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    var events []Event
    json.NewDecoder(r.Body).Decode(&events)
    for _, e := range events {
        processWebhook(e)  // spawn and forget
    }
}

Clean. Short. Easy to explain in a code review. In production, it was a memory grenade.


The Memory Explosion Most days, traffic was normal. A few hundred or a few thousand events per request. The code hummed along and nobody thought about it. Then one partner pushed a batch with 50,000 events in a single shot. Our service obediently created 50,000 goroutines. Rough mental math:

  • 50,000 goroutines × ~2KB minimum stack ≈ 100MB
  • 50,000 events × ~1KB event data ≈ 50MB

So why did memory climb from 8GB to 16GB to 32GB before the kernel killed the process?

Because the goroutines were not just holding a tiny stack and a small struct. Each one was doing heavy I/O and allocations:

func transform(e Event) Data {
    // terrible idea: new client on every call
    resp, _ := http.Get("https://api.service.com/enrich/" + e.ID)
    defer resp.Body.Close()

body, _ := ioutil.ReadAll(resp.Body)  // full response in memory
    var enriched EnrichedData
    json.Unmarshal(body, &enriched)
    return process(enriched)
}

Now multiply that by tens of thousands:

  • HTTP client internals and TCP connections
  • TLS handshakes and buffers
  • Full response bodies in memory
  • JSON allocations for every enrichment
  • Goroutine stacks that grow beyond the minimum as the call stack deepens
  • A garbage collector forced to scan everything under pressure

There was nothing “free” about those goroutines anymore. The code did exactly what we asked it to do.
The problem was that we had given it no ceiling.


How We Should Have Fixed It in Go

This is the part that hurt the most: we could have stayed in Go and avoided the whole disaster.

A simple worker pool plus a concurrency limit would have saved us:

const maxWorkers = 100

func startWorkers(events <-chan Event, wg *sync.WaitGroup) {
    for i := 0; i < maxWorkers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for e := range events {
                data := transform(e)
                sendDownstream(data)
            }
        }()
    }
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
    var events []Event
    json.NewDecoder(r.Body).Decode(&events)

evCh := make(chan Event, len(events))
    var wg sync.WaitGroup
    startWorkers(evCh, &wg)
    for _, e := range events {
        evCh <- e
    }
    close(evCh)
    wg.Wait()
}

Same language. Same runtime.
The difference is one idea: never let concurrency grow without a cap. We did not do that. We just kept spawning.


Enter Rust We ended up rewriting the service in Rust, partly to learn and partly because we wanted stricter control over concurrency and memory. With Tokio and a semaphore, the design looked like this:

use tokio::sync::Semaphore;
use std::sync::Arc;

const MAX_CONCURRENT: usize = 100;
async fn process_events(events: Vec<Event>) {
    let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT));
    let mut tasks = Vec::new();
    for event in events {
        let permit = semaphore.clone().acquire_owned().await.unwrap();
        tasks.push(tokio::spawn(async move {
            let data = transform(event).await;
            send_downstream(data).await;
            drop(permit); // explicit release
        }));
    }
    futures::future::join_all(tasks).await;
}

The rule is visible in the code: nothing runs without grabbing a permit. We could accept 100,000 events, but at most 100 were processed at any moment. The rest simply waited their turn. Rust did not magically make concurrency safe. It just forced us to be explicit about the limits.


The Benchmark We hit both implementations with 100,000 events in a test environment.

Go (unlimited goroutines):
- Time: 45 seconds
- Peak Memory: 28.7GB
- Success Rate: 72% (OOM killer at ~28.7GB)

Rust (semaphore, max 100 concurrent):
- Time: 52 seconds
- Peak Memory: 487MB
- Success Rate: 100%

Seven seconds slower.
Roughly fifty-nine times less memory.
And, most importantly, no 3 AM incident. Those numbers changed how we think about “fast” and “efficient.”
Speed that ends in an OOM is not performance. It is a time bomb.


What This Actually Taught Us Go did not fail us. We failed Go. Here is what we changed in how we design concurrent systems now:

  • We never spawn goroutines directly from untrusted input without a hard cap.
  • We treat concurrency limits as core architecture, not a last-minute optimization.
  • We reuse HTTP clients and avoid reading entire bodies into memory unless there is no other option.
  • We test with ugly inputs: 100k events, not 100.
  • We monitor memory, open connections, and queue depth, not just CPU and latency.

Both Go and Rust are excellent for high-concurrency backends. Rust forced us to confront ownership and limits up front.
Go let us ignore those limits until production taught us the hard way.

In the end, the language matters less than whether you respect the machine underneath your abstractions. Concurrency is not about how many tasks you can start. It is about how many your system can survive.

Read the full article here: https://medium.com/@kp9810113/the-day-our-go-goroutines-blew-up-memory-and-rust-did-not-271d99ff3e67