Why Discord Migrated Read States from Go to Rust
The pattern was unmistakable: every two minutes, like clockwork, Discord’s Read States service would spike to 10–40 milliseconds of latency. Users would experience tiny but noticeable delays when loading channels or seeing new messages. For a platform built on feeling “super snappy,” this was unacceptable. The Read States service handles one of Discord’s most critical functions: tracking which channels and messages users have read across billions of read states. Read States is accessed every time you connect to Discord, every time a message is sent and every time a message is read. In short, Read States is in the hot path. After months of optimization attempts and garbage collector tuning, Discord’s engineering team made a pivotal decision: rewrite the entire service in Rust. The results were transformational — not just eliminating the latency spikes, but achieving performance metrics that seemed almost too good to be true. This is the detailed story of why Go failed at scale, how Rust succeeded beyond expectations, and the hard-won lessons about language choice for performance-critical services. The Scale Problem: Billions of Read States To understand why the migration was necessary, we need to understand the massive scale Discord operates at: Data Structure Scale:
- Discord has billions of Read States. There is one Read State per User per Channel
- There are millions of Users in each cache. There are tens of millions of Read States in each cache
- There are hundreds of thousands of cache updates per second
- There are tens of thousands of database writes per second
The Architecture: Each Read States server maintained a Least Recently Used (LRU) cache containing millions of user read states. Each read state tracked multiple atomic counters (like @mention counts) that needed frequent updates and resets. The cache was backed by a Cassandra database cluster for persistence. The performance requirements were stringent: every Discord user interaction depended on this service responding quickly and consistently. Any latency spike would be immediately felt by millions of users. The Go Problem: Unavoidable 2-Minute Cycles With the Go implementation, the Read States service was not supporting its product requirements. It was fast most of the time, but every few minutes we saw large latency spikes that were bad for user experience. The engineering team initially suspected typical garbage collection issues and spent significant effort optimizing their Go code. They had written extremely efficient code with minimal allocations, yet the spikes persisted. The Root Cause Discovery After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection. This was the smoking gun. Even though Discord’s Read States service was generating minimal garbage, Go’s runtime forced garbage collection every two minutes as a safety mechanism. Why Standard GC Tuning Failed:
- GC Percent Tuning Had No Effect: We figured we could tune the garbage collector to happen more often in order to prevent large spikes, so we implemented an endpoint on the service to change the garbage collector GC Percent on the fly. Unfortunately, no matter how we configured the GC percent nothing changed.
- The Real Problem Was Cache Scanning: The spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire LRU cache in order to determine if the memory was truly free from references.
- The Cache Size Dilemma: Making the LRU cache smaller reduced GC scan time but increased database load due to lower cache hit rates, creating a trade-off between GC performance and overall latency.
The Performance Profile The Go service exhibited a characteristic sawtooth pattern:
- P50 Latency: Sub-millisecond most of the time
- P99 Latency: 10–40ms spikes every 2 minutes
- CPU Spikes: Corresponding CPU usage spikes during GC cycles
- User Impact: Noticeable delays in message loading and channel switching
The Rust Solution: Immediate Memory Management Rust is blazingly fast and memory-efficient: with no runtime or garbage collector, it can power performance-critical services, run on embedded devices, and easily integrate with other languages. The key insight was that Rust’s ownership model could eliminate the fundamental problem: So in the Rust version of the Read States service, when a user’s Read State is evicted from the LRU cache it is immediately freed from memory. The read state memory does not sit around waiting for the garbage collector to collect it. Rust’s Memory Management Advantage Rust uses a relatively unique memory management approach that incorporates the idea of memory “ownership”. Basically, Rust keeps track of who can read and write to memory. It knows when the program is using memory and immediately frees the memory once it is no longer needed. This eliminated the entire class of problems that plagued the Go version:
- No garbage collection pauses
- No memory scanning overhead
- Immediate deallocation on cache eviction
- Predictable, consistent performance
The Async Rust Challenge The migration faced one significant hurdle: At the time this service was reimplemented, Rust stable did not have a very good story for asynchronous Rust. For a networked service, asynchronous programming is a requirement. Discord made a bold decision to use unstable nightly Rust to access early async features. As an engineering team, we decided it was worth using nightly Rust and we committed to running on nightly until async was fully supported on stable. This bet paid off as stable async Rust became available shortly after. The Implementation Journey Phase 1: Direct Translation The actual rewrite was fairly straight forward. It started as a rough translation, then we slimmed it down where it made sense. The initial Rust version was a relatively straightforward port of the Go logic, but Rust’s superior type system allowed for immediate improvements:
- Eliminated Go code that existed due to lack of generics
- Removed manual cross-goroutine memory protection (Rust’s memory model handled this automatically)
- Streamlined error handling with Rust’s Result types
Phase 2: Initial Results When we started load testing, we were instantly pleased with the results. The latency of the Rust version was just as good as Go’s and had no latency spikes! Even more remarkably: Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version. This is a huge testament to how easy it is to write efficient programs with Rust compared to the deep dive we had to do with Go. Phase 3: Optimization After a bit of profiling and performance optimizations, we were able to beat Go on every single performance metric. Latency, CPU, and memory were all better in the Rust version. Key Optimizations:
- Data Structure Choice: Changed from HashMap to BTreeMap in the LRU cache for better memory usage
- Concurrency Libraries: Swapped metrics library for one using modern Rust concurrency
- Memory Copies: Reduced unnecessary memory copying operations
The Results: Beyond Expectations Immediate Performance Gains The initial Rust deployment eliminated the latency spikes entirely while matching Go’s baseline performance. But the real benefits emerged when Discord could finally increase cache sizes without GC penalties. Cache Capacity Breakthrough After the service ran successfully for a few days, we decided it was time to re-raise the LRU cache capacity. In the Go version, as mentioned above, raising the cap of the LRU cache resulted in longer garbage collections. We no longer had to deal with garbage collection, so we figured we could raise the cap of the cache and get even better performance. The Results:
- Cache Size: Increased to 8 million Read States (previously limited by GC overhead)
- Latency: Notice the average time is now measured in microseconds and max @mention is measured in milliseconds
- Memory Efficiency: Lower memory usage despite larger cache sizes
- CPU Usage: Consistently lower CPU utilization
Ecosystem Benefits Recently, tokio (the async runtime we use) released version 0.2. We upgraded and it gave us CPU benefits for free. Rust’s rapidly evolving ecosystem provided additional performance improvements without code changes — a stark contrast to Go where runtime improvements were often offset by GC overhead. Quantified Business Impact Performance Metrics:
- Latency Spikes: Eliminated (from 10–40ms spikes every 2 minutes to consistent microsecond response)
- Cache Hit Ratio: Dramatically improved due to 8x larger cache capacity
- Database Load: Reduced due to better cache performance
- CPU Utilization: Consistently lower despite handling more traffic
Operational Benefits:
- Deployment Confidence: No more worrying about GC tuning parameters
- Monitoring Simplicity: Eliminated complex GC monitoring and alerting
- Performance Predictability: Consistent performance regardless of traffic patterns
- Capacity Planning: Linear scaling without GC overhead considerations
User Experience:
- Perceived Performance: Eliminated noticeable delays in message loading
- Reliability: No more intermittent “sluggish” periods during GC spikes
- Scalability: Better performance as user base continued growing
Decision Framework: When to Consider Rust Migration Migrate from Go to Rust When:
- GC Latency is Unacceptable: P99 latency spikes impact user experience
- Large In-Memory Data Structures: GC scan time scales with heap size
- Consistent Performance Required: Cannot tolerate periodic performance degradation
- Memory Usage is Critical: Need precise control over memory allocation patterns
- Long-Running Services: Services that benefit from eliminating GC overhead over time
Stay with Go When:
- GC Latency is Acceptable: Millisecond-level spikes don’t impact user experience
- Development Velocity is Priority: Team expertise and ecosystem maturity matter more
- Simple Request/Response Patterns: Short-lived requests where GC overhead is minimal
- Rapid Prototyping: Need to iterate quickly on business logic
- Team Size Constraints: Limited bandwidth to learn new language and tooling
Consider Hybrid Approaches When:
- Mixed Performance Requirements: Some services are latency-sensitive, others aren’t
- Gradual Migration Strategy: Want to prove Rust benefits before full commitment
- Team Learning: Building Rust expertise while maintaining Go services
Implementation Lessons Learned Technical Insights:
- GC Overhead Scales Non-Linearly: Large in-memory caches create disproportionate GC overhead in Go
- Rust Learning Curve: Initial productivity impact was smaller than expected for experienced developers
- Async Ecosystem: Early adoption of emerging Rust features can pay significant dividends
- Performance Tuning: Rust made optimization easier compared to complex GC tuning in Go
Process Insights:
- Load Testing is Critical: Comprehensive testing revealed performance characteristics before production
- Canary Deployments: Gradual rollout allowed discovery and fixing of edge cases
- Monitoring Strategy: Different metrics needed for Rust vs Go services
- Team Buy-in: Success with one service built confidence for broader Rust adoption
The Broader Impact at Discord At this point, Discord is using Rust in many places across its software stack. We use it for the game SDK, video capturing and encoding for Go Live, Elixir NIFs, several backend services, and more. The Read States migration became a proof point that influenced Discord’s broader technology strategy. The success demonstrated that Rust could deliver both performance and reliability benefits for critical services. Strategic Considerations: Team Development: Along with performance, Rust has many advantages for an engineering team. For example, its type safety and borrow checker make it very easy to refactor code as product requirements change or new learnings about the language are discovered. Ecosystem Momentum: The rapidly improving Rust ecosystem provided ongoing benefits without code changes, contrasting with Go where runtime improvements often came with trade-offs. The Real Lesson: Performance is a Feature Discord’s migration from Go to Rust for Read States wasn’t just about eliminating latency spikes — it was about treating performance as a core product feature. In a competitive messaging platform market, the difference between “fast enough” and “consistently excellent” creates measurable user experience advantages. The 2-minute GC problem in Go represented a fundamental constraint that couldn’t be solved through optimization or tuning. It required rethinking the entire approach to memory management. Rust’s ownership model didn’t just solve the problem — it eliminated the entire class of problems. For services operating at Discord’s scale, where billions of operations happen daily and millions of users expect instant responsiveness, the difference between garbage-collected and ownership-based memory management isn’t academic — it’s the difference between fighting your runtime and having it work with you. The question for other engineering teams isn’t whether Rust is “better” than Go in abstract terms. It’s whether your specific performance requirements, scale constraints, and user experience goals align with the trade-offs each language makes. Discord’s experience provides a concrete example of when those trade-offs favor Rust — and the transformational results that can follow.
Read the full article here: https://medium.com/@chopra.kanta.73/why-discord-migrated-read-states-from-go-to-rust-bdff7fb7c487