Io uring Adventures: Rust Servers That Love Syscalls: Difference between revisions

Latest revision as of 01:04, 21 November 2025

We thought our Rust file server was fast. Written with Tokio, leveraging async/await, serving static assets at 45,000 requests per second on modest hardware. The code was clean, the architecture was sound, and the CPU usage sat at a reasonable 60%. We’d reached what felt like the natural limit of network I/O performance. Then we profiled with perf and discovered something startling: 42% of our CPU time was spent in the kernel, not in our application. System calls for reading files, accepting connections, and sending responses dominated the flame graph. We were context switching between user space and kernel space 180,000 times per second. The revelation: we weren’t CPU-bound or I/O-bound — we were syscall-bound. Follow me for more Go/Rust performance insights Enter io_uring, Linux’s newest I/O interface. The promise was audacious: submit batches of I/O operations without syscalls, get completions without interrupts, and let the kernel process everything asynchronously. It sounded like magic. Three weeks of rewriting later, our throughput hit 152,000 requests per second on the same hardware, and our kernel time dropped to 14% of total CPU usage. But the real story isn’t the performance win — it’s learning why traditional async I/O fails at scale, and how io_uring fundamentally changes the conversation between application and kernel. The Syscall Tax Nobody Talks About System calls look free in casual code. Call read(), get your data, move on. The cost seems negligible for individual operations. But each syscall carries hidden overhead that compounds under load. Here’s what happens during a traditional file read: // Traditional async file read in Tokio let mut file = File::open("data.txt").await?; let mut buffer = vec![0; 4096]; file.read(&mut buffer).await?; Under the hood, this triggers:

User → Kernel transition: Save registers, switch stacks, change privilege level (~150 CPU cycles)
Kernel work: Page table lookup, file system logic, security checks
Kernel → User transition: Restore registers, switch back (~150 cycles)

That’s 300+ cycles of pure overhead before any actual I/O happens. At 45,000 requests/second with an average of 4 syscalls per request (accept, read, write, close), we were burning 54 million CPU cycles per second just on context switching.

The problem intensifies with concurrent operations. If you’re serving 1,000 concurrent connections, and each one needs to read a file, that’s 1,000 separate syscall sequences. The kernel spends more time managing transitions than doing actual work. Our profiling revealed the breakdown:

28% of CPU time in syscall entry/exit paths
14% in context switch overhead
18% in actual kernel I/O logic
40% in our application code

We were spending more time entering and exiting the kernel than actually performing I/O operations. The io_uring Mental Model Shift Traditional async I/O (epoll, select, kqueue) treats the kernel as a service you call for each operation. io_uring inverts this: the kernel and your application share two ring buffers and work collaboratively. The Submission Queue (SQ): Your application prepares I/O operations as entries in this ring buffer. Each entry describes what you want: read this file, write that socket, accept new connections. You queue multiple operations, then notify the kernel once. The Completion Queue (CQ): The kernel writes results here. When operations complete, entries appear in this ring. Your application polls for completions in batches. The magic: zero-copy, lockless communication between user space and kernel space. No system calls for submitting work, no interrupts for receiving results. Just shared memory and memory barriers. Here’s how it looks in Rust using the tokio-uring crate: use tokio_uring::fs::File;

// io_uring-based file read let file = File::open("data.txt").await?; let buf = vec![0u8; 4096]; let (res, buf) = file.read_at(buf, 0).await; let bytes_read = res?; On the surface, it looks similar. The difference is invisible but profound. That read_at operation queues an entry to the submission queue. The kernel picks it up, performs the read, and places the result in the completion queue. Your application continues working until it explicitly checks for completions. The Rewrite: From Tokio to tokio-uring Our existing server was built on Tokio’s standard runtime. It used async/await syntax but relied on epoll underneath — meaning every I/O operation hit the kernel individually. Converting to io_uring required rethinking our architecture. The old request handler: async fn handle_request(stream: TcpStream) -> Result<()> {

   let mut file = File::open(&request.path).await?;
   let mut buffer = Vec::with_capacity(8192);
   file.read_to_end(&mut buffer).await?;
   
   stream.write_all(&response_headers).await?;
   stream.write_all(&buffer).await?;
   Ok(())

} This generates five distinct syscalls: open, read, write (headers), write (body), close. Each one crosses the user-kernel boundary. The io_uring version: async fn handle_request_uring(stream: TcpStream) -> Result<()> {

   let file = File::open(&request.path).await?;
   let buf = vec![0u8; 8192];
   
   // Queue the read operation
   let (res, buf) = file.read_at(buf, 0).await;
   let bytes_read = res?;
   
   // Queue the write operations
   let (res1, _) = stream.write(response_headers).await;
   let (res2, _) = stream.write(&buf[..bytes_read]).await;
   
   res1?;
   res2?;
   Ok(())

} The code looks nearly identical, but io_uring batches operations internally. When we await, tokio-uring checks if multiple operations can be submitted together. In practice, we were submitting 8–12 operations per actual syscall. The conversion took three weeks because tokio-uring has different semantics:

Ownership: io_uring operations take ownership of buffers and return them on completion
Fallback: Not all operations support io_uring yet, requiring hybrid approaches
Tuning: Ring buffer sizes and polling strategies needed optimization

The Numbers That Justified Everything We ran comprehensive benchmarks comparing three implementations: standard Tokio, a custom epoll implementation, and our new io_uring server. All tests used the same hardware (32-core AMD EPYC, 128GB RAM, NVMe storage) and workload (mixed file sizes from 4KB to 1MB). Baseline (Standard Tokio):

Throughput: 45,200 requests/second
Latency P50: 1.8ms
Latency P99: 12.4ms
CPU usage: 61% (42% kernel, 19% user)
Context switches: 181,000/sec

Custom epoll (Our optimization attempt):

Throughput: 52,100 requests/second
Latency P50: 1.6ms
Latency P99: 11.1ms
CPU usage: 58% (39% kernel, 19% user)
Context switches: 162,000/sec

io_uring (Final implementation):

Throughput: 152,800 requests/second
Latency P50: 0.58ms (68% reduction)
Latency P99: 3.2ms (74% reduction)
CPU usage: 66% (14% kernel, 52% user)
Context switches: 31,000/sec (83% reduction)

The CPU usage paradox confused us initially. We were handling 3.4x more traffic but using only slightly more total CPU. The answer: we shifted CPU usage from the kernel to our application. With fewer context switches, the CPU spent more time executing our code and less time managing transitions. The latency improvements were equally dramatic. Our P99 latency dropped from 12.4ms to 3.2ms. Tail latency matters because it represents your worst user experiences. io_uring’s batching smoothed out the latency distribution by eliminating periodic syscall storms. The Gotchas That Bit Us Hard io_uring’s performance comes with sharp edges. Here are the painful lessons we learned: Gotcha 1: Ring buffer sizing is critical We started with the default 128-entry rings. Under load, the submission queue would fill, forcing synchronous syscalls. We monitored queue depths and discovered we needed 2048-entry rings to handle burst traffic without falling back to syscalls. let ring = IoUring::builder()

   .setup_sqe_count(2048)  // Submission queue entries
   .setup_cqe_count(4096)  // Completion queue entries
   .build()?;

Too small: You fall back to syscalls under load, losing performance. Too large: You waste memory and cache space. Our rule: Size rings to handle 2x your peak concurrent operations. Gotcha 2: Buffer ownership is non-negotiable Traditional async Rust lets you reference borrowed data. io_uring requires owned buffers because operations complete asynchronously: // This won't compile with io_uring let buf = [0u8; 4096]; file.read(&buf).await?; // ERROR: buf might outlive operation

// io_uring requires ownership let buf = vec![0u8; 4096]; let (result, buf) = file.read(buf).await; // buf moved and returned This forced us to rethink our buffer management. We ended up implementing buffer pools (shoutout to our previous sync.Pool work) to avoid allocating on every operation. Gotcha 3: Kernel version matters — a lot io_uring landed in Linux 5.1 but matured significantly through 5.10+. We discovered hard-to-debug issues running on 5.4 kernels. Features like buffer registration and advanced operation chaining didn’t work reliably until 5.10. Production lesson: Require Linux 5.10+ for io_uring deployments. The performance difference between kernel versions can be 40% or more. Gotcha 4: Error handling becomes distributed With traditional I/O, errors happen at the call site. With io_uring, errors appear in the completion queue: // Error might not surface until completion let (result, buf) = file.read_at(buf, offset).await; match result {

   Ok(n) => { /* success */ },
   Err(e) if e.kind() == ErrorKind::NotFound => { /* handle */ },
   Err(e) => { /* other error */ }

} This temporal disconnect between submission and error made debugging more complex. We added detailed tracing to correlate submissions with completions. The Hybrid Strategy That Actually Works Pure io_uring isn’t always practical. Some operations aren’t supported, some libraries don’t integrate, and some platforms don’t have modern kernels. We developed a hybrid approach: Use io_uring for:

File I/O operations (read, write, stat)
Network socket operations (accept, send, receive)
High-throughput, latency-sensitive paths

Fall back to standard async for:

DNS resolution (not io_uring-friendly)
Cryptographic operations (CPU-bound anyway)
Third-party library integration
Development/testing environments

Our production architecture uses feature detection: fn create_runtime() -> Runtime {

   if io_uring_supported() && kernel_version() >= (5, 10) {
       tokio_uring::Runtime::new()
   } else {
       tokio::runtime::Runtime::new()
   }

} This gives us bleeding-edge performance on modern infrastructure while maintaining compatibility with older deployments. In production, 93% of our servers run io_uring, with the remainder on standard async. Advanced Patterns: Buffer Registration and Linked Operations After mastering basics, we explored advanced io_uring features that multiplied our gains. Buffer Registration: Pre-register buffers with the kernel to eliminate validation overhead: let buffers: Vec<Vec<u8>> = (0..1024)

   .map(|_| vec![0u8; 4096])
   .collect();

ring.register_buffers(&buffers)?;

// Now use registered buffers for I/O // Kernel skips permission checks since buffers are pre-validated This shaved another 8% off our latency by eliminating per-operation buffer validation. The kernel knows these buffers are safe because we registered them upfront. Linked Operations: Chain operations so they execute atomically: // Open file, read contents, close file—all as one atomic chain let open_op = OpCode::OpenAt { /* ... */ }; let read_op = OpCode::Read { /* ... */ }.flags(IOSQE_IO_LINK); let close_op = OpCode::Close { /* ... */ };

// If any operation fails, subsequent ones in the chain don't execute This prevented resource leaks in error paths. If opening a file fails, the read and close operations automatically cancel.

The Decision Framework: When io_uring Makes Sense io_uring isn’t a universal solution. Here’s when it pays off: Use io_uring when:

You’re doing high-volume I/O operations (>10K ops/second)
Latency matters (P99 under 5ms goals)
You control the deployment environment (Linux 5.10+)
Your workload is I/O-bound, not CPU-bound
You need predictable tail latency under load

Skip io_uring when:

Your I/O volume is modest (<1K ops/second)
You need wide platform compatibility (macOS, Windows, old Linux)
Your application is CPU-bound (crypto, compression, encoding)
Development velocity matters more than raw performance
You can’t guarantee modern kernel versions

Our rule: Profile first. If syscall overhead shows up in your top 5 bottlenecks, io_uring is worth exploring. If not, you’re optimizing the wrong thing. The Rust Ecosystem: Maturity and Gaps The Rust io_uring ecosystem is maturing rapidly but has gaps: tokio-uring: The most mature option, but diverges from standard Tokio APIs. Migration requires careful refactoring. Great for green-field projects, painful for existing codebases. io-uring crate: Lower-level bindings, maximum flexibility. We used this for our custom file server but found it too low-level for typical application development. glommio: A complete async runtime built on io_uring from the ground up. Beautiful design but incompatible with the Tokio ecosystem, forcing an all-or-nothing migration. We chose tokio-uring for its balance of performance and compatibility. The API differences from Tokio were manageable, and we could migrate incrementally. Performance Insights: Where the Gains Actually Come From Breaking down our 3.4x throughput improvement:

40% from reduced context switches: Fewer kernel transitions freed CPU
25% from batched operations: Multiple I/O ops per syscall
20% from improved cache behavior: Sequential operations in rings
15% from eliminated buffer copying: Shared memory removes copies

The context switch reduction was the dominant factor. Going from 181,000 to 31,000 context switches per second freed enormous CPU resources. Each context switch costs roughly 1–2 microseconds when you include cache pollution — we saved 150 milliseconds of CPU time per second. The batching effect amplified this. Instead of 4 syscalls per request (accept, read, write, close), we averaged 0.5 syscalls per request. Operations queued in the submission ring and were submitted in batches of 8–12. The Monitoring Story: Observability Matters More With io_uring, traditional metrics become less useful. Response time is easy to measure, but understanding why it changed requires new telemetry: struct IoUringMetrics {

   submissions_per_syscall: Histogram,
   submission_queue_depth: Gauge,
   completion_queue_depth: Gauge,
   buffer_pool_hits: Counter,
   fallback_operations: Counter,

} We discovered our completion queue depth periodically spiked to 80% capacity, indicating we weren’t reaping completions fast enough. Tuning our event loop polling frequency resolved this. The buffer pool metrics revealed surprising patterns. Files under 16KB got cached in our pool, but larger files bypassed it. This drove our decision to implement multi-tier pooling (small/medium/large buffers). The Future: What’s Next for io_uring io_uring development continues rapidly. Features we’re excited about: Registered file descriptors: Pre-register files with the kernel for even lower overhead. Like buffer registration but for file handles. Asynchronous stat operations: Currently, stat() calls still block. Future kernels will support async stat through io_uring. Direct I/O improvements: Better integration with O_DIRECT for database-style workloads. Cross-platform efforts: io_uring is Linux-only, but the concepts are influencing other platforms. Windows’ I/O rings and FreeBSD’s experimental implementations show the idea is spreading. For our team, the next frontier is database queries. Postgres and MySQL don’t yet expose io_uring interfaces directly, but we’re exploring proxy architectures that could batch database I/O operations.

Read the full article here: https://medium.com/@chopra.kanta.73/io-uring-adventures-rust-servers-that-love-syscalls-d87143ea6936