EBPF From Rust: The 34ms Ghost We Couldn’t See

So here’s the thing that drove us absolutely insane for three weeks straight. Our API had this… ceiling. Like we’d hit an invisible wall at 84ms P99 latency and nothing — and I mean nothing — we tried could push through it. The really maddening part? Every profiler we threw at it said we were doing great. Database queries? Lightning fast. Network calls? Barely a blip. Application code? Optimized to hell and back. But somehow, somewhere, 34 milliseconds were just… vanishing into thin air. Here’s what the numbers looked like (and yes, I stared at these until my eyes crossed): Measured application time: 47ms Total request time: 84ms Mystery gap: 37ms (where the hell IS this?) Traditional profilers: *crickets* Team morale: circling the drain Three weeks. THREE WEEKS of this nonsense. And then — okay, this is where it gets interesting — we deployed this eBPF tracing thing from Rust, right? Found the problem in twenty minutes. Twenty. Minutes. Turns out every single request was triggering 127 filesystem stat() calls that we didn’t even know existed. Our profilers just… never showed them. Because they were happening in kernel space. Which is, you know, kind of a big blind spot. After we fixed it:

P99 latency: 84ms → 51ms (39% drop, finally!)
That mystery gap: gone, just completely gone
Stat() syscalls: 127 → 3 per request (yeah, 127)
Kernel time: 37ms → 4ms
CPU utilization: 67% → 41% (servers basically started coasting)

So let me show you how we did this. Because honestly? This changed everything about how we debug. Why Everything We Tried Failed (And Why That Makes Sense, Actually) Standard profiling tools — perf, flamegraphs, all that good stuff — they work in userspace. Which means they see your application code perfectly fine, but the kernel? Total black box. // Your userspace profiler sees this and goes "yep, looks good!" fn handle_request() -> Response {

   start_timer();
   
   let data = load_config();  // Profiler: "2ms, nice!"
   let result = process(data); // Profiler: "42ms, fine"
   let response = format(result); // Profiler: "3ms, whatever"
   
   end_timer();  // Profiler's final verdict: "47ms total, ship it"
   response

}

// But the actual wall clock time? 84ms. // Those missing 37ms? In the kernel, baby. // Where your profiler literally cannot see. And here’s what really gets me — that load_config() function? It looked fast. 2ms according to every tool. But under the hood it was doing:

127 stat() calls (checking file metadata over and over)
43 open()/close() operations (why though?)
28 mmap() calls for shared libraries (this seems excessive?)
14 futex() calls for locking (okay this one makes sense)

None of this showed up in perf. None of it showed in our flamegraphs. Our Rust profiler? Useless. They all just measured userspace execution time and called it a day. Which, I mean, fair. That’s their job. But it’s also why we were completely blind to the actual problem. What eBPF Actually Does (The Thing That Saved Us) Okay so eBPF — extended Berkeley Packet Filter, which is a terrible name — it lets you run programs inside the kernel without needing to write kernel modules. Which sounds scary but it’s actually super safe because the kernel verifies everything before running it. From Rust, we can:

Trace every syscall (entry and exit, the whole journey)
Measure kernel time with nanosecond precision (nanoseconds!)
Capture stack traces (both userspace and kernel)
Run in production with basically zero overhead (like, 2%)
Aggregate data in the kernel (so no constant copying to userspace)

The Rust eBPF ecosystem is actually pretty mature now:

aya — Pure Rust, no C dependencies (this is what we used)
libbpf-rs — Safe Rust bindings if you want the C library underneath
redbpf — Full-featured framework (heard good things)

We went with aya because type safety and because honestly? Not dealing with C was worth it. The Code That Found Our 34ms Black Hole Alright, here’s the actual eBPF program. This runs in the kernel and tracks every syscall with nanosecond precision: // eBPF program (this literally runs in kernel space, wild) use aya_bpf::{

   macros::{kprobe, kretprobe, map},
   maps::HashMap,
   programs::ProbeContext,

};

// Storage for when syscalls started (per-thread)

[map]

static mut SYSCALL_START: HashMap<u64, u64> =

   HashMap::with_max_entries(10240, 0);  // one slot per thread, should be enough

// Accumulated latency by syscall type

[map]

static mut SYSCALL_LATENCY: HashMap<u32, u64> =

   HashMap::with_max_entries(512, 0);  // 512 syscall types is overkill but whatever

// This fires when ANY syscall starts

[kprobe]

pub fn syscall_entry(ctx: ProbeContext) -> u32 {

   // Get current thread ID (this is actually thread+process ID combined)
   let pid_tgid = unsafe { 
       bpf_get_current_pid_tgid() 
   };
   
   // Grab the timestamp right now, in nanoseconds since boot
   let ts = unsafe { bpf_ktime_get_ns() };
   
   // Store when this syscall started, we'll need it later
   unsafe {
       SYSCALL_START.insert(&pid_tgid, &ts, 0)
           .unwrap_or(());  // if insert fails, whatever, we'll miss one measurement
   }
   
   0  // return 0 means "continue normal execution"

} // This fires when the syscall returns

[kretprobe]

pub fn syscall_return(ctx: ProbeContext) -> u32 {

   let pid_tgid = unsafe { 
       bpf_get_current_pid_tgid() 
   };
   
   // Find when this syscall started
   let start = unsafe {
       SYSCALL_START.get(&pid_tgid)
   };
   
   if let Some(start_ts) = start {
       // Calculate how long the syscall took
       let now = unsafe { bpf_ktime_get_ns() };
       let delta = now - *start_ts;  // duration in nanoseconds
       
       // Figure out which syscall this was (stat, open, read, etc)
       let syscall_nr = ctx.arg::<u32>(0)
           .unwrap_or(0);  // if we can't get it, call it syscall 0
       
       // Add this duration to the running total for this syscall type
       unsafe {
           let current = SYSCALL_LATENCY
               .get(&syscall_nr)
               .copied()
               .unwrap_or(0);  // start from 0 if first time seeing this syscall
           
           SYSCALL_LATENCY.insert(
               &syscall_nr, 
               &(current + delta),  // accumulate the time
               0
           ).unwrap_or(());  // again, if it fails, we just miss one data point
       }
   }
   
   0  // all good, carry on

} This is the program that changed everything. No sampling, no guessing, just tracking every single syscall with nanosecond precision. The BPF verifier makes sure it can’t crash the kernel (which is reassuring when you’re running this on production machines). The Userspace Side: Reading What We Found The eBPF program runs in the kernel, but we need userspace code to load it and read the results: use aya::{Bpf, programs::KProbe, maps::HashMap}; use std::time::Duration;

fn main() -> Result<(), Box<dyn std::error::Error>> {

   // Load the compiled eBPF program (we compiled it separately)
   let mut bpf = Bpf::load_file("syscall_trace.o")?;
   
   // Attach our entry probe to the syscall entry point
   let entry: &mut KProbe = bpf
       .program_mut("syscall_entry")  // find our entry function
       .unwrap()
       .try_into()?;
   entry.load()?;  // load into kernel
   entry.attach("sys_enter", 0)?;  // attach to syscall entry tracepoint
   
   // Attach our return probe to the syscall exit point
   let ret: &mut KProbe = bpf
       .program_mut("syscall_return")  // find our return function
       .unwrap()
       .try_into()?;
   ret.load()?;
   ret.attach("sys_exit", 0)?;  // attach to syscall exit tracepoint
   
   println!("Tracing syscalls. Press Ctrl-C to exit.");
   println!("(This is running in production btw, very cool)");
   
   // Get a handle to our latency map
   let latency_map: HashMap<_, u32, u64> = 
       HashMap::try_from(
           bpf.map_mut("SYSCALL_LATENCY").unwrap()
       )?;
   
   // Print results every second (could be configurable but whatever)
   loop {
       std::thread::sleep(Duration::from_secs(1));
       
       // Iterate over all syscalls we've seen
       for (syscall_nr, total_ns) in 
           latency_map.iter() 
       {
           // Convert syscall number to human-readable name
           let syscall_name = 
               syscall_name(syscall_nr?);
           
           // Convert nanoseconds to milliseconds for readability
           let total_ms = total_ns? as f64 / 1_000_000.0;
           
           // Only print if it took more than 1ms total
           if total_ms > 1.0 {
               println!(
                   "{}: {:.2}ms", 
                   syscall_name, 
                   total_ms
               );
           }
       }
       
       println!("---");  // separator between time periods
   }
   
   Ok(())

} And then we ran this and… holy shit: stat: 34,847.23ms open: 2,341.45ms close: 1,847.92ms mmap: 847.34ms futex: 234.21ms --- 34 seconds of stat() calls. Per second of operation. This was it. This was our ghost. The Root Cause (And Why We Felt Dumb) So we followed the trace and found… oh god, this is embarrassing. We were using dynamically loaded plugins: // This innocent-looking code was killing us fn load_plugin(name: &str) -> Result<Plugin> {

   // Load a shared library (.so file)
   let lib = unsafe {
       libloading::Library::new(
           format!("./plugins/{}.so", name)
       )?
   };
   
   // Get the plugin initialization function
   // (THIS is what triggered the stat() storm)
   let init: Symbol<fn() -> Plugin> = 
       unsafe { lib.get(b"plugin_init")? };
   
   Ok(init())  // initialize and return

} Seems fine, right? Except every time this ran, the dynamic linker had to:

Scan all 16 directories in LD_LIBRARY_PATH (looking everywhere)
Call stat() on every potential location (127 calls per load)
Open and parse ELF headers (43 files opened)
Map shared libraries into memory (28 mmap operations)

And we were doing this. On. Every. Request. facepalm The fix was so obvious once we could see the problem: use once_cell::sync::Lazy; use std::collections::HashMap;

// Cache plugins after first load (duh) static PLUGIN_CACHE: Lazy<Mutex<HashMap<String, Plugin>>> =

   Lazy::new(|| Mutex::new(HashMap::new()));

fn load_plugin(name: &str) -> Result<Plugin> {

   let mut cache = PLUGIN_CACHE.lock().unwrap();
   
   // Check cache first (this is what we should have done from the start)
   if let Some(plugin) = cache.get(name) {
       return Ok(plugin.clone());  // boom, done, no syscalls
   }
   
   // Only load if we haven't seen this plugin before
   let lib = unsafe {
       libloading::Library::new(
           format!("./plugins/{}.so", name)
       )?
   };
   
   let init: Symbol<fn() -> Plugin> = 
       unsafe { lib.get(b"plugin_init")? };
   
   let plugin = init();
   
   // Store in cache for next time
   cache.insert(name.to_string(), plugin.clone());
   
   Ok(plugin)

} Results after this one-line fix (okay, more than one line, but conceptually):

stat() calls: 127 → 3 per request (98% reduction, oh my god)
P99 latency: 84ms → 51ms (39% faster, finally breathing)
CPU utilization: 67% → 41% (servers basically idling now)
Throughput: 4,200 → 6,800 req/sec (62% increase, wild)

The thing that kills me? Userspace profilers showed load_config taking 2ms. “Looks good!” But it was actually consuming 37ms of kernel time. The eBPF trace made that invisible work suddenly visible. Other Patterns We Found (Because Once You Start Tracing…) Network Stack Debugging You can trace TCP internals too. We found some wild stuff: // Track TCP retransmissions (packet loss detection)

[kprobe(function = "tcp_retransmit_skb")]

pub fn trace_tcp_retransmit(ctx: ProbeContext) -> u32 {

   // Get the socket struct pointer
   let sk = ctx.arg::<*const u8>(0).unwrap_or(null());
   
   // Extract connection details from the socket struct
   // (these offsets are kernel-version specific, be careful)
   let src_port = unsafe { 
       read_sk_field(sk, SPORT_OFFSET)  // source port
   };
   let dst_port = unsafe { 
       read_sk_field(sk, DPORT_OFFSET)  // destination port
   };
   
   // Track how many retransmits per connection
   unsafe {
       RETRANSMIT_COUNT.increment(1);  // global counter
       
       // Per-connection tracking
       let key = (src_port, dst_port);
       let count = CONN_RETRANSMITS
           .get(&key)
           .copied()
           .unwrap_or(0);
       
       CONN_RETRANSMITS.insert(
           &key, 
           &(count + 1),  // increment this connection's count
           0
       ).unwrap_or(());
   }
   
   0

} We discovered 847 TCP retransmits per second on one specific connection. Turns out:

Incorrect TCP window size tuning (oops)
Buffer overflow in high-throughput path (not good)
Specific to one AWS availability zone (???)

Fixed the TCP window scaling, eliminated 99.2% of retransmits, improved latency by 23ms. Not bad for a day’s work. Filesystem Operations (More Pain) Track file I/O with crazy precision: // Hook into VFS read operations (all filesystem reads go through this)

[kprobe(function = "vfs_read")]

pub fn trace_read(ctx: ProbeContext) -> u32 {

   let bytes = ctx.arg::<usize>(2).unwrap_or(0);  // how many bytes being read
   let ts_start = unsafe { bpf_ktime_get_ns() };  // when did this start
   
   // Store start time for this thread
   unsafe {
       IO_START.insert(
           &bpf_get_current_pid_tgid(), 
           &ts_start, 
           0
       ).unwrap_or(());
   }
   
   0

} // Hook the return from vfs_read

[kretprobe(function = "vfs_read")]

pub fn trace_read_return(ctx: ProbeContext) -> u32 {

   let pid_tgid = unsafe { bpf_get_current_pid_tgid() };
   
   if let Some(start) = unsafe { IO_START.get(&pid_tgid) } {
       // Calculate how long this read took
       let duration = unsafe { 
           bpf_ktime_get_ns() 
       } - *start;
       
       // Only care about slow reads (>10ms is concerning)
       if duration > 10_000_000 {  // 10ms in nanoseconds
           unsafe {
               SLOW_READS.increment(1);
               
               // Build a histogram of slow read latencies
               let bucket = duration / 1_000_000;  // convert to ms for bucketing
               let key = bucket as u32;
               
               let count = READ_HISTOGRAM
                   .get(&key)
                   .copied()
                   .unwrap_or(0);
               
               READ_HISTOGRAM.insert(
                   &key, 
                   &(count + 1),  // one more read in this latency bucket
                   0
               ).unwrap_or(());
           }
       }
   }
   
   0

} Found that 12% of our reads took over 10ms. All from one specific log file. The problem:

Log rotation held a file lock (blocking everything)
New writes blocked readers (cascading delays)
Everything waiting on this one file

Changed logging to non-blocking mode, eliminated 100% of slow reads. Just… gone. Memory Allocation Mysteries You can even track kernel memory allocations: // Hook into kernel memory allocation

[kprobe(function = "__kmalloc")]

pub fn trace_kmalloc(ctx: ProbeContext) -> u32 {

   let size = ctx.arg::<usize>(0).unwrap_or(0);  // allocation size
   let flags = ctx.arg::<u32>(1).unwrap_or(0);   // allocation flags (GFP_KERNEL, etc)
   
   // Track large allocations (>1MB is suspicious)
   if size > 1024 * 1024 {
       unsafe {
           // Get the process name that's allocating
           let comm = bpf_get_current_comm();
           let key = (comm, size);
           
           // Count how many times this process allocates this size
           let count = LARGE_ALLOCS
               .get(&key)
               .copied()
               .unwrap_or(0);
           
           LARGE_ALLOCS.insert(
               &key, 
               &(count + 1),
               0
           ).unwrap_or(());
       }
   }
   
   0

} We were allocating 847MB/sec in kernel space for network buffers. Why? Because:

Incorrect socket buffer size (we set it wrong)
MTU mismatch causing fragmentation (network config issue)
Memory pressure triggering GC (cascading effects)

Tuned socket buffers to match MTU, reduced kernel allocations by 94%. Our poor servers could finally breathe.

The Overhead Question (Because You’re Thinking It) “Yeah but doesn’t all this tracing slow things down?” Great question. Here’s what we measured running eBPF continuously in production: Without eBPF:

P99 latency: 51ms
Throughput: 6,800 req/sec
CPU utilization: 41%

With eBPF tracing everything:

P99 latency: 52ms (+2% overhead)
Throughput: 6,700 req/sec (-1.5% overhead)
CPU utilization: 42% (+1% overhead)

That’s it. 2% overhead. For complete kernel visibility. Why so low?

Kernel-side aggregation (no data copying per-event)
BPF verifier ensures safe, bounded execution (no runaway loops)
JIT compilation (eBPF programs run as native code)
Zero sampling overhead (we track 100% of events efficiently)

Compare this to other debugging approaches:

Printf debugging: 34% overhead (LOL)
Userspace profiling: 12% overhead
DTrace-style tracing: 8% overhead
eBPF: 2% overhead

Yeah. I’ll take 2%. Why Rust Makes This Better (Type Safety Nerds Unite) Okay so here’s why doing this in Rust specifically is awesome: Type Safety: // eBPF maps are strongly typed at compile time static mut LATENCY: HashMap<u32, u64> =

   HashMap::with_max_entries(1024, 0);

unsafe {

   // This compiles fine - types match
   LATENCY.insert(&syscall_nr, &duration, 0)?;
   
   // This won't compile - type mismatch
   // LATENCY.insert(&duration, &syscall_nr, 0)?;
   // The compiler catches this before you even run it

} Memory Safety: // The aya verifier ensures no buffer overflows at compile time let comm: [u8; 16] = unsafe {

   bpf_get_current_comm()  // always returns exactly 16 bytes

};

// Can't overflow - array size is compile-time known comm[0..16].copy_from_slice(&data); // If you try to access comm[17], compiler error Zero-Cost Abstractions: // This high-level iterator code... for (key, value) in map.iter() {

   process(key?, value?);

}

// ...compiles to efficient eBPF map iteration // No hidden allocations, no overhead // Just clean, fast code The Rust compiler basically becomes your kernel safety net. It catches mistakes before they hit production, which is kind of a big deal when you’re running code in kernel space. Our Production Setup (The Full Stack) Here’s what we actually deployed: pub struct SystemTracer {

   bpf: Bpf,  // the loaded eBPF program
   syscall_map: HashMap<u32, u64>,  // syscall latencies
   network_map: HashMap<(u16, u16), u64>,  // per-connection stats
   io_histogram: HashMap<u32, u64>,  // I/O latency distribution

}

impl SystemTracer {

   pub fn new() -> Result<Self> {
       // Load the compiled eBPF program from disk
       let mut bpf = Bpf::load_file(
           "kernel_tracer.o"
       )?;
       
       // Attach all our probes (this is the magic part)
       Self::attach_probe(&mut bpf, 
           "syscall_entry", "sys_enter")?;  // syscall entry point
       Self::attach_probe(&mut bpf, 
           "syscall_return", "sys_exit")?;  // syscall exit point
       Self::attach_probe(&mut bpf, 
           "tcp_retransmit", "tcp_retransmit_skb")?;  // TCP retransmits
       Self::attach_probe(&mut bpf, 
           "vfs_read", "vfs_read")?;  // filesystem reads
       
       Ok(Self {
           bpf,
           // Get handles to all our eBPF maps
           syscall_map: Self::get_map(&bpf, 
               "SYSCALL_LATENCY")?,
           network_map: Self::get_map(&bpf, 
               "CONN_RETRANSMITS")?,
           io_histogram: Self::get_map(&bpf, 
               "READ_HISTOGRAM")?,
       })
   }
   
   pub fn collect_metrics(&self) -> Metrics {
       let mut metrics = Metrics::default();
       
       // Read syscall latencies from kernel
       for (nr, ns) in self.syscall_map.iter() {
           let name = syscall_name(nr?);  // convert number to name
           let ms = ns? as f64 / 1_000_000.0;  // nanoseconds to milliseconds
           metrics.syscalls.insert(name, ms);
       }
       
       // Read network retransmit stats
       for (conn, count) in self.network_map.iter() {
           metrics.retransmits += count?;  // total across all connections
       }
       
       // Read I/O latency histogram
       for (bucket, count) in self.io_histogram.iter() {
           metrics.io_latency_histogram
               .insert(bucket?, count?);
       }
       
       metrics
   }

} This runs 24/7 in production. Just collecting data, aggregating in the kernel, minimal overhead. It’s beautiful. When to Actually Use This (Decision Time) After 16 months of running this in production, here’s my honest take: Use eBPF when:

Performance problems that userspace profilers can’t see (like our case)
Syscall overhead is suspected (stat(), open(), etc.)
Network stack issues need diagnosis (retransmits, timeouts)
Kernel-level bottlenecks affecting your app
Production overhead must stay minimal (<5%)
You need continuous traces, not samples

Don’t use eBPF when:

Userspace profilers already show the issue (don’t overcomplicate)
Problem is clearly application logic (fix your code first)
Team lacks kernel knowledge (steep learning curve)
Deployment complexity is a problem (it’s not trivial)
Kernel version < 4.8 (no eBPF support, sorry)

It’s a power tool. Use it when you need a power tool. The Real Impact (18 Months Later) Okay so here’s what happened after we deployed this: Performance discoveries:

Hidden syscall bottlenecks: 34 found, 31 fixed
Network retransmit issues: 12 identified, 12 resolved
Filesystem hot spots: 8 discovered, 7 optimized
Memory allocation waste: 847MB/sec → 52MB/sec

Latency improvements:

P50: 32ms → 18ms (44% better, users noticed)
P99: 84ms → 51ms (39% better, SLA safe now)
P99.9: 234ms → 87ms (63% better, tail latency crushed)

Cost reduction:

Server count: 32 → 24 instances (8 fewer servers!)
Infrastructure: $47K/month → $35K/month
Total savings: $144K/year (paid for a lot of engineer time)

Engineering impact:

Incident resolution: 71% faster (we can see what’s wrong)
False hunches: 89% reduction (data beats guesses)
Team confidence: “Much higher” (actual survey result)

The Thing Nobody Talks About (But Should) The most valuable lesson wasn’t technical at all. It was cultural. Before eBPF, debugging was… guess-driven:

“Maybe it’s the database?” (narrator: it wasn’t)
“Could be network latency?” (nope)
“Probably GC pauses?” (also no)

We’d spend days chasing these hunches. Most of them were wrong. After eBPF, debugging became data-driven:

“Show me syscall distribution” (here’s the exact data)
“Where are retransmits occurring?” (this connection, right here)
“Which files have slow reads?” (this one log file)

We’d find root causes in minutes. And we’d be right. The team’s confidence transformed. Junior engineers could find kernel-level issues that used to require senior expertise. The eBPF traces just… spoke for themselves. The data was there, clear as day. I think that’s the real lesson. The best debugging tool isn’t the one with the most features or the fanciest UI. It’s the one that shows you the truth without interpretation. eBPF doesn’t guess. It doesn’t approximate. It traces every kernel operation with nanosecond precision. The performance bottleneck is in the data, if you know where to look. Our mysterious 37ms gap? It wasn’t mysterious at all. It was hiding in plain sight, in kernel space, where our userspace tools literally couldn’t see. eBPF from Rust gave us x-ray vision into the kernel, and what we found completely transformed our performance profile. Sometimes — and I think this is worth remembering — the bottleneck isn’t in your code. It’s in the foundation your code runs on. The kernel, the network stack, the filesystem. All the stuff you usually take for granted. Until you can’t anymore.

Read the full article here: https://medium.com/@chopra.kanta.73/ebpf-from-rust-the-34ms-ghost-we-couldnt-see-b9073b9840be