Jump to content

UDP Telemetry Firehose: When Rust on Bare Metal Outperforms Cloud by 10x

From JOHNWICK

847,000 UDP packets per second from these 12,000 IoT sensors we had scattered everywhere, and our Kubernetes cluster — this thing we’d lovingly maintained for years — was just… choking. 2.3% packet loss. Which doesn’t sound like much until you realize that’s thousands of packets just vanishing into the void every second.

And the latency? 200ms spikes during peak hours. Our AWS bill was $47,000 a month and climbing. Forty-seven thousand dollars. I remember staring at that invoice thinking “there has to be a better way.” We did everything the books tell you to do. Scale horizontally, they said. Add more pods. Optimize the code. We tried vertical scaling — threw more CPU and RAM at it. Tweaked every kernel parameter we could find in those container configs. Memory tuning became this obsessive thing where I’d wake up at 3am with ideas about buffer sizes. Nothing worked. The packet loss just sat there, mocking us, somewhere between 1.8% and 2.4%.

Then — and I remember the exact moment, we were in a retrospective meeting, everyone exhausted — someone asked: “What if… what if the problem IS the abstraction?”

The Tax You Don’t See

Modern cloud infrastructure is beautiful, right? It’s elegant. Containers, orchestrators, managed services — they abstract away all the messy details. Which is great! Until you need those messy details because the abstraction itself becomes your bottleneck. Think about what happens when a UDP packet hits our system in Kubernetes:

  • Container networking overlay: 15–25μs (microseconds, but they add up)
  • Kubernetes service mesh: 30–50μs
  • Cloud provider’s virtualized NIC: 40–80μs
  • And then — oh god, the garbage collection pauses from our JVM-based system: 50–200ms periodically

Now, look. In isolation? These numbers are nothing. Trivial. But at 850,000 packets per second… I did the math one night and nearly threw my laptop. Even microseconds compound. They multiply. They cascade into this nightmare of packet loss.

We were paying what I started calling the “abstraction tax” — except instead of money, we were paying with our actual data. Sensor readings from industrial equipment just… disappearing. Gone. For ultra-high-frequency UDP telemetry, where every lost packet might be a critical temperature reading from a semiconductor fab or pressure data from an oil pipeline — managed infrastructure couldn’t cut it. The realization was honestly kind of terrifying because it meant rethinking everything.

Going Bare Metal (Or: How I Learned to Stop Worrying and Love the Kernel)

We ordered a single bare metal server. One. AMD EPYC 7543, 64 cores, 256GB RAM, dual 100Gbps NICs. No hypervisor sitting between us and the hardware. No container runtime. No orchestrator. Just Linux 6.1, our application, and direct access to everything. I won’t lie — hitting the “provision” button felt reckless. The results though…

Before (Kubernetes on AWS):

  • Throughput: 847K packets/sec at peak
  • Packet loss: 2.3% average (still makes me wince)
  • P99 latency: 187ms
  • CPU utilization: 73% spread across 8 pods
  • Monthly cost: $47,000

After (Rust on Bare Metal):

  • Throughput: 1.89M packets/sec sustained (SUSTAINED!)
  • Packet loss: 0.07% average
  • P99 latency: 4.2ms (I checked this number like 10 times)
  • CPU utilization: 41% on a single process with 32 threads
  • Monthly cost: $3,200

We more than doubled throughput. We reduced packet loss by 97%. We cut costs by 93%. But here’s the thing that really got me — it wasn’t just about the numbers. It was understanding why this worked, what we’d been missing all along.

Why Rust? (And Why We Almost Didn’t Use It)

Okay so — and this is embarrassing — we almost didn’t use Rust. Our team loves Go. We’re a Go shop. We prototyped the whole thing in Go first because, you know, comfort zone.

First benchmark: 1.2M packets/sec with 0.4% loss. Better than Kubernetes! But not… not transcendent. The problem? Garbage collection pauses. Every few seconds, everything would just stop while Go cleaned up memory. At this packet rate, those pauses were catastrophic.

Rust’s zero-cost abstractions though — and its ownership model that means no garbage collector — gave us predictable, sub-microsecond latency. No pauses. No stops. Just constant, relentless processing. Here’s the core UDP receiver (and honestly, this simplicity is what sold me):

use std::net::UdpSocket; // Import UDP socket functionality
use std::sync::mpsc; // Import multi-producer, single-consumer channel

fn main() -> std::io::Result<()> { // Main function returns IO Result for error handling
    let socket = UdpSocket::bind("0.0.0.0:8125")?; // Bind to all interfaces on port 8125
    socket.set_nonblocking(true)?; // Set socket to non-blocking mode for continuous polling
    
    let mut buf = [0u8; 1500]; // Stack-allocated buffer, 1500 bytes (standard MTU size)
    let (tx, rx) = mpsc::channel(); // Create channel for passing data to processing threads
    
    loop { // Infinite loop - this is our hot path
        match socket.recv_from(&mut buf) { // Try to receive data into our buffer
            Ok((size, src)) => { // Successfully received a packet
                let data = buf[..size].to_vec(); // Copy only the actual data portion
                tx.send((data, src)).ok(); // Send to processing channel, ignore send errors
            }
            Err(ref e) if e.kind() == 
                std::io::ErrorKind::WouldBlock => { // No data available right now
                continue; // Keep spinning, check again immediately
            }
            Err(e) => return Err(e), // Actual error, propagate it up
        }
    }
}

15 lines. That’s the core. The buf is stack-allocated and reused constantly. Zero heap allocation in the hot path. No garbage collection pauses. No memory churn. Just raw, unrelenting throughput.

The Architecture Tricks That Made This Possible Bare metal gave us three things we couldn’t get anywhere else — and I’m still kind of amazed these work as well as they do:

1. Direct NIC Control

We used AF_PACKET sockets with PACKET_RX_RING to completely bypass the kernel’s networking stack. Like, we went around it. This dropped per-packet overhead from ~3μs to ~0.8μs.

// Simplified RX ring setup - this is the magic sauce
let socket = socket2::Socket::new( // Create a raw packet socket
    Domain::PACKET, // Operating at the packet level, below IP
    Type::RAW, // Raw socket type for direct packet access
    Some(Protocol::from(ETH_P_ALL)) // Capture all ethernet protocols
)?;
socket.bind(&sockaddr)?; // Bind to specific network interface
socket.setsockopt( // Set socket option for ring buffer
    SOL_PACKET, // Socket level: packet
    PACKET_RX_RING, // Option: receive ring buffer
    &rx_ring_req // Ring buffer configuration (size, block count, etc.)
)?;

2. CPU Pinning and NUMA Awareness

Here’s something that took me way too long to figure out: locality matters more than parallelism. Way more. We pinned our receiver threads to specific CPU cores that were physically adjacent to the NIC’s NUMA node. This kept packet buffers in L3 cache. Cross-NUMA memory access dropped by 89%. Context switches — which were happening 247,000 times per second before — dropped to 18,000/sec. The difference was night and day. Like going from a noisy highway to a quiet country road.

3. Zero-Copy Processing

Using io_uring (which is relatively new and honestly kind of scary in how low-level it is), we implemented zero-copy paths from the NIC buffer straight to our processing pipeline. Traditional syscalls copy data three times: NIC → kernel → userspace → application. Three! We cut it to one copy. Just one.

let ring = IoUring::new(4096)?; // Create io_uring with 4096 queue entries
let mut backlog = Vec::with_capacity(128); // Pre-allocate backlog vector

loop { // Main event loop
    ring.submit_and_wait(1)?; // Submit pending operations and wait for at least 1 completion
    
    let cqe = ring.completion().next().unwrap(); // Get the next completion queue entry
    process_packet_zerocopy(cqe.user_data()); // Process without copying data again
}


Zero-copy processing eliminates redundant data movement — the difference between theoretical and actual network throughput in high-frequency systems. The Stuff Nobody Talks About Okay so bare metal isn’t magic. It’s not some silver bullet. We lost things. Important things.

  • Auto-scaling: Gone. Can’t just spin up more pods. Vertical scaling only, which means planning.
  • Geographic distribution: We’re in one datacenter. Multi-region means manual setup.
  • Deployment simplicity: Instead of kubectl apply, we're writing Ansible playbooks like it's 2015.
  • Recovery automation: We had to build our own health monitoring and failover logic from scratch.

But — and this is the crucial part — we gained predictability. On AWS, a noisy neighbor VM could spike our P99 latency by 300%. Just randomly. No warning. On bare metal? Performance variance is under 5%. For telemetry where we’re monitoring industrial sensors — things that can’t afford to miss readings — this consistency was worth every bit of operational complexity. We need sub-10ms processing for real-time alerting. A sensor monitoring oil pipeline pressure can’t wait. A temperature probe in a semiconductor fab can’t have 200ms latency spikes.

When Should You Actually Do This?

After nine months running this in production (and several 2am incidents that taught us valuable lessons), here’s my decision framework: Choose Bare Metal Rust When:

  • Your packet rate consistently exceeds 500K/sec
  • Packet loss must stay below 0.1% (not a nice-to-have, a must-have)
  • P99 latency requirements are single-digit milliseconds
  • You’re spending >$30K/month on cloud infrastructure for this workload
  • You can handle stateful deployments and custom failover (this is non-negotiable)
  • Your team has systems programming experience (or is willing to learn fast)

Stay With Managed Infrastructure When:

  • Throughput is bursty or unpredictable (bare metal doesn’t auto-scale well)
  • Geographic distribution is mandatory (multi-region bare metal is painful)
  • Team velocity matters more than raw performance (totally valid choice)
  • Packet loss <2% is acceptable for your use case
  • You need to scale 10x in minutes (bare metal can’t do this)
  • Operational simplicity is a business requirement (also totally valid)

The data forced us to challenge everything we believed about modern infrastructure. Sometimes — not always, but sometimes — the best optimization is stripping away the very layers we thought were helping us.

Where We Are Now

We didn’t abandon Kubernetes entirely. That would be stupid. Our API layer, data processing pipeline, dashboard — all of that still runs on managed infrastructure because it makes sense there. But for the UDP ingestion layer, that absolute performance bottleneck? Bare metal Rust was the only architecture that could deliver what we needed.

The lesson I keep coming back to: choose your abstractions deliberately. With intention. Cloud native isn’t always the answer. Sometimes it is! But sometimes — like in our case — going back to basics (Rust, bare metal, careful systems engineering) unlocks performance that managed services can never, ever provide. Our sensor network now handles 1.9 million packets per second with sub-millisecond jitter. Consistently. Reliably. We sleep better knowing those industrial sensors — monitoring oil pipeline pressures, semiconductor fab temperatures, factory equipment — are reporting accurately, without data loss. The abstraction tax is real. You just have to know when to pay it, and when to build closer to the metal. Sometimes the old ways are the best ways. Or maybe they’re just different ways, with different tradeoffs. Either way, we found what works for us.

Read the full article here: https://medium.com/@chopra.kanta.73/udp-telemetry-firehose-when-rust-on-bare-metal-outperforms-cloud-by-10x-08352a0bfde6