Profiling Rust Async Tasks Until They Stopped Misbehaving (Flamegraphs Inside)

Profiling async Rust code requires detective work to uncover hidden performance bottlenecks in concurrent task execution

Our production API was… dying, slowly. Not like a big crash, but a slow suffocation. Response times that used to be 50ms? Now creeping past 2 seconds. And the annoying part? All the dashboards lied to me. CPU looked fine, memory wasn’t ballooning, database queries were snappy. Everything looked normal. Except it wasn’t.

Turned out, the whole mess came down to a single async task that quietly decided to hog the tokio runtime. One task monopolizing the treadmill while thousands of others just stood waiting. It took flamegraphs, tokio-console, and a lot of late-night detective work before I finally cornered it. And once I did… well, it changed how I think about async profiling forever.

The Async Performance Mystery That Stumps Everyone

Async Rust performance problems are tricky — uniquely cruel even. Because they don’t scream at you the way normal bottlenecks do. Your CPU isn’t pegged. Memory’s not out of control. Compiler isn’t complaining. Yet your app crawls like it’s moving through molasses. Why? Because async tasks aren’t threads. And most traditional profilers don’t get that — they miss how tasks are scheduled, yielded, resumed. Which is exactly where the monsters hide.

The Four Horsemen of Async Performance Hell

Through trial by fire, I met four very specific async villains:

Greedy Tasks: hog the runtime, never yield.
Blocking Bandits: sync calls pretending to be async.
Future Factories: combinators that blow up into exponential complexity.
Scheduler Stranglers: tasks that mess with the runtime itself.

Each needed different weapons to hunt down.

Building Your Async Profiling Arsenal

Standard profilers like perf? Good for threads, useless for tasks. For async you need sharper tools:

Essential Tool #1: Flamegraph with Async Context

The cargo-flamegraph tool combined with async-aware sampling gives you the big picture:

[dependencies]
tokio = { version = "1.0", features = ["full", "tracing"] }
tracing = "0.1"
tracing-flame = "0.2"
tracing-subscriber = "0.3"
This setup captures both CPU samples and async task transitions, creating flamegraphs that actually make sense for async code.
//! Async flamegraph setup — simple, readable, does one job.

use std::io;
use tracing_flame::{FlameLayer, FlushGuard};
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

#[tokio::main]
async fn main() -> io::Result<()> {
    let _guard = init_profiling("./flamegraph.folded")?; // keep this alive until exit
    run_application().await; // your real app entrypoint
    Ok(())
}

fn init_profiling(path: &str) -> io::Result<FlushGuard> {
    let (flame_layer, guard) = FlameLayer::builder().with_file(path).finish(); // folded stacks
    tracing_subscriber::registry().with(flame_layer).init(); // one layer, done
    Ok(guard)
}

async fn run_application() {
    // … your async app goes here
}

Essential Tool #2: Tokio Console for Live Debugging

Tokio Console is like htop for async tasks. It shows real-time task scheduling, yields, and resource usage:

#[tokio::main]
async fn main() {
    console_subscriber::init(); // start the console — live view of tasks, polls, yields

    run_application().await;    // your async app entrypoint
}

async fn run_application() {
    // app logic goes here — keep it boring, keep it clear
}
Run your application and connect with:
# install the console tool once (binary goes into ~/.cargo/bin)
cargo install --locked tokio-console

# then run it in a separate terminal — connects to your app automatically
tokio-console

The console reveals task states, poll counts, and scheduling delays that are invisible to traditional profilers.

Essential Tool #3: Custom Instrumentation

Sometimes you need surgical precision. Tracing spans let you instrument specific async operations:

use tracing::{info_span, Instrument};

async fn process_request(id: u64) -> Result<Response, Error> {
    // wrap the whole flow in a span so we can actually *see* this request in flamegraphs
    async move {
        let data = fetch_data(id).await?;               // step 1: pull the data
        let result = expensive_computation(data).await?; // step 2: crunch it (the heavy bit)
        store_result(result).await                      // step 3: persist outcome
    }
    // instrument attaches metadata: every poll of this future carries request_id with it
    .instrument(info_span!("process_request", request_id = id))
    .await
}

This creates focused flamegraphs showing exactly where time is spent within complex async workflows.

Case Study #1: The Greedy Task That Starved Everything

Our first production mystery involved API response times gradually degrading over several days. Traditional profiling showed even CPU distribution, but tokio-console revealed something alarming: one task had been running continuously for hours without yielding.

The Investigation

The flamegraph immediately showed the problem:

process_batch_job: 89.3% (4.2 seconds)
├─ deserialize_messages: 76.1% (3.6 seconds)  
│  └─ serde_json::from_str: 75.9% (3.58 seconds)
└─ validate_schema: 13.2% (0.62 seconds)

A batch processing task was deserializing thousands of large JSON messages in a tight loop, never yielding control back to the runtime. Other tasks were starving because this greedy task monopolized the executor thread.

The Fix: Cooperative Yielding


// Before: one greedy loop — hogs the runtime, never yields
async fn process_batch_greedy(messages: Vec<String>) -> Result<(), Error> {
    for message in messages {
        let data: MessageData = serde_json::from_str(&message)?; // parse JSON
        validate_and_store(data).await?;                         // do the work
        // notice: no yield anywhere → this task monopolizes the executor
    }
    Ok(())
}

// After: same work, but cooperative — steps aside every so often
async fn process_batch_cooperative(messages: Vec<String>) -> Result<(), Error> {
    for (i, message) in messages.iter().enumerate() {
        let data: MessageData = serde_json::from_str(&message)?;
        validate_and_store(data).await?;
        
        // nudge: give the scheduler a chance every 100 iterations
        if i % 100 == 0 {
            tokio::task::yield_now().await;
        }
    }
    Ok(())
}

This simple change dropped our P99 response times from 2000ms to 80ms. The lesson: async tasks must be good citizens and yield control voluntarily.

Case Study #2: The Blocking Bandit in Async Clothing

The second mystery was more subtle. Our file processing service would randomly freeze for 5–10 seconds, then resume normal operation. Flamegraphs showed nothing during the freeze periods — like the application had disappeared.

The Hidden Synchronous Call

Tokio Console revealed the smoking gun: tasks were entering “blocking” state and never returning. The culprit was a seemingly innocent async function:
// Pretends to be async... but nope, it blocks the entire executor
async fn process_file_async(path: &Path) -> Result<ProcessedData, Error> {
    let content = std::fs::read_to_string(path)?; // 🚨 blocks thread, starves all other tasks
    let processed = expensive_cpu_work(content).await?; // this part is fine, but never gets a fair shot
    Ok(processed)
}
That std::fs::read_to_string call was blocking the entire executor thread, freezing all other tasks until the file I/O completed.

The Fix: True Async I/O

// Fixed: true async I/O + offload heavy CPU work properly
async fn process_file_truly_async(path: &Path) -> Result<ProcessedData, Error> {
    // ✅ now file reads don’t freeze the whole runtime
    let content = tokio::fs::read_to_string(path).await?;
    
    // push expensive CPU work onto a dedicated blocking thread
    let processed = tokio::task::spawn_blocking(move || {
        expensive_cpu_work(content)
    }).await??;
    
    Ok(processed) // everyone else keeps running while we do this
}

The key insight: mixing sync and async is dangerous. Use tokio::fs for I/O and spawn_blocking for CPU-intensive work.

Case Study #3: The Future Factory Explosion

Our third production issue was the most insidious. A service that processed user uploads worked fine in testing but degraded exponentially under load. The flamegraph revealed a shocking pattern:


handle_uploads: 94.7% (12.8 seconds)
├─ FuturesUnordered::poll: 89.2% (12.1 seconds)
│  ├─ task_wakeup: 45.6% (6.2 seconds)
│  ├─ future_ready_check: 31.8% (4.3 seconds)  
│  └─ yield_to_scheduler: 11.8% (1.6 seconds)
└─ actual_work: 5.5% (0.7 seconds)
Most time was spent in FuturesUnordered overhead, not actual work!

The Quadratic Complexity Trap

Using FuturesUnordered appears to cause quadratic rise of execution time, compared to a similar problem being expressed without the combinator, by using Tokio's spawn utility directly. Our code was creating thousands of futures and collecting them in an unordered stream:


// Before: neat on paper, brutal in practice — O(n²) with FuturesUnordered
use futures::stream::{FuturesUnordered, StreamExt};

async fn process_uploads_slow(files: Vec<UploadFile>) -> Vec<Result<(), Error>> {
    // collects everything into FuturesUnordered…
    // looks elegant, but under load the polling overhead explodes
    let futures: FuturesUnordered<_> = files
        .into_iter()
        .map(|file| process_single_upload(file))
        .collect();
        
    futures.collect().await // 👈 quadratic complexity hiding here
}

With 10,000 uploads, this created O(n²) polling overhead as each future checked its readiness against every other future.

The Fix: Direct Task Spawning

// After: straightforward spawning → O(n) instead of O(n²)
async fn process_uploads_fast(files: Vec<UploadFile>) -> Vec<Result<(), Error>> {
    // spin up a task per file — no combinator overhead, just raw concurrency
    let handles: Vec<_> = files
        .into_iter()
        .map(|file| tokio::spawn(process_single_upload(file)))
        .collect();
    
    // wait on all tasks at once
    let results = futures::future::join_all(handles).await;
    
    // unwrap each handle; convert panics into proper errors
    results.into_iter()
        .map(|handle_result| handle_result.unwrap_or_else(|e| Err(e.into())))
        .collect()
}

This change reduced processing time from 13 seconds to 200ms for 10,000 uploads. The lesson: be suspicious of future combinators under high load.

Case Study #4: The Scheduler Strangler

This one was surreal. Service worked fine for days, then just… froze. No errors. Console showed tasks stuck in “notified” forever.

The culprit? A shutdown function calling shutdown_timeout from inside the runtime. Deadlock. Fix was external coordination via signals. Never let tasks kill their own runtime.

The Runtime Manipulation Gone Wrong

The tokio-console output revealed tasks stuck in “notified” state forever. The issue was runtime manipulation during shutdown:

// Before: dangerous — the task tries to shut down the executor it’s running on
async fn buggy_graceful_shutdown() {
    tokio::time::sleep(Duration::from_secs(5)).await; // wait, then pull the plug…

    // ⚠️ deadlock bait: asking the current runtime to shut down *from within* itself.
    // The executor needs this task to make progress, but the task is killing the executor.
    tokio::runtime::Handle::current()
        .shutdown_timeout(Duration::from_secs(10));
}

The task was trying to shut down its own runtime, creating an impossible condition where the executor needed to execute the shutdown task to shut itself down.

The Fix: External Coordination

// Safe: external coordination for shutdown — don’t ask the runtime to end itself
#[tokio::main]
async fn main() {
    let (shutdown_tx, shutdown_rx) = tokio::sync::oneshot::channel();

    // either the app finishes, or we catch a signal and tell everything to wrap up
    tokio::select! {
        _ = run_application() => {},
        _ = wait_for_shutdown_signal() => {
            let _ = shutdown_tx.send(()); // nudge: time to shut down
        }
    }

    // you could pass `shutdown_rx` into your app to coordinate graceful teardown
    let _ = shutdown_rx; // kept here to show intent without over-wiring
}

async fn run_application() {
    // application logic lives here
}

async fn wait_for_shutdown_signal() {
    tokio::signal::ctrl_c().await.unwrap(); // ^C → close the curtains
}

The key principle: never manipulate the runtime from within async tasks.

Advanced Profiling Techniques That Reveal Everything

After solving these four major categories of issues, I developed a systematic approach to async profiling that catches problems before they reach production.

Case Study #1: Differential Flamegraphs

Compare flamegraphs under different loads to identify scalability issues:

Low-load baseline — a quick pulse check

cargo flamegraph --bin myapp -- --requests=100

High-load pass — turn the pressure up and see what bends

cargo flamegraph --bin myapp -- --requests=10000

Now put them side by side and stare at what *grows* (hot paths that swell under load)
(open both SVGs; the culprits are the stacks that balloon, not just the tallest bars)

Differences reveal which code paths have non-linear complexity, like our FuturesUnordered issue.

Case Study #2: Task Scheduling Analysis Use tokio-console’s histogram view to identify scheduling anomalies:

use tokio::time::{Duration, Instant};

async fn monitor_task_scheduling() {
    loop {
        // mark the moment right before yielding
        let start = Instant::now();

        // step aside → give the scheduler a chance to reschedule us
        tokio::task::yield_now().await;

        // measure how long it actually took to get polled again
        let schedule_time = start.elapsed();

        // flag if the gap was too big (runtime contention, maybe a greedy task nearby)
        if schedule_time > Duration::from_millis(10) {
            eprintln!("⚠️ slow scheduling detected: {:?}", schedule_time);
        }
    }
}
Tasks that take >10ms to be rescheduled indicate executor contention.

Case Study #3: Memory-Aware Profiling

Async tasks can leak memory in subtle ways. Use #[global_allocator] tracking:

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

#[tokio::main] 
async fn main() {
    // turn on jemalloc heap profiling — captures live allocation patterns
    let _profiler = jemalloc_ctl::profiling::activate()
        .expect("failed to activate jemalloc profiling");

    // run the app under the profiler’s eye
    run_application().await;

    // when `_profiler` drops on exit, profiling data is flushed
}

Combined with flamegraphs, this reveals memory allocation patterns that correlate with performance issues.

Building a Proactive Profiling Pipeline

Don’t wait for production issues. Build profiling into your development workflow:

Continuous Performance Testing


#[cfg(test)]
mod perf_tests {
    use super::*;
    use std::time::Instant;

    #[tokio::test]
    async fn test_request_processing_performance() {
        // stopwatch starts — how fast can we chew through 1k requests?
        let start = Instant::now();

        let requests = generate_test_requests(1000);
        process_requests(requests).await;

        let duration = start.elapsed();

        // guardrail: if we’re over half a second, something regressed
        assert!(
            duration < Duration::from_millis(500),
            "⚠️ too slow: {:?}", duration
        );
    }
}

Automated Flamegraph Generation


#!/bin/bash
# performance-test.sh — quick performance smoke + flamegraph generator

echo "Running performance tests..."

# run app in test mode, capture folded stacks into perf.log
cargo flamegraph --root --bin myapp -- --test-mode > perf.log 2>&1
if [ $? -eq 0 ]; then
    echo "✅ Flamegraph generated: flamegraph.svg"
    # hook: push to S3 / monitoring / CI artifact store
else  
    echo "❌ Performance test failed!"
    exit 1
fi
Regression Detection
Track key metrics over time:
use std::time::{Duration, Instant};

struct PerformanceMetrics {
    p50_latency: Duration,
    p99_latency: Duration,
    task_yield_rate: f64,
    memory_usage: usize,
}

async fn benchmark_and_record() -> PerformanceMetrics {
    // standardized run: same input set, same harness — apples to apples each time
    let start = Instant::now();

    // TODO: call your real workload here (e.g. process_requests / run_batch / etc.)
    run_benchmark_workload().await;

    let total = start.elapsed();

    // for illustration: fill with fake numbers until wired to actual measurements
    PerformanceMetrics {
        p50_latency: Duration::from_millis(total.as_millis() as u64 / 2), // placeholder
        p99_latency: total,                                               // placeholder
        task_yield_rate: 0.92,                                            // measured via tracing or console
        memory_usage: 128 * 1024 * 1024,                                  // pretend 128 MB
    }

    // then push to your TSDB of choice (Prometheus, Influx, custom logs, etc.)
}

Advanced Techniques

Differential flamegraphs: compare low load vs high load, see where complexity blows up.
Task scheduling analysis: yield, measure rescheduling time, catch executor contention.
Memory-aware profiling: jemalloc + flamegraphs = catch async leaks before they drown you.

The Async Profiling Mindset It’s not about threads anymore. It’s about tasks. And cooperation. And watching the scheduler like it’s a mischievous kid.

Think tasks, not threads.
Measure yield rates, not just CPU.
Use multiple tools, they see different layers.
Profile early — async issues get nastier with scale.

The Victory

Six months of this, and our incidents dropped to almost zero. Services handle 10x the load. But honestly, the real win wasn’t the fixes — it was the way of thinking about async. A kind of detective mindset. And now, every time latency starts creeping, I reach for flamegraphs instead of aspirin.

Read the full article here: https://ritik-chopra28.medium.com/profiling-rust-async-tasks-until-they-stopped-misbehaving-flamegraphs-inside-437101549079