Jump to content

Profiling Rust Made Easy: cargo-flamegraph, perf & Instruments

From JOHNWICK

Think of your Rust program like a busy playground. Some kids (functions) are calmly swinging; others are hogging the slide. Profiling is how you watch the playground to learn where time is really being spent — so you can fix the bottlenecks instead of guessing. Below is a practical guide that borrows the best tips from the Rust community forum, a hands‑on blog tutorial, and a short case study showing how one team cut CPU usage by ~70% after reading a flamegraph the right way. What “profiling” actually means (ELI5)

  • Sampling profiling: Imagine taking a snapshot of the playground every few milliseconds. You’ll approximate which kids (functions) were on the slide most often (= where CPU time goes). It’s fast and usually the best default. Tools: Linux perf, macOS Instruments, Windows WPA (Windows Performance Analyzer).
  • Counting/instrumentation: You film everything, counting every slide (every call). That’s very detailed but much slower (often ~10× overhead), so use it when you truly need that precision (e.g., with callgrind).

TL;DR: Start with sampling unless you have a strong reason not to. The one tool that makes life easy: cargo flamegraph If you want a single command that “just works” on Rust projects, start here. cargo-flamegraph wraps the platform profilers (Linux perf, macOS DTrace/ Instruments) and spits out an interactive SVG flame graph showing where your program spends time. Install & run 1) Install cargo-flamegraph (and ensure perf/DTrace is available on your OS) cargo install flamegraph 2) Profile your default target cargo flamegraph 3) Pass args to your program cargo flamegraph — arg1 arg2 4) Pick a specific binary and a custom output file cargo flamegraph -b mybin -o my-output.svg — arg1 arg2 5) Adjust sampling frequency (e.g., 1997 Hz) if you need more/fewer samples cargo flamegraph -F 1997 — arg1 arg2 You’ll get flamegraph.svg. Open it (Firefox works nicely) and click to zoom into hot stacks. Wider boxes = more time; taller stacks = deeper call chains. Make your profiles correct

  • Build with debug info (even in release).
Optimized builds can “erase” symbol names and line info. Add this to Cargo.toml so release builds still have symbols:
  • # Cargo.toml
  • [profile.release]
  • debug = true

Or set it ad‑hoc: CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph.

  • Capture system calls (I/O, networking).
If a lot of time is inside the kernel, run with elevated privileges so the profiler can sample in syscalls:
  • cargo flamegraph — root — arg1 arg2

Otherwise you may miss where “the time actually went.”

  • Collect enough samples (but not too many).
As a rule of thumb, ~1k samples is often “enough to see the big rocks.” Make the workload run longer or change the sampling rate with -F to tune this. Beware “lockstep” sampling rates (e.g., a neat 100 Hz) that align with your app’s periodic work — try a quirky frequency (1997) to avoid bias.
  • Tiny programs need a loop.
“Hello, world!” profiles are mostly startup/syscalls. Wrap the interesting code in a loop so the profiler measures your code, not the loader.
  • Prefer DWARF call graphs when in doubt.
On Linux, perf can use different call‑graph modes. DWARF is a safe default (works on more hardware/deep stacks); LBR (Last Branch Record) is very low‑overhead but limited to shallow stacks and newer CPUs. Start with DWARF; try LBR later and compare.
  • If names look mangled, it could be your profiler build.
Some perf versions have regressions that garble symbols; downgrading fixed it in one real‑world case. If your stacks lack function names, verify your perf version.

Reading a flamegraph without getting lost

  • Width is time share. Look for the widest boxes near the top; they’re your first candidates.
  • Zoom in: click the wide area; follow the stack down to see why that function is hot.
  • Ignore the noise: allocators, runtime, and async executors can be visible; focus on your own crate first.
  • Open the SVG in a browser and explore — don’t skim it like a static picture.

A tiny real‑world story: the iterator that ate 70% of our CPU A team running a Rust microservice saw stubbornly high CPU without a clear bottleneck — until they looked at a flamegraph. It highlighted an iterator chain that did far more work than expected. A small change (simplifying the chain) dropped CPU usage by ~70%. Moral: let the profile lead you to the surprising hotspot; it’s often hiding in “perfectly fine‑looking” code. OS‑specific pointers

  • Linux: Use perf directly. For deep stacks try — call-graph=dwarf; LBR is faster but shallower. Then perf report to inspect the results.
  • macOS: Xcode Instruments (Time Profiler) provides a good sampling view. cargo-flamegraph can also wrap DTrace/ Instruments on macOS.
  • Windows: Windows Performance Analyzer (WPA) is the native sampling profiler; it can reveal CPU hot paths in Rust binaries just fine.

If you prefer a conceptual walkthrough, Vitaly Bragilevsky’s short talk is a helpful primer on what profiling measures and why it matters for time and memory. (YouTube) Copy‑paste checklist

  • Build with symbols (see [profile.release] debug = true).
  • Run cargo flamegraph on a representative workload; pass — root if you suspect I/O/kernel time.
  • Ensure ≥ ~1k samples (adjust -F or make the workload run longer).
  • Open the SVG and chase the widest stacks first.
  • Fix one hotspot, re‑run the same workload, and compare flamegraphs.
  • If stacks look shallow or truncated on Linux, try DWARF call graphs; compare with LBR if available.
  • Only if you need exact counts, switch to Callgrind — and expect ~10× overhead.

Common pitfalls

  • “The flamegraph shows only system stuff!” → Add a loop or a bigger input to spend time in your code; capture syscalls with — root.
  • “Function names are missing.” → Make sure release builds include debug symbols; check for known perf regressions.
  • “Sampling rate feels off.” → Try a non-round -F value (e.g., 1997) to avoid “lockstep” bias.

Why this works The Rust forum consensus is clear: hardware‑assisted sampling is a great default because it’s accurate enough to guide your effort while keeping overhead low. When you do need full counts, expect much higher overhead and measure carefully. Combining that with cargo-flamegraph’s convenience gives you fast iterations and fewer blind guesses. References