Jump to content

Inside Rust’s Codegen Units: How Parallel Compilation Actually Happens

From JOHNWICK
Revision as of 00:56, 22 November 2025 by PC (talk | contribs) (Created page with "file:Inside_Rust’s_Codegen_Units.jpg If you’ve ever stared at your terminal wondering why your Rust build takes forever, you’re not alone. At some point, every Rust dev goes through the same five stages of grief: * cargo build * Wait. * Wait more. * Question life choices. * Google “why is Rust compilation so slow”. But under all that waiting, something pretty fascinating is happening. The Rust compiler isn’t just compiling your crate — it...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

If you’ve ever stared at your terminal wondering why your Rust build takes forever, you’re not alone.

At some point, every Rust dev goes through the same five stages of grief:

  • cargo build
  • Wait.
  • Wait more.
  • Question life choices.
  • Google “why is Rust compilation so slow”.

But under all that waiting, something pretty fascinating is happening. The Rust compiler isn’t just compiling your crate — it’s splitting it into tiny self-contained “codegen units” and compiling them in parallel.

This mechanism, called Codegen Units (CGUs), is one of the most underappreciated yet critical optimizations in Rust’s compiler pipeline. It’s the reason why multicore CPUs actually matter when you hit build.

So let’s open up the hood on rustc’s code generation engine and see how parallel compilation actually happens.

What Are Codegen Units?

Think of a crate (a Rust package) as a big Lego castle. You could try building it all in one go — one giant, monolithic compilation unit — but that would take ages.

Instead, Rust breaks the crate into smaller Lego sections, each compiled separately. These are your Codegen Units (CGUs).

Each CGU represents a chunk of your program’s intermediate representation (MIR/LLVM IR), containing functions, generics, and data structures grouped together by the compiler.

Once all CGUs are generated, they can be compiled in parallel by multiple threads. Then the results are linked back into the final binary or library.

Here’s a simplified diagram of the flow:

          +-------------------------+
          |        Your Crate       |
          +-------------------------+
                      |
                      v
        +----------------------------+
        |  rustc splits into CGUs    |
        +----------------------------+
          |          |          |
          v          v          v
      CGU_1       CGU_2       CGU_3
          |          |          |
          +----------+----------+
                      |
                      v
             Final Linked Binary

Why Parallel Compilation Exists

Rust’s compilation model is heavier than most languages — thanks to monomorphization (where every generic function is compiled for each type it’s used with), borrow checking, and heavy LLVM optimizations.

Without parallelism, even small projects would compile like a glacier sliding uphill. That’s where codegen-units comes in. 
You can tweak this with:

RUSTFLAGS="-C codegen-units=8" cargo build --release

This tells the compiler:

“Split my crate into 8 parallel chunks and compile them simultaneously.” It’s a balance — more units = faster compilation, but slightly worse runtime performance. 
Why? Because inlining and cross-unit optimizations become harder when the compiler doesn’t see the whole crate at once.

The Tradeoff: Speed vs. Optimization

Rust’s LLVM backend is great at optimizing code — but only when it sees everything.
When we split the code into many CGUs, LLVM can’t always inline or optimize across boundaries.

Here’s a simple example:

// main.rs mod utils;


fn main() {

   let x = 10;
   println!("Result = {}", utils::double(x));

}

// utils.rs pub fn double(x: i32) -> i32 {

   x * 2

}

If main and utils end up in different codegen units, the compiler might not inline double() into main(). 
That means slightly worse performance — but a faster build.

However, if both are in the same CGU (like when you reduce -C codegen-units to 1), LLVM can inline across the entire crate, generating tighter, faster machine code.

This is why release builds usually go with:

cargo build --release

which implicitly uses -C codegen-units=1, preferring performance over compilation speed.

Under the Hood: The Compiler Architecture Here’s roughly what happens inside rustc when you compile your crate:

  ┌─────────────────────────────┐
  │        Parsing (AST)        │
  └──────────────┬──────────────┘
                 │
                 ▼
  ┌─────────────────────────────┐
  │     MIR (Mid-Level IR)      │
  └──────────────┬──────────────┘
                 │
                 ▼
  ┌─────────────────────────────┐
  │   Codegen Units Creation    │  ← split into N units
  └──────────────┬──────────────┘
     ┌───────────┼───────────┐
     ▼           ▼           ▼
 CGU 1       CGU 2       CGU 3     ← each compiled by LLVM thread
     └───────────┬───────────┘
                 ▼
       Final Link + LTO (if enabled)

The Codegen Backend uses a thread pool to distribute these CGUs across CPU cores. 
Each thread invokes LLVM on one unit, and once all are done, the linker merges the object files (.o) into the final binary.

Code Flow Example

Let’s look at an example with real compiler flags:

cargo rustc -- -C codegen-units=4 -C save-temps

This will:

  • Split the crate into 4 CGUs.
  • Save intermediate .ll (LLVM IR) files, which you can inspect.

You’ll see files like:

target/debug/deps/

 my_crate-1234abc.0.ll
 my_crate-1234abc.1.ll
 my_crate-1234abc.2.ll
 my_crate-1234abc.3.ll

Each one represents a separate codegen unit, compiled in parallel. If you open one, you’ll see IR like this:

Function Attrs
nounwind

define internal i32 @utils::double(i32 %x) unnamed_addr { entry:

 %mul = mul i32 %x, 2
 ret i32 %mul

}

Multiple such IR files are fed into LLVM, then merged by the linker to produce your final executable.

Advanced: How LTO (Link-Time Optimization) Fits In If you enable LTO, Rust essentially re-merges all CGUs at the link stage — letting LLVM re-optimize across the entire crate (and dependencies).

cargo build --release --features "lto"

That’s why LTO builds are slower — they undo some of the parallel benefit, but produce highly optimized machine code.

There’s also ThinLTO, which keeps some of that parallel structure but allows selective cross-unit optimization — the best of both worlds.

Architecture Diagram

                   ┌──────────────┐
                   │  rustc Frontend│
                   └───────┬────────┘
                           │
              ┌────────────┴────────────┐
              │        MIR Passes       │
              └────────────┬────────────┘
                           │
           ┌───────────────┴───────────────┐
           │     Codegen Unit Splitter     │
           └───────────────┬───────────────┘
                   ┌───────┴───────┐
                   │               │
               LLVM Thread 1   LLVM Thread 2
                   │               │
                   └───────┬───────┘
                           │
                      Final Linker
                           │
                           ▼
                     Optimized Binary

This design gives Rust the flexibility to:

  • Compile large crates faster on multi-core CPUs
  • Tune performance by adjusting CGU counts
  • Integrate with LTO and ThinLTO seamlessly

Real-World Lessons

After years of working in large Rust monorepos, here’s what dev teams learned:

  • For dev builds: Use higher CGU counts (8–16). It keeps iteration times manageable.
  • For release builds: Stick with 1 CGU for full optimization.
  • For CI/CD pipelines: Combine CGUs with incremental compilation — that’s where the real speedup comes.
  • For large dependency trees: Be careful; cross-crate inlining can still be the bottleneck.

The Emotional Bit: Why It Matters

What makes Rust special isn’t just the safety or performance — it’s the way it respects your machine. 
While many modern languages hide the compiler’s work behind abstraction, Rust exposes the knobs — it lets you tweak, measure, and truly understand what’s happening under the hood.

Codegen Units are a perfect example of that philosophy: a transparent, tunable performance lever, made for developers who care deeply about their craft.

So the next time your build spins up all your CPU cores, just smile — you’re watching a symphony of compiler threads orchestrating your code into machine perfection.

Key Takeaways

  • Codegen Units split crates into smaller compilation chunks for parallelism.
  • You can control them with -C codegen-units=N.
  • More units = faster compile times, slightly less optimization.
  • LTO merges units for better performance at the cost of build time.
  • Understanding CGUs helps you tune your build system intelligently.

Read the full article here: https://medium.com/@theopinionatedev/inside-rusts-codegen-units-how-parallel-compilation-actually-happens-c3ea5be53191