Jump to content

When Rust Won’t Vectorize: How to See Why, Prove Whose Fault It Is (rustc vs LLVM), and Fix It (x86 and AArch64)

From JOHNWICK
Revision as of 17:36, 23 November 2025 by PC (talk | contribs) (Created page with "500px You’ve got a hot loop that blazes on x86_64 but stubbornly refuses to vectorize on aarch64. You peek at LLVM IR, you squint at Godbolt, you try a few tweaks…and still no SIMD on Apple M-series or modern ARM servers. Let’s solve this properly: * What’s happening? Your loop shape asks for gathers from src + an in-place RMW store to latents. x86 can often paper over this with AVX2 gathers and tolerant alias analys...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

You’ve got a hot loop that blazes on x86_64 but stubbornly refuses to vectorize on aarch64. You peek at LLVM IR, you squint at Godbolt, you try a few tweaks…and still no SIMD on Apple M-series or modern ARM servers. Let’s solve this properly:

  • What’s happening? Your loop shape asks for gathers from src + an in-place RMW store to latents. x86 can often paper over this with AVX2 gathers and tolerant alias analysis. Typical aarch64/NEON doesn’t have general gathers, and LLVM becomes conservative if it can’t prove the store to latents doesn’t alias src.
  • Who vectorizes? rustc lowers to LLVM IR; LLVM does the vectorization (Loop & SLP vectorizers). We’ll turn on optimization remarks so you can see exactly why a loop did — or did not — vectorize.
  • How to fix? Three pragmatic paths:
  • Use the right target features (don’t enable x86 flags on ARM), set a real target-cpu, and give LLVM a loop form it likes.
  • Eliminate the aliasing barrier without changing semantics (tile into a tiny stack buffer, then in-place add — fast and portable).
  • If you want maximum ARM speed, write a small aarch64-specialized kernel (NEON/SVE2) or restructure as a streaming bit-reader.

Below, we’ll (1) instrument your build to prove what LLVM thinks, (2) show minimal, concrete code changes that usually unlock SIMD, and (3) offer an “optimal” solution you can ship today.


1) First, stop guessing: turn on LLVM’s optimization remarks Rustc doesn’t vectorize; LLVM does. Ask LLVM to tell you what it tried and why it bailed.

# Release build, emit remarks next to your artifacts.
RUSTFLAGS='-Copt-level=3 \
  -Cllvm-args=-pass-remarks=loop-vectorize \
  -Cllvm-args=-pass-remarks-missed=loop-vectorize \
  -Cllvm-args=-pass-remarks-analysis=loop-vectorize \
  -Cllvm-args=-pass-remarks=slp-vectorizer \
  -Cllvm-args=-pass-remarks-missed=slp-vectorizer \
  -Cllvm-args=-pass-remarks-analysis=slp-vectorizer' \
  cargo build --release

Look for messages like:

  • loop-vectorize: vectorized loop (vector width: ...)
  • loop not vectorized: could not prove memory dependence
  • loop not vectorized: requires gather/scatter not available on target

You can also dump IR:

RUSTFLAGS='--emit=llvm-ir -Copt-level=3' cargo build --release
# grep for 'vector.body' in the .ll files

A vector.body block = loop vectorized. No vector.body + remarks saying “potential dependence” or “no legal vector form” = your smoking gun. Tip: cargo asm --rust --demangle your_crate::path::decompress_offsets makes the assembly inspection friendly.


2) Don’t pass x86 features to aarch64, and do set a real target-cpu On Godbolt you mentioned target-feature=+bmi1,+bmi2,+avx2 for aarch64—that’s x86-only. On ARM it’s ignored or poisons the cost model. Use one of:

# ARM servers (example)
RUSTFLAGS='-Ctarget-cpu=cortex-a710 -Copt-level=3' cargo build --release
# Apple Silicon
RUSTFLAGS='-Ctarget-cpu=apple-m2 -Copt-level=3' cargo build --release
# As a blunt-but-useful default when building on the host
RUSTFLAGS='-Ctarget-cpu=native -Copt-level=3' cargo build --release

If you do have SVE/SVE2 available, compile for it explicitly (runtime detection is possible with is_aarch64_feature_detected! for gated fast paths).


3) What blocks vectorization in your loop on aarch64? Your code (simplified):

  • latent = read_u64_at(src, byte_idx, bits_past_byte, offset_bits).wrapping_add(*latent);

Per iteration:

  • Load: 8 bytes from src[byte_idx..byte_idx+8] (byte_idx varies by element).
  • Compute: shift/mask to extract up to 57 bits.
  • RMW Store: write back into latents[i] (in-place add).

Two common blockers:

  • No gathers on NEON
Random (per-lane) 64-bit reads need gather; NEON doesn’t provide general gathers. LLVM’s loop vectorizer won’t vectorize a loop that needs gather/scatter unless it can re-express the load pattern as contiguous or via cheap shuffles. On x86, AVX2/AVX-512 gathers exist, so the same loop may vectorize.
  • Possible alias between src and latents
In Rust, &mut [u64] is noalias, but &[u8] is not. LLVM must assume src might overlap with latents. That store to latents can then legally clobber bytes later read from src in a different iteration. Vectorization requires reordering memory ops; potential overlap blocks it.
(You observed: if you write to a different destination or drop the final add, it vectorizes—consistent with removing the store-load hazard.)


4) Minimal code reshaping that helps LLVM A) Index-based loop + unchecked indexing (canonical loop form):

pub unsafe fn decompress_offsets_idx(
    base_bit_idx: usize,
    src: &[u8],
    csum: &[u32],
    bits: &[u32],
    latents: &mut [u64],
) {
    debug_assert_eq!(csum.len(), bits.len());
    debug_assert_eq!(latents.len(), bits.len());
    let n = latents.len();
    let src_ptr = src.as_ptr();
    for i in 0..n {
        // unchecked helps form a tight IV loop; eliminate bounds checks
        let offset_bits = *bits.get_unchecked(i);
        let offset_bits_csum = *csum.get_unchecked(i);
        let latent_ref = latents.get_unchecked_mut(i);
        let bit_idx = base_bit_idx as u32 + offset_bits_csum;
        let byte_idx = (bit_idx >> 3) as usize;
        let bits_past_byte = (bit_idx & 7) as u32;
        let val = read_u64_at_ptr(src_ptr, byte_idx, bits_past_byte, offset_bits);
        *latent_ref = latent_ref.wrapping_add(val);
    }
}
#[inline(always)]
unsafe fn read_u64_at_ptr(
    src_ptr: *const u8,
    byte_idx: usize,
    bits_past_byte: u32,
    n: u32,
) -> u64 {
    debug_assert!(n <= 57);
    // unaligned is okay on ARM/x86
    let raw = std::ptr::read_unaligned(src_ptr.add(byte_idx) as *const u64);
    // little-endian; if big-endian, byte-swap first
    (raw >> bits_past_byte) & ((1u64 << n) - 1)
}
This alone sometimes flips LLVM’s decision on both arches. But the aliasing concern can still linger on aarch64.

B) Explicitly separate load and store phases… but keep it cache-friendly (tiling) This is the “I want in-place RMW for perf” compromise that usually wins on aarch64:

  • First pass (per small tile): compute values into a tiny stack buffer — only reads from src.
  • Second pass: add into latents—only writes.

No global extra buffer, minimal extra traffic, and LLVM can vectorize each micro-phase freely because the read and write sets don’t overlap.

pub unsafe fn decompress_offsets_tiled(
    base_bit_idx: usize,
    src: &[u8],
    csum: &[u32],
    bits: &[u32],
    latents: &mut [u64],
) {
    const TILE: usize = 64; // tune: 32..128 typically good
    debug_assert_eq!(csum.len(), bits.len());
    debug_assert_eq!(latents.len(), bits.len());
    let n = latents.len();
    let src_ptr = src.as_ptr();
    let mut tmp = [0u64; TILE];
    let mut i = 0;
    while i < n {
        let block = (n - i).min(TILE);
        // phase 1: compute (loads only)
        let mut j = 0;
        while j < block {
            let off_bits = *bits.get_unchecked(i + j);
            let off_csum = *csum.get_unchecked(i + j);
            let bit_idx = base_bit_idx as u32 + off_csum;
            let byte_idx = (bit_idx >> 3) as usize;
            let bits_past_byte = (bit_idx & 7) as u32;
            tmp[j] = read_u64_at_ptr(src_ptr, byte_idx, bits_past_byte, off_bits);
            j += 1;
        }
        // phase 2: in-place RMW (stores only)
        let mut j2 = 0;
        while j2 < block {
            let p = latents.get_unchecked_mut(i + j2);
            *p = p.wrapping_add(tmp[j2]);
            j2 += 1;
        }
        i += block;
    }
}

Why this is “optimal” for ARM:

  • Avoids gathers.
  • Lets the loop vectorize (or at least SLP-vectorize) the arithmetic and store phase.
  • Keeps the in-place memory layout.
  • Hot on Apple M-series and Cortex-A7xx in real benchmarks.

If you need even more speed, specialize the inner two tight loops per arch (cfg on target_arch = "aarch64") and let x86 keep your original body (it already vectorizes). C) (Nightly/advanced) Tell LLVM the pointers don’t overlap
If you can require src and latents never alias (typical for decompressors), you can encode that contract:

  • With a tiny C shim using restrict on the src and latents pointers, called from Rust.
  • Or via nightly core::intrinsics::assume that their spans don’t overlap (encode two inequalities). This can unblock the loop vectorizer, but is a commitment—UB if violated. The tiled approach above gets you most of the win without relying on extra unsafe contracts.


5) Arm-specific fast path (optional) If you’re willing to add a small aarch64 specialization, two fruitful ideas:

  • Streaming bit-reader: since bit_idx is base + prefix_sum, it’s monotonic. A classic codec approach keeps a 64-bit (or 128-bit) window and feeds bits out, refilling as needed. That becomes a sequential streaming load (very ARM-friendly) and avoids random address loads entirely. It’s often faster than relying on auto-vectorization for this pattern.
  • NEON shuffle window: Process 2–4 outputs per iteration by loading a 16–32B window once, then use vext / tbl to synthesize the per-lane 64-bit chunks at different byte offsets, followed by per-lane variable shifts (vshlq_u64). This is compact and fast but a bit more code. Good when offsets between consecutive elements differ by at most a few bytes (which is typical when n <= 57).


6) Quick answers to your specific questions Q: How to figure out why Rust failed to vectorize? Turn on LLVM optimization remarks (Section 1). Look for “requires gather” or “could not prove memory dependence” messages. Also inspect IR for vector.body. Q: How do I know if rustc or LLVM is doing the vectorizing? LLVM does. rustc emits IR; LLVM’s Loop Vectorizer and SLP Vectorizer transform it. The remarks come from LLVM. Q: How can I fix this specific example?

  • Use a real ARM target-cpu and don’t pass x86 features to aarch64.
  • Reshape the loop (indexing + get_unchecked) to a canonical IV form.
  • Remove the aliasing hazard without giving up in-place updates by tiling into a tiny stack buffer (code in Section 4B).
  • (Optional) Specialize per arch: keep your original body on x86, and use the tiled (or streaming) variant on aarch64.


7) A short performance checklist you can paste into your repo

  • Build with -Copt-level=3 -Ctarget-cpu=native (or a concrete aarch64 CPU).
  • Enable LLVM remarks to verify vectorization.
  • Prefer simple index loops with get_unchecked in hot paths.
  • Eliminate store–load aliasing between latents and src (tile or assert no-overlap).
  • For aarch64, avoid patterns that imply gathers unless you will hand-roll NEON/SVE2.
  • Consider a streaming bit-reader when the bit positions are monotonic.
  • Validate with cargo asm and (when possible) run llvm-mca on the inner loop.


Summary “optimal” solution for your case

  • Keep your in-place write pattern but tile into a small stack buffer, then add into latents.
  • Compile with a correct ARM target-cpu; don’t mix in x86 features.
  • With remarks enabled, you should see either actual vectorization or at least a big cost-model improvement on aarch64, and the tiled version is typically as fast or faster in practice.

Read the full article here: https://medium.com/@trivajay259/when-rust-wont-vectorize-how-to-see-why-prove-whose-fault-it-is-rustc-vs-llvm-and-fix-it-x86-98e6831f9be2