Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
JOHNWICK
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
The state of SIMD in Rust in 2025
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
Whatâs SIMD? Why SIMD? Hardware that does arithmetic is cheap, so any CPU made this century has plenty of it. But you still only have one instruction decoding block and it is hard to get it to go fast, so the arithmetic hardware is vastly underutilized. To get around the instruction decoding bottleneck, you can feed the CPU a batch of numbers all at once for a single arithmetic operation like addition. Hence the name: âsingle instruction, multiple data,â or SIMD for short. Instead of adding two numbers together, you can add two batches or âvectorsâ of numbers and it takes about the same amount of time. On recent x86 chips these batches can be up to 512 bits in size, so in theory you can get an 8x speedup for math on u64 or a 64x speedup on u8! Instruction sets Historically, SIMD instructions were added after the CPU architecture was already designed, so SIMD is an extension with its own marketing name on each architecture. ARM calls theirs âNEONâ, and all 64-bit ARM CPUs have it. WebAssembly doesnât have a marketing department, so they just call theirs âWebAssembly 128-bit packed SIMD extensionâ. 64-bit x86 shipped with one called âSSE2â which has basic instructions for 128-bit vectors, but later they added a whole menagerie of extensions on top of that, with SSE 4.2 adding more operations, AVX and AVX2 adding 256-bit vectors and AVX-512 adding 512-bit vectors. The word âlaterâ in the above paragraph creates a problem. Does this CPU have that instruction? If youâre running a program on an x86 CPU, itâs not a given that the CPU has any particular SIMD extension. So by default the compiler isnât allowed to use instructions beyond SSE2 because that wonât work on all x86 CPUs. There are two ways around this problem. If you work for a company that only ever runs their binaries on their own servers or on a public cloud, you can just assert that theyâre all recent enough to at least have AVX2 that was introduced over 10 years ago, and have the program crash or misbehave if it ever runs on anything without AVX2: RUSTFLAGS='-C target-cpu=x86â64-v3' cargo build --release However, if you are distributing the binaries for other people to run, thatâs not really an option. Instead you can do something called function multiversioning: compile the same function multiple times for different SIMD extensions, and when the program actually runs, check what features the CPU supports and select the appropriate version based on that. Fortunately, this problem only exists on x86. ARM made its NEON mandatory in all 64-bit CPUs and then didnât bother expanding the width beyond 128 bits. (Technically SVE exists, but in 2025 it is still mostly on paper, and Rust support for it is still in progress). WebAssembly makes you compile two different binaries, one with SIMD and one without, and use JavaScript to check if the browser supports SIMD. Solution space There are four approaches to SIMD in Rust, in ascending order of effort: * Automatic vectorization * Fancy iterators * Portable SIMD abstractions * Raw intrinsics Automatic vectorization The easiest approach to SIMD is letting the compiler do it for you. It works surprisingly well, as long as you structure your code in a way that is amenable to vectorization. This article covers it: You can check if itâs working with cargo-show-asm or godbolt.org, but your benchmarks are the ultimate judge of the results. Sadly there is a limit on the complexity of the code that the compiler will vectorize, and it may change between compiler versions. If something vectorizes today that doesnât necessarily mean it still will in a year from now. The other drawback of this method is that the optimizer wonât even touch anything involving floats (f32 and f64 types). Itâs not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but itâs currently nightly-only). So right now, if you need to process floats, autovectorization is a no-go unless you can use nightly builds of the Rust compiler. (Floats are cursed even without SIMD. Something as simple as summing an array of them in a usable way turns out to be really hard). There is no built-in way to multiversion functions, but the multiversion crate works great with autovectorization. Fancy iterators Just like rayon lets you run your iterators in parallel by swapping .iter() with .par_iter(), there have been attempts to do the same for SIMD. After all, what is SIMD but another kind of parallelism? This is the approach that the faster crate takes. That crate has been abandoned for years, and it doesnât look like this approach has panned out. Portable SIMD abstractions The idea is to let you write your algorithm by explicitly operating on chunks of data, something like [f32; 8] but wrapped in a custom type, and then provide custom implementations of operations like + that compile down into SIMD instructions. std::simd is exactly that. It supports all instruction sets LLVM supports, so its platform support is unparalleled. It pairs well with the multiversion crate. Sadly itâs nightly-only and will remain such for the foreseeable future, so itâs unusable in most situations. The wide crate is a mature, established option. It supports NEON, WASM and all the x86 instruction sets. But it doesnât support multiversioning at all, save for very exotic and limited approaches like cargo-multivers. The pulp crate has built-in multiversioning, and is reasonably mature and complete, if not as much as wide. It powers faer, so its performance is clearly proven. A major limitation is that it only operates on the native SIMD width, so you need your code to be able to handle variable width chunks as opposed to expressing everything in terms of something like [f32; 8] and letting the library lower it into the appropriate instructions like std::simd and wide do. And itâs difficult to write code thatâs generic over type, so if you want both f32 and f64 there will be some code duplication. The architecture support is also limited â only NEON, AVX2 and AVX-512. AVX2 was introduced in 2012, but in the Firefox hardware survey only 75% of systems have it. The macerator crate is a fork of pulp with better support for generic programming and vastly expanded instruction set support. It supports all x86 extensions, WASM, NEON, and even the LoongArch SIMD extensions. It also improves on pulp for generic programming. Itâs used only by burn-ndarray, and even there itâs an optional dependency. It sounds great on paper, but itâs oddly obscure and therefore unproven. The fearless_simd crate is inspired by pulpâs design, but also supports fixed-size chunks just like std::simd and wide. Itâs far less mature than pulp, but itâs under active development. As of this writing it supports NEON, WASM and SSE4.2, but not the newer x86 extensions. Seems too immature just yet, but something to keep an eye on. simdeez is a rather old crate that supports all instruction sets except AVX-512 and comes with built-in multiversioning. What gives me pause is that despite existing for many years, itâs still barely used. Everyone else who needed SIMD built their own instead of using it. And its README says: Currently things are well fleshed out for i32, i64, f32, and f64 types. So I guess the other types arenât complete? TL;DR: use std::simd if you donât mind nightly, wide if you donât need multiversioning, and otherwise pulp or macerator. If itâs not 2025 when youâre reading this, check out fearless_simd, because std::simd is still in nightly in your glorious future, isnât it? Raw intrinsics If you want to get really close to the metal, there are always the raw intrinsics, just one step removed from the processor instructions. The problem looming over any use of raw intrinsics is that you have to manually write them for every platform and instruction set youâre targeting. Whereas std::simd or wide let you write your logic once and compile it down to the assembly automatically, with intrinsics you have to write a separate implementation for every single platform and instruction set (SSE, AVX, NEONâŚ) you care to support. Thatâs a lot of code! Itâs really not helped by the fact that they are all named something like _mm256_srli_epi32 and your code ends up as a long list of calls to these arcanely named functions. And wrappers that help readability introduce their own problems, such as clashes with multiversioning or unsafe code or arcane macros. You also have to build your own multiversioning. Or rather, you have to manually dispatch to the dedicated implementation you have manually written for each instruction set. std::is_x86_feature_detected! macro takes care of the feature detection, but it is somewhat slow. In some cases it is beneficial to detect available features exactly once and then cache the results, but you have to implement that manually too. On the bright side, this year writing intrinsics got markedly less awful. Most of them are no longer unsafe to call in Rust 1.87 and later, and the safe_unaligned_simd crate provides safe wrappers for the rest. So at least this approach is no longer unsafe on top of all the other problems it has! Which one is right for you? The right tool for the job ultimately depends on the use case. Want zero dependencies and little up-front hassle? Autovectorization. Porting existing C code or targeting very specific hardware? Intrinsics. Anything else? Portable SIMD abstraction. And now that you made it this far, you can understand the table at the top of the article, which will help guide your decision!
Summary:
Please note that all contributions to JOHNWICK may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
JOHNWICK:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
The state of SIMD in Rust in 2025
Add topic