Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
JOHNWICK
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
5 Rust FFI Moves for Hot Python Paths
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
[[file:5_Rust_FFI_Moves.jpg|500px]] Five Rust FFI patterns — PyO3, zero-copy NumPy, GIL-free parallelism, buffer/bytes tricks, and stateful workers — to speed up hot Python code paths. Python is the front door; Rust is the engine room. When a tight loop or data transform becomes your p99 villain, you don’t need a rewrite. You need a carefully-placed, memory-savvy Rust function that does one thing fast — and plays nicely with Python. Here are five moves that consistently deliver. 1) The “PyO3 + Maturin” Express Lane If you want the most Pythonic developer experience, start here. [PyO3] exposes Rust functions as native Python callables. [Maturin] builds wheels that pip install cleanly across platforms. Rust (Cargo.toml) [package] name = "fastops" version = "0.1.0" edition = "2021" [lib] name = "fastops" crate-type = ["cdylib"] [dependencies] pyo3 = { version = "0.22", features = ["extension-module"] } Rust (src/lib.rs) use pyo3::prelude::*; #[pyfunction] fn clamp_sum(a: i64, b: i64, min: i64, max: i64) -> PyResult<i64> { let s = a.saturating_add(b); Ok(s.clamp(min, max)) } #[pymodule] fn fastops(_py: Python<'_>, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(clamp_sum, m)?)?; Ok(()) } Build & install (no Dockerfiles required) pip install maturin maturin develop # builds a local wheel and installs it into your venv Why it’s fast: zero interpreter overhead at call-site, compiled Rust in the hot path. Where it shines: short, frequently called functions; simple shapes in/out; tight arithmetic or small transforms. Let’s be real: this already beats pure Python for many workflows. But the fun starts when data sizes grow. 2) Zero-Copy with NumPy: Borrow, Don’t Box The worst performance bug in FFI is accidental copies. Use numpy + PyO3’s numpy crate to borrow memory via the buffer protocol. Cargo.toml (extra deps) numpy = "0.21" ndarray = { version = "0.15", features = ["rayon"] } rayon = "1.10" Rust (src/lib.rs) — read-only borrow, no copies: use numpy::{PyReadonlyArray1, PyArray1}; use pyo3::prelude::*; use rayon::prelude::*; #[pyfunction] fn l2_norm(x: PyReadonlyArray1<f32>) -> PyResult<f32> { // Borrow as a Rust slice without copying let slice = x.as_slice()?; // Parallel + numerically stable-ish accumulation let sumsq: f32 = slice.par_iter().map(|v| v * v).sum(); Ok(sumsq.sqrt()) } #[pyfunction] fn scale_inplace(mut x: PyReadonlyArray1<f32>, factor: f32) -> PyResult<()> { // If you must mutate, require a writable view from Python side. // Shown here read-only for safety; prefer explicit writable arrays in API design. let _ = &mut x; let _ = factor; // illustrate the API edge Err(pyo3::exceptions::PyTypeError::new_err("Pass a writable array")) } #[pymodule] fn fastops(_py: Python<'_>, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(l2_norm, m)?)?; m.add_function(wrap_pyfunction!(scale_inplace, m)?)?; Ok(()) } Python import numpy as np, fastops x = np.random.rand(2_000_000).astype(np.float32) # ~8MB print(fastops.l2_norm(x)) # no copy, just compute Why it’s fast: Rust reads the same memory NumPy owns. No marshalling giant lists. Guardrails: Prefer float32/int32/int64 explicitly. Validate dtype/contiguity at the boundary. 3) Release the GIL, Then Go Wide The GIL isn’t the enemy — holding it too long is. Run heavy work without the GIL and parallelize with Rayon. use pyo3::prelude::*; use rayon::prelude::*; #[pyfunction] fn topk_sum(mut data: Vec<i64>, k: usize) -> PyResult<i64> { Python::with_gil(|py| { py.allow_threads(|| { // heavy section without the GIL data.par_sort_unstable_by(|a,b| b.cmp(a)); Ok::<i64, ()>(data.par_iter().take(k).sum()) }) }).map_err(|_| pyo3::exceptions::PyRuntimeError::new_err("compute failed")) } Pattern: * Convert to a Rust type quickly (or borrow via NumPy like above). * Drop the GIL with allow_threads. * Use Rayon for CPU parallelism. * Reacquire GIL only to create/return Python objects. Real-world feel: on a 16-core box, this wins big for CPU-bound workloads (sorting, reductions, SIMD-friendly math). For I/O, parallelism helps less; batch instead. 4) Bytes, Memoryview, and the “No-UTF-8” Rule Text is often the sneaky bottleneck. If you’re hashing, compressing, or scanning, accept bytes/memoryview and treat data as raw buffers. Rust use pyo3::prelude::*; use pyo3::types::PyBytes; use std::hash::{Hasher, BuildHasherDefault}; use twox_hash::XxHash64; #[pyfunction] fn fast_hash(py: Python<'_>, b: &PyBytes) -> PyResult<u64> { let buf = b.as_bytes(); // zero-copy borrow from Python let mut h = BuildHasherDefault::<XxHash64>::default().build_hasher(); h.write(buf); Ok(h.finish()) } #[pymodule] fn fastops(_py: Python<'_>, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(fast_hash, m)?)?; Ok(()) } Python import fastops data = memoryview(open("blob.bin","rb").read()) print(fastops.fast_hash(data.tobytes())) # or pass bytes directly Why it’s fast: no decoding/encoding churn; you operate on raw bytes. Rule of thumb: only decode to str at the UI/edge. Everywhere else, keep bytes binary. 5) Stateful Rust Workers: Warm, Batched, Predictable Many “slow” paths are chatty: lots of tiny calls. Pay the setup cost once and reuse a long-lived Rust state (model weights, indexes, lookup tables). Cargo.toml (extra deps) once_cell = "1.19" Rust (long-lived state in module) use once_cell::sync::OnceCell; use pyo3::prelude::*; struct Scorer { weights: Vec<f32>, } impl Scorer { fn score(&self, xs: &[f32]) -> f32 { xs.iter().zip(&self.weights).map(|(x,w)| x*w).sum() } } static SCORER: OnceCell<Scorer> = OnceCell::new(); #[pyfunction] fn init_weights(ws: Vec<f32>) -> PyResult<()> { SCORER.set(Scorer { weights: ws }).map_err(|_| { pyo3::exceptions::PyRuntimeError::new_err("already initialized") }) } #[pyfunction] fn score_batch(batch: Vec<Vec<f32>>) -> PyResult<Vec<f32>> { let s = SCORER.get().ok_or_else(|| pyo3::exceptions::PyRuntimeError::new_err("call init_weights first"))?; // GIL-free CPU work Python::with_gil(|py| py.allow_threads(|| { Ok::<Vec<f32>, ()>(batch.iter().map(|row| s.score(row)).collect()) })).map_err(|_| pyo3::exceptions::PyRuntimeError::new_err("compute failed")) } #[pymodule] fn fastops(_py: Python<'_>, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(init_weights, m)?)?; m.add_function(wrap_pyfunction!(score_batch, m)?)?; Ok(()) } Python import fastops, numpy as np fastops.init_weights([0.2, 0.5, 0.3]) rows = np.random.rand(1000, 3).astype(np.float32).tolist() scores = fastops.score_batch(rows) # one call, many rows Why it’s fast: one FFI boundary, many computations; warm state; fewer allocations. Production tip: hide init_weights behind a single load() that checks idempotency and path configs. A Quick Mental Model (how these moves fit) * Boundary: keep function signatures small and explicit; convert/validate once. * Memory: borrow large arrays via buffer protocol; don’t copy unless absolutely necessary. * GIL: hold briefly; compute outside; return. * Parallelism: use Rayon where CPU-bound; batch where I/O-bound. * State: pay initialization once; reuse forever (or until reload). Mini Case: 14× Faster Feature Engineering A team had a Pandas pipeline where a custom apply() computed rolling stats per user on millions of rows. They replaced the hot function with a Rust fastops.roll_stats() that: * accepted NumPy arrays zero-copy, * dropped the GIL and used Rayon chunking, * returned a preallocated result array. The end-to-end job went from 11m 40s to 49s on the same hardware. The rest of the Python stayed the same. The hot path didn’t. Packaging for Reality (tiny but crucial) * Set crate-type = ["cdylib"]. * Use maturin build --release to produce manylinux wheels for CI/CD. * Pin Python ABI versions you support; test with tox or nox. * For CPU goodies, compile with RUSTFLAGS="-C target-cpu=native" for your own fleet, or choose portable flags for public wheels and detect features at runtime. Common Pitfalls to Dodge * Hidden copies: converting lists of Python objects to Rust per element (ouch). Prefer arrays/bytes. * Long GIL holds: anything sorting/iterating big data — drop it early. * Unclear dtypes: validate dtype and contiguous memory, fail fast with a friendly error. * Over-chattery APIs: batch calls; return vectors, not scalars in loops. * String thrash: operate on bytes; decode at the boundary. Conclusion Rust doesn’t replace Python; it amplifies it. Start with one hot path, measure, then apply the next move where it hurts most. You’ll ship the same Python APIs — just with the latency profile of a lower-level language. Read the full article here: https://medium.com/@kaushalsinh73/5-rust-ffi-moves-for-hot-python-paths-557d3f74c0c7
Summary:
Please note that all contributions to JOHNWICK may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
JOHNWICK:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
5 Rust FFI Moves for Hot Python Paths
Add topic