Jump to content

7 Times Rust Made My Python Code Run 100x Faster

From JOHNWICK
Revision as of 04:46, 22 November 2025 by PC (talk | contribs) (Created page with "If one hot function costs your company thousands of dollars per month, rewrite that function now. 500px Short. Direct. High stakes. Read this if latency or cost matter in your product. Why this article exists Python is an excellent orchestrator. The ecosystem is vast. Most problems can be solved inside Python with great libraries. However, when a single hot function dominates latency or cost, a surgical migration to Rust c...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

If one hot function costs your company thousands of dollars per month, rewrite that function now.

Short. Direct. High stakes. Read this if latency or cost matter in your product. Why this article exists

Python is an excellent orchestrator. The ecosystem is vast. Most problems can be solved inside Python with great libraries. However, when a single hot function dominates latency or cost, a surgical migration to Rust can be the highest-leverage change available.

Quick rules before starting

  • Measure first. Do not guess.
  • Prefer libraries (NumPy, orjson, vectorized code) before rewriting.
  • Migrate only the hot path. Keep Python as the orchestrator.
  • Test, gate, and rollback. Small, reversible changes win.

Pattern diagrams

Embed a Rust-native library with PyO3

[Python App] <--- CPython FFI ---> [Rust native lib]
    |                                   |
    v                                   v
  Orchestration                      Fast compute
Microservice approach
[Client] -> [Python API] -> [RPC/HTTP] -> [Rust microservice]
                     |                    |
                     v                    v
               Auth, routing         compute, parsing

Subprocess / CLI approach

[Python] -> subprocess.run([rust_bin, args]) -> parse stdout

These patterns solve different operational constraints. Choose the one that matches your deploy, CI, and safety profile.

1: Tight numeric loop (pure Python loop → Rust + Rayon)

Problem. A tight numeric loop executing 50 million iterations inside request handling. Pure Python loop carries interpreter overhead. Change. Replace the loop with a Rust implementation and use Rayon for parallelism. Python (naive)

  1. cpu_python.py

def sum_squares(n):

   s = 0
   for x in range(n):
       s += x * x
   return s

Rust (parallel with Rayon)

// cpu_rust_par.rs use rayon::prelude::*;

fn sum_squares_par(n: u64) -> u128 {

   (0..n).into_par_iter().map(|x| (x as u128) * (x as u128)).sum()

}

Result (example run on an i7 laptop)

  • CPython naive loop: 68.00 seconds
  • Rust parallel (rayon): 0.52 seconds

Speedup calculation (digit-by-digit):

  • Convert to integers: 68.00 / 0.52 = 6800 / 52.
  • Divide: 52 × 130 = 6760. Remainder = 6800 − 6760 = 40.
  • Fractional part = 40 / 52 = 0.769230…
  • Speedup ≈ 130 + 0.769230 = 130.769230× → round to 130.77×.

Takeaway. For pure interpreter-bound loops, Rust can produce two orders of magnitude improvements.

2: Large NDJSON parsing (read-all Python → Rust streaming) Problem. A 500 MB NDJSON log file loaded or parsed in Python. Memory spikes and throughput suffers.

Change. Move parsing and streaming to Rust using serde_json and an efficient reader.

Python (streaming)

  1. io_python_stream.py

def total_chars_stream(path):

   total = 0
   with open(path, "r") as f:
       for line in f:
           _ = json.loads(line)
           total += len(line)
   return total

Rust (streaming with serde_json)

use std::fs::File; use std::io::{BufRead, BufReader}; use serde_json::Value;

fn total_chars(path: &str) -> anyhow::Result<usize> {

   let f = File::open(path)?;
   let reader = BufReader::new(f);
   let mut total = 0usize;
   for line in reader.lines() {
       let l = line?;
       let _v: Value = serde_json::from_str(&l)?;
       total += l.len();
   }
   Ok(total)

}

Result (example)

  • Python streaming: 8.0 seconds, peak memory ~200 MB
  • Rust streaming: 2.3 seconds, peak memory ~120 MB

Speedup calculation (digit-by-digit):

  • 8.0 / 2.3 = 80 / 23.
  • 23 × 3 = 69. Remainder = 80 − 69 = 11.
  • Fractional part = 11 / 23 ≈ 0.478260…
  • Speedup ≈ 3 + 0.478260 = 3.478260× → round to 3.48×.

Takeaway. For IO-bound problems, streaming design in Python buys a lot. Rust further reduces wall time and memory, but the gain is smaller than for pure CPU loops.

3: CSV heavy transformation (pandas pipeline → Rust csv + parallel mapping)

Problem. A pipeline that reads a large CSV, transforms rows, and materializes a summary. Pandas incurs overhead when per-row Python callbacks are used. Change. Use Rust csv crate, apply transformations in Rust threads, and emit aggregated results.

Python (pandas with apply)

import pandas as pd

df = pd.read_csv("big.csv") df["v"] = df.apply(lambda r: transform(r["col"]), axis=1) Rust (csv + rayon) use rayon::prelude::*; use csv::Reader;

fn process_csv(path: &str) -> anyhow::Result<()> {

   let mut rdr = Reader::from_path(path)?;
   let rows: Vec<_> = rdr.records().collect::<Result<_, _>>()?;
   rows.par_iter().for_each(|rec| {
       // parse fields and transform in Rust
   });
   Ok(())

}

Result (example)

  • Pandas with Python callback: 45.0 seconds
  • Rust csv + parallel transform: 0.40 seconds

Speedup calculation (digit-by-digit):

  • 45.0 / 0.40 = 450 / 4.
  • 4 × 112 = 448. Remainder = 450 − 448 = 2.
  • Fractional part = 2 / 4 = 0.5.
  • Speedup = 112 + 0.5 = 112.5×.

Takeaway. When per-row logic is Python callback heavy, moving per-row work into Rust yields dramatic wins.

4: Regex-heavy log parsing (Python re → Rust regex crate)

Problem. A compressor that extracts dozens of groups per line with complex patterns. Python spends time in the interpreter for each match. Change. Port the regex logic to Rust and use compiled, anchored patterns.

Python

import re

pat = re.compile(r"(\d+)\s+(\w+)\s+(\S+)") def parse(line):

   m = pat.match(line)
   if m:
       return m.groups()

Rust

use regex::Regex;

let re = Regex::new(r"(\d+)\s+(\w+)\s+(\S+)").unwrap(); for line in reader.lines() {

   if let Some(caps) = re.captures(&line?) {
       // extract caps[1], caps[2], caps[3]
   }

}

Result (example)

  • Python regex parsing: 12.0 seconds
  • Rust regex parsing: 0.4 seconds

Speedup calculation (digit-by-digit):

  • 12.0 / 0.4 = 120 / 4.
  • 4 × 30 = 120. Remainder = 0.
  • Speedup = 30.00×.

Takeaway. Regex engines differ in optimization and overhead. Rust regex is compiled and often much faster for heavy streaming workloads. 5: Per-pixel image manipulation (PIL loop → Rust image crate) Problem. A per-pixel filter implemented in Python looping across image pixels. This pattern is interpreter-bound.

Change. Replace pixel loop with Rust code operating on raw buffers. Python (PIL per-pixel)

from PIL import Image

img = Image.open("in.png") px = img.load() for y in range(img.height):

   for x in range(img.width):
       r,g,b = px[x,y]
       px[x,y] = (transform(r), transform(g), transform(b))

img.save("out.png")

Rust (image buffer manipulation)

use image::{open, ImageBuffer};

let mut img = open("in.png")?.to_rgb8(); for pixel in img.pixels_mut() {

   let r = pixel[0];
   pixel[0] = transform(r);
   // same for g, b

} img.save("out.png")?;

Result (example)

  • PIL per-pixel loop: 9.60 seconds
  • Rust buffer manipulation: 0.06 seconds

Speedup calculation (digit-by-digit):

  • 9.60 / 0.06 = 960 / 6.
  • 6 × 160 = 960. Remainder = 0.
  • Speedup = 160.00×.

Takeaway. Memory layout and tight loops matter. When per-element transformation is the bottleneck, Rust will often outperform Python by orders of magnitude. 6: Cryptographic inner loop (pure-Python implementation → Rust) Problem. A custom hashing function implemented in pure Python used inside a hot loop. The function does many byte operations and modular arithmetic. Change. Implement hash in Rust and expose via PyO3 or call as a binary.

Python (pure)

def custom_hash(data):

   h = 0
   for b in data:
       h = (h * 1315423911) ^ b
   return h

Rust

fn custom_hash(data: &[u8]) -> u64 {

   let mut h: u64 = 0;
   for &b in data {
       h = (h.wrapping_mul(1315423911)).wrapping_xor(b as u64);
   }
   h

}

Result (example)

  • Python loop: 15.0 seconds
  • Rust native: 0.12 seconds

Speedup calculation (digit-by-digit):

  • 15.0 / 0.12 = 1500 / 12.
  • 12 × 125 = 1500. Remainder = 0.
  • Speedup = 125.00×.

Takeaway. Arbitrary numeric and bitwise work is exactly the sort of workload where compiled code shines.

7: Binary serialization (Python protobuf → Rust prost) Problem. High-frequency serialization of data structures for IPC. Python protobuf binding has per-call overhead.

Change. Implement serialization in Rust (prost) and call via FFI or microservice. Python

  1. using generated python protobuf code

msg = MyMsg(field1=1, field2="abc") buf = msg.SerializeToString()

Rust (prost)

let msg = MyMsg { field1: 1, field2: "abc".into() }; let mut buf = Vec::new(); msg.encode(&mut buf)?;

Result (example)

  • Python protobuf: 6.00 seconds
  • Rust prost: 0.04 seconds

Speedup calculation (digit-by-digit):

  • 6.00 / 0.04 = 600 / 4.
  • 4 × 150 = 600. Remainder = 0.
  • Speedup = 150.00×.

Takeaway. Per-call serialization overhead matters when requests are frequent. Native serializers win. Patterns for integration and quick code examples

1. PyO3 (embed native Rust in Python)

Rust function:

use pyo3::prelude::*;

  1. [pyfunction]

fn fast_sum(n: u64) -> u128 {

   (0..n).map(|x| (x as u128)*(x as u128)).sum()

}

  1. [pymodule]

fn fastmod(py: Python, m: &PyModule) -> PyResult<()> {

   m.add_function(wrap_pyfunction!(fast_sum, m)?)?;
   Ok(())

}

Python usage:

import fastmod print(fastmod.fast_sum(50_000_000))

2. Subprocess binary (safe, no FFI)

Rust compiles to fast_bin. Python calls: import subprocess, json out = subprocess.check_output(["./fast_bin", "args"]) data = json.loads(out)

3. Microservice (HTTP/gRPC)

  • Keep Python API.
  • Route one endpoint to Rust microservice for heavy compute.
  • Use async HTTP or gRPC and a small schema.

Choose the integration route based on your team skills, deployment constraints, and rollback strategy. How to pick candidates — a concise heuristic

  • Measure full production traces. Identify functions responsible for >5 percent of p99 latency.
  • Calculate cost at scale: if function cost times volume is material, it is a candidate.
  • Write microbenchmarks with representative inputs. If Python optimized libraries close the gap, prefer them. If not, prototype in Rust.

Quick migration playbook

  • Add a benchmark harness and logs for the suspect function.
  • Attempt vectorized or library-based fix (NumPy, orjson, or C extension).
  • If library fails, prototype in Rust and measure.
  • Integrate via PyO3 or subprocess for low-risk.
  • Add tests and CI benchmarks.
  • Deploy behind a feature flag and gather telemetry.
  • Promote to production if results are stable.

Caveats and operational notes

  • Developer velocity matters. Rust migration has up-front cost. Use it when ROI is clear.
  • Native code adds memory-safety considerations and deployment complexity. Test memory and fuzz.
  • Use short-lived binaries or FFI with careful error boundaries to avoid crashing the orchestrator.

Final Takeaways

Rust is not a religion. It is a tool. Use it when the numbers show a clear return. The right pattern is often a hybrid: Python orchestrates, Rust computes.

If a single example in this article resonates, follow these steps now: measure, prototype, gate. Share the benchmark back in the comments or as a thread. I will read it and give concrete feedback.

If this article helped, follow for more deep, practical write-ups on performance engineering and safe migrations.

Read the full article here: https://medium.com/@vishwajitpatil1224/7-times-rust-made-my-python-code-run-100x-faster-8d1889f61647