Jump to content

Deadlocks We Deserved: How Rust Retries and Postgres Locks Finally Agreed

From JOHNWICK

Rust felt bulletproof until Postgres fought back. Our system could not go a week without a deadlock on the busiest tables, even as our retry logic grew cleverer. Metrics looked fine, then fell off a cliff without warning. The bug reports stacked up. We knew our code was safe, yet Postgres held all the cards — and for a long time, we played the wrong hand. Here is how we finally brokered a truce.

When Safe Rust Still Loses to the Database We wrote careful code. Each update wrapped in a transaction, every connection pooled and checked.

It worked — until we hit concurrency on live workloads. Postgres would raise a deadlock detected error and roll back one of our writes.

The confusion was real: Rust caught panics, but not this. Our application spun, retried, and sometimes spiraled.

fn update_rows(conn: &mut PgConnection, id1: i32, id2: i32) -> QueryResult<()> {

   conn.transaction(|tx| {
       diesel::update(my_table.filter(id.eq(id1)))
           .set(val.eq(42))
           .execute(tx)?;
       diesel::update(my_table.filter(id.eq(id2)))
           .set(val.eq(99))
           .execute(tx)?;
       Ok(())
   })

}

In production, retry-on-deadlock logic masked the pain, but user-facing latency doubled at peak.

Sometimes both updates failed, and our alerts barely caught it. Smart retry logic is not a bandage. It needs a handshake with the database — one we were not making.

How Deadlocks Formed: Visualizing the Enemy Debugging blind was the norm. Postgres logs hinted at cycle waits, but never showed the real bottleneck.

We traced query graphs and mapped how our concurrent transactions grabbed locks in different order. This ASCII map made the pattern visible:

+-----------+        +-----------+
| txn 1: A  | -----> |  waits B  |
+-----------+        +-----------+
      ^                   |
      |                   v
+-----------+        +-----------+
| txn 2: B  | <----- |  waits A  |
+-----------+        +-----------+

Every so often, two transactions would each lock one row and wait on the other — classic deadlock.

Metrics showed random latency spikes and aborts when traffic crossed a certain threshold.

Modeling lock order made the next fix possible.

Making Rust Retries and Postgres Work Together We rewired the retry logic. Instead of blind exponential backoff, we structured retries to break lock cycles.

The key: always lock rows in a consistent order by primary key, no matter which code path was taken.

fn lock_in_order(ids: &mut [i32]) {

   ids.sort_unstable();

}

fn safe_update(conn: &mut PgConnection, mut ids: [i32; 2]) -> QueryResult<()> {

   lock_in_order(&mut ids);
   conn.transaction(|tx| {
       for &id in &ids {
           diesel::update(my_table.filter(id.eq(id)))
               .set(val.eq(77))
               .execute(tx)?;
       }
       Ok(())
   })

}

After deployment, p99 latency dropped by 30%. Write conflicts went down by two-thirds, and our dashboard stayed green through peak traffic.

The core idea is simple: order your locks, and your database will not force you to choose who loses.

Tracing Retries in Rust: When It Still Fails We still hit rare deadlocks under stress. Rust’s error handling needed to distinguish transient deadlocks from true logic bugs.

We wrapped our operations with a limited retry block that only retried when Postgres sent code 40P01.

fn with_deadlock_retry<F, T>(mut op: F) -> Result<T, Error> where

   F: FnMut() -> Result<T, Error>,

{

   let mut attempts = 0;
   loop {
       match op() {
           Ok(val) => return Ok(val),
           Err(e) if is_deadlock(&e) && attempts < 3 => {
               attempts += 1;
               std::thread::sleep(Duration::from_millis(40 * attempts));
           }
           Err(e) => return Err(e),
       }
   }

}

This structure prevented runaway retries and ensured we only retried for deadlocks — not all errors.

After this change, error rates became predictable, and failed operations did not snowball into outages.

You do not want to mask every failure. You want clear, actionable errors. The Mini Table That Made Us Believe

After weeks of patching, we finally compared before/after runs. We used a simple table to keep score:

| Scenario   | Deadlock Rate | p99 Latency |
|------------|--------------|-------------|
| Old Logic  |    0.41%     |   410 ms    |
| New Order  |    0.08%     |   272 ms    |

The numbers did not lie. Most deadlocks disappeared, and the slow tail got much faster for everyone.

Every prod metric improved. The code was not just safer — it was friendlier to our users.

Drawing the Real Boundary: App or Database

For months, we tried to outsmart Postgres from the app layer alone. The real fix was to treat database locks as a first-class part of our system, not an afterthought.

This ASCII diagram captures the handshake we built:

+-------------+        +-------------+
| Rust Layer  | <----> |  Postgres   |
+-------------+        +-------------+
| Order Locks |        |  Detects    |
|  Handle     |        |  Deadlocks  |
+-------------+        +-------------+

Now, every retry and every lock acquisition is designed as a two-way agreement. We test lock order locally, stage it, and only then trust our retry code. The boundary is not between safe code and unsafe SQL. It is the shared logic in between.

The Short Path to Real Wins

We earned every deadlock that Postgres threw at us. Our first fixes just moved the pain around; only when we changed how we acquired locks did the system recover its speed and predictability. The code is safer, the database is happier, and our users finally see stable response times.

Every system is a handshake — one side in Rust, one side in Postgres.

Read the full article here: https://medium.com/@maahisoft20/deadlocks-we-deserved-how-rust-retries-and-postgres-locks-finally-agreed-9774e06825be