Jump to content

Supervision and Fault Tolerance in Actor Systems for Rust

From JOHNWICK
Revision as of 08:29, 19 November 2025 by PC (talk | contribs) (Created page with "500px In the first post, we explored how the Actor model eliminates shared state and makes concurrent programming tractable. We built a distributed counter system where actors communicated through messages and maintained isolated state. Everything worked perfectly because we carefully avoided failures. Real systems don’t have that luxury. Network connections drop. External APIs timeout. Memory runs out. Bugs slip through code review....")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

In the first post, we explored how the Actor model eliminates shared state and makes concurrent programming tractable. We built a distributed counter system where actors communicated through messages and maintained isolated state. Everything worked perfectly because we carefully avoided failures. Real systems don’t have that luxury. Network connections drop. External APIs timeout. Memory runs out. Bugs slip through code review. The question isn’t whether your system will encounter failures, but how it responds when it does. Traditional threading models force you to handle errors at every level, wrapping every operation in try-catch blocks and hoping you’ve covered all the edge cases. Miss one, and your entire process crashes. The Actor model offers something better. Instead of trying to prevent all failures, it embraces failure as inevitable and provides structured ways to recover. This is supervision, and it’s the feature that transforms actors from a concurrency primitive into a fault-tolerance framework. The Philosophy of Letting Things Fail Erlang, where the Actor model reached maturity, has a principle: let it crash. This sounds reckless until you understand what it really means. It’s not about ignoring errors. It’s about separating error detection from error recovery. When you write defensive code that tries to handle every possible failure mode, you end up with complex error-handling logic scattered throughout your application. Your business logic becomes tangled with recovery logic. Worse, you can’t anticipate every failure mode. The database might return data in an unexpected format. A configuration file might be corrupted in a way you never tested. A race condition you never imagined might surface under load. Supervision inverts this model. Your actors focus on the happy path. When something goes wrong and an actor can’t continue, it crashes. A supervisor actor, whose sole job is managing other actors, detects the crash and decides what to do. Restart the actor with clean state? Stop it permanently? Restart related actors that might be in an inconsistent state? The supervisor makes these decisions based on declarative policies, not scattered error-handling code. Press enter or click to view image in full size  This separation of concerns makes systems dramatically easier to reason about. Your actors contain business logic. Your supervisors contain recovery logic. When you need to change how your system responds to failures, you modify supervisor strategies, not every actor in the system. Supervision Strategies Ractor supports multiple supervision strategies, each appropriate for different failure scenarios. The choice of strategy depends on whether the failed actor’s state is independent or whether other actors might be affected. One-for-one supervision restarts only the actor that failed. This works when actors are independent. If your market data feed actor crashes, you restart just that actor. The order execution actor keeps running. The risk management actor is unaffected. Each actor’s failure is isolated. One-for-all supervision restarts every child actor when any one fails. This makes sense when actors share state or invariants that might be violated by a single actor’s failure. If one actor in a transaction processing pipeline crashes mid-transaction, you might need to restart all actors in that pipeline to restore consistency. Rest-for-one supervision sits between these extremes. When an actor crashes, the supervisor restarts that actor and any actors that were started after it. This maintains startup order dependencies. If actor B depends on actor A being initialized, and A crashes, you need to restart B as well to maintain the dependency. The power comes from composing these strategies into supervision trees. A supervisor is itself an actor, which means it can be supervised by another supervisor. You end up with a hierarchy where leaf actors do work, middle-layer supervisors manage related groups of workers, and top-level supervisors manage subsystems. Building a Trading System Let’s make this concrete with a trading system. We need several components that interact but can fail independently. A market data feed that streams price updates from an exchange. An order book that maintains the current state of bids and asks. A risk manager that validates orders before execution. An order executor that sends orders to the exchange. Each of these can fail in different ways. The market data feed might lose its connection. The order book might receive corrupted data. The risk manager might encounter an edge case in a complex calculation. The order executor might timeout waiting for the exchange. Our supervision strategy needs to handle all of these gracefully. Here’s the market data feed actor: use ractor::{Actor, ActorProcessingErr, ActorRef, SupervisionEvent}; use std::time::Duration; use tokio::time::sleep;

  1. [derive(Debug, Clone)]

pub struct MarketDataTick {

   pub symbol: String,
   pub price: f64,
   pub volume: u64,
   pub timestamp: u64,

}

pub enum MarketDataMessage {

   Start,
   Stop,
   Subscribe(ActorRef<OrderBookMessage>),

}

pub struct MarketDataFeed {

   symbol: String,

}

pub struct MarketDataState {

   symbol: String,
   subscribers: Vec<ActorRef<OrderBookMessage>>,
   connection_attempts: u32,
   is_running: bool,
   skip_first_failure: bool,

}

  1. [ractor::async_trait]

impl Actor for MarketDataFeed {

   type Msg = MarketDataMessage;
   type State = MarketDataState;
   type Arguments = (String, bool);
   async fn pre_start(
       &self,
       _myself: ActorRef<Self::Msg>,
       (symbol, skip_first_failure): Self::Arguments,
   ) -> Result<Self::State, ActorProcessingErr> {
       println!("[MarketData] Starting feed for {}", symbol);
       Ok(MarketDataState {
           symbol,
           subscribers: Vec::new(),
           connection_attempts: 0,
           is_running: false,
           skip_first_failure,
       })
   }
   async fn handle(
       &self,
       _myself: ActorRef<Self::Msg>,
       message: Self::Msg,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match message {
           MarketDataMessage::Start => {
               state.is_running = true;
               state.connection_attempts += 1;
               // Simulate connection failure on first attempt (unless we're a restarted actor)
               if state.connection_attempts == 1 && !state.skip_first_failure {
                   return Err(ActorProcessingErr::from(
                       "Failed to connect to exchange"
                   ));
               }
               println!("[MarketData] Connected to exchange for {}", state.symbol);
               
               // Start streaming data
               let symbol = state.symbol.clone();
               let subscribers = state.subscribers.clone();
               tokio::spawn(async move {
                   for i in 0..10 {
                       let tick = MarketDataTick {
                           symbol: symbol.clone(),
                           price: 100.0 + (i as f64 * 0.5),
                           volume: 1000 + (i * 100),
                           timestamp: i,
                       };
                       
                       for subscriber in &subscribers {
                           let _ = subscriber.cast(OrderBookMessage::UpdatePrice(tick.clone()));
                       }
                       
                       sleep(Duration::from_millis(500)).await;
                   }
               });
           }
           MarketDataMessage::Stop => {
               state.is_running = false;
               println!("[MarketData] Stopping feed for {}", state.symbol);
           }
           MarketDataMessage::Subscribe(subscriber) => {
               println!("[MarketData] New subscriber for {}", state.symbol);
               state.subscribers.push(subscriber);
           }
       }
       Ok(())
   }
   async fn post_stop(
       &self,
       _myself: ActorRef<Self::Msg>,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       println!(
           "[MarketData] Stopped after {} connection attempts",
           state.connection_attempts
       );
       Ok(())
   }

} The market data feed deliberately fails on its first connection attempt. In a real system, this might be a network timeout or authentication failure. The actor doesn’t try to handle this itself. It returns an error and crashes, trusting its supervisor to restart it. Now the order book that maintains market state: pub enum OrderBookMessage {

   UpdatePrice(MarketDataTick),
   GetBestBid(ractor::RpcReplyPort<Option<f64>>),
   GetBestAsk(ractor::RpcReplyPort<Option<f64>>),
   PlaceOrder(Order),
   UpdateRiskManager(ActorRef<RiskMessage>),

}

  1. [derive(Debug, Clone)]

pub struct Order {

   pub order_id: String,
   pub symbol: String,
   pub side: OrderSide,
   pub quantity: u64,
   pub price: f64,

}

  1. [derive(Debug, Clone)]

pub enum OrderSide {

   Buy,
   Sell,

}

pub struct OrderBook {

   symbol: String,

}

pub struct OrderBookState {

   symbol: String,
   best_bid: Option<f64>,
   best_ask: Option<f64>,
   last_price: Option<f64>,
   update_count: u64,
   risk_manager: Option<ActorRef<RiskMessage>>,

}

  1. [ractor::async_trait]

impl Actor for OrderBook {

   type Msg = OrderBookMessage;
   type State = OrderBookState;
   type Arguments = (String, Option<ActorRef<RiskMessage>>);
   async fn pre_start(
       &self,
       _myself: ActorRef<Self::Msg>,
       (symbol, risk_manager): Self::Arguments,
   ) -> Result<Self::State, ActorProcessingErr> {
       println!("[OrderBook] Initializing for {}", symbol);
       Ok(OrderBookState {
           symbol,
           best_bid: None,
           best_ask: None,
           last_price: None,
           update_count: 0,
           risk_manager,
       })
   }
   async fn handle(
       &self,
       _myself: ActorRef<Self::Msg>,
       message: Self::Msg,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match message {
           OrderBookMessage::UpdatePrice(tick) => {
               state.update_count += 1;
               state.last_price = Some(tick.price);
               
               // Simulate spread
               state.best_bid = Some(tick.price - 0.1);
               state.best_ask = Some(tick.price + 0.1);
               
               if state.update_count % 5 == 0 {
                   println!(
                       "[OrderBook] {} updates: last={:.2}, bid={:.2}, ask={:.2}",
                       state.update_count,
                       tick.price,
                       state.best_bid.unwrap(),
                       state.best_ask.unwrap()
                   );
               }
           }
           OrderBookMessage::GetBestBid(reply) => {
               reply.send(state.best_bid)?;
           }
           OrderBookMessage::GetBestAsk(reply) => {
               reply.send(state.best_ask)?;
           }
           OrderBookMessage::PlaceOrder(order) => {
               if let Some(ref risk_manager) = state.risk_manager {
                   println!("[OrderBook] Forwarding order {} to risk check", order.order_id);
                   risk_manager.cast(RiskMessage::CheckOrder(order))?;
               } else {
                   println!("[OrderBook] No risk manager configured, rejecting order");
               }
           }
           OrderBookMessage::UpdateRiskManager(new_risk_manager) => {
               println!("[OrderBook] Updating risk manager reference");
               state.risk_manager = Some(new_risk_manager);
           }
       }
       Ok(())
   }

} The risk manager validates orders against position limits and exposure rules: pub enum RiskMessage {

   CheckOrder(Order),
   GetExposure(ractor::RpcReplyPort<f64>),
   Reset,

}

pub struct RiskManager;

pub struct RiskState {

   max_position_size: u64,
   current_exposure: f64,
   orders_checked: u64,
   orders_rejected: u64,
   executor: Option<ActorRef<ExecutorMessage>>,

}

  1. [ractor::async_trait]

impl Actor for RiskManager {

   type Msg = RiskMessage;
   type State = RiskState;
   type Arguments = (u64, Option<ActorRef<ExecutorMessage>>);
   async fn pre_start(
       &self,
       _myself: ActorRef<Self::Msg>,
       (max_position_size, executor): Self::Arguments,
   ) -> Result<Self::State, ActorProcessingErr> {
       println!("[RiskManager] Starting with max position {}", max_position_size);
       Ok(RiskState {
           max_position_size,
           current_exposure: 0.0,
           orders_checked: 0,
           orders_rejected: 0,
           executor,
       })
   }
   async fn handle(
       &self,
       _myself: ActorRef<Self::Msg>,
       message: Self::Msg,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match message {
           RiskMessage::CheckOrder(order) => {
               state.orders_checked += 1;
               
               // Check position limits
               if order.quantity > state.max_position_size {
                   state.orders_rejected += 1;
                   println!(
                       "[RiskManager] REJECTED order {}: quantity {} exceeds limit {}",
                       order.order_id, order.quantity, state.max_position_size
                   );
                   return Ok(());
               }
               
               // Simulate risk check failure for specific order
               if order.order_id == "ORDER-FAIL" {
                   return Err(ActorProcessingErr::from(
                       "Risk calculation error: invalid market state"
                   ));
               }
               
               let order_value = order.quantity as f64 * order.price;
               state.current_exposure += order_value;
               
               println!(
                   "[RiskManager] APPROVED order {}: exposure now {:.2}",
                   order.order_id, state.current_exposure
               );
               
               if let Some(ref executor) = state.executor {
                   executor.cast(ExecutorMessage::ExecuteOrder(order))?;
               }
           }
           RiskMessage::GetExposure(reply) => {
               reply.send(state.current_exposure)?;
           }
           RiskMessage::Reset => {
               println!("[RiskManager] Resetting state");
               state.current_exposure = 0.0;
               state.orders_checked = 0;
               state.orders_rejected = 0;
           }
       }
       Ok(())
   }

} The order executor sends validated orders to the exchange: pub enum ExecutorMessage {

   ExecuteOrder(Order),
   GetExecutedCount(ractor::RpcReplyPort<u64>),

}

pub struct OrderExecutor;

pub struct ExecutorState {

   executed_count: u64,

}

  1. [ractor::async_trait]

impl Actor for OrderExecutor {

   type Msg = ExecutorMessage;
   type State = ExecutorState;
   type Arguments = ();
   async fn pre_start(
       &self,
       _myself: ActorRef<Self::Msg>,
       _args: Self::Arguments,
   ) -> Result<Self::State, ActorProcessingErr> {
       println!("[Executor] Starting order executor");
       Ok(ExecutorState {
           executed_count: 0,
       })
   }
   async fn handle(
       &self,
       _myself: ActorRef<Self::Msg>,
       message: Self::Msg,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match message {
           ExecutorMessage::ExecuteOrder(order) => {
               // Simulate sending to exchange
               println!(
                   "[Executor] Executing order {}: {} {} @ {:.2}",
                   order.order_id,
                   match order.side {
                       OrderSide::Buy => "BUY",
                       OrderSide::Sell => "SELL",
                   },
                   order.quantity,
                   order.price
               );
               
               sleep(Duration::from_millis(100)).await;
               state.executed_count += 1;
               
               println!("[Executor] Order {} filled", order.order_id);
           }
           ExecutorMessage::GetExecutedCount(reply) => {
               reply.send(state.executed_count)?;
           }
       }
       Ok(())
   }

} Now we need the supervisor that manages these components: pub enum TradingSystemMessage {

   Start,
   SubmitOrder(Order),
   GetSystemStats(ractor::RpcReplyPort<SystemStats>),
   Shutdown,

}

  1. [derive(Debug, Clone)]

pub struct SystemStats {

   pub market_data_restarts: u32,
   pub risk_manager_restarts: u32,
   pub orders_executed: u64,

}

pub struct TradingSystemSupervisor {

   symbol: String,

}

pub struct SupervisorState {

   symbol: String,
   market_data: Option<ActorRef<MarketDataMessage>>,
   order_book: Option<ActorRef<OrderBookMessage>>,
   risk_manager: Option<ActorRef<RiskMessage>>,
   executor: Option<ActorRef<ExecutorMessage>>,
   market_data_restarts: u32,
   risk_manager_restarts: u32,

}

  1. [ractor::async_trait]

impl Actor for TradingSystemSupervisor {

   type Msg = TradingSystemMessage;
   type State = SupervisorState;
   type Arguments = String;
   async fn pre_start(
       &self,
       myself: ActorRef<Self::Msg>,
       symbol: Self::Arguments,
   ) -> Result<Self::State, ActorProcessingErr> {
       println!("[Supervisor] Initializing trading system for {}", symbol);
       
       // Spawn executor (no supervision needed, stateless)
       let (executor_ref, _) = Actor::spawn(
           Some(format!("{}-executor", symbol)),
           OrderExecutor,
           (),
       ).await?;
       
       // Spawn risk manager with supervision
       let (risk_ref, _) = Actor::spawn_linked(
           Some(format!("{}-risk", symbol)),
           RiskManager,
           (10000, Some(executor_ref.clone())),
           myself.clone().into(),
       ).await?;
       
       // Spawn order book
       let (book_ref, _) = Actor::spawn(
           Some(format!("{}-book", symbol)),
           OrderBook { symbol: symbol.clone() },
           (symbol.clone(), Some(risk_ref.clone())),
       ).await?;
       
       // Spawn market data with supervision (allow first failure)
       let (market_ref, _) = Actor::spawn_linked(
           Some(format!("{}-market", symbol)),
           MarketDataFeed { symbol: symbol.clone() },
           (symbol.clone(), false),
           myself.clone().into(),
       ).await?;
       
       // Subscribe order book to market data
       market_ref.cast(MarketDataMessage::Subscribe(book_ref.clone()))?;
       
       Ok(SupervisorState {
           symbol,
           market_data: Some(market_ref),
           order_book: Some(book_ref),
           risk_manager: Some(risk_ref),
           executor: Some(executor_ref),
           market_data_restarts: 0,
           risk_manager_restarts: 0,
       })
   }
   async fn handle(
       &self,
       _myself: ActorRef<Self::Msg>,
       message: Self::Msg,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match message {
           TradingSystemMessage::Start => {
               if let Some(ref market_data) = state.market_data {
                   println!("[Supervisor] Starting market data feed");
                   market_data.cast(MarketDataMessage::Start)?;
               }
           }
           TradingSystemMessage::SubmitOrder(order) => {
               if let Some(ref order_book) = state.order_book {
                   order_book.cast(OrderBookMessage::PlaceOrder(order))?;
               }
           }
           TradingSystemMessage::GetSystemStats(reply) => {
               let executed = if let Some(ref executor) = state.executor {
                   match executor.call(
                       |r| ExecutorMessage::GetExecutedCount(r),
                       Some(Duration::from_secs(1))
                   ).await {
                       Ok(ractor::rpc::CallResult::Success(count)) => count,
                       _ => 0,
                   }
               } else {
                   0
               };
               reply.send(SystemStats {
                   market_data_restarts: state.market_data_restarts,
                   risk_manager_restarts: state.risk_manager_restarts,
                   orders_executed: executed,
               })?;
           }
           TradingSystemMessage::Shutdown => {
               println!("[Supervisor] Shutting down trading system");
           }
       }
       Ok(())
   }
   async fn handle_supervisor_evt(
       &self,
       myself: ActorRef<Self::Msg>,
       event: SupervisionEvent,
       state: &mut Self::State,
   ) -> Result<(), ActorProcessingErr> {
       match event {
           SupervisionEvent::ActorFailed(cell, error) => {
               println!("[Supervisor] Actor failed: {}", error);
               // Check which actor failed and restart it immediately
               if let Some(ref market_data) = state.market_data {
                   if market_data.get_cell() == cell {
                       println!("[Supervisor] Market data feed failed, restarting...");
                       state.market_data_restarts += 1;
                       // On restart, skip the simulated failure
                       let (market_ref, _) = Actor::spawn_linked(
                           Some(format!("{}-market", state.symbol)),
                           MarketDataFeed { symbol: state.symbol.clone() },
                           (state.symbol.clone(), true),
                           myself.clone().into(),
                       ).await?;
                       if let Some(ref order_book) = state.order_book {
                           market_ref.cast(MarketDataMessage::Subscribe(order_book.clone()))?;
                       }
                       // Restart the feed
                       market_ref.cast(MarketDataMessage::Start)?;
                       state.market_data = Some(market_ref);
                       println!("[Supervisor] Market data feed restarted");
                       return Ok(());
                   }
               }
               if let Some(ref risk_manager) = state.risk_manager {
                   if risk_manager.get_cell() == cell {
                       println!("[Supervisor] Risk manager failed, restarting...");
                       state.risk_manager_restarts += 1;
                       let executor = state.executor.clone();
                       let (risk_ref, _) = Actor::spawn_linked(
                           Some(format!("{}-risk", state.symbol)),
                           RiskManager,
                           (10000, executor),
                           myself.clone().into(),
                       ).await?;
                       state.risk_manager = Some(risk_ref.clone());
                       // Update order book with new risk manager reference
                       if let Some(ref order_book) = state.order_book {
                           order_book.cast(OrderBookMessage::UpdateRiskManager(risk_ref.clone()))?;
                           println!("[Supervisor] Risk manager restarted with clean state");
                       }
                       return Ok(());
                   }
               }
           }
           SupervisionEvent::ActorTerminated(cell, _, _) => {
               // Actor terminated normally, no action needed
               println!("[Supervisor] Actor terminated: {:?}", cell);
           }
           _ => {}
       }
       Ok(())
   }

} Here’s the complete system architecture:

Notice the supervision relationships. The market data feed and risk manager are supervised because they’re stateful and can fail in recoverable ways. The order book and executor aren’t supervised because the order book’s state can be rebuilt from market data, and the executor is essentially stateless. Now let’s run this system and trigger some failures:

  1. [tokio::main]

async fn main() -> Result<(), Box<dyn std::error::Error>> {

   println!("=== Starting Trading System ===\n");
   
   let (supervisor_ref, _supervisor_handle) = Actor::spawn(
       Some("trading-supervisor".to_string()),
       TradingSystemSupervisor { symbol: "AAPL".to_string() },
       "AAPL".to_string(),
   ).await?;
   
   // Start the market data feed (will fail once, then restart)
   supervisor_ref.cast(TradingSystemMessage::Start)?;
   
   // Wait for market data to stabilize
   sleep(Duration::from_secs(2)).await;
   
   println!("\n=== Submitting Orders ===\n");
   
   // Submit some valid orders
   for i in 0..5 {
       let order = Order {
           order_id: format!("ORDER-{}", i),
           symbol: "AAPL".to_string(),
           side: if i % 2 == 0 { OrderSide::Buy } else { OrderSide::Sell },
           quantity: 100,
           price: 100.0 + (i as f64),
       };
       supervisor_ref.cast(TradingSystemMessage::SubmitOrder(order))?;
       sleep(Duration::from_millis(200)).await;
   }
   
   // Submit an order that will cause risk manager to crash
   println!("\n=== Triggering Risk Manager Failure ===\n");
   let failing_order = Order {
       order_id: "ORDER-FAIL".to_string(),
       symbol: "AAPL".to_string(),
       side: OrderSide::Buy,
       quantity: 500,
       price: 105.0,
   };
   supervisor_ref.cast(TradingSystemMessage::SubmitOrder(failing_order))?;
   
   // Wait for restart
   sleep(Duration::from_secs(2)).await;
   
   // Submit more orders after recovery
   println!("\n=== Submitting Orders After Recovery ===\n");
   for i in 5..8 {
       let order = Order {
           order_id: format!("ORDER-{}", i),
           symbol: "AAPL".to_string(),
           side: OrderSide::Buy,
           quantity: 100,
           price: 100.0 + (i as f64),
       };
       supervisor_ref.cast(TradingSystemMessage::SubmitOrder(order))?;
       sleep(Duration::from_millis(200)).await;
   }
   
   sleep(Duration::from_secs(1)).await;
   
   // Get final statistics
   let stats = supervisor_ref.call(
       |reply| TradingSystemMessage::GetSystemStats(reply),
       Some(Duration::from_secs(1))
   ).await?;
   
   println!("\n=== System Statistics ===");
   println!("Market data restarts: {}", stats.market_data_restarts);
   println!("Risk manager restarts: {}", stats.risk_manager_restarts);
   println!("Orders executed: {}", stats.orders_executed);
   
   supervisor_ref.cast(TradingSystemMessage::Shutdown)?;
   sleep(Duration::from_millis(500)).await;
   
   Ok(())

} When you run this, you’ll see the market data feed fail on its first connection attempt, get restarted by the supervisor, and then successfully connect. Later, when the risk manager crashes processing a specific order, it also gets restarted with clean state. The rest of the system continues running throughout these failures. What Supervision Buys You The trading system demonstrates several key properties that emerge from supervision. First, failures are isolated. When the market data feed crashes, it doesn’t take down the order book or risk manager. Each actor fails independently, and the supervisor contains the blast radius. Second, recovery is automatic and consistent. You don’t scatter error-handling logic throughout your code. The supervisor embodies the recovery policy. Want to change how the system responds to market data failures? Modify the supervisor’s restart policy. The individual actors don’t change. Third, state management becomes explicit. When the risk manager restarts, it gets clean state. The supervisor knows this happened and can take additional actions, like notifying other actors or resetting related state. Compare this to traditional error handling where you try to clean up state in finally blocks scattered across your codebase. Fourth, monitoring and observability fall out naturally. The supervisor tracks restart counts and failure patterns. You can expose these metrics without instrumenting every actor. In a production system, you’d send these events to your monitoring infrastructure and set alerts on restart rates. Designing Supervision Hierarchies The trading system uses a flat supervision structure where one supervisor manages all components. Real systems often need deeper hierarchies. You might have a market data supervisor that manages feeds for multiple symbols, each with its own set of parsing and validation actors. A risk supervisor might manage different risk checks as separate actors. An execution supervisor might manage connections to multiple exchanges. Each level of the hierarchy makes supervision decisions appropriate to its scope. The market data supervisor knows that if a parser for one symbol crashes, it shouldn’t affect parsers for other symbols. That’s one-for-one supervision. The execution supervisor might know that if the connection to an exchange fails, all pending orders to that exchange need to be cancelled. That might warrant one-for-all supervision or at least notifying all related actors. The key is matching your supervision structure to your system’s failure domains. Actors that share fate should be grouped under the same supervisor. Actors that can fail independently should be in separate supervision branches. This is the art of building resilient actor systems. Beyond Simple Restarts We’ve focused on restart strategies, but supervision encompasses more than just restarting failed actors. Supervisors can escalate failures to their own supervisors when repeated restarts fail. They can implement backoff strategies, waiting longer between each restart attempt. They can maintain circuit breakers, temporarily stopping actors that are failing too frequently. Ractor provides the primitives for all of these patterns. The handle_supervisor_evt method gives you complete control over how failures are handled. You can track restart counts, implement exponential backoff, or even replace a failing actor with a fallback implementation. In a production trading system, you might have the supervisor switch to a backup market data provider after three failed connection attempts. Or it might notify operations when the risk manager restarts more than twice in a minute. Or it might gradually reduce position limits if execution failures suggest exchange issues. All of these policies live in supervisors, keeping your business logic clean. The Path to Distribution We’ve built a resilient trading system that handles failures gracefully within a single process. The actors communicate through local message passing, and the supervisor manages them all locally. But the trading system needs to scale beyond one machine. You need multiple instances for redundancy. You need to distribute order processing across regions. You need to integrate with external systems that might be in different data centers. This is where we’re heading in the next post. We’ll take these supervision patterns and extend them across machines. We’ll explore how actor location transparency lets you treat remote actors the same as local ones. We’ll see how supervision trees can span processes and even continents. The patterns we’ve learned here don’t change. They just operate at a different scale. For now, the critical insight is this: supervision transforms actors from a concurrency primitive into a fault-tolerance framework. By separating error detection from error recovery, by making failure explicit rather than exceptional, and by providing structured ways to manage actor lifecycles, supervision gives you the tools to build systems that don’t just handle failures, but expect them and recover automatically. That’s the foundation for building truly resilient distributed systems.

Code is available here > https://github.com/aovestdipaperino/medium-posts/tree/main/actor-ii

Read the full article here: https://medium.com/rustaceans/when-things-break-supervision-and-fault-tolerance-in-actor-systems-2b61d41660f9