Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification.
Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failures can be correlated (including by adversarial user behavior), its best to design the system to reject the cause of the errors and continue.
The bottom line is that error handling in systems isn’t a local property. The right way to handle errors is a global property of the system, and error handling needs to be built into the system from the beginning.
Getting this right is hard, and that’s where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means that if you do the wrong thing you affect less than all your traffic – ideally a small percentage of traffic. Blast radius reduction is humility in the face of complexity.
Can they be handled at a higher layer? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a higher layer (e.g. by replacing instances or containers as they fail load balancer health checks using AWS Autoscaling), but can’t handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting with Lambda-style serverless all the way to Erlang’s approach, are designed to handle higher rates of errors, and crashing rather the continuing is appropriate in more cases.
Is it possible to meaningfully continue? This is where you need to understand your business logic. In most cases with configuration, and some cases with data, its possible to continue with the last-known good version. This adds complexity, by introducing the behavior mode of running with that version, but that complexity may be worth the additional resilience. On the other hand, in a database that handles updates via operations (e.g. x = x + 1) or conditional operations (if x == 1 then y = y + x) then continuing after skipping some records could cause arbitrary state corruption. In the latter case, the system must be designed (including its operational practices) to ensure the invariant that replicas only get records they understand. These kinds of invariants make the system less resilient, but are needed to avoid state divergence.
Marc Brooker The opinions on this site are my own. They do not necessarily represent those of my employer. marcbrooker@gmail.com
If you don’t want to play, and just see my answers, click here: Show All Answers.
There are three unifying principles behind my answers here.
This work is licensed under a Creative Commons Attribution 4.0 International License.



You must be logged in to post a comment.