What Now? Handling Errors in Large Systems

Published by

xauto21

11/26/2025

Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification.

Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failures can be correlated (including by adversarial user behavior), its best to design the system to reject the cause of the errors and continue.

The bottom line is that error handling in systems isn’t a local property. The right way to handle errors is a global property of the system, and error handling needs to be built into the system from the beginning.

Getting this right is hard, and that’s where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means that if you do the wrong thing you affect less than all your traffic – ideally a small percentage of traffic. Blast radius reduction is humility in the face of complexity.

Can they be handled at a higher layer? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a higher layer (e.g. by replacing instances or containers as they fail load balancer health checks using AWS Autoscaling), but can’t handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting with Lambda-style serverless all the way to Erlang’s approach, are designed to handle higher rates of errors, and crashing rather the continuing is appropriate in more cases.

Is it possible to meaningfully continue? This is where you need to understand your business logic. In most cases with configuration, and some cases with data, its possible to continue with the last-known good version. This adds complexity, by introducing the behavior mode of running with that version, but that complexity may be worth the additional resilience. On the other hand, in a database that handles updates via operations (e.g. x = x + 1) or conditional operations (if x == 1 then y = y + x) then continuing after skipping some records could cause arbitrary state corruption. In the latter case, the system must be designed (including its operational practices) to ensure the invariant that replicas only get records they understand. These kinds of invariants make the system less resilient, but are needed to avoid state divergence.

Marc Brooker The opinions on this site are my own. They do not necessarily represent those of my employer. marcbrooker@gmail.com

If you don’t want to play, and just see my answers, click here: Show All Answers.

There are three unifying principles behind my answers here.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Hello,

This is the xdefiance Online Web Shop.

A True Shop for You and Your Higher, Enlightnened Self…

Welcome to the xdefiance website, which is my cozy corner of the internet that is dedicated to all things homemade and found delightful to share with many others online and offline.

You can book with Jeffrey, who is the Founder of the xdefiance store, by following this link found here.

Visit the paid digital downloads products page to see what is all available for immediate purchase & download to your computer or cellphone by clicking this link here.

Find out more by reading the FAQ Page for any questions that you may have surrounding the website and online sop and get answers to common questions. Read the Returns & Exchanges Policy if you need to make a return on a recent order. You can check out the updated Privacy Policy for xdefiance.com here,

If you have any unanswered questions, please do not hesitate to contact a staff member during office business hours:

Monday-Friday 9am-5pm, Saturday 10am-5pm, Sun. Closed

You can reach someone from xdefiance.online directly at 1(419)-318-9089 via phone or text.

If you have a question, send an email to contact@xdefiance.com for a reply & response that will be given usually within 72 hours of receiving your message.

Browse the shop selection of products now!

Reaching Outwards

Join the fun!

Stay updated with our latest tutorials and ideas by joining our newsletter.

Visit Newsletter Sign-up Page

XDEFiANCE'e Quality Internet Shop