Designing for failure: Patterns

Reusable building blocks to help design reliable systems in the presence of failures.

API design

Rather than internal details, these patterns describe the API as seen by clients.

Idempotency key: Identify identical requests
Reject non-identical retries: Detect changes in request content between retries
Callback: Inform clients about the results of asynchronous operations

Patterns for writing to a single system. Most patterns assume this system is an ACID database. This is the simplest topology, and the easiest to work with. It’s worth trying to design systems like this where possible, to avoid the complexity that arises from trying to maintain consistency between multiple systems.

ACID transaction: Perform multiple writes, such that either all of them or none of them succeed
Atomic read-then-write: Concurrently write data based on current state
Idempotency key (external): Send a request to an external system at-least-once with only a single side effect
Change record: Record that a change has been made so it doesn’t happen again
Response record: Return the same response for every retry

Writing to multiple systems

When writing to a single ACID database, we get atomicity and consistency built in. Things get more complicated when writing to multiple systems where we don’t have these guarantees: we might not be able to perform all writes atomically, and so can end up in an inconsistent state.

Transactional outbox: Transactionally write a description of work to be performed asynchronously
Saga: Perform a series of transactions with backwards recovery
Distributed transaction: Write to multiple systems transactionally
Resumable operation: Allow operations to continue from where the previous attempt failed
Recovery point: Record current progress to allow recovery with minimal rework
Reliable retries: Reliably keep retrying until success
At-most-once guard: Write to a system at most once
Idempotency key lock: Protect against concurrent retries
Store-then-reference: Prevent dangling references

Background processes

Sometimes inconsistency is unavoidable, whether by design, or simply because of a buggy implementation. Background processes can identify these inconsistencies and handle them in various ways.

Completer: Complete unfinished operations, even if clients give up retrying
Garbage collection: Find and delete unused data
Reconciliation: Detect and resolve inconsistencies

Other

Other patterns for handling failure or edge cases.

Handling out of order messages: Reliably process dependent messages in any order

Antipatterns

Some patterns exist which should be avoided. They may seem to offer benefits, but either do not deliver what they seem to or have other serious drawbacks.

I/O inside transaction: Wrap a transaction around non-database I/O
Reject duplicate requests: Return an error when a duplicate request is detected

Comparisons

When consistency is important, you will generally need to choose (at least) one of the patterns in the table below.

Pattern	Number of systems	Synchronicity	Atomicity	Consistency	Complexity
ACID transaction	One	Sync	Atomic	Strong	Simple
Transactional outbox	Many	Async	Non-atomic¹	Eventual	Moderate
Reliable retries	Many	Async	Non-atomic¹	Eventual	Moderate
Completer	Many	Success: sync Error: async²	Non-atomic¹	Eventual	Moderate
Distributed transaction	Many	Success: sync Error: async²	Atomic	Eventual	Complex
Saga	Many	Async	Atomic	Eventual	Complex

More patterns

Non-atomic because some writes might fail with no guarantee that successful writes will be rolled back. ↩ ↩² ↩³
Attempts to do all work synchronously, but will continue asynchronously in the case of failure. ↩ ↩²

Thom's Blog

Designing for failure: Patterns

API design

Writing to a single system

Writing to multiple systems

Background processes

Other

Antipatterns

Comparisons

More patterns