Context
When implementing a Resumable operation, it can be desirable to minimise rework, either because the operations are expensive or might produce repeated side effects. In which case we would want to perform each sub-operation at least once, while minimising the frequency of any happening more than once.
Prerequisites
It is acceptable to perform each sub-operation at least once, and for retries to perform additional side-effects after a previous attempt fails.
Example
Purchasing some products on an e-commerce site. The Complete purchase operation might need to save the order details, update stock availability, take payment, and schedule some later work such as getting the products packaged and delivered. It is desired for the API to do this synchronously and respond with either: Success
, OutOfStock
or PaymentFailed
.
Problem
How do we allow operations to continue from a known state after a failure?
Solution
Model the operation as a state machine. Write a record to a database after successfully performing a part of the operation. This record should identify the operation (probably using an Idempotency key), which state it is in, and include any necessary data needed for the next steps.
When handling a request, start by fetching the latest recovery point associated with the operation, and continue from that point.
For the example above, we might have three states: OrderReceived
, PaymentSuccess
and OrderFinished
(ignoring error cases, which I realise goes against the narrative of this whole thing), and the steps would be:
- Fetch recovery point for the idempotency key.
- If it exists, advance to that point in the operation.
- Transaction:
- Insert recovery point - state:
OrderReceived
- Insert order details
- Update stock availability
- Insert recovery point - state:
- Take payment
- Update recovery point - state:
PaymentSuccess
- Publish
NewOrder
message- In the background work will be scheduled to email the customer and start the shipping process
- Update recovery point - state:
OrderFinished
- Perhaps a Response record
Steps 3-5 could be consolidated into a single step using a Transactional outbox.
Relying on the client to retry is a kind of passive recovery. This might leave the system in an inconsistent state if e.g. the process crashes while taking the payment and the client stops retrying. In which case we might want to consider active recovery using a completer.
This pattern has focused on forwards recovery: attempting to successfully complete the operation. An alternative is backwards recovery: attempting to roll back. See saga for more information about backwards recovery.
Also known as
- Checkpoint
- Passive recovery
See also
Related
- At-most-once guard – Write to a system at most once
- Completer – Complete unfinished operations, even if clients give up retrying
- Reconciliation – Detect and resolve inconsistencies
- Response record – Return the same response for every retry
- Resumable operation – Allow operations to continue from where the previous attempt failed
- Saga – Perform a series of transactions with backwards recovery
- Store-then-reference – Prevent dangling references
- Transactional outbox – Transactionally write a description of work to be performed asynchronously