Reliable retries

Context

Operations can fail partway through, leaving them in a partially complete state. Sometimes they might need driving to completion, otherwise the system might end up in an inconsistent state. In many client/server architectures, clients cannot be relied on to reliably retry until completion.

Example

A password reset system might follow these steps when changing the password:

[Sync] Write the new password to the database
[Async] Send an email to the user, informing them that their password has changed

If something fails after step 1 then an email might never be sent unless there is something to guarantee it gets retried until completion.

Problem

How do we ensure that an operation will always be driven to completion, even if it may fail partway through?

Solution

There are various ways to design systems which reliably retry until completion.

A common example would be a messaging system which retries for a given duration or number of attempts, after which it puts the message into a dead letter queue. Messages in the dead letter queue can be manually retried as necessary.

For the password reset system above, if a PasswordUpdated event is published (e.g. using a transactional outbox), then the messaging system could guarantee that the email sending gets retried until success.

Other examples include:

Transactional outbox – a separate process retries a subset of the operation.
Completer – a separate process retries the complete operation.

Completer – Complete unfinished operations, even if clients give up retrying
Recovery point – Record current progress to allow recovery with minimal rework
Resumable operation – Allow operations to continue from where the previous attempt failed
Transactional outbox – Transactionally write a description of work to be performed asynchronously

Context

Example

Problem

Solution

Related