Thom's Blog

How to look smart with async messaging

When reviewing a design for an asynchronous system, there are some simple questions you can ask.

The problem with decorrelated jitter

Decorrelated jitter has a major flaw: clamping. Retry intervals can get repeatedly clamped to the maximum allowed duration. This can significantly reduce the amount of jitter applied.

Writing tips: style and structure

Writing is hard. Reading can be hard. Poor writing is harder to read, and common mistakes can be distracting (for many).

Here are my top tips for avoiding many common stylistic and structural mistakes I see.

How to cause an incident with a read-only user in PostgreSQL

It’s easy to have a false sense of security when accessing a database with a read-only user. I’d like to talk about how locks work in PostgreSQL, and how this can lead to problems when using read-only users.

Names matter: Root cause

“Root cause” is a misleading term for the concept it represents. At least for computer scientists who consider a “root” to be singular. For other people who know what real trees actually look like, it’s a perfectly good term.

What takes one second?

I had an interesting debugging challenge recently, investigating strange latencies in several services. One symptom I observed had outgoing requests taking one second longer than usual. Why one second?

Designing alerts for SLOs

Getting alerts right can be hard. It’s not uncommon to see alerts which are too noisy, paging on-call engineers for small numbers of errors, where either the error rate was very low or the duration of the error-producing event was very short. This can cause alert fatigue, and result in real incidents being ignored. On the other hand, many alerts are not sensitive enough and errors can occur at high rates without detection.

In this post I’ll talk through how I approach writing alerts which find a better balance.

Automatic test retries

One day, some sleazy individual might come up to you and whisper in your ear:

Psst, I got something real nice for you. Check out this new test runner. It’ll retry your tests for you if they fail, and if they pass after a few attempts it’ll just say they succeeded and no one will have to know. No more flaky tests.

Don’t buy it. This person is trying to sell you drugs lies and deception.


A common cause of incidents I see is lack of pagination. Or, more precisely, APIs returning an unbounded number of items. Really it’s the lack of a limit which is the problem, which I think is an important distinction. When returning multiple items, pagination is optional but limits are arguably not.

Designing for failure: Introduction

Systems fail. Processes crash unexpectedly, network partitions happen, operations time out. How much failure is acceptable depends on context, but it’s generally important to be aware of what the implications are.

Retries upon retries

Retries are used to increase availability in the presence of errors at the cost of increased latency. The concept seems simple at a high level, but there is a fair amount of complexity hidden inside it. How effective any particular approach is will depend on context, including the pattern of incoming requests and the pattern of failure causing the errors.

Auto-scaling: positive feedback loops

Consider a scenario where we have two services, A and B. A is consuming messages from a queue and sending requests to B. The message queue is backing up. There is a growing number of pending messages which Service A hasn’t received yet.

Git rebase --onto

I generally prefer to keep my git history as a straight line. And my branches (when I have to use them) based on the HEAD of main. I pull main and rebase my branch onto it fairly often to keep up to date with the latest changes.

Recently I’ve been in the unfortunate position where it made sense to use a branch off a branch. This can be a pain to keep up to date with the latest changes on main.

Why can't we have exactly-once message processing?

One of the big problems in distributed systems is reliably sending, processing and acknowledging messages. As such, message delivery systems such as queues often come with some guarantees about delivery. You might see terms such as “at-least-once” or “at-most-once”.

Choosing appropriate data structures

How do we choose which data structures to use in our code? In some instances it’s fairly obvious. When the amount of data we’re working with is the primary constraint, we probably need to choose the most efficient structure for what we’re trying to achieve.

How many chairs do I need?

Decomposition in microservice architectures

This article is an adaptation of some advice I’ve written over the years about designing microservice architectures. There isn’t much novel advice here, it’s mainly just existing ideas rehashed in my own words. It’s largely based on real-world problems I’ve encountered or discussed, but I don’t claim to be an expert. This is an attempt to get some ideas out of my head and written down to more easily share and discuss them. The intended audience was full-stack Node.js engineers with varying amounts of experience.

Validating configuration with io-ts

Something I wrote at Candide about how we ensured our services didn’t get deployed with invalid configuration.

Transaction isolation in PostgreSQL

I often forget how different isolation levels affect queries in PostgreSQL, so I wrote a quick overview to remind myself. It won’t include all the details but is hopefully accurate enough in what it does say!

TCP state transitions are a lie

Everyone seems to be participating in some weird lie about the LISTEN > SYN-RECEIVED state transition in TCP. I feel very left out and I wish someone would tell me what is going on.

Correlation IDs in Node.js

Much has already been written about the need for correlation IDs in microservice architectures. If this is a new concept for you, I encourage you to read Building Microservices by Sam Newman. Or if you want a quick intro, try this blog post.

Multi-Environment Setups in Snap CI

I’ve been a big fan of Travis for a while now. It runs the builds for most of my open source projects. However, recently I’ve been finding it a bit sluggish, and something fishy seems to have happened to my automated NPM deployments. So, I figured it was time to give some other CI services a go.

Handling Events with React-Mainloop

I recently created a React.js component wrapper around this main loop library. You can find it here: react-mainloop. It can be used to control a React component using a game loop. It uses an update() function to generate new props, and takes control of when rendering occurs. It’s especially useful for animating games, or other interactive canvas-based apps.

Running Mocha in __tests__ directories

I don’t know about you, but I quite like the Jest convention of putting tests in __tests__ directories. It keeps the tests local to the modules they’re testing, and visible in the src directory, rather than hidden away in test. I know, it’s the little things.

Beautiful APIs in CoffeeScript

Let’s say we want to make a maths library in CoffeeScript (e.g. a Matrix library). We could easily write an API for addition that looks like:


Welcome to my brand new website, courtesy of GitHub Pages and Poole.