How to look smart with async messaging

07 May 2024

distributed systems
reliability

When reviewing a design for an asynchronous system, there are some simple questions you can ask.

The problem with decorrelated jitter

24 Apr 2024

jitter
algorithms
reliability

Decorrelated jitter has a major flaw: clamping. Retry intervals can get repeatedly clamped to the maximum allowed duration. This can significantly reduce the amount of jitter applied.

Writing tips: style and structure

03 Feb 2024

writing
communication

Writing is hard. Reading can be hard. Poor writing is harder to read, and common mistakes can be distracting (for many).

Here are my top tips for avoiding many common stylistic and structural mistakes I see.

How to cause an incident with a read-only user in PostgreSQL

22 Sep 2023

reliability
databases
postgresql
incidents

It’s easy to have a false sense of security when accessing a database with a read-only user. I’d like to talk about how locks work in PostgreSQL, and how this can lead to problems when using read-only users.

Names matter: Root cause

29 Aug 2023

naming
incidents
postmortems

“Root cause” is a misleading term for the concept it represents. At least for computer scientists who consider a “root” to be singular. For other people who know what real trees actually look like, it’s a perfectly good term.

What takes one second?

12 Jul 2023

tcp
debugging

I had an interesting debugging challenge recently, investigating strange latencies in several services. One symptom I observed had outgoing requests taking one second longer than usual. Why one second?

Designing alerts for SLOs

20 Jun 2023

alerting
observability
reliability

Getting alerts right can be hard. It’s not uncommon to see alerts which are too noisy, paging on-call engineers for small numbers of errors, where either the error rate was very low or the duration of the error-producing event was very short. This can cause alert fatigue, and result in real incidents being ignored. On the other hand, many alerts are not sensitive enough and errors can occur at high rates without detection.

In this post I’ll talk through how I approach writing alerts which find a better balance.

Automatic test retries

06 May 2023

testing
reliability
antipatterns

One day, some sleazy individual might come up to you and whisper in your ear:

Psst, I got something real nice for you. Check out this new test runner. It’ll retry your tests for you if they fail, and if they pass after a few attempts it’ll just say they succeeded and no one will have to know. No more flaky tests.

Don’t buy it. This person is trying to sell you ~~drugs~~ lies and deception.

Pagination

28 Apr 2023

reliability
databases
incidents

A common cause of incidents I see is lack of pagination. Or, more precisely, APIs returning an unbounded number of items. Really it’s the lack of a limit which is the problem, which I think is an important distinction. When returning multiple items, pagination is optional but limits are arguably not.

Designing for failure: Introduction

06 Apr 2023

reliability
microservices

Systems fail. Processes crash unexpectedly, network partitions happen, operations time out. How much failure is acceptable depends on context, but it’s generally important to be aware of what the implications are.

Retries upon retries

12 Sep 2022

microservices
reliability

Retries are used to increase availability in the presence of errors at the cost of increased latency. The concept seems simple at a high level, but there is a fair amount of complexity hidden inside it. How effective any particular approach is will depend on context, including the pattern of incoming requests and the pattern of failure causing the errors.

Auto-scaling: positive feedback loops

25 Jul 2022

microservices
reliability
queues

Consider a scenario where we have two services, A and B. A is consuming messages from a queue and sending requests to B. The message queue is backing up. There is a growing number of pending messages which Service A hasn’t received yet.

Git rebase --onto

25 Jun 2022

tools
tips
git

I generally prefer to keep my git history as a straight line. And my branches (when I have to use them) based on the HEAD of main. I pull main and rebase my branch onto it fairly often to keep up to date with the latest changes.

Recently I’ve been in the unfortunate position where it made sense to use a branch off a branch. This can be a pain to keep up to date with the latest changes on main.

Why can't we have exactly-once message processing?

24 May 2022

reliability
queues

One of the big problems in distributed systems is reliably sending, processing and acknowledging messages. As such, message delivery systems such as queues often come with some guarantees about delivery. You might see terms such as “at-least-once” or “at-most-once”.

Choosing appropriate data structures

10 May 2022

types

How do we choose which data structures to use in our code? In some instances it’s fairly obvious. When the amount of data we’re working with is the primary constraint, we probably need to choose the most efficient structure for what we’re trying to achieve.

How many chairs do I need?

05 May 2022

microservices
reliability

Decomposition in microservice architectures

12 Mar 2022

microservices

This article is an adaptation of some advice I’ve written over the years about designing microservice architectures. There isn’t much novel advice here, it’s mainly just existing ideas rehashed in my own words. It’s largely based on real-world problems I’ve encountered or discussed, but I don’t claim to be an expert. This is an attempt to get some ideas out of my head and written down to more easily share and discuss them. The intended audience was full-stack Node.js engineers with varying amounts of experience.

Validating configuration with io-ts

01 Feb 2022

microservices
reliability
types

Something I wrote at Candide about how we ensured our services didn’t get deployed with invalid configuration.

Transaction isolation in PostgreSQL

11 Jan 2022

databases
postgresql

I often forget how different isolation levels affect queries in PostgreSQL, so I wrote a quick overview to remind myself. It won’t include all the details but is hopefully accurate enough in what it does say!

TCP state transitions are a lie

08 Dec 2020

Everyone seems to be participating in some weird lie about the LISTEN > SYN-RECEIVED state transition in TCP. I feel very left out and I wish someone would tell me what is going on.

Correlation IDs in Node.js

21 Jul 2018

microservices
reliability
observability

Much has already been written about the need for correlation IDs in microservice architectures. If this is a new concept for you, I encourage you to read Building Microservices by Sam Newman. Or if you want a quick intro, try this blog post.

Multi-Environment Setups in Snap CI

17 Jun 2015

I’ve been a big fan of Travis for a while now. It runs the builds for most of my open source projects. However, recently I’ve been finding it a bit sluggish, and something fishy seems to have happened to my automated NPM deployments. So, I figured it was time to give some other CI services a go.

Handling Events with React-Mainloop

05 Jun 2015

I recently created a React.js component wrapper around this main loop library. You can find it here: react-mainloop. It can be used to control a React component using a game loop. It uses an update() function to generate new props, and takes control of when rendering occurs. It’s especially useful for animating games, or other interactive canvas-based apps.

Running Mocha in tests directories

29 May 2015

I don’t know about you, but I quite like the Jest convention of putting tests in __tests__ directories. It keeps the tests local to the modules they’re testing, and visible in the src directory, rather than hidden away in test. I know, it’s the little things.

Beautiful APIs in CoffeeScript

23 Aug 2014

Let’s say we want to make a maths library in CoffeeScript (e.g. a Matrix library). We could easily write an API for addition that looks like:

Welcome

02 Aug 2014

Welcome to my brand new website, courtesy of GitHub Pages and Poole.