Avoiding Downtime in Postgres with lock_timeout

This post comes after reading a very insightful article by a GoCardless engineer describing an unexpected way they caused an (albeit, very short) downtime in their app using Postgres. This story nicely shows how lock_timeout can be useful to avoid downtimes.

Some obvious footguns

There are some well known DDL operations that can cause downtimes. The post above mentions some, and links to a nice Braintree article with some more. My favorite resource is the strong_migrations Ruby gem, it lists workarounds for a lot of dangerous operations.

Some examples:

But there are always some tricky ones…

The author mentions that they had introduced some new tables which were empty, so they assumed they could easily fiddle with these since they weren’t even used in production at all. They renamed the tables, and then wanted to rename the foreign key constraints because their names had become outdated. The latter would also seem like a simple operation for the reasons mentioned earlier.

However, running the migration caused some alerts to trigger, showing that API requests were timing out, for about 15 seconds.

The fault? Adding a foreign key constraint (via ALTER TABLE) acquires an AccessExclusive lock on both the table with the constraint and the one it references. Wow!

But that’s not all it takes to fail. It takes another long-running query to complement it to cause downtime. If there’s a slow SELECT query already running, the ALTER TABLE command queues up since an AccessExclusive lock competes with all other locks (even SELECT’s AccessShare). And the final touch: all other queries that come afterwards will queue up since they all conflict with the AccessExclusive lock now in the queue.

The post indirectly mentions that they had a long-running analytics query, which should have run on a read replica DB instead.

Use lock_timeout to avoid downtime

Apart from the obvious recommendation of avoiding long analytical queries in the OLTP database, another layer of defence is using lock_timeout.

lock_timeout is a config parameter that determines the amount of time a transaction will wait to acquire a lock before aborting.

If they had set a lock_timeout value of, say, 2 seconds, the ALTER TABLE command would have aborted, and the remaining queued commands would have proceeded normally.

The conclusion is: make sure a lock_timeout is in place for DDL modification statements that are likely to acquire strict locks.

Bonus: how it works with statement_timeout

If statement_timeout is set and is non-zero, it’s pointless to set lock_timeout to a value equal to or greater than statement_timeout, since the statement timeout will always have triggered first.