Two PostgreSQL Sequence Misconceptions
✨ With Examples! ✨
Some constructs seem more powerful than the promises they make.
PostgreSQL sequences are like that. Many assume it offers stronger properties
than it can deliver.
They trust them to be the grail of SQL ordering, the one-size-fits-all of strict
serializability. However, there is a good reason Amazon spent design time on
vector clocks in Dynamo, Google invested significantly into Chubby, then
Percolator’s timestamp oracle, then Spanner’s expensive,
atomic-clock-based TrueTime; why Twitter built Snowflake, and so many others
built custom timestamp systems.
- Strict serializability is hard to achieve, especially in a distributed
system, but even in a centralized system with the possibility of failure.
- Developers assume the system is strict-serializable, but it usually is not.
- When a system provides timestamps, developers will use those as if they were
monotonically strictly increasing atomically throughout the distributed
system, but they often are not, which causes subtle bugs.
The problem space §
To design your system’s properties right, it is often useful or necessary to
determine the order in which events happened. Ideally, you wish for the “wall
clock” order (looking at your watch), although instantaneity gets tricky when
events occur at a distance, even within the same motherboard, but especially
across a datacenter, or between cities.
At the very least, you want to reason about causal ordering: when that event
happened, did it already see this other event?
A nice property to have, even for a single centralized database, is to give a
monotonically increasing identifier for each row. Most PostgreSQL users rely on
the SERIAL
type for that – a sequence. Each insertion will call nextval()
and store an increasing value.
What you implicitly want is to list rows by insertion order, Your mental model
is that each insertion happens at a set “wall clock” time. A first insertion
will happen at T0 and set the identifier 1, the next one happens at T1 and get
number 2, and so on. Therefore, you expect a row with ID N to have causally
been inserted after a row with ID M < N.
Operational order is a consistency constraint strongly associated with isolation
levels. A PostgreSQL database can handle multiple simultaneous operations.
(Side note: I could be talking about threads and locks, but I will not, because
those are just tools to achieve properties. PostgreSQL may switch tools to
better meet a given promise (they did so with the serializable level in 2011),
but the promise won’t change.)
By default, it promises Read Committed isolation: a transaction can witness
the effects of all transactions that commit “before” it does (but not those that
have not committed yet). Their commits are therefore causally ordered by commit
time.
However, nothing else within a transaction has any causal promise with respect
to other transactions. The same SELECT
can yield different values;
simultaneous insertions can happen either before, after, or anything in between,
your own insertion.
The highest isolation level PostgreSQL offers is Serializable isolation: all
transactions are causally ordered; from BEGIN
to COMMIT
. Of course,
transactions still execute in parallel; but the database makes sure that
everything that a transaction witnesses can be explained by executing all its
statements either after all statements of another transaction, or before all of
them. It won’t see a changing state within the execution of the transaction.
(By the way, PostgreSQL only achieved serializability in 2011, when they
released version 9.1 with support for predicate locks. It is hard.)
Having a causal order does not mean that this order follows real time: one
insertion may complete at 9:30am after (in causal order) another that
completes later at 10:40am. If you want the additional property that the order
is consistent with wall clock time, you want Strict Serializability.
However, PostgreSQL makes no claim of Strict Serializability.
Given all this, sequences probably feel much weaker than you initially thought.
You want them to give a continuous set of numbers, but a sequence can yield
values with gaps (1 2 4).
You want them to give a causal order (2 was inserted before 3), but it can
yield values out of order (1 3 2).
All a sequence promises is to give values that have an order. Not a continuous
order, nor a time order.
Let’s demonstrate both.
Gaps §
Let’s create a table with a SERIAL
identifier. For the purpose of showing
things going right, let’s insert a row.
CREATE TABLE orders (id SERIAL);
INSERT INTO orders DEFAULT VALUES;
SELECT * FROM orders;
id
----
1
(1 row)
Now comes the gap.
BEGIN;
INSERT INTO orders DEFAULT VALUES;
ROLLBACK;
Since we rolled back, nothing happened – or did it?
Let’s now insert another row.
INSERT INTO orders DEFAULT VALUES;
SELECT * FROM orders;
id
----
1
3
(2 rows)
Oops! Despite the rollback, the sequence was incremented without being reverted.
Now, there is a gap.
This is not a PostgreSQL bug per se: the way sequences are stored, it just does
not keep the information necessary to undo the nextval()
without potentially
breaking other operations.
Let’s now break the other assumption.
Order violation §
First, a table with a sequence and a timestamp:
CREATE TABLE orders (id SERIAL, created_at TIMESTAMPTZ);
Let’s set up two concurrent connections to the database. Each will have the same
instructions. I started the first one yesterday:
BEGIN;
I launch the second one today:
BEGIN;
INSERT INTO orders (created_at) VALUES (NOW());
COMMIT;
Let’s go back to the first one:
INSERT INTO orders (created_at) VALUES (NOW());
COMMIT;
Simple enough. But we actually just got the order violation:
SELECT * FROM orders ORDER BY created_at;
id | created_at
----+-------------------------------
2 | 2019-09-04 21:10:38.392352+02
1 | 2019-09-05 08:19:34.423947+02
The order of the sequence does not follow creation order.
From then on, developers may write some queries ordering by ID, and some
ordering by timestamp, expecting an identical order. That incorrect assumption
may break their business logic.
Lest you turn your heart to another false god, that behavior remains the same
with serializable transactions.
Are we doomed? §
No.
Sure, the systems we use have weak assumptions. But that is true at every level.
The nice thing about the world is that you can combine weak things to make
strong things. Pure iron is ductile, and carbon is brittle, but their alloy is
steel.
For instance, you can get the best of both worlds, causal order and “wall clock”
timestamps, by having a TIMESTAMPTZ
field, only inserting rows within
serializable transactions, and setting the created_at
field to now, or after
the latest insertion:
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO orders (created_at)
SELECT GREATEST(NOW(), MAX(created_at) + INTERVAL '1 microsecond') FROM orders;
COMMIT;
Indeed, PostgreSQL’s TIMESTAMPTZ
has a precision up to the microsecond. You
don’t want to have conflicts in your created_at
(otherwise you could not
determine causal order between the conflicting rows), so you add a microsecond
to the current time if there is a conflict.
However, here, concurrent operations are likely to fail, as we acquire a
(non-blocking) SIReadLock on the whole table (what the documentation calls a
relation lock):
SELECT l.mode, l.relation::regclass, l.page, l.tuple, substring(a.query from 0 for 19)
FROM pg_stat_activity a JOIN pg_locks l ON l.pid = a.pid
WHERE l.relation::regclass::text LIKE 'orders%'
AND datname = current_database()
AND granted
ORDER BY a.query_start;
mode | relation | page | tuple | substring
------------------+----------+------+-------+--------------------
SIReadLock | orders | | | INSERT INTO orders
RowExclusiveLock | orders | | | INSERT INTO orders
AccessShareLock | orders | | | INSERT INTO orders
The reason for that is that we perform a slow Seq Scan in this trivial example,
as the EXPLAIN proves.
QUERY PLAN
-------------------------------------------------------------------------------
Insert on orders (cost=38.25..38.28 rows=1 width=8)
-> Aggregate (cost=38.25..38.27 rows=1 width=8)
-> Seq Scan on orders orders_1 (cost=0.00..32.60 rows=2260 width=8)
With an index, concurrent operations are much more likely to work:
CREATE INDEX created_at_idx ON orders (created_at);
We then only take a tuple lock on the table:
mode | relation | page | tuple | substring
------------------+----------+------+-------+--------------------
SIReadLock | orders | 0 | 5 | INSERT INTO orders
RowExclusiveLock | orders | | | INSERT INTO orders
AccessShareLock | orders | | | INSERT INTO orders
However, the tuple in question is the latest row in the table. Any two
concurrent insertions will definitely read from the same one: the one with the
latest created_at
. Therefore, only one of concurrent insertion will succeed;
the others will need to be retried until they do too.
Subset Ordering §
In cases where you only need a unique ordering for a subset of rows based on
another field, you can set a combined index with that other field:
CREATE TABLE orders (
account_id UUID DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ);
CREATE INDEX account_created_at_idx ON orders (account_id, created_at DESC);
Then the query planner goes through the account index:
INSERT INTO orders (account_id, created_at)
SELECT account_id, GREATEST(NOW(), created_at + INTERVAL '1 microsecond')
FROM orders WHERE account_id = '9c99bef6-a05a-48c4-bba3-6080a6ce4f2e'::uuid
ORDER BY created_at DESC LIMIT 1
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Insert on orders (cost=0.15..3.69 rows=1 width=24)
-> Subquery Scan on "*SELECT*" (cost=0.15..3.69 rows=1 width=24)
-> Limit (cost=0.15..3.68 rows=1 width=32)
-> Index Only Scan using account_created_at_idx on orders orders_1 (cost=0.15..28.35 rows=8 width=32)
Index Cond: (account_id = '9c99bef6-a05a-48c4-bba3-6080a6ce4f2e'::uuid)
And concurrent insertions on different accounts work:
mode | relation | page | tuple | substring
------------------+----------+------+-------+--------------------
SIReadLock | orders | 0 | 1 | INSERT INTO orders
RowExclusiveLock | orders | | | INSERT INTO orders
AccessShareLock | orders | | | INSERT INTO orders
SIReadLock | orders | 0 | 2 | COMMIT;
(The first three row are from one not-finished transaction on account 1, the
last is from a finished one on account 2.)