This month’s PGSQLPhriday
event is about UUId thanks to
my post calling for a fight debate on the topic.
I will answer a question a friend developer asked me: “What is the best when we need a primary key? UUID, CUID, or TSID?”
TL;DR: please don’t π
The answer is: “It depends!” Yes, it’s a running joke amongst consultants: you can always answer all customers’ questions with an “it depends”. The reason for that is simple: if the answer was obvious, you wouldn’t need the consultant!
Definitions π
First, let’s make sure we understand the different notions here:
UUID: According to Wikipedia, " A universally unique identifier (UUID) is a 128-bit label used for information in computer systems. The term globally unique identifier (GUID) is also used. When generated according to the standard methods, UUIDs are, for practical purposes, unique. Their uniqueness does not depend on a central registration authority or coordination between the parties generating them, unlike most other numbering schemes. While the probability that a UUID will be duplicated is not zero, it is generally considered close enough to zero to be negligible."
CUID: According to the cuid2 project (the cuid project is deprecated), “Modern web applications have different requirements than applications written in the early days of GUID (globally unique identifiers) and UUIDs (universally unique identifiers). In particular, Cuid2 aims to provide stronger uniqueness guarantees that any existing GUID or UUID implementation and protects against leaking any information about the data being referenced, or the system that generated the id.”
TSID: According to Vlad Mihalcea, “If you plan to store UUID values in a Primary Key column, then you are better off using a TSID (time-sorted unique identifier).”
My Postgres expert opinion π
I’m a Postgres expert. So, my opinion is only worth what Postgres has to offer. Postgres does not offer natively CUID or TSID but does offer UUID. I’m reluctant to let an application create primary keys because it’s not how it should work. The database is in charge of the primary key, it should be the one generating them.
My opinion on primary keys is very simple: if you care about performance, don’t mess around, model your data, use a “generated always as identity column”, and add a unique constraint on your natural primary key. Don’t play with other pieces of stuff that might blow up in your face if you don’t understand in-depth the implications. Uniqueness (including uniqueness for primary keys) in Postgres is enforced with B-tree indexes. Those indexes are very interesting when the data indexed are sequential. With UUID, CUID you will lose that. You can see in this article from my colleague Tomas how the WAL generated is way higher for UUID than sequenced IDs. Of course, should your workload consist of random updates instead of sequential inserts, then, you won’t have the WAL advantage of using a sequenced ID. Still, a sequenced ID is shorter and lighter to use.
UUID use cases π
Let’s look at the reasons why you might want to use a UUID, CUID, or TSID in the first place:
- you want to generate the IDs in the application and avoid having a central system. With Postgres, this is a moot point. Postgres is a central system and you can’t avoid it.
- the value is random and can be safely used in a UI without having the user guessing other IDs and accessing values they shouldn’t access. Well, secure your access! When a user guesses an identifier and accesses data they shouldn’t, the problem is not that the identifier can be guessed! The problem is the application does not check what the user is allowed to do!
- I couldn’t find another reason, please enlighten me if you have another one. I would be really happy to be proven wrong.
To conclude π
So, in my opinion, there is no valid reason to use UUID, CUID, or TSID with Postgres. As for the “what about when we want to scale horizontally?” question, as Postgres does not scale natively horizontally (unless you want to create the whole system yourself with Foreign Data Wrappers and partitioned tables), my experience is that for more than 90% of the use cases I’ve seen, you don’t need to scale Postgres, because it’s that good that it should perform correctly without scaling, should you take care of it correctly.
Everyone knows the first rule of optimization is “If you’re not an expert, don’t” (and the second one is “If you’re an expert, don’t yet”). I’d say the rule 0 for optimization on an RDBMS is “model your data carefully”.