Primary keys

Introduction

Primary keys are constant problem.

  1. Should it be natural or artificial?
  2. If it's artificial, what kind of them exactly?
  3. Where to generate primaries (server or client)?

People tend to prefer simple answers like "always use surrogate PKs" or "always use integers for PKs" especially if they were personally beaten by a particular problem. Unfortunately, there is no simple answer.

PK requirements

Let's see, at first, what StackOverflow says about the subject.

  1. Primary keys should be as small as necessary. Prefer a numeric type because numeric types are stored in a much more compact format than character formats. This is because most primary keys will be foreign keys in another table as well as used in multiple indexes. The smaller your key, the smaller the index, the less pages in the cache you will use.
  1. Primary keys should never change. Updating a primary key should always be out of the question. This is because it is most likely to be used in multiple indexes and used as a foreign key. Updating a single primary key could cause of ripple effect of changes.
  1. Do NOT use "your problem primary key" as your logic model primary key. For example passport number, social security number, or employee contract number as these "primary key" can change for real world situations.

On surrogate vs natural key, I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a primary key in place.

Answered by @Logicalmind

Not a bad to start with. Point 3 here is actually an extension of point 2, because it's basically describes how things, that you imagine are constant, may change.

I wouldn't say that "Primary keys never change" because there are always an exceptions. Things may come screwed to the point when you'll have to dive into your DB console or write a script to change primaries manually. This will require to strive with foreign key constraints, be very accurate to not destroy your data but there may be no other choice.

The advice

Prefer a numeric type because numeric types are stored in a much more compact format than character formats

is very one-sided. The story behind keys is much more complicated.

From the usability point of view (clients, db analysts) primaries should be:

From the data architecture point of view (program) primaries should be:

From the maintenance point of view (DB) primaries should be:

Mutually exclusive paragraphs detected. Being "unique" contradicts being "as short as possible". Being "sequential" is the opposite to being "random". Depending on whom you ask you'll get different answers.

People tend to be the most concerned about their direct jobs, that's why devops will assure you that autoincremental integers are "obviously" the best solution. They will say "Perfomance should be thinked upfront" and "Big indexes kills performance", simply ignoring that additional code someone else will be required to write and support.

Programmers will recommend "obvious" UUID because it's so random, does not expose row quantity and fashionable. "Disk space is cheap" they say.

Managers may also invade this party with some unexpected wishes.

PK investigations

Let's explore this further. Numeric types have several weaknesses. They make your business data leak to the outer world. Your competitors should never know how many purchases your e-store has already made and makes per day.

But when you expose a link like http://my-cool-estore.com/orders/35 youre going naked.

I met opinions that you shouldn't use PK in URL because it's "insecure". This seems a nonsense to me because the whole point of PK is to represent your domain model. Everywhere. Including other systems. The complexity of architecture depends greatly on this.

If you provide access to your model through other field (or fields) you need to ensure that it's unique, does not change over time and apply almost every other criteria you've applied to PK before.

This spoils the whole picture in practice because you have non-PK that should behave like one, that requires additional index, making your tables bigger, your dataflow is splitted to id and non-id based strategies, and so on. And all this to satisfy some crippled vision of security.

Let's start with AUTOINC approach. Numeric keys have problems. They stored as numbers but used as strings (you never add or multiply PKs, right?). So you'll have to coerce between types in some places (urls, maybe database, etc.). This is boresome and error-prone. Numbers also naturally have an overflow limit pushing you to choose bigger ranges upfront. Some of the platforms may have surprisingly little maximum number. For example, EcmaScript's Number.MAX_SAFE_INTEGER is only 9007199254740991 which may be a deal breaker for big data.

When it's come to distributed systems, AUTOINC keys fail shortly because you can't generate them in parallel. This is also important because you may want to generate ids on the client. In case of custom implementations it's easy to find yourself in a situation where you need to rely on uncertain community projects because DB authors simply "forgot" that part of the deal.

So we need to come with really unique string identifier which does not expose you business stuff. Why not just take something really random like UUIDs as a standard solution.

Some new databases like RethinkDB did exactly that and chose to use standard UUID to represent their default surrogate keys. Some others like MongoDB came with their own approximations. UUID seems to be the best choice because basically every language has libraries to generate UUIDs. Being crossplatform and standard matters. Why bother with something else?

Because implementation matters. UUID v4 provides very long and incosistent sequences. This means that you tables will become big and slow.

"Never use UUID" is not the answer as well because if you're going to index that field nevertheless, and access by that index most of the time, this is a changing factor and you may win nothing by adding surrogate key. A lot depends of your DB of choice and it's storage strategy.

What about natural vs surrogate primaries. Surrogate keys are obviously better when you generate a lot of models. Primaries have to be generated because you generally can't grant PK selection to occasional visitors and you don't want to bother them with technical details.

People are inclined to thought patterns. In CMS, it may be a good idea to index your "pages" or "documents" by their local URLs.

== ID ==   | == CONTENT ==
/about-us  | ...
/contacts  | ...

Most of CMSs just add another field named "alias" or "slug" to bind URL and data. They argue that you may want to change URL. In general case, though, ULR change is complicated. You need to add redirection rules for search spiders at least and this rules are decoupled from your database, making things messy.

Such "slug" field duplicates PK's purpose and require indexing so you may think you save space and performance by adding surrogate PK while in reality you don't.

The number of such "pages" in a typical website won't come close even to 10K. Even if you write an article per day, 3k articles will be made in a 10 years...

So why CMS's are projected like they are something different? Because people always try to simplify things. Their heard of "always-use-surrogate-keys" mantra and follow it.

Slugs are hacks over the data architecture, but in the end, it's always a question of resources, usage patterns and audience. If you are able to lurk into DB console for occasional hacky actions, you may remove that permanent hacky code in your repo and vice-versa.

The underline is that PK choice and generation strategies are much more complex that it seems from the first sight. Moreover, it turns out to be among the primary scaling concerns. People invent things like Twitter Snowflake for a reason.

Natural vs Artificial keys

Artificial Numeric

+ fastest read / write raw performance
+ fastest sorting
+ fastest machine maintenance operations (replication, deduplication etc.)
+ minimum fragmentation
- lead to business data leaks

Artificial Unique String (Random)

* something between "Artificial Numeric" and "Natural" depending on length
- cause fragmentation

Artificial Unique String (Sequentially Random)

* like Artificial Unique String with less fragmentation
- harder to implement

Natural

+ minimum number of joins / quieries
+ minimum db size in some cases
+ easiest maintenance in some cases

Natural Composite

* like "Natural" but more cumbersome
+ the best choice for M-to-N tables

A word about "Natural Composite". When people define ID for M-to-N relational table (which is required by many frameworks and libraries...) they reluctantly imply there can be more than one M-to-N relations possible. For example it's possible to associate tag with image twice despite it conveys no meaning:

id (PK) | image_id | tag_id
      1 | 1        | 1
      2 | 1        | 2
      3 | 1        | 1 -- ^_^

unless additional constraint is casted on image_id + tag_id pair. But it would be much better to make image_id + tag_id a composite primary and a) get rid of additional meaningless column b) get that constraint for free c) get logic in case of additional columns which will depend straight on PK and not on ID agent which has no meaning at all.

image_id (PK) | tag_id (PK)
       1      | 1
       1      | 2

M-to-N gains meaning only with joins. If you will never access some data by PK – don't add that PK. If your library insists – reconsider the library choice. The mantra "every table must have an ID column" goes solely from crappy object-oriented ORM's which choose to "not support" composite keys for the sake of "simplicity".

Read this and this and let's finish with it.

Frontend vs backend for id generation

Let's put off enterprise stuff for now. Imagine we have a reasonably small table and want UUID PK. Where to generate ids?

Many people does not even raise this question because they're bound by mainstream paradigms, where AUTOINC keys used to be "the only" choice. Modern web dev landscape still is poisoned with "Active Record" and "ORM" centered frameworks which enforce user to "be predictable" and "keep defaults".

When you choose Django or similar, the most interesting choices are made for you. And you take them as "natural" or "obvious" when it's only a slice of possibilities.

You need to start typing to see more flaws behind this. I mean typing with types, not with a keyboard.

Assume we generate primaries in backend. What type do your models have? Let's imaging something simple:

let Tc = require("tcomb")

let Robot = Tc.struct({
  id: Tc.String,
  about: Tc.String,
  create_date: Tc.Date,
})

let robot1 = Robot({
  id: "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  about: "I'm an awesome robot",
  create_date: new Date(),
})

But if you create your models in frontend and push them to API (as you should do) this is not correct. Because you create your models before you've got an id. You're about to send this model to server, but you can't even create an object.

One of the choices in this particular case is to send raw untyped data but this will spoil your architecture. Why "create" and "edit" actions should be so different?

You can also describe both types explicitly:

let Tc = require("tcomb")

let AlmostRobot = Tc.struct({
 about: Tc.String,
 create_date: Tc.Date,
})

let Robot = AlmostRobot.extend({
 id: Tc.String, // or Uid
})

let robotToCreate = AlmostRobot({
 about: "I'm an awesome robot",
 create_date: new Date(),
})

Two types are required for every model you allow to create on the frontend. A lot of code to put and support and seemingly out of a thin air...

Also keep in mind that models without ids are "broken" from many points of view. You can't store them in global dict where keys are PK, for example.

One way or another, the conclusion that frontend generation for primaries is better choice (from the purist point of view) asks for itself. In backend both POST /robots/ and PUT /robot/:id endpoints should be allowed and supported.

So we can outline the whole picture now. I'll keep my recomendations draft because I believe there is still much more to say and evaluate.

Draft recommendations

Very small data that rarely changes

Such tables tend to be required everywhere. Read performance is much more critical comparing to write. Choose natural meaningful PKs whenever possible to reduce the number of joins or subqueries required.

Small data, predictable requirements.

Choose natural PK whenever possible. Data access and analysis will be easier. You won't duplicate things.

Phone number for phones table may be a very good PK because it's unique and never changes. It's not too long and not too sparse (only digits).

Keep in mind the privacy concern. Even if you forbid to change email, it is not a good PK for users table because it exposes private data.

Pay attention that phone and email cases are different. If you're going social, you will expose /users/john.doe@mail.com URL and this is a business data leak. Phone URLs will likely be accessible for stuff only.

If you, hovewer, going to publish URLs like /phones/555-555-5555 you better choose surrogate PK instead.

If you're not sure how your data will change and requirements will evolve – choose surrogate PK.

Again, some people will argue that "nothing can be predicted" and advice to "always use identity PK" because of that. For me personally it's easier to overcome rare cases when I was wrong than to puzzle things right from the beginning.

Unpredictable requirements

Choose time-proven surrogate PK. Stick with that PK type your DB is best tuned for. Do some investigations uprfont. You may find a lot of information to surprise yourself. Generate your PK in if it's possible.

Big data

Choose what will give the required performance and scalability. The big data solutions tend to be messy and pragmatic rather than pure and idealistic.

Author: @ivankleshnin