• Home
  • Articles
  • français
  • Mehdi Valette -

    Integer or UUID primary keys in relational databases?

    Integer or UUID?

    Database primary keys are integers by default, but they can leak information and make merging databases more challenging. To address these issues, some people use random UUIDv4 for primary keys. UUIDv4 come with their own drawbacks, but there's a way to get the best of both worlds!

    TL;DR

    Use integers for primary keys as they are more performant. Add a UUIDv4 in an additional column if necessary. In any case, avoid exposing primary keys outside the database.

    Integers

    Most relational databases default to integers for primary keys, whether it's integer, unsigned integer, big integer, etc.

    erDiagram
    
    Student {
      int student_id
      string full_name
    }
    
    Course {
      int course_id
      string name
    }
    
    Student_Course {
      int student_fk
      int course_fk
    }
    
    Student ||--o{ Student_Course : ""
    Student_Course }o--|| Course : ""
    
    // seen by the client
    {
      "student": 
      {
        "student_id": 1,
        "full_name": "Alice",
        "courses": [
          {
            "course_id": 1,
            "name": "Mathematics"
          },
          {
            "course_id": 2,
            "name": "Physics"
          }
        ]
    }
    

    Advantages

    • They come in ascending order. That helps the database to organize its index, which improves insertion and search performance. When we query one record in the main table, we possibly query tens of records in linked tables. Therefore, optimized primary keys are an important factor in the database's performance.

    • They have a small size. That's especially useful for many-to-many relationships, where a bridge table contains mainly primary keys. If each key is half the size, the overall table is half the size, which not only saves space, but also enables caching more records for faster lookup.

    For example, to get a student with their courses, we need to get just one record from Student, but many records from Student_Course, and many records from Course.

    Drawbacks

    • They leak information. Ascending numbers can expose information that we prefer to keep hidden. For example, if a student has id = 1500 when other students in their class have id > 1680, and we know that each year about 100 students join the school, we deduce that the student with id = 1500 is probably repeating the grade. Disclosing more information than intended is a risk for data protection and security (collecting a lot of benign information can facilitate an exploit).

    • Merging databases is more challenging. When merging two databases together, we must change all the primary keys and all the foreign keys from at least one of the databases. Discovering and rectifying a mistake can prove difficult because nothing identifies the records outside of their initial table.

    • Uniquely identifying records. To keep track of an entity's events, for example for auditing, we need to log the database and the table alongside the primary key. It becomes especially challenging when the database's and table's name change over time, or if two databases are merged, resetting all the primary keys.

    UUIDv4

    To avoid the drawbacks of integers, some developers use UUIDv4 as primary keys. These randomly generated, universally unique identifiers resolve most of the drawbacks of integers, but they also lack their advantages.

    erDiagram
    
    Student {
      uuid student_id
      string full_name
    }
    
    Course {
      uuid course_id
      string name
    }
    
    Student_Course {
      uuid student_fk
      uuid course_fk
    }
    
    Student ||--o{ Student_Course : ""
    Student_Course }o--|| Course : ""
    
    // seen by the client
    {
      "student": 
      {
        "student_id": "0943745f-3618-4b79-8dab-9ee516cae5d3",
        "full_name": "Alice",
        "courses": [
          {
            "course_id": "25a8da31-27e8-439a-8f29-b2eb0d7981d2",
            "name": "Mathematics"
          },
          {
            "course_id": "f0bb78eb-426b-4d18-ac64-ce6ee443de2e" ,
            "name": "Physics"
          }
        ]
    }
    

    Advantages

    • No information leakage. The version 4 of UUID is random, so reading the primary key doesn't disclose any information. Note that this is only true for UUIDv4; other versions of UUID contain information, such as the timestamp of its creation.

    • Merging databases is easier. Merging two databases becomes much easier because the primary key of each record in each table of each database is very likely unique. Duplicates are extremely rare, so even in the unlikely event where duplicates happen, we can certainly handle them manually.

    • Identify each record uniquely. Each primary key identifies a record not only in its own table, but in the world (though collision can very exceptionally happen). That comes in handy for logging, audit, or global search.

    Drawbacks

    • No natural order. The lack of order in UUIDv4 means updating the index is harder, thereby slowing down the database. It isn't a problem in small databases, but the cost increases at scale.

    • Larger than integers. UUID are four times as large as 32-bit integers (INT), and twice as large as 64-bit integers (BIGINT). The difference becomes noticeable at scale in bridge tables.

    Each student and course get their own UUID, giving them a unique identifier in the entire world, and no further information can be inferred from the primary key. However, there's a performance penalty, especially for traversing the Student_Course table.

    Best of both worlds

    I consider that primary keys and unique identifiers serve different purposes. The primary key (I name the column id) is used by the database to organize its records. The unique identifier (I name the column code) is used by external applications to identify the records.

    Important: the primary key is used only by the database, it shouldn't be given to any other party. It cannot leak information if it doesn't leave the database. As far as other applications are concerned, the code is the record's primary key.

    erDiagram
    
    Student {
      int student_id
      uuid student_code
      string full_name
    }
    
    Course {
      int course_id
      uuid course_code
      string name
    }
    
    Student_Course {
      int student_fk
      int course_fk
    }
    
    Student ||--o{ Student_Course : ""
    Student_Course }o--|| Course : ""
    
    // seen by the client
    {
      "student": 
      {
        "student_code": "0943745f-3618-4b79-8dab-9ee516cae5d3",
        "full_name": "Alice",
        "courses": [
          {
            "course_code": "25a8da31-27e8-439a-8f29-b2eb0d7981d2",
            "name": "Mathematics"
          },
          {
            "course_code": "f0bb78eb-426b-4d18-ac64-ce6ee443de2e" ,
            "name": "Physics"
          }
        ]
    }
    

    This method solves all the problems exposed so far:

    • leaking information

      • The result from the client's perspective is similar to having UUID primary keys. They can uniquely identify each record, but no further information can be inferred.
    • database performance

      • Despite increasing the size of the tables Student and Course, the impact on performance is minimal. Every primary and foreign keys are integers, so the most frequently used indexes are optimized.
    • merging databases

      • To merge databases, we use the code columns to distinguish each record. Additionally, we can more easily verify the correctness of the merged database.
    • uniquely identifying each record

      • Each record is identified uniquely in the world, facilitating logging, audit, and global search.

    Conclusion

    I advise the use of integer primary keys. If you need to obfuscate the primary key or to uniquely identify each record, you can easily add a column of type UUID. Whenever possible, avoid exposing integer primary keys to users.

    Criteria Integer Primary Key UUIDv4 Primary Key
    Size 4 bytes (INT) or 8 bytes (BIGINT) 16 bytes
    Index performance Excellent (sequential inserts, minimal fragmentation) Slower (random inserts, more page splits in B-tree indexes)
    Insertion speed Very fast (append-friendly) Slower at scale due to random distribution
    Cache efficiency Better (more rows per page) Lower (fewer rows per page)
    Information leakage Can enable unintended inferences Hide meaningful information
    Global uniqueness Unique only within a table and a database Practically globally unique
    Database merging Complex (ID collisions) Easy (collisions extremely unlikely)

    Notes

    You may be wondering:

    Why UUID instead of random strings?

    UUID is a well supported standard. Many databases and applications handle the generation, indexation, and storage of UUID much more efficiently than random text.

    Why UUID instead of random bytes?

    Even if you put enough effort to match the performance of UUID, the size, serialization and representation of UUID is standardized. If you use random bytes, you reinvent the wheel and lose the interoperability with other systems.

    What about other types of UUID or ULID?

    Other types of UUID and ULID facilitate indexation, but reveal some information. For example, several UUID versions and ULID start with a millisecond timestamp. The key is easier to index because the timestamp always increases, but it reveals the date and time of the record's creation. It wouldn't solve the problem of the student id = 1500 described in this article.