Friday, June 25, 2010

The Case for Locally Unique IDs (LUIDs)


It should be no surprise to regular readers of this blog that I am in love with UUIDs. As I have said before, they are an inexhaustible source of identifiers that are meaningless (not a pejorative term!) and whose generation can be distributed/federated without danger of duplication. As a result, they are an extremely powerful means of providing a uniform and federated identity scheme.

As a SOA-indoctrinated IT practitioner, I am loathe to expose domain-specific entity identifiers to the larger world because such leakage leads to tight coupling and brittle integration. Yet identifiers of "resources" must often be exposed. How do we do this? Even "best practice" guidelines fail to adequately admonish designers against exposing domain-specific identifiers. [e.g., Subbu Allamaraju's otherwise excellent RESTful Web Services Cookbook talks about "opaque" URIs in recipe 4.2 but ends up recommending the use of internal entity identifiers in recipe 3.10 :-(.]

My point is simple. If the 'employee' table in my application's database looks like this


+------+--------------+---------------+-------------+
| id | first_name | last_name | dob |
+------+--------------+---------------+-------------+
| 1122 | John | Doe | 12-Jan-1960 |
+------+--------------+---------------+-------------+
| 3476 | Jane | Doe | 08-Sep-1964 |
+------+--------------+---------------+-------------+
| 6529 | Joe | Bloggs | 15-Jun-1970 |
+------+--------------+---------------+-------------+

I do not want to be exposing resources that look like these

/employees/1122
/employees/3476
/employees/6529

I don't want to expose my application's local primary keys to the entire world. They may be "meaningless" but they're still coupled to my domain's internal data structures. I need something else.

My standard solution so far has been the magnificent UUID. I add a candidate key column to my table, like so

+------+--------------+---------------+-------------+--------------------------------------+
| id | first_name | last_name | dob | UUID |
+------+--------------+---------------+-------------+--------------------------------------+
| 1122 | John | Doe | 12-Jan-1960 | 4885c205-8248-4e5b-9c45-4d042e7cc992 |
+------+--------------+---------------+-------------+--------------------------------------+
| 3476 | Jane | Doe | 08-Sep-1964 | cdbf87dd-93cb-4c53-9c5d-718c596b0a00 |
+------+--------------+---------------+-------------+--------------------------------------+
| 6529 | Joe | Bloggs | 15-Jun-1970 | 73feb1bf-e687-4d58-9750-5bf98ca7b9fa |
+------+--------------+---------------+-------------+--------------------------------------+

or I maintain a separate mapping table, like so

+------+--------------------------------------+
| id | UUID |
+------+--------------------------------------+
| 1122 | 4885c205-8248-4e5b-9c45-4d042e7cc992 |
+------+--------------------------------------+
| 3476 | cdbf87dd-93cb-4c53-9c5d-718c596b0a00 |
+------+--------------------------------------+
| 6529 | 73feb1bf-e687-4d58-9750-5bf98ca7b9fa |
+------+--------------------------------------+

and I expose my resources like so

/employees/4885c205-8248-4e5b-9c45-4d042e7cc992
/employees/cdbf87dd-93cb-4c53-9c5d-718c596b0a00
/employees/73feb1bf-e687-4d58-9750-5bf98ca7b9fa

I've still got unique identifiers, they're guaranteed not to conflict with anything else in time and space, and more importantly, my domain-specific identifiers remain hidden. I can now even change my entire domain model, including identifiers, and still preserve my external contracts. That's SOA.

But the sad fact of the matter is that many legacy systems and packaged software do not readily support UUID datatypes or even char(36) columns for various reasons. I have recently heard of a far-sighted software vendor that has provided for a "Public ID" field in their database tables for this precise reason, i.e., to allow an externally visible identifier to be specified for their entities. But alas, the column is defined to be varchar(20), much too small to hold a UUID.

It occurred to me that there is nothing sacrosanct about a 128-bit UUID (expressed as a 36-character string). It's just that the larger a random number gets, the more remote the probability of conflict with another such random number. 128 bits is a nice, safe length. But smaller lengths also have the same property, only with a lower degree of confidence.

The constraints of vendor packages like the one I described above led me to postulate the concept of the LUID (Locally Unique ID). This is just a string that is smaller than 32 hex digits (a UUID has 32 hex digits and 4 hyphens in-between). I call this a Locally Unique ID because the smaller it gets, the lower the confidence with which we can claim it to be universally unique. But we may still be able to rely on its uniqueness within a local system. If I'm only holding details of a few thousand employees (or even a few million customers) in my database, an LUID may still be expected to provide unique identifiers with a reasonable degree of confidence.

That vendor package definitely cannot hold a UUID such as "2607881a-fec1-4e5d-a7fc-f87527c93e2d" in its "Public ID" field, but a 20-character substring such as "4e5da7fcf87527c93e2d" is definitely possible.

Accordingly, I've written a Java class called LUID with static methods

String randomLUID( int _length ) and
String getLUID( String _uuidString, int _length )

The first generates a random hex string of the desired length (without hyphens). The second chops a regular UUID down to a hex string of the required length, again without hyphens.

You can download the class from here. Just running the compiled class with "java LUID" will result in some test output which should illustrate how it works. Feel free to incorporate it into your own projects, but be warned that there is no warranty ;-).

Of course, there is a limit to how small an LUID can become before it loses its utility, but I'm not going to draw an arbitrary line in the sand over this. The class above represents mechanism, not policy. Application designers need to think about what makes sense in their context. An LUID of the appropriate length can enable them to implement SOA Best Practice by decoupling externally visible resource identifiers from internal entity identifiers (another example of the difference between Pat Helland's Data on the Outside and Data on the Inside) when a standard UUID cannot be used.

Bottomline: If you can use a standard UUID, do so. If you can't, consider using an LUID of the kind I've described. But always hide the specifics of your application domain (which include entity identifiers) when exposing interfaces to the outside. That's non-negotiable if you want to be SOA-compliant.

Update 27/06/2010:
I should perhaps make it clear what my proposal is really about, because judging from a reader comment, I think I may have created the impression that all I want is for entity identifiers to be "meaningless" in order to be opaque. That's actually not what I mean.

To be blunt, I believe that any entity/resource that is to be uniquely identified from the outside needs _two_ identifiers (i.e., two candidate keys). One of them is the "natural" primary key of the entity within the domain application. The other is a new and independent identifier that is required to support the exposure of this entity through a service interface (whether SOAP or REST). There should be no way to derive this new key from the old one. The two keys should be independently unique, and the only way to derive one from the other should be through column-to-column mapping, either within the same table or through a separate mapping table as I showed above.

To repeat what I wrote in the comments section in reply to this reader:

There was a content management system that generated (meaningless) IDs for all documents stored in it, and returned the document ID to a client application that requested storage, as part of a URI. At one stage, it became necessary to move some of the documents (grouped by a certain logical category) to another instance of the CMS, and all the document IDs obviously changed when reloaded onto the other instance. The client application unfortunately had references to the old IDs. Even if we had managed to switch the host name through some smart content-based routing (there was enough metadata to know which set of documents was being referred to), the actual document ids were all mixed up.

If we had instead maintained two sets of IDs and _mapped_ the automatically generated internal ID of each document to a special externally-visible ID and returned the latter to the client, we could have simply changed the mapping table when moving documents to the new CMS instance and the clients would have been insulated from the change. As it turned out, the operations team had to sweat a lot to update references on the calling system's databases also.
I hope it's clear now.

1. Entity only seen within the domain => single primary key is sufficient
2. Entity visible outside the domain => two candidate keys are required, as well as a mapping (not an automated translation) between the two.
3. UUID feasible for the new candidate key => use a UUID
4. UUID not possible for some reason => use a Locally Unique ID or LUID of appropriate length (code included)

6 comments:

Integral ):( Reporting said...

Ganesh:

I share the same awe about GUIDs. That being said, the web transforms local UIDs into GUIDs by design.

If employes/1234 is unique to you by definition (and construction) http://www.doodle.com/employes/1234 is a GUID. This is just a different algorithm, but it words just as well.

prasadgc said...

JJ:

I think you miss my point. It's not about translating locally unique IDs into globally unique ones. Your example is perfectly valid, of course.

My point here is about needing to hide natural local identifiers by using special "candidate keys", a very important aspect of decoupling the domain from the service interface, in my opinion. If "1234" in your example is the key column in the application's database, I don't want to expose that in my public service interface (whether it be SOAP or REST). I want a special candidate key created for that purpose.

And when UUIDs cannot be used as the new candidate key for any reason, I propose LUIDs. They're another meaningless key meant to be an externally-visible identifier, but they don't have the same length constraint of 36 chars or 128 bits.

Regards,
Ganesh

Integral ):( Reporting said...

Ganesh,

you are not exposing anything. Both are opaque to an outside organization (and a piece software). Only a human makes a a difference between the two.

JJJ-

prasadgc said...

JJ,

I think I know what you're saying, which is that internal identifiers are also meaningless and therefore opaque, but I have seen burnt fingers with this type of exposure, which is why I'm proposing the "two identifiers per exposed resource" model.

There was a content management system that generated (meaningless) IDs for all documents stored in it, and returned the document ID to a client application that requested storage, as part of a URI. At one stage, it became necessary to move some of the documents (grouped by a certain logical category) to another instance of the CMS, and all the document IDs obviously changed when reloaded onto the other instance. The client application unfortunately had references to the old IDs. Even if we had managed to switch the host name through some smart content-based routing (there was enough metadata to know which set of documents was being referred to), the actual document ids were all mixed up.

If we had instead maintained two sets of IDs and _mapped_ the automatically generated internal ID of each document to a special externally-visible ID and returned the latter to the client, we could have simply changed the mapping table when moving documents to the new CMS instance and the clients would have been insulated from the change. As it turned out, the operations team had to sweat a lot to update references on the calling system's databases also.

Regards,
Ganesh

Dave Schinkel said...

UUIDs are verbose as hell. Why not use an auto increment integer...I see no reason to use verbose UUIDS.

prasadgc said...

@Dave,

Autoincremented serial numbers require a single generating authority - an unnecessary dependency in a distributed system where some local autonomy may be required. That's the classic argument for UUIDs - uniqueness without the need for centralised ID generation.

However, UUIDs are verbose, as you point out. That's the point of this blog post. If the number of entities is relatively small, we don't need a 36 character UUID. A subset of it will probably be sufficient.