Wednesday, July 22, 2009

Modelling Resources from First Principles

I've been providing architectural advice to a group of colleagues who are building a set of services. Without going into too much detail, they need to uniquely identify some entities. Clients of the services use these identifiers as references when they return to make related queries on these entities. They've proposed using UUIDs as the unique identifiers, and while I liked the idea, I thought it was too simplistic. There was more to the requirement than just unique identifiers.

They're actually dealing with two types of entities - widgets (say) and requests for widgets. These are different because a request can pertain to a set of widgets, so it may be necessary to model them distinctly. The service interfaces dealing with requests and widgets may need to be distinct as well.

Mind you, these services are not going to be REST services. But having been exposed to the RESTian way of thinking, I immediately thought of resource representations for the two types of entities. Rather than plain old UUIDs, I thought there should be a degree of structure around them (but not so much detail as to make the scheme brittle and inflexible).

Something like these, in other words:

However, this suggestion proved to be a surprisingly hard sell. The cross-examination was withering.

Why all the extra information?
Why http://?

Why not a simpler scheme like these examples show:



I found I had to retrace my steps and work through my reasoning from first principles. In the process, I learnt a great deal about naming.

My initial response was to point my colleagues to points 7 and 8 of "Common REST Mistakes", where we are admonished not to try and invent our own proprietary object identifiers, and not to try for "protocol independence" (i.e., avoid HTTP URIs). But this wasn't too convincing.

I made a bit of progress by getting agreement on the following:

1. It probably made sense to distinguish between widget identifiers and widget-request identifiers, so some sort of prefix to distinguish between them was necessary. UUIDs alone were probably not enough.
2. It also probably made sense to specify the "domain" within which these resources were being identified, so the "mycompany" string probably belonged somewhere as well.

But then, why not just these:



Frankly, I hated this. My point was that such a format, even though "simple", would have to be explained to anyone looking at it. The structure wasn't immediately obvious. Worse, it was ambiguous and could be extended by later designers in ways that violated the original designers' intent. To this, the counterargument was that the knowledge of the format was only required on the server side. To the client, the whole name was just going to be an opaque string, - a reference ID.

I wrestled with this objection for a while. Then I proposed a guiding principle that given a choice between two naming conventions, a universally understood one was preferable to one that we made up ourselves, provided it wasn't unnecessarily complex.

My research led me to the definition of a URN (Universal Resource Name). What I learnt from this was that in order to name something, we first need to specify a "scheme" that then defines what the rest of the name denotes according to the predefined format for that scheme. The name of the scheme is followed by a colon, then the rest of the name is something that can only be interpreted according to the rules specified by that scheme.

In other words, a standard name (URN) looks like this:

<scheme name>:<some scheme-specific format>

A common example is


It's important to point out that "http" in the string above does not refer to the HTTP protocol! It's the name of a "scheme". What does this mean?

Well, in the URN "file:///home/ganesh", the string "file" is not a protocol, because more than one protocol may be used to get to the file.

Similarly, in the URN "", the string "mailto" is not a protocol. SMTP is the actual mail protocol.

[For those familiar with XML namespaces, when we say "xmlns=''", the URN being referred to here is not necessarily a web page that one can point a browser at. It just needs to be a unique string.]

So we're not necessarily modelling our resources as web resources. All that the "http" scheme defines is that after the colon (":"), there is a scheme-specific structure that specifies a few things.

There are two slashes, then there's a dot-separated domain name, then a slash, then a "resource path" which is itself slash-separated. So that's what a URN conforming to the "http" scheme looks like:

"http" (the name of the scheme)
":" (the colon separating the name of the scheme from the scheme specific structure. This is from the basic definition of a URN)
"//" (the "http" scheme just specifies this, OK?)
"" (this is the dot-separated domain name)
"/" (this is the first slash that signifies that the domain name is terminated)
"widgets/4f138ff2-362f-4e35-8f9e-173290fe86d7" (all of this is the "resource path" , and internal slashes are possible, as we can see)

So now going back to our guiding principle (using a well-understood format is preferable to rolling our own) as well as the two points on which there was agreement (i.e., that we may need to qualify the resource's UUID with the type of resource as well as the organisational domain), it looks like the "http" scheme of the URN naming standard fits the bill. This is a well-understood way to include both a domain and a resource path to provide some structure around an already unique ID.

I concede that the "www" prefix of the domain could confuse. All we really need to identify the domain is "".

And so, a unique, standards-based and minimal way to name resources in this business domain would be