Wednesday, June 30, 2010

The Case for Watermarked UUIDs

UUIDs are my latest toy

They fill my little world with joy
Forgive me for waxing lyrical. The more I play with UUIDs, the more wonderful uses I find for them.

Take one-time tokens used for idempotence, for instance.

Joe Gregorio's nugget on how to use UUIDs with hashes to prevent spoofing got me thinking. What if the hash could be hidden in the UUID itself? UUIDs are so roomy and spacious, they can tuck a lot of things inside without losing their core property of universal uniqueness.

But first, a summary of Joe's piece for those too impatient to go over and read it.

Idempotence is (or should be!) a basic concept that all IT folk are familiar with. Performing an idempotent operation more than once has exactly the same effect as doing it once. The way this is commonly implemented is through the use of one-time tokens. [If you do a 'View Source' on your bank's transaction confirmation page (the one that asks "Are you sure?"), you may see a hidden form variable with a large random value. That's the one-time token the bank uses to ensure that you don't end up posting a transaction twice even if you hit the submit button twice. It's a wonderful way to ensure a measure of reliability over an unreliable Internet.]

Joe Gregorio suggests the use of UUIDs as one-time tokens for RESTful idempotent services, but also raises a potential problem with the use of raw UUIDs. Anyone can spoof a token by just generating a UUID. For a server to be able to recognise that a UUID is a genuine one-time token that it itself had handed out earlier, it would normally be expected to store the token somewhere so it can verify it when it's presented later on. But such a naive design would open the server up to a "resource exhaustion attack". A malicious user (or bot) can swamp the server with millions of requests for one-time tokens, and the server's database will rapidly fill up with useless UUIDs that are never going to be used.

To address this, Joe suggests a hash. If the generated UUID is combined with a secret string that only the server knows, and this combination is hashed, then the combination of UUID and hash will let the server validate its genuineness, because no one else can generate a valid hash for a random UUID without knowing the server's secret string. In Joe's model, the one-time token is not the raw UUID, but a combination of the UUID and a hash. With this design, the server doesn't have to store a one-time token when it's handed out, only when it's actually used to perform a transaction. The number of such UUIDs can be controlled, because spoofed UUIDs can be filtered out through a mere computational check.

I think this solution is elegant and ingenious, but I believe it can be made even more "clean".

A UUID looks like this:

e5b0eb4c-aeb9-411c-bf3f-cdd207096d3e

It consists of 5 groups of hex characters separated by hyphens, and the regular expression for it is

[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

What I want to do is replace the last set of 12 hex characters with another one representing a hash (or at least part of a hash). Then, while the UUID will still look like a UUID (and indeed, will be a valid UUID), it is in effect "watermarked" by the server and can be verified by computation alone.

What do we lose?

Mathematically, we lose a great deal of uniqueness. We're now dealing with just 20 hex characters instead of 32 (It's not 24, because the 4 hyphens don't count). 20 characters are still a lot, and we actually get something back through the 12-character hash, because this is calculated not just on the 20-character UUID prefix but on the combination of the prefix and the server's secret string. So it's an independent 12-character hex string, and while it may be a more sparse range than its length may suggest, it's still something. So I don't believe we lose too much from a uniqueness perspective. UUIDs are so huge you can trim them and still not encounter conflicts.

Is there a danger that some random UUID out there may accidentally be computed as a valid watermarked UUID because its last 12 characters miraculously match the hash? Well, the probability of this is 1 in 16 raised to the power 12, which is about 1 in 3 quadrillion. I'd take my chances.

Architecturally, it would seem that we have introduced meaning into what should have been a meaningless identifier, and that would then open the door to implicit dependencies (tight coupling) and consequent brittleness. However, on closer inspection, there is no way an external system can seek to build any dependency on the structure of a watermarked UUID, because without a knowledge of the server's secret string, the hash cannot be externally calculated. The UUID remains necessarily opaque to all external parties. The implicit dependency of one part of the UUID on another would seem to be a limitation too, but this is by design! The "limitation" serves to exclude spoofed UUIDs.

And so, I believe there is no real downside to the use of watermarked UUIDs. On the contrary, they retain the visual elegance of plain UUIDs and furthermore simplify the design of idempotent services by encapsulating the entire token-validation function within the service with no leakage through to the interface.

I've written a couple of classes in Java that should help developers get started (no warranties, of course). The base class is called TokenWatermarker, and it performs the basic hashing and string concatenating logic on any token string. It can watermark LUIDs, for example. It also performs verification of previously watermarked tokens, of course. Then there's the UUIDWatermarker class that extends TokenWatermarker and provides the same capability for UUIDs.

Running the compiled classes using "java TokenWatermarker" or "java UUIDWatermarker" will print out a sample output that will show you how these classes work.

Enjoy!

3 comments:

toubsen said...
This comment has been removed by the author.
toubsen said...

What I'd see as a problem with this approach is, that whenever the secret "salt" token gets leaked in any way, you won't be able anymore to exchange it, as all clients would still reference to / rely on UUIDs hashed using the previous token. I'd rather use a natural key of the entity as a salt for your additional hash. This way it will not be guessable for a yet unknown entity, but be unique and final for a certain entity.

prasadgc said...

ekmek,

You're right in the general case. In the specific case of one-time tokens, where there's no real entity, a hash using a random secret will have to do. The lifetime of a token could be fairly short depending on the circumstances, plus we could reset it from time to time when there's a natural break in the service.

Regards,
Ganesh