Monday, August 13, 2012

After NoSQL, It's Time For NoLDAP


I've been in discussions with the IETF SCIM working group (formerly Simple Cloud Identity Management and now called System for Cross-Domain Identity Management) about the three suggestions for improvement that I wrote about in my InfoQ article.

One of the things I've learnt from this discussion is that, while I sense a bit of an NIH syndrome in some of the pushback I have received, there are also some genuine technical challenges involved in implementing my ideas. In my article, I stated that the fundamental problem with manipulating resources was multi-valued attributes that lacked a unique key per value. Over the past two weeks of discussion, I'm convinced that I got that absolutely right. Multi-valued attributes are the problem. And the biggest culprit in keeping them entrenched with little hope of change is that legacy beast known as an LDAP directory.

You can see for yourself how clumsy it is in LDAP to update multi-valued attributes. When I googled "multi-valued attributes" and "LDAP" together, I found this prescient piece written by Mark Wahl in 2005. I was amazed to see how accurately he had anticipated the problem:

Unfortunately, some of the emerging protocols which also intend to represent and transfer personal identity information have perhaps taken a step backwards by not even considering these issues [problems with multi-valued attributes], perhaps sweeping them under the rug in the guise of simplicity, XMLification, or "fix in the next version", which only postpone finding interoperable solutions to allowing applications to express the identity entries they want to express.
SCIM is one of these "emerging protocols", and sure enough, the arguments that I have been facing have explicitly cited "simplicity" as the reason not to accept my solution proposal, exactly as Mark Wahl had predicted. However, the real reason for resistance is not about "simple" but about "easy".

It is not easy for SCIM to move to a simple model, and that is the problem. It is more expedient to maintain complexity.

The irony is that the SCIM spec is anything but simple, thanks to multi-valued attributes. This is how the spec proposes to deal with multi-valued attributes:

Multi-valued attributes: An attribute value in the PATCH request body is added to the value collection if the value does not exist and merged if a matching value is present. Values are matched by comparing the value Sub-Attribute from the PATCH request body to the value Sub-Attribute of the Resource. Attributes that do not have a value Sub-Attribute; e.g., addresses, or do not have unique value Sub-Attributes cannot be matched and must instead be deleted then added. Specific values can be removed from a Resource by adding an "operation" Sub-Attribute with the value "delete" to the attribute in the PATCH request body. As with adding/updating attribute value collections, the value to delete is determined by comparing the value Sub-Attribute from the PATCH request body to the value Sub-Attribute of the Resource. Attributes that do not have a value Sub-Attribute or that have a non-unique value Sub- Attribute are matched by comparing all Sub-Attribute values from the PATCH request body to the Sub-Attribute values of the Resource. A delete operation is ignored if the attribute's name is in the meta.attributes list. If the requested value to delete does not match a unique value on the Resource the server MAY return a HTTP 400 error.

Sounds like quite the dog's breakfast, doesn't it?

To be fair, that's the implementation, which is usually hidden from sight. Let's see if the visible API looks any better.

Here's how to delete a single member from a Group, as per the current spec:

   PATCH /Groups/acbf3ae7-8463-4692-b4fd-9b4da3f908ce
   Host: example.com
   Accept: application/json
   Authorization: Bearer h480djs93hd8
   ETag: W/"a330bc54f0671c9"

   {
     "schemas": ["urn:scim:schemas:core:1.0"],
     "members": [
       {
         "display": "Babs Jensen",
         "value": "2819c223-7f76-453a-919d-413861904646"
         "operation": "delete"
       }
     ]
   }

Here's how to delete ALL members from a group according to the current spec:

   PATCH /Groups/acbf3ae7-8463-4692-b4fd-9b4da3f908ce
   Host: example.com
   Accept: application/json
   Authorization: Bearer h480djs93hd8
   ETag: W/"a330bc54f0671c9"

   {
     "schemas": ["urn:scim:schemas:core:1.0"],
     "meta": {
       "attributes": [
         "members"
       ]
     }
   }

For two functions that do very similar things, the syntax is wildly different. It's hardly what one would call simple. But these represent the easy way out from an implementation perspective, and so the SCIM Working Group is extremely reluctant to tamper with this implementation.

With my suggestion (based on "best practice" around the use of PATCH), here's how to delete a single member from a group:

   PATCH /Groups/acbf3ae7-8463-4692-b4fd-9b4da3f908ce
   Host: example.com
   Accept: application/json
   Authorization: Bearer h480djs93hd8
   ETag: W/"a330bc54f0671c9"

   {
     "operations" : [
       {
         "RETIRE" : {
           "key" : "members.2819c223-7f76-453a-919d-413861904646"
         }
       }
     ]
   }

Here's how I suggest deleting ALL members from a group:

   PATCH /Groups/acbf3ae7-8463-4692-b4fd-9b4da3f908ce
   Host: example.com
   Accept: application/json
   Authorization: Bearer h480djs93hd8
   ETag: W/"a330bc54f0671c9"

   {
     "operations" : [
       {
         "RETIRE" : {
           "key" : "members"
         }
       }
     ]
   }

That's a darn sight more elegant, if I may say so myself. In both cases, a reader can tell exactly what is being attempted. What's more, the syntax for both "delete" functions is identical.

This is arguably simple as an API. But I'm told it cannot be implemented, because it's not simple easy. I'm told that cloud service providers will not be willing to make the changes required to make this work, and this will hinder adoption of the spec. And what do incumbent cloud providers use to store identity information? Why, LDAP of course!

That's why I'm coming around to the view that no incremental change is possible. Human society progresses in its views not because people gradually change their minds, but because generations die, and new generations with different ideas take their place. Ideas only die when people die. People never change their minds. It's the people who have to be replaced. A cynical observation, but probably true. The Augean stables can only be cleaned with a flood. 

So we need a new spec, a new set of cloud service providers and a new type of directory, if we are to simplify cross-domain identity management.

It's the last of these that I want to talk about here.

I want a directory that has the advantage of LDAP (fast reads) without the disadvantages (a rigid tree structure and awful treatment of multi-valued attributes). I think one of the NoSQL document databases that can hold JSON documents may fit the bill, -- with one important constraint. JSON supports two types of data structures, the dictionary and the array. The dictionary forces every value to have a unique key. That's goodness. The array allows multiple values per key. As we now know, that's the root of all evil. Therefore, our document database must only allow dictionaries and not arrays. Clients are still free to upload data to the directory in array form, but the directory will only store these arrays after converting them into dictionaries. It will generate its own random keys for them. (If the order of elements is to be preserved, the generated keys can be in sequence, with sufficient gaps between them to permit insertions of other values later.)

Every element of such a directory can be addressed in a fully-qualified way through "dot notation". Take this example of part of a Telecom company's customer database. The logical customer record shown has two addresses, and different telecom "carriage" services are available at the two addresses. The first address has carriage services "cable" and "ADSL", while the second has "ADSL2+" and "Wi-fi". As we can see, that's two nested arrays. The outer array holds addresses, and within each address, there's an inner array holding available services.

{
  ...
  addresses: [
    {
      "type" : "home",
      "street_number" : "35",
      "street_name" : "High Road",
      ...
      "country" : "Australia",
      "available_services" : ["cable", "ADSL"]
    },
    {
      "type" : "office",
      "street_number" : "213",
      "street_name" : "Main Street",
      ...
      "country" : "Australia",
      "available_services: ["ADSL2+", "Wi-fi"]
    }
  ]
}

Let's say we need to update the customer record, because the first address now has "Wi-fi" too, and it no longer has "cable". How do we do this?

We want to delete
"customerX.addresses[0].available_services[0]"

and insert "Wi-fi" into the array
"customerX.addresses[0].available_services".

However, we can't reliably use positional indexes because these can change with every operation. We need stable identifiers. (I will not even attempt to formulate the SCIM syntax required for these operations!)

Here's how I suggest it should be done:

PATCH /Users/2819c223-7f76-453a-919d-413861904646

{
  "operations" : [
    {
      "INCLUDE" : {
        "key" : "addresses.d6ea365462f5.available_services",
        "value" : "Wi-fi"
      },
      "RETIRE" : {
        "key" : "addresses.d6ea365462f5.available_services.9be6378dc303"
      }
    }
  ]
}

Again, if may say so myself, this is elegance. And that's only possible because my directory has converted these two arrays into dictionaries, like this:

{
  ...
  addresses: {
    "d6ea365462f5" :
    {
      "type" : "home",
      "street-number" : "35",
      "street-name" : "High Road",
      ...
      "country" : "Australia",
      "available_services" : {
        "9be6378dc303" : "cable", 
        "6aa1429eba34" : "ADSL"
      }
    },
    "3cbaaff8e84e" :
    {
      "type" : "office",
      "street-number" : "213",
      "street-name" : "Main Street",
      ...
      "country" : "Australia",
      "available_services: {
        "2beca1fdf3e5" : "ADSL2+", 
        "8c3dcc204a33" : "Wi-fi"
      }
    }
  }
}

(But how does the client know these generated keys? They're returned as part of the record creation response, and also provided as part of the resource representation returned with every GET request.)

This model is easy to support if a service provider implements its identity store as a database of JSON documents rather than in LDAP. One can have both "simple" and "easy", without compromises. But for that, LDAP has to go. "NoLDAP" is what we need in its place.

My definition of a "NoLDAP Directory" is simply this:
A document database that holds dictionary-only JSON documents.

I think the next generation of cloud service providers will emerge based on this architecture, offering a simple API that is also easy for them to implement.

As for LDAP and SCIM, I guess the best TLA is RIP.

2 comments:

Gladston Arulanandam said...

Not just the content but love your writing too. Liked the statement "Human society progresses in its views not because people gradually change their minds, but because generations die". Too good and crisp and so thought you plagiarized this but looks like it's your own!

prasadgc said...

Thanks, Gladston. That one was original, all right. It takes personal frustration to come up with good ones ;-)

Ganesh