Oct 11

Distributed locking made easy

There are various situations when one would use a simple mutex to ensure mutual exclusion to a shared resource. This is usually very simple to accomplish using your favorite language library, but that constrains you to a single process mutual exclusion. Even single machine mutual exclusion is rather straight forward, usually just locking a resource (i.e. file) and awaiting for the lock to be released. You can use that for an IPC mutex.

But what if one needs a distributed mutex to allow mutual exclusion amongst distributed clients? At that, the mutex has to offer various guarantees, as is with any shared state. We know shared state is hard to reason about and provokes a lot of bug-prone software, shared distributed state, is much harder, requiring distributed guaranteed consensus all while operating in a non-reliable network environment. There are various distributed consensus algorithms, Paxos being one of the more widely used ones.

But you can deploy your own distributed locking service, without having to implement your own. Apache Zookeeper, offers distributed synchronization and group services. Mutex/locking is just one of the things you can do with Zookeeper. Zookeeper is rather low level, so implementing a distributed lock, although trivial, requires some boilerplate.

Recently we needed a distributed lock service, to ensure only one person in our organization is performing a particular systems activity at any given point and time. We implemented it on top of a homegrown tool written in ruby. The example below is in ruby, though the api calls would translate to any language…

The usage of this Lock class is such:

      5, ## Timeout in seconds
      lambda { ## Timeout callback
        abort("Couldn't acquire lock.  Timeout.") },
      lambda { ## Do what ever you want here  }

The details of the algorithm are outlined here.

Of course, before you use it, you must install Zookeeper and create the root path /myapp in order to be able to use it.

Also, please note, I have removed the access control part from the example. In order to use this in production, I strongly encourage you read this.

Tags: , ,


  1. Do you have any idea why the sequence complication is necessary for a simple lock? This is the way they recommend implementing locks in the ZooKeeper documentation, but there’s no explanation for it. After all, creating a node that already exists does issue an error — if they ensure non-concurrency for sequences, surely they must also ensure it for the more basic task of creating nodes, right? Or…?

  2. There are two bugs in the example above. The first is that it should delete the lock file if it fails to get the lock and times out (copy line 14 to after 16). Without deleting the lock file, when two instances each drop their attempt in, nobody ever gets the lock again. The second is outside of the gist in the example usage – the second lambda (containing the action to take if you get the lock) should be a proc (or a block) instead of a lambda (and either be moved outside the closing paren or preceeded with an &). Passing it as a lambda errors out with “Too many arguments”. I’m sure there are details there that I don’t quite understand, but that’s what I had to do to make it work. Finally, “Zookeeper::WatcherCallback” has changed to “Zookeeper::Callbacks::WatcherCallback”, though that only generates a warning rather than failing.

    I forked the gist and made these changes as well as adding a small loop that uses the class to demonstrate acquiring (and fighting other intsances of the same script) for the lock: https://gist.github.com/4693078.

    Thanks a bunch for the example; it helped me validate my zookeeper cluster and is a nice proof of concept to show off zookeeper.

Leave a comment