Recently I read a blog post by Jonathan Ellis about bloom filters. Jonathan works on Cassandra and though had lots of empirical recommendations on its implementation. Cassandra uses bloom filter extensively for performance optimization. Bloom filter is a space-efficient probabilistic data structure used to test whether an element belongs to a set. It’s similar to a standard hashtable, except that its space efficient and doesn’t store the actual value, rather it hashes the keys in a bit array. The reason it’s a probabilistic data structure, is that it allows false positives, but not false negatives. This means, that to answer the question whether A is a subset of B (A ⊆ B), a bloom filter returns true or false. A *false* (doesn’t exist) is guaranteed to be accurate, but *true* (exists), has a probability of being false positive.

So why would one use such an algorithm? Say you store records on disk. Someone requests a particular record and you proceed to seek this record. This is usually an expensive operation for high throughput or systems with limited resources, though before invoking such an expensive operation you can find out if the record exists. If record does exist, you can then proceed to retrieve it through the more intensive operation. Because there is a small probability of this being a false positive, you might still find (through the resource intensive operation) that it doesn’t exist. So in an environment where you’re servicing many requests which might not exist, you can reduce the amount of expensive operations and answer such request in constant time O(1).

Jonathan’s blog post provides some great information about how to implement a very effective bloom filter. One of the most important considerations is the hash function. To lower the probability of false positives, one must use a hash function which effectively distributes the hashes across the hash-space. Jonathan recommends the use of murmur hash algorithm, which is one of the most efficient and effective hash functions, which has great performance and low collisions rate.

Another thing done to reduce hash collisions and in turn false positives, is the fact that you don’t just turn on the bits of a single hash function result, rather, you do so numerous times. (5 times is referred to a lot in literature and seems like a sweet spot). This means, that you take a key, calculate 5 hashes (using 5 different hash algorithms or a single hash algorithm strategy I’ll discuss below) and set the bit for each one of these hashes in the bit array. Answering the question of whether the key exists, does the reverse. Calculate 5 hashes and check to make sure they are all set in the bit array. If any of the 5 aren’t set, then you can be assured it doesn’t exist.

So let’s look at some code. Below is the implementation of bloom filter in Scala. It relies on a murmur hash implementation which I won’t list, but you can view/download it here.

import scala.collection.mutable.BitSet class BloomFilter(capacity:Int, hashable:Hashable, hashCount:Int = 5) { private val buckets:BitSet = { new BitSet(capacity) } private val hashFunc = hashable.hashes(hashCount)(capacity) _ def addValue(value:String) { hashFunc(value).foreach( buckets += _ ) } def exists_?(value:String):Boolean = { for ( i <- hashFunc(value) ) if (!buckets.contains(i)) return false return true } } trait Hashable { def hashes(hashCount:Int)(max:Int)(value:String):Array[Int] } class MurmurHashable extends Hashable { import com.cobrio.algorithms.{MurmurHash => MH} def hashes(hashCount:Int)(max:Int)(value:String):Array[Int] = { val hash1 = MH.hash(value.getBytes, 0) val hash2 = MH.hash(value.getBytes, hash1) ( for ( i <- 0 until hashCount) yield Math.abs((hash1 + i * hash2) % max) ).toArray } }

The code above should be pretty self explanatory, but let’s just take a look at the hashing strategy. We calculate 5 hashes (default) on the key being stored, although we only ever invoke the murmur algorithm twice. Look at the highlighted lines above. Adam Kirsch and Michael Mitzenmacher wrote a paper titled, Less Hashing, Same Performance…, which shows that using a particular hashing technique which simulates additional hash functions beyond two, can increase performance of bloom filters without any significant loss in the false positive probability. To summarize the math in the paper, this is the formula: * g_{i}(x) = h_{1}(x) + ih_{2}(x) mod m*, where m is the number of buckets in the bloom filter,

*are the two calculated hashes respectively, and*

**h**_{1}and h_{2}*will range from*

**i***up to*

**0***where*

**k – 1***is the number of hashes we want to generate.*

**k**Here is how you’d use the above bloom filter…

val bloom = new BloomFilter(2000, new MurmurHashable()) bloom.addValue("Ilya Sterin") bloom.addValue("Elijah Sterin") assert(bloom.exists_?("Ilya Sterin")) assert(bloom.exists_?("Elijah Sterin")) assert(!bloom.exists_?("Don't Exist"))

Tags: algorithms, scala, statistics

This has helped me out – thanks.

This has helped me out – thanks.

Thanks. Good summary of bloom filters.

Thanks for your help

Excellent clean code ! Thank you !

can just do MH.hash(value.getBytes, i)for each i?