FastUtil OpenHashMap internals (linear probing hashing)

Linear probing

In the previous post I looked at the implementation of the standard java HashMap that uses separate chaining hashing. It has disadvantage of using linked lists. Another strategy is using linear probing over separate chaining. Linear probing is a simple open-addressing hashing strategy. To insert an element x, compute h(x) and try to place x there. If it’s full, keep moving through the array wrapping around at the end, until a free spot is found:

When using linear probing, one advantage is that it can use less memory than separate chaining because we don’t need to create a linked list to resolve collisions. Another advantage is locality of reference since the accesses performed in linear probing tend to be closer in memory than the accesses performed in chained hashing.

One weakness of linear probing is that, with a bad choice of hash function, primary clustering can cause the performance of the table to degrade significantly. While chained hashing can still suffer from bad hash functions, it’s less sensitive to elements with nearby hash codes, which don’t adversely impact the runtime. If a hash function is bad, we can use another data structure like a binary search tree to improve performance from O(N) to O(LogN).

The other weakness of linear probing is that its performance significantly degrades as the load factor approaches 1. You can address this either by rehashing periodically or by using the Robin Hood hashing technique.

OpenHashMap

Let’s take a look at how OpenHashMap uses linear probing to implement fast hash map. OpenHashMap is a collection from FastUtils that contains type-specific Java collections. So instead of Map<String, Integer> we can use Object2IntOpenHashMap<String> and instead of Map<Integer, String> we can use Int2ObjectOpenHashMap<String>. Since we don’t need to box/unbox int or other java primitive types, the performance is better than in the standard java HashMap. It also uses a linear probing algorithm.

OpenHashMap keeps values and keys in 2 arrays. Since value is a primitive type, we create an array of int. We also track the size of the array by getting the next power of two for the expected size divided by load factor that has a range from 0 to 1 (excluded). Next we calculate how many items we can add before we resize the arrays and keeping the value in maxFill variable.

Get operation

To perform a search, we compute a hash function for the given key and apply mask to the hashed value. We may have any hash value and we are interested in hashes that have values below the array size. Since we use power of two, it’s pretty easy to remove unnecessary bits:

Next there is a while loop until the key in the array is not equal to the given key or we found a null value. We increment position by 1 in each iteration and traverse the array. If we found the value, return it from the function otherwise return default value that you can set by calling:

defaultReturnValue(int v)

Since we use the primitive type int, we can’t return null pointer so we need to provide a hint what to return in case the value is not found.

Put operation

First we call find() function to find a position for insert. We get a hash code for the key, apply mask and if position is not available (not null), increment the position and check the next spot in the array. We do this loop until we find available spot and return a negative index of the array where we can insert the value. If we found that the value is already exist in the array, we return positive index of this value.

Then we check if position is negative and do actual insert or otherwise update the value and return the old object.

If we have a bad hash function, we may get a lot of collisions. As long as the table is big enough, a free cell can always be found, but the time to do so can get quite large. Worse, even if the table is relatively empty, blocks of occupied cells start forming. This effect, known as primary clustering, means that any key that hashes into the cluster will require several attempts to resolve the collision, and then it will add to the cluster.

You can find performance testing of FastUtils vs other popular libraries here:

http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/

FastUtil OpenHashMap internals (linear probing hashing)

Linear probing

OpenHashMap

Get operation

Put operation

Java HashMap implementation