Java HashMap implementation

Hash map

Hash map is a data structure that maps keys to values. If our keys are small integers, we could use an array. For example we want to map 1->’Hash’ 2->’map’ 3->’is’ 4->’awesome!’. We can create an array of Strings:

This is the most efficient way to keep these key/values. We have the cost O(1) for search, insert and delete operation. But what if we have keys in the range from 0 to 1.000.000.000? It is not space efficient to create an array containing so many elements to keep 4 values.

A hash function helps to map arbitrary keys to array numbers from 0 to N, where N is a table size. Ideally, hash function should be simple to compute and ensure that any two distinct keys get different indexes of the array. The main problem with this solution is to deal with situation when to different keys map to the same index. This problem is known as a collision.

Hash function

Hash function transforms the object to number. Using hashCode mod arraySize we can find the index in the array for the value.

In the example above all keys are integers. Returning key % arraySize seems a reasonable choice. Now consider what happens if the size of the array is 10 and a lot of keys end with 0 (100, 110, 120, 130, 140, etc). They all have the same index 0! It’s a good idea to make sure the size of the array is a prime number. Many hash functions implement a function that finds the next prime number.

Java HashMap uses the power of two for the array length. The reason is performance because the modulo operation is expensive. If we use the power of two, we can replace modulo (%) operation with and operation(&) to calculate the chain index.

Hash function must return the same values for equal objects. If x.equals(y), then x.hashCode() == y.hashCode(). Let’s consider how Java implements hash code for different data types:

String caches the hash. If hash is 0 it calculates it using loop through all characters and multiplying the value by 31. The multiplier usually is a prime number (for better distribution) and precise value is not important.

The common pattern to create a hash code for custom object is getting hash codes for all fields and multiplying them by small prime number, for example:

In general a good hash function for a given data type should:

be consistent—equal keys must produce the same hash value
be efficient to compute
uniformly distribute the keys

Java hash map implementation

HashMap.java uses separate chaining strategy. You have an array of buckets and each bucket contains a linked list of objects.

Nodes

HashMap.java keeps all key/value pairs in the nodes:

Each node contains a key and a value. There are two implementations of the node:

1. Basic node Node is implementation of a linked list. It has a pointer to the next Node in a chain. Last node has a null pointer (See picture above, John Smith has a pointer to next node which is Sandra Dee)
2. TreeNode is implementation of a binary search tree (red black tree). This is performance optimization in case we have a lot of collisions. To keep the same traversal order, it uses the doubly linked list that’s why it extends LinkedHashMap.Entry:

Get operation

To perform a search, we use the hash function to determine which linked list (or BST) to traverse using and operation:

First we are trying to get the index of the linked list in the array using the &(and) operation: first = tab[(n – 1) & (hash = hash(key))].

For example we may have this hash code: 10010011_00111110_01011001_01001110 and tab.length is 00000001_00000000(256).

We know that the length is a power of two (the leftmost bit is set) and it is easy to create a mask to get the index value 256 – 1 = 255 or 11111111. Now we can use & to get the exact index: 10010011_00111110_01011001_01001110 & 11111111 = 01001110 = 78 (this is the last byte from the hash code).

The last step is to check that the hash code and the key are the same. If we have more than 1 element in the bucket, we traverse the linked list of TreeNode (binary search tree) and compare the keys:

Put operation

To perform put, we check the appropriate bucket to see if it exists and traverse the linked list or BST to find the element with the given key. If the key exists, we update the value and return old value. If the key doesn’t exist, we create a new node and insert it to the bucket.

Sometimes the linked list may be too big if we have a lot of collisions. To improve the search from O(N) to O(logN), HashMap uses a binary search tree. There is a special threshold when we try to convert a linked list to a BST:

if (binCount >= TREEIFY_THRESHOLD – 1) // -1 for 1st
treeifyBin(tab, hash);

The reason we don’t use TreeNodes by default is because TreeNodes are about twice the size of the basic nodes. When the tree becomes too small, the chain is converted back to a linked list. If the hash function is good enough, the TreeNodes are rarely used.

Last step is updating the size and if the size is larger than a threshold, resize the hash map:

if (++size > threshold)
resize();

Resize operation

When we add more items to the hash map, the size of the array is increased to the next power of two starting from 16: 16, 32, 64, 128, 256 … The decision to resize the array to avoid collisions computed using the load factor that has default value 0.75. No need to change this number until you find that you get a lot of collisions (decrease load factor) or you have a has function with prefect distribution (increase load factor but it still should be less than 1.0). It means if the size of the array is 16 and we have more than 16 * 0.75 (12) items, we need to resize it. It is very expensive operation. We need to create a new array and remap and place all nodes to this array. If you know the final size of you HashMap, it’s better to use the constructor that has size parameter.

The function calculates a new threshold using the formula array.length * load factor, create a new array newTab and copies all nodes to a new array. If we have more than one node in a chain, we try to preserve the order.