Hash Tables

Hash Tables

2020-02-17T23:46:37.782Z

Hash tables are stores of key-value pairs with O(1) constant time lookups. Hash tables are great for creating a mapping between two entities, or looking something up, or checking for duplicates, or caching. As named in various languages: Hashmap in Java, Object in JavaScript, Dict in Python.

Hash tables are by nature disordered.

Operation Runtime
lookup by key O(1)
insertion O(1)
deletion O(1)

Hash functions

To enable fast storage and retrieval, a hash table uses a hash function, i.e. a function where you put in a string and you get back a number—a string-to-integer conversion. In other words, a hash function maps strings to integers (indexes). Since the hash function provides the exact address of the item in memory (index), there is no need to iterate over items when accessing a value by a key.

Hashing allows us to retrieve a value using the index from a key without performing a search.

There are some requirements for a hash function:

  • Its mapping must be consistent: same string, same index.
  • Its mapping must be distinct: different strings, different indexes.
  • The hash function must know how big the array is and only return valid indexes.

So how does a hash function work? Say you have a collection of key-value pairs such as { "apple": 1.99 } and so on, representing products and their prices.

  1. Create an empty array.
  2. Feed the string "apple" (key, product) into the hash function. You get back integer 3 (index).
  3. Store the float 1.99 (value, price) at array index 3.
  4. Keep going, and eventually the whole array will be full of prices. (insertion)
  5. Now, when you need to search for "apple", you feed the key again into the hash function and get back an integer/index, and use that integer/index to find the price at the array index. (lookup)

Combine a hash function and an array and you get a hash table.

Another hash function is a secure hash algorithm (SHA) function. Given a string, SHA gives you a hash for that string. The terminology can be a little confusing here. SHA is a hash function. It generates a hash, which is just a short string. The hash function for hash tables went from string to array index, whereas SHA goes from string to string.

Collisions

A good hash function distributes data across all available indexes—ideally, a hash function should map keys evenly all over the hash. A collision occurs when two keys are mapped to the same index in a hash table, i.e. when two strings are assigned the same slot. If somehow all data ends up in a single index of the hash table, the hash table would be no better than an array: the worst-case performance for a hash table lookup is O(n). A hash table must be designed to reduce the number of collisions, so that it typically performs lookups in O(1) time rather than O(n) time.

While a hash table with one hundred indexes is great for preventing collisions, it might use up one hundred indexes to store just few pieces of data, which is a poor use of memory. This is the tradeoff: a good hash table strikes a balance between avoiding collisions and not consuming much memory.

Generally, there are two ways for handling collisions: open addressing and separate chaining. Open addressing is the process of finding an open location in the hash table in the event of a collision. Open addressing has several variations: linear probing, quadratic probing, and double hashing.

Generally speaking, for every seven data elements stored in a hash table, the hash table should have ten indexes. This ratio of data to indexes is called load factor. The ideal load factor is 0.7 (7 elements / 10 cells).

Implementation

In Java, a HashMap is a class and Map is an interface.

public class Main {
	public static void main(String[] args) {
		Map<Integer, String> map = new HashMap<>();
		map.put(1, "John");
		map.put(2, "Mary");
		var value = map.get(2)
		map.containsKey(2); // O(1)
		map.containsValue("Mary"); // O(n)

		for (var key : map.keySet())
			System.out.println(key);
	}
}

Implementing a hash table in Java:

public class HashTable {
	private class Entry {
		private int key;
		private String value;

		Entry(int key, String value) {
			this.key = key;
			this.value = value;
		}
	}

	private LinkedList<Entry>[] entries = new LinkedList[5];

	public void put(int key, String value) {
		var index = hash(key)
		if (entries[index] == null)
			entries[index] = new LinkedList<>();

		var bucket = entries[index];
		for (var entry : bucket) {
			if (entry.key == key) {
				entry.value = value;
				return;
			}
		}
		bucket.addLast(new Entry(key, value));
	}

	public String get(int key) {
		var index = hash(key);
		var bucket = entries[index];
		if (bucket != null) {
			for (var entry : bucket)
				if (entry.key == key)
					return entry.value;
		}
		return null;
	}

	public void remove(int key) {
		var index = hash(key);
		var bucket = entries[index];
		if (bucket != null)
			throw new IllegalStateException();
		for (entry : bucket) {
			if (entry.key = key) {
				bucket.remove(entry);
				return;
			}
		}
		throw new IllegalStateException();
	}

	private int hash(int key) {
		return key % entries.length;
	}

}

Common operations

Finding the first non-repeated character in a string:

public findFirstNonRepeatedCharacter(String input) {

	Map<Character, Integer> map = new HashMap<>();
	var chars = input.toCharArray()

	for (char ch : chars) {
		int count = map.containsKey ? map.get(ch) : 0;
		map.put(ch, count + 1);
	}
	for (char ch : chars) {
		if (map.get(ch) == 1)
			return ch;
	}
	return Character.MIN_VALUE;
}