An introduction to hash tables with Python

What is it? What is it good for? And how do they work?

Before we get into this article, you should know that I will be having some code samples and to get a good sense of each line of code, you should look at this hash table class that I made. I won’t be reviewing the code line by line or method by method but you’ll get a good sense of what the code and concept would look like if you have the hash table class as a reference.

Distinguishing Dictionaries and Hash Tables

What a Dictionary is

Before we jump into hash tables, it’s a good idea to define a dictionary first. Just like a real-life dictionary where a word maps to a definition, in Computer Science a dictionary is a general concept that maps keys to values.

Isn’t a hash table and dictionary the same thing?

You might have heard people used these two words interchangeably. However, now that we know that a dictionary is a general concept that maps keys to values… isn’t a hash table the same thing? Nope, it’s not!

What a hash table is

A hash table is a specific way to implement a dictionary. There are other ways of implementing a dictionary. Another way includes red-black trees.

Putting Hash Tables and Dictionaries together

In Python, dictionaries are implemented as Hash Tables. Below is an image for a better understanding of what you just read:

How Hash Tables work

We’re going to understand how most hash tables work. In Python, there are things that are implemented differently than most languages which I will go over again later in this article. Knowing this way of how a hash table works, is the most useful since this is how most languages have their hash tables built under the hood.

The key and value in a dictionary can be any data type. Most of the time, the key is a string. However, the key and value could be a string, integer, person, circle, pretty much anything…as long as you have a hash function (We will discuss this soon).

Necessary ingredients that a hash table needs

A hamburger with sauces representing a hashtable

An Array of buckets

We need an array to store items. Each item consists of a bucket. Each bucket is a Singly Linked List.

Hash Function

A hash function takes in a string, converts that string into some sort of integer(this integer is called the hash code), and then remaps the hash code into an index(item in the index is a “bucket”) in the array. The goal of a hash function in a hash table is to return the index of where we want to put the key-value pair entry. There are different types of hash functions but for this example, we are going to say that our hash function is defined as below:

  def _bucket_index(self, key):
  “””Return the bucket index where the given key would be stored.”””
      return hash(key) % len(self.buckets)

The code above simply takes in the key(a string in our case) and then when hashing the key (hash(key)). This becomes our hash code and we call the modulus operator on the number of buckets we currently have. Like I said earlier, this is a very simple hash function. Hash functions that are built-in languages are usually more robust and perform differently from the code above.

Linked Lists

Okay, so we used the hash function and using the integer it outputs, which is an index. The index is where the “bucket” or linked list is…and when we access the array by that index or bucket, our bucket is a singly linked list where we are going to store our key-value pair entry.

Hash Collision

What if you are adding multiple key-value pairs into the hash table and we happen to encounter the same index / linked list of a key-value pair we entered before? This is known as a hash collision. One way to handle this is by using a linked list. By using a singly linked linked list, this method of a hash collision is called “chaining”. What chaining means is “Hey, if there are any collisions, store them in a linked list”. Since we are using a linked list, we are able to store multiple entries. You are not limited to linked lists. You can also handle hash collisions with other data structures such as trees.

Load factor

We’ve added enough entries to the point where some of our linked lists are very long. Therefore, it’s sometimes taking a while to retrieve the value of the key we are looking for. In a hash table, a load factor is a number that measures how full the hash table is. If our load factor is 0.75, when the load factor reaches above 0.75, the hash table will call the resize operation. This way, we are able to keep the hash table efficient.

Formula of load factor:

L= number of entries/number of buckets

Hash Table Resizing

Now we know that the hash table calls the resize operation when the load factor exceeds a certain number. Let’s see how the resizing operation works.

The set method in the hash table

Within the hash table, there is a method called “set” or something related which all it does is add a key-value entry.

There are 3 things you could do in order to resize your hash table:

1 — Get a list of all key-value entries

  temp_entries = []
  for entry in self.entries:
        temp_entries.append(entry)

2 — Re-initialize the array of buckets to the double number it currently is.

For example, if my current number of buckets in the array is 4 you double it so the new size is now 8. Because you reinitialized it, the buckets are empty but since you saved all the key-value entries before this step, it moves us to the next step!

  self.__init__(new_size)

3. Iterate the list of all key-value entries and for each key-value entry, call the set operation to put that key-value entry in the hash table.

  for k,v in temp_entries:
      self.set(k,v)

To sum these three steps up, all we’re doing, in a nutshell, is putting all of our key-value pairs in a temporary list. Then, we re-initialize the class which takes in the size of the hash table so we multiply the current size * 2. Finally, we just iterate over the temporary list that is created before all of this and adds each item into the re-initialized hash table by calling the set method on each item.

Steps to getting a key-value pair

So we’ve now learned a lot about hash tables and if you feel overwhelmed, please don’t! You’re not supposed to learn all of this in one read and to get everything you want to know you’re going to probably want to do a little more researching 🙂There is one more main thing you should know about hash tables, and that’s knowing how you get a specific key-value pair.

Great! We’ve added all the items we needed in the hash table, now let’s learn how you get a particular key-value pair under the hood.

Getting a key-value pair is very similar to adding a key-value pair. We first find the index using the hash function, then we access our bucket / linked list and iterate through that.

. . .

Resources

Here are the resources I used to put this article together. I would recommend going through these if you want to go more in-depth about hash tables & dictionaries!

Hash Table vs Dictionaries, Samuel Klatchko
How are Python’s built in dictionaries implemented, Praveen Gollakota
Data Structures: Hash Tables, Gayle Laakmann McDowell

Intro to Hashtables