std::unordered_map is slow

Nanobind (link here) is a ... (description). They attribute part of their speed to avoiding std::unordered_map (link here). In general, C++ is a performant language (attribute here, something like OK, this is not always true .. etc etc with a ^1), but the (lack of) performance of std::unordered_map has been discussed at least a few times (e.g. ... link to some examples). To illustrate this, we can start with a much simpler argument.

1. Two sum

The premise of the Two Sum problem is simple: given an array of integers and some target integer, return the indices of the two numbers such that they add up to the target integer. Additionally, for the purposes of this post, we will make the modification that instead of returning the indices, we simply return whether we can find two integers that add up to the target integer. There are a few approaches to this problem:

We can use two nested for loops: an outer one to select each element, and an inner one to check if any subsequent elements sums to the target integer.
We can sort the array first, then loop over the elements using two pointers: one at the beginning of the list and one at the end, and move inward based on whether the current sum is too small or too large.
We build a hash map of values to indices, then loop through the array checking if the complement (target minus current value) exists in our map.
An improvement over the approach in 3. is to build the hash map while simultaneously checking for complements in a single pass, eliminating the need for our second loop.

Comparing the approaches to these problems using big-\(O\) analysis, where \(N\) is the length of the array, we find that the first approach is clearly \(O(N^2)\) time since we're checking every pair of elements. The second approach is dominated by \(O(N \log N)\) time for sorting. The third and fourth approaches are taught to be the fastest: \(O(N)\) time (with the third approach slower by a factor of two), with the tradeoff being \(O(N)\) space to store the hash map.

So, if all we cared about was speed, by asymptotic analysis, the hash table dominates. Unfortunately, in the real world, it is not so simple.

2. Practical implementation

Suppose we were implementing the "algorithmically efficient" solution in C++, using the standard library's std::unordered_map. It might look something like this: (code). (setup for our benchmark). Running our solution on this setup gives us the following (avg time + variance).

2.1 What's in a hashmap?

There are two conventional strategies for implementing a hash table:

The first is called separate chaining (link) (or ... other names). Each slot in an array of buckets contains a pointer (or list) to elements whose keys hash to that bucket. When we run into an insertion that collides with an already-inserted element, the elements are kept in a linked list (or similar - there are more sophisticated solutions, e.g. a binary tree) associated with that bucket. (diagram)
The other approach is called open addressing. All elements are stored directly in an array. If a key's hashed slot is occupied, the algorithm "probes" (searches) for another empty slot, of which there are various strategies to determine the interval between probes (e.g. linear, quadratic, etc.). Once an empty slot is found, the element is placed there. The lookups follow the same probe sequence to find a key (or determine it's not present). (diagram)

2.2 What's in `std::unordered_map`?

The C++ standard's std::unordered_map is implemented with the separate chaining scheme, using singly-linked lists for collisions (cite). Each element is allocated as a node containing the key-value pair and pointers for the chain.

Buckets are an array of pointers to these nodes. When we insert, we compute the hash for the key, then pick our bucket index. If there's already an element in that bucket, we allocate a new node and link it to the end of the list. (diagram)

Unfortunately, this design is not very efficient. Every insertion requires an allocation (to create a node), and possibly a pointer update. Lookup requires potentially chasing through multiple pointers in memory. Since we have pointers, the elements are scattered and have poor spatial locality in memory.

2.3 The C++ ABI

3. Modern processor architecture

3.1 Sorting is fast

3.2 Detour: Bubble sort is fast

3.3 Separate chaining is slow

4. Robin hood hashing

Go over Robin hood hashing slowly and clearly with diagrams

4.1 Is open addressing fast?

Yes, it's cache friendly (explain how) but there are some pathological cases we need to be careful

4.2 Lookup

Go over how look up works

4.3 Insertion

Go over how we insert items

4.4 Deletion

Go over the different ways of deleting items

5. The numbers

Get graphs and results here, methodology, etc.

Disclaimer

There is far more analysis required to make the conclusion that std::unordered_map is slow, or that we should be using Robin-Hood Hashing instead. I enjoyed reading (1, 2, 3, add references) on this topic which present these topics in a more scientific fashion. Rather, the purpose is to demonstrate that (make a conclusion from the data structures talk...). I was largely inspired by (talk), and figured this particular topic would be fun to dig into further.

"All models are wrong, but some are useful." - George E.P. Box

Acknowledgement

None of the ideas presented here are new, and there are many other blog posts discussing these topics, which are linked below (link to the reference section) that I encourage you to check out. This post was largely written for my own learning and understanding.

References

P. Celis, P. -A. Larson and J. I. Munro. Robin hood hashing. Annual Symposium on Foundations of Computer Science (sfcs 1985), 1985, pp. 281-288, doi: 10.1109/SFCS.1985.48.
Robin Hood Hashing, David Gries, 2021
Robin Hood Hashing should be your default Hash Table implementation, Sebastian Sylvan, May 8, 2013 (Hacker News post)
More on Robin hood Hashing, Seabstian Sylvan, Aug 5, 2013
Robin Hood hashing, Emmanual Goossaert, Nov 11, 2013
Robin Hood hashing: backward shift deletion, Emmanuel Goossaert, Nov 17, 2013
Building a HashMap in Rust - Part 1: What's a HashMap?, Alexis Beingessner
tmmcguire/rust-toys/pony/anagrams/robinhood.pony, Tommy M. McGuire
CppCon 2014: "Efficiency with Algorithms, Performance with Data Structures", Chandler Carruth
CppCon 2017: "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step", Matt Kulukundis
code::dive 2014: Cpu Caches and Why You Care, Scott Meyers
Benchmark of major hash maps implementations, Thibaut Goetghebuer-Planchon, Oct 5, 2017
Solving "Two Sum" in C with a tiny hash table, Chris Wellons, June 26, 2023 (Hacker News post)