logo 'Hybrid' linear/triangular hash probing

Last updated 2022 Jan 16

In the course of my work on PDCursesMod, a (very) cross-platform implementation of the Curses library, I recently had cause to implement an open-addressing hash table for color pairs. In this specific case, the idea was that the find_color( ) function needs to check to see if a color pair with a desired foreground/background index already exists, and if so, to return the index for that color pair.

Most of what I did for this was quite standard and unremarkable, or specific to the problem at hand, or both. But the way in which the hash table is probed in case of collisions may be of some interest for open-address hashing in general.

The most basic technique used if you look for a hash table entry and find it already occupied is 'linear probing' : from the current position, search forward for an empty entry and stop when you find one. Unless the table is completely full, this is guaranteed to eventually find an open entry. If you keep the table load (percentage of used hash table entries) low enough, it will (usually) not take all that long. And, because you're looking at consecutive entries, it'll be cache-friendly.

The main problem with it is that you can get clustering. If a decent percentage of hashes all lead to the same neighborhood in the table, it'll fill up locally, even though much of the table is free. As a result, quadratic probing is sometimes used; starting from your current location in the table, you look at the next one, then four ahead, 9, 16, etc. If your starting location is loc(0) and table size is tblsize,

loc(1) = (loc(0) + 1) % tblsize
loc(2) = (loc(0) + 4) % tblsize
loc(3) = (loc(0) + 9) % tblsize
....
loc(n) = (loc(0) + n^2) % tblsize = (loc(n - 1) + 2*n - 1) % tblsize

If the table is less than half full and tblsize is prime, this is guaranteed to find an unused entry. There is a modified "alternating signs" quadratic probing scheme that guarantees finding an entry if the table isn't completely full for certain table sizes, but it's a bit more complex.

Alternatively, one can use 'triangular number probing'. Triangular numbers are 1, 3, 6, 10, 15, ... n(n+1)/2; each is the sum of the numbers 1 to n. For this,

loc(1) = (loc(0) + 1) % tblsize
loc(2) = (loc(1) + 2) % tblsize
loc(3) = (loc(2) + 3) % tblsize
....
loc(n) = (loc(n - 1) + n) % tblsize = (loc(0) + n(n + 1)/2) % tblsize

If there is at least one free entry and tblsize is an even power of two, this is guaranteed to find an unused entry.

Note that the tblsize size restrictions in all of these cases are easy enough to work around. If your table size is not the required even power of two (or prime number), no problem; find the next larger power of two (or prime number) and use that as your modulus in the above series. When a resulting loc(i) goes beyond the limits of your actual hash table, skip it. This will happen less than half the time, and the location series is computationally undemanding for all of these series.

Triangular probing appealed to me because my table sizes would always be a power of two (no need even to apply the logic in the preceding paragraph), and because the table load could be arbitrary (subject to performance degredation as you approach 100%); you don't have to worry about not finding a free cell. I could see no advantage to quadratic probing. However, the cache-friendly aspect of linear probing still called to me.

My first thought was to do, say, five linear probes, then proceed with triangular probing from that point. This would not be a bad way to go. However, it then occurred to me that a better "linear/triangular hybrid" scheme could work as follows.

Think of your hash table in groups of (say) four elements. There are tblsize / 4 such groups, which is still a power of two. Probe groups by triangular numbers; then probe linearly within the four elements in each group.

This is simpler than it might sound. For example, say tblsize=64, so there are sixteen groups of four. And let's say that our initial location is entry 19. The sequence would look like this :

   19    20    21    22       (group 1)
   23    24    25    26       (group 2)
   31    32    33    34       (group 3)
   43    44    45    46       ...
   59    60    61    62
   15    16    17    18
   39    40    41    42
    3     4     5     6
   35    36    37    38
    7     8     9    10
   47    48    49    50
   27    28    29    30
   11    12    13    14
   63     0     1     2
   55    56    57    58
   51    52    53    54       (group 16)

We're probing four consecutive entries each time, then advancing to a new group, by gradually increasing jumps. Each entry is visited once, so that if at least one is available, we'll find it. The recurrence relationship becomes

if( n modulo groupsize != 0)
   idx(n) = (idx(n - 1) + 1) modulo tblsize;            /* linearly probe */
else
   idx(n) = (idx(n - 1) + n + 1 - groupsize) modulo tblsize;

As before, if tblsize isn't a power of two, one can bump up to the next power of two and toss out cases where idx(n) >= tblsize.

The optimal groupsize is, of course, not necessarily four (though it, like n, has to be a power of two.) I would expect the optimal value to be larger if the hash table entries are small (so that more of them fit in the cache at once and linear probing is favored) and also larger if the cache is large. If your hash function isn't very good, a smaller groupsize would be favored... although really, if the hash function is insufficiently "pseudo-random", you should get a new hash function. (I notice that discussions of hashing frequently assume an insufficiently random hash function. I can't imagine why one would use such, given that fast hash functions with distributions more than random enough for this purpose are easily available.)

Contact info

I can be reached at p‮ôç.ötulpťcéjôřp@otúl‬m. If you're a human instead of a spambot, you can probably figure out how to remove the diacritical marks...