7 Rolling hashes on sequences

Note

This chapter is partially adapted from (Groot Koerkamp & Martayan, 2025), accepted to SEA 2025 and published in LIPIcs.

Ragnar Groot Koerkamp designed the initial AVX2 implementation, which we then improved and turned into the packed-seq and seq-hash libraries together. The paper itself has equal contributions from both of us. The analysis of NtHash biases is my own later work, and is not part of the original publication.

We’ve seen in Chapter 6 how to parse and bitpack sequences efficiently. Hashing their content, in particular the k‑mers they contain, is one of the most common operations in sequence analysis. In this chapter, we will focus on rolling hashes, which are especially well-suited for this task, and see how far we can optimize them.

7.1 Rolling hashes and their properties

The concept of rolling hash (or recursive hash) was first introduced by Karp & Rabin (1987) for pattern matching. Its purpose is to efficiently compute a hash over a sliding window of integers, without having to rehash the content of the entire window as it shifts.

Karp & Rabin (1987) defined their rolling hash as follows: given a large prime number p and r \in [2, p), H(x_0 \dots x_{k-1}) = \sum_{i=0}^{k-1} r^{k-1-i} x_i \mod p

7.1.1 Constant-time updates

The most important property of rolling hashes is that removing the first value of the window and adding a new value at the end only requires a constant-time update to the hash.

To remove x_0, we can simply subtract the corresponding term: H(x_1 \dots x_{k-1}) = H(x_0 \dots x_{k-1}) - r^{k-1} x_0 \mod p Note that, assuming the window size k is known in advance, r^{k-1} can be precomputed once. Similarly, to append a new value x_k, we can shift the previous values by one position to the left by multiplying by r before adding the new term: H(x_1 \dots x_k) = H(x_1 \dots x_{k-1}) \cdot r + x_k \mod p

By combining these two operations, we can compute the hashes of sliding windows in a text T in \mathcal{O}(|T|) regardless of the window size. In the context of bioinformatics, this is especially useful for hashing k‑mers, including large ones that would not fit in a native integer.

7.1.2 Merging and splitting hashes

Another useful property is that the hashes of two sequences u and v can be merged in constant time: H(uv) = H(u) \cdot r^{|v|} + H(v) And the same relation can be used to remove any prefix of a given hash: H(v) = H(uv) - H(u) \cdot r^{|v|} In particular, the merging property can be used to parallelize the hash computation of a large sequence by dividing it into chunks, hashing them in parallel and merging the resulting hashes.

7.1.3 Hashing arbitrary substrings using prefixes

While rolling hashes are traditionally used to hash fixed-sized windows, we can also use them to compute the hashes of arbitrary substrings.

One way to achieve this is to compute the hashes of every prefix of the text T and then use the splitting property to subtract the corresponding prefixes: H(T[i,j)) = H(T[0,j)) - H(T[0,i)) \cdot r^{j-i} This technique allows any substring of T to be hashed in constant time, at the cost of preindexing the prefix hashes in \mathcal{O}(|T|) time and space.

7.2 NtHash and its variants

7.2.1 Cyclic polynomial hashing

One of the drawbacks of Rabin-Karp hashing is that it requires multiplying and computing a modulo for every update, which is quite expensive compared to other operations¹. Cohen (1997) proposed an alternative rolling hash, based on cyclic polynomials, that replaces the multiplication by a bitwise left rotation (denoted \operatorname{rol}) and the addition by an exclusive or (denoted \oplus): H(x_0 \dots x_{k-1}) = \bigoplus_{i=0}^{k-1} \operatorname{rol}^{k-1-i}(h(x_i)) Note that each x_i is replaced by a seed value h(x_i), typically computed using a lookup table, to have a better distribution of the bits.

Using this definition, we have H(x_1 \dots x_{k-1}) = H(x_0 \dots x_{k-1}) \oplus \operatorname{rol}^{k-1}(h(x_0)) and H(x_1 \dots x_k) = \operatorname{rol}(H(x_1 \dots x_{k-1})) \oplus h(x_k) Again, assuming the window size k is known in advance, \operatorname{rol}^{k-1}(h(\cdot)) can be precomputed and stored in a separate lookup table.

7.2.2 NtHash and MulHash

NtHash (Mohamadi et al., 2016) is a direct adaptation of cyclic polynomial hashing, specialized for the DNA alphabet A/C/T/G. The main benefit is that since the lookup table only needs to store 4 hashes, it can fit in a 128-bit (resp. 256-bit) register for 32-bit (resp. 64-bit) hashes, making it convenient for vectorization. A scalar implementation of ntHash is detailed in Algorithm 7.1.

Another variation, introduced as MulHash in (Groot Koerkamp & Martayan, 2025), is to replace the lookup table by a simple multiplication: h(x_i) = C \cdot x_i where C is a fixed random constant. This approach is slightly slower than a lookup table for small alphabets, but faster for larger alphabets such as ASCII inputs.

Algorithm 7.1: Scalar pseudocode for ntHash.

\begin{algorithmic} \Function{ntHash}{$k, \texttt{in\_out}$} \Comment{\texttt{in\_out} iterates over (last bp, first bp) of each k-mer} \State{$T_{\text{in}} \gets [h(\texttt{A}), h(\texttt{C}), h(\texttt{T}), h(\texttt{G})]$} \Comment{static lookup tables} \State{$T_{\text{out}} \gets [\operatorname{rol}^{k-1}(h(\texttt{A})), \operatorname{rol}^{k-1}(h(\texttt{C})), \operatorname{rol}^{k-1}(h(\texttt{T})), \operatorname{rol}^{k-1}(h(\texttt{G}))]$} \State{$H \gets 0$} \For{$(\texttt{in\_bp}, \texttt{\_})$ in $\texttt{in\_out}[0:k-1]$} \Comment{no \texttt{out\_bp} in the first $k-1$ iterations} \State{$H \gets \operatorname{rol}(H)$} \Comment{implemented as $H\gets (H \ll 1 \,|\, H \gg 31)$} \State{$H \gets H \oplus \texttt{table\_lookup}(T_{\text{in}}, \texttt{in\_bp})$} \EndFor \For{$(\texttt{in\_bp}, \texttt{out\_bp})$ in $\texttt{in\_out}[k-1:]$} \Comment{2-bit bp coming in and out of each k-mer} \State{$H \gets \operatorname{rol}(H)$} \State{$H \gets H \oplus \texttt{table\_lookup}(T_{\text{in}}, \texttt{in\_bp})$} \State{\textbf{yield} $H$} \State{$H \gets H \oplus \texttt{table\_lookup}(T_{\text{out}}, \texttt{out\_bp})$} \EndFor \EndFunction \end{algorithmic}

7.2.3 Reverse-complement and canonical hashing

This definition of rolling hashes is also convenient to hash the reverse-complement of a sequence. Since we process the values in a reverse order, we can simply change the direction of rotation: \overline{H}(x_0 \dots x_{k-1}) = \bigoplus_{i=0}^{k-1} \operatorname{rol}^i(h(\overline{x_i})) In that case, we have \overline{H}(x_1 \dots x_{k-1}) = \operatorname{ror}\left(\overline{H}(x_0 \dots x_{k-1}) \oplus h(\overline{x_0})\right) and \overline{H}(x_1 \dots x_k) = \overline{H}(x_1 \dots x_{k-1}) \oplus \operatorname{rol}^{k-1}(h(\overline{x_k})) where \overline{x_i} denotes the complement of x_i and \operatorname{ror} denotes a bitwise right rotation.

Since we can easily maintain the hash of both a sequence and its reverse-complement, we can add both hash values to obtain a canonical hash: \widehat{H} = H + \overline{H}. Note that any symmetric operation could work here, but something like a \min would introduce a bias in the high bits.

7.3 Vectorized implementation of ntHash

Because it only relies on bitwise operations and has no branches, Algorithm 7.1 is a good candidate for vectorization. In order to make full use of the 256-bit registers available on most x86 CPUs, our goal in this section will be to compute 8 32-bit hashes at the same time. The main limitation here is that rolling hashes are inherently sequential: each hash value is updated from the hash of the previous window. Therefore, instead of computing 8 consecutive hashes, we focus on computing 8 independent hashes together.

7.3.1 Parallel iteration on a packed input

As input to our algorithm, we use the 2-bit packed sequence representation as described in Section 6.2. Since we want to compute 8 independent hashes, we first split the sequence into 8 chunks and assign each chunk to a dedicated lane of a SIMD register. Note that the chunks overlap by k - 1 bases so that each window of size k is processed once. In order to have the content of each chunk available in separate lanes, we load one 256-bit register per chunk and perform a transpose operation². Once each chunk is loaded in a separate lane, we can iterate over the bases of each chunk simultaneously by shifting and masking the bottom 2 bits of each lane. The entire iteration pipeline is illustrated in Figure 7.1.

Figure 7.1: Overview of parallel iteration on a packed sequence, each color corresponds to a SIMD lane associated to a given chunk. Note that while the figure shows L=4 lanes, we use L=8 lanes in practice for 256-bit registers.

This routine for parallel iteration over packed sequences is available in a Rust library named packed-seq.

7.3.2 Vectorized operations and table lookups

Using this approach for parallel iteration, we produce two iterators for the bases coming in and out of each k‑mer. We can then reuse Algorithm 7.1 and replace each scalar operation with the corresponding vectorized operation on 8 lanes. The main difference is for the table_lookup that computes h(x_i) or \operatorname{rol}^{k-1}(h(x_i)) for each lane, which we implemented using _mm256_permutevar_ps in AVX2 and vqtbl1q_u8 in NEON. Our vectorized implementation of ntHash and MulHash, built on top of packed-seq, is available in the seq-hash Rust library.

7.3.3 Performance

To evaluate the performance gain brought by vectorization, we measured the throughput of different hashing methods on x86 and ARM. We benchmarked 5 hashing methods in Rust (script available at https://github.com/imartayan/nthash-bias/blob/main/src/bin/throughput.rs):

iterating over ASCII-encoded k‑mers (slice of 31 bytes) and hashing them with rapidhash³ v4.4.1 (which is one of the fastest hashes for slices of bytes)
iterating over bitpacked k‑mers (stored in a u64) with packed-seq and hashing them with fxhash⁴ v2.1.2 (which is one of the fastest hashes for native integers)
ntHash implementation from the nthash⁵ crate v0.5.1
our implementation of ntHash (seq-hash v0.1.2)
our implementation of mulHash (seq-hash v0.1.2)

The results are reported in Table 7.1. Note that enabling the most aggressive compiler optimizations (lto="fat" and codegen-units=1) has a big impact on the throughput, especially for the nthash crate.

Table 7.1: Throughput of different hashing methods on x86 and ARM using a single thread, best in bold.

Implementation	Throughput on Xeon Gold 6430 (x86)	Throughput on Apple M1 (ARM)
ASCII seq + `rapidhash`	0.47 Gbp/s	0.65 Gbp/s
`packed-seq` + `fxhash`	0.71 Gbp/s	1.25 Gbp/s
`nthash` crate	1.03 Gbp/s	1.00 Gbp/s
`seq-hash` ntHash	2.65 Gbp/s	2.28 Gbp/s
`seq-hash` MulHash	2.39 Gbp/s	2.30 Gbp/s

Overall, our vectorized implementation is 2.3–2.6× faster than the existing ntHash implementation, and is the fastest option among evaluated methods. We can also notice that using a packed input also benefits non-rolling hashes since the input is 4× smaller.

7.4 Breaking ntHash

While ntHash is clearly not meant to be a cryptographic hash, it still has some flaws that can also happen in non-adversarial use cases.

7.4.1 Forging and propagating collisions

The first observation we can make is that it’s very easy to create collisions when the window size k is larger than the number of bits in the hash (which is not so surprising because the function cannot be injective in this case). Assuming we are still using 32-bit hashes, any pair of bases that are 32 (or any multiple of 32) positions apart will end up with the same rotation (since rotating by 32 is equivalent to not rotating at all). Therefore, for any \alpha, \beta \in \Sigma and u \in \Sigma^{31}, H(\alpha \cdot u \cdot \beta) = H(\beta \cdot u \cdot \alpha) and similarly H(\alpha \cdot u \cdot \alpha) = H(\beta \cdot u \cdot \beta) = \operatorname{rol}(H(u)) This issue is discussed in ntHash2 (Kazemi et al., 2022), where the authors suggest to replace the \operatorname{rol} operation with \operatorname{srol}, which divides the 64-bit hash into a 31-bit and a 33-bit part and rotates them independently. While this operation is more expensive to compute, it has a period of 1023 instead of 32, thus limiting these specific collisions to k \ge 1024.

Something more problematic is that, because of the merging property (described in Section 7.1.2), if for some word size n we have a collision for u \ne v \in \Sigma^n, it will create \Sigma^{k-n} collisions for each word size k \ge n: \forall u, v \in \Sigma^n, \forall x, y \in \Sigma^*, H(u) = H(v) \implies H(x \cdot u \cdot y) = H(x \cdot v \cdot y)

Therefore, it is especially important to avoid collisions for small words. Ragnar Groot Koerkamp described on his website⁶ a method to select the seed values h(x_i) to avoid collisions in 64-bit hashes for k \le 32. His solution is essentially to define h(\texttt{T}) as h(\texttt{A}) \oplus h(\texttt{C}) \oplus h(\texttt{G}) and to adjust the other three values to obtain an invertible system.

7.4.2 Bias on the leading zeros

A very common use case of rolling hashes in bioinformatics is to hash every k‑mer and select those with “small” hashes as representatives. For instance, this is extensively used in MinHash-based sketching techniques (presented in Chapter 3) or in random minimizers (presented in Chapter 4), which we will discuss in more detail in the next chapter.

These techniques generally assume that the hash values are independent and uniformly distributed. Under these assumptions, every k‑mer has a probability 2^{-n} that its hash starts with at least n zeros. So if we define “small” as starting with n zeros, the distance between “small” k‑mers roughly follows a geometric distribution (this is not exactly true because of repeated k‑mers, though this is negligible for a large k).

Unfortunately, I noticed that the empirical distribution I had in my experiments⁷ was not quite geometric, and that the repartition of “small” k‑mers was skewed. I observed in particular that the number of leading zeros for a given k‑mer was very correlated with the number of leading zeros of the neighboring k‑mers, thus invalidating both independence and uniformity assumptions.

To document this behavior, I computed the number of leading zeros for every k‑mer hash in a large random sequence (4 billion bases), and measured the probability that two consecutive k‑mers transition from i leading zeros to j leading zeros. These experiment scripts are available at https://github.com/imartayan/nthash-bias.

Figure 7.2 contains two visualizations of this transition probability. The first one (Figure 7.2 (a)) shows the raw probability of each transition: each line corresponds to the current number of leading zeros, while each column corresponds to the next value. The second one (Figure 7.2 (b)) shows the ratio between this probability and the expected probability for a geometric distribution: blue when less than expected, white when matching the expectation and red when more than expected.

The first thing that we can notice in Figure 7.2 (a) is that many of the cells are actually empty, meaning that the corresponding transition is simply not observed. While this is not very surprising for the bottom-right half of the matrix simply because of sample size (remember that a transition from i to j has a probability proportional to 2^{-(i+j)} in theory), it is much more surprising to see empty rows or columns interleaved with non-empty ones (see columns 3, 5, 6 and rows 4, 6, 7). In particular, this implies that after observing a hash with exactly 4 leading zeros, the hash next to it will always have at most 3 leading zeros. Conversely, a hash with exactly 3 leading zeros can only be observed after at most 4 leading zeros. Figure 7.2 (b) allows us to put these probabilities into perspective with the expected ones. While the first rows and columns mostly match the expectations, the missing transitions in blue are compensated by overrepresented ones in red. Notably, after observing a hash with at least 9 leading zeros, the probability of having exactly 7 leading zeros next to it is 16× higher. This creates an interesting dynamic where small hashes with at least 9 leading zeros are likely followed by 7 leading zeros, which are then followed by at most 6 leading zeros.

Even though this phenomenon is difficult to explain, the offset of one between the missing rows and columns made me conjecture that it is related to the 1-bit rotations performed at each step. To verify this hypothesis, I introduced variations of ntHash where the 1-bit rotations are replaced by R-bit rotations. This does not cause any problem as long as R is odd (since R and 32 need to be coprime) and does not affect performance, but simply shuffles the rotation offsets at each step. Figure 7.3 shows the results obtained for R = 7 while Figure 7.4 shows the results for R = 13.

Interestingly, R does have an impact on the offset between biased rows and columns: in Figure 7.3 (b) we observe a pattern starting at (i, j) = (8, 1) while in Figure 7.4 (b) we observe two patterns starting at (i, j) = (14, 1) and (i, j) = (0, 19) respectively (note that 19 = -13 \mod 32), which confirms the i - j = R \mod 32 conjecture. One practical benefit of shifting the biased rows and columns is that their probability drastically drops, making them less concerning in practice and leading to a more regular transition matrix, as illustrated in Figure 7.3 (a) and Figure 7.4 (a). As a result, I recommend using R = 13 or 15 by default to reduce the bias (and we implemented a similar fix in seq-hash).

7.5 Toward minimizer computation

The vectorized hashing routines developed in this chapter feed directly into Chapter 8, where we use the resulting stream of k‑mer hashes to extract minimizers efficiently.

Computing a modulo requires doing a division which is typically an order of magnitude slower than bitwise operations.↩︎
https://stackoverflow.com/questions/25622745 ↩︎
https://docs.rs/rapidhash/latest/rapidhash/↩︎
https://docs.rs/rustc-hash/latest/rustc_hash/↩︎
https://docs.rs/nthash/latest/nthash/, not to be confused with nthash-rs which is much slower↩︎
https://curiouscoding.nl/posts/nthash/↩︎
documented at https://github.com/imartayan/pfp-distribution ↩︎