8  Vectorized computation of minimizers

Note

This chapter is adapted from (Groot Koerkamp & Martayan, 2025).

In this chapter, we focus on computing random minimizers as efficiently as possible: for each window of w consecutive k-mers, we select the smallest k-mer given a pseudo-random order. This order is usually given by comparing hash values, and selecting the leftmost smallest hash in case of ties (see Chapter 4 for a broader introduction). More precisely, we are interested finding the starting positions of random minimizers in a sequence, from which we can extract the minimizers themselves. We will first focus on computing forward minimizers (without taking reverse-complement into account) for simplicity, and later describe a solution for canonical minimizers in Section 8.3. Our implementation of the vectorized algorithm discussed in Section 8.4 is available in the simd-minimizers Rust library.

The computation of random minimizers is divided in three main steps, illustrated in Figure 8.1. First each k-mer of the input string (corresponding to horizontal lines in the figure) is hashed to a 32-bit integer (value above each line) using a rolling hash as presented in Chapter 7. Then, for each window of w consecutive k-mers, we find the absolute position of its smallest hash: for instance the fourth window in the figure, highlighted in yellow, has a smallest hash equal to 12 starting at position 5 (zero-based). Finally, the resulting minimizers positions are deduplicated.

Figure 8.1: Overview of the main steps to compute random minimizers, for k=5 and w=4.

8.1 Existing algorithms for sliding window minimum

Assuming that we have computed the hash values, our problem becomes a sliding window minimum. Given a sliding window of w values (in our case k-mer hashes), we want to output the position of its smallest value (breaking ties with the leftmost one). Many approaches can be used to solve this problem, here we will focus on three common ones: the “naive” algorithm, the “rescan” algorithm and the monotone queue algorithm. A summarized comparison of these three algorithms is given in Table 8.1.

8.1.1 Naive approach

The simplest approach is to simply loop over the w values in each window independently, as described in Algorithm 8.1. This takes \mathcal{O}(nw) time, where n denotes the total number of values, but has the benefit of being branchless and can still be quite efficient when w is small by using vectorized instructions.

Algorithm 8.1: Naive algorithm for sliding window minimum.
\begin{algorithmic} \Function{Naive}{$w, \texttt{vals}$} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - w\}$} \State{\textbf{yield} $\arg \min \{\texttt{vals}[i], \ldots, \texttt{vals}[i+w-1]\}$} \EndFor \EndFunction \end{algorithmic}

8.1.2 Monotone queue

A solution with better complexity, detailed in Algorithm 8.2, is to use a monotone queue, which stores a non-decreasing subsequence of the w hashes, alongside their positions. For instance, this is the method used in Bifrost (Holley & Melsted, 2020). Every time the window slides one to the right and we are about to push a new k-mer hash onto the right of the queue, we first remove any values larger than it, as they are “shadowed” by the new hash and can never be minimal anymore. The minimum of the window is then always the leftmost queue element. This data structure guarantees an amortized constant time update, but has many unpredictable branches due to removing between 0 and w values, which makes it costly in practice.

Algorithm 8.2: Monotone queue algorithm for sliding window minimum.
\begin{algorithmic} \Function{MonotoneQueue}{$w, \texttt{vals}$} \State{$\texttt{queue} \gets$ empty \textsc{DoubleEndedQueue}} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \If{$\texttt{queue.front().pos} + w \le i$} \State{\texttt{queue.pop\_front()}} \EndIf \While{$\texttt{queue.back.val} > \texttt{vals}[i]$} \State{\texttt{queue.pop\_back()}} \EndWhile \State{$\texttt{queue.push\_back}(\texttt{val} \gets \texttt{vals}[i], \texttt{pos} \gets i)$} \If{$i \ge w-1$} \State{\textbf{yield} $\texttt{queue.front().pos}$} \EndIf \EndFor \EndFunction \end{algorithmic}

8.1.3 Rescan

Another approach used in bioinformatics is to only keep track of the minimum value and rescan the entire window of w values when the current minimum goes out of scope, as described in Algorithm 8.3. For instance, this is the method used in minimap2 (Li, 2018). While this algorithm does not guarantee a worst-case constant time update, it only branches when the minimum goes out of scope and hence is more predictable. This makes it more efficient in practice, especially since minimizers typically have a density of \mathcal{O}(1/w) so that the \mathcal{O}(w) rescan step takes amortized constant time per element.

Algorithm 8.3: Rescan algorithm for sliding window minimum.
\begin{algorithmic} \Function{Rescan}{$w, \texttt{vals}$} \State{$\texttt{min\_val} \gets +\infty$} \State{$\texttt{min\_pos} \gets 0$} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \If{$\texttt{vals}[i] < \texttt{min\_val}$} \State{$\texttt{min\_val} \gets \texttt{vals}[i]$} \State{$\texttt{min\_pos} \gets i$} \EndIf \If{$\texttt{min\_pos} + w \le i$} \State{$\texttt{min\_pos} \gets \arg \min \{\texttt{vals}[i-w+1], \ldots, \texttt{vals}[i]\}$} \State{$\texttt{min\_val} \gets \texttt{vals}[\texttt{min\_pos}]$} \EndIf \If{$i \ge w-1$} \State{\textbf{yield} $\texttt{min\_pos}$} \EndIf \EndFor \EndFunction \end{algorithmic}
Table 8.1: Comparison of the naive, rescan and monotone queue sliding window minimum algorithms.
Algorithm Time complexity Branches
Naive \mathcal{O}(nw) worst-case branchless
Rescan \mathcal{O}(nw) worst-case, \mathcal{O}(n) on average branch for each rescan
Monotone queue \mathcal{O}(n) worst-case branch for each update

8.2 A branchless linear algorithm using two stacks

8.2.1 The two-stacks algorithm

Ideally, we would like to design an algorithm that has both a \mathcal{O}(n) worst-case complexity (like the monotone queue) and no unpredictable branches (like the naive approach). The two-stacks method (Hirzel et al., 2017; Theodorakis et al., 2018), well-known in the competitive programming community1, allows us to compute online sliding minima where elements may be added and removed at varying rates. Here, we only discuss a version where the number of elements remains constant at w.

Conceptually, we first split the sequence of input k-mer hashes into blocks of size w, as shown in Figure 8.2. Then, we can compute both prefix minima and suffix minima of each block in \mathcal{O}(w) per block, or \mathcal{O}(1) amortized per input hash. Now, any window of size w can be split into a suffix of the previous block and a prefix of the current block (as highlighted in yellow in Figure 8.2), and we can return the minimum of the two corresponding suffix/prefix minima. When the window exactly coincides with a block (as shown on the right), the suffix and prefix minimum are equal.

Figure 8.2: An example showing how sliding-window minima can be computed for windows of size w=4. The first window highlighted in yellow overlap two blocks and has 5 as suffix minimum and 1 as prefix minimum. The second highlighted window coincides with a block and has 3 as suffix and prefix minimum.

In the implementation, detailed in Algorithm 8.4, the prefix minima are simply computed incrementally, while the suffix-minima are computed in batches after every block of w hashes has been filled. This way, a single buffer of size w is needed and the two cases above are unified. The only branch in the algorithm triggers exactly every w iterations, and is thus completely data-independent and predictable.

Algorithm 8.4: Two-stacks algorithm for sliding window minimum.
\begin{algorithmic} \Function{TwoStacks}{$w, \texttt{vals}$} \State{$\texttt{prefix\_min} \gets \texttt{MAXINT}$} \State{$\texttt{suffix\_min} \gets$ buffer of size $w$ filled with $\texttt{MAXINT}$} \State{$j \gets 0$} \Comment{Current index in the buffer.} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \State{$\texttt{prefix\_min} \gets \min(\texttt{prefix\_min}, \texttt{vals}[i])$} \State{$\texttt{suffix\_min}[j] \gets \texttt{vals}[i]$} \Comment{Write the current value into the buffer.} \State{$j \gets j + 1$} \If{$j = w$} \Comment{When the buffer is full, recompute suffix minima.} \State{$j \gets 0$} \State{$\texttt{prefix\_min} \gets \texttt{MAXINT}$} \For{$k$ in $\{w-2, w-3, \ldots, 0\}$} \State{$\texttt{suffix\_min}[k] \gets \min(\texttt{suffix\_min}[k], \texttt{suffix\_min}[k+1])$} \EndFor \EndIf \If{$i \ge w-1$} \Comment{Skip the first $w-1$ incomplete windows.} \State{\textbf{yield} $\min(\texttt{prefix\_min}, \texttt{suffix\_min}[j])$} \EndIf \EndFor \EndFunction \end{algorithmic}

8.2.2 Combining hashes and positions

Since we want to return the leftmost position of the minimum and not the minimum itself, we need to keep track of the positions alongside values. While we could simply store value-position tuples in the buffer, what we do in practice is that we only use the upper 16 bits of each hash value and store the position in the lower 16 bits. This way, taking the minimum still prioritize the smallest values and breaks ties in favor of the leftmost occurrence as expected. Since the position is stored using 16 bits, strings longer than 216 characters are processed in chunks of 216. We note here that while using a 16-bit hash is usually not sufficient for k-mer indexing purposes, this is sufficiently good for selecting the minimum of a window when w \ll 2^{16}, which is typically the case in bioinformatics applications. We intentionally do not expose this hash to the user, and recommend instead to compute a second larger hash for indexing purposes if needed.

8.3 Computing canonical minimizers

The algorithm we described so far does not take reverse-complement into account. However, in many applications, we want a sequence and its reverse-complement to output the same result. To satisfy this constraint, canonical minimizers should return the same set of k-mers regardless of the orientation of the input. Specifically, if the canonical minimizer of a window W is at position p, then the canonical minimizer of the reverse-complement of W should be at position |W| - k - p = w-1-p. In practice, canonical minimizers are often overlooked as an implementation detail and most existing methods simply compare canonical k-mers, computed as \widehat{x} = \min(x, \operatorname{rc}(x)), which gives a weaker guarantee (Marçais et al., 2024).

The first step is to adapt the hash function so that a k-mer and its reverse-complement are always hashed to the same value. For rolling hashes, this can easily be achieved using the method described in Section 7.2.3.

Following (Pan & Reinert, 2024), we define the canonical strand for each window of length \ell = w + k - 1 as the strand where the count of GT bases2 (encoded as 11 and 10 respectively) is the highest, as shown in Algorithm 8.5. This count is in [0, \ell] and when \ell is odd there can be no tie between the two strands. In code, we instead compute the more symmetric \#\texttt{GT} -\#\texttt{AC} = 2\#\texttt{GT} - \ell, which is in [-\ell, \ell], so that count >0 defines the canonical strand. Then, we select the leftmost forward minimizer when the window is canonical, and the rightmost reverse-complement minimizer when the window is not canonical.

Algorithm 8.5: Counting \#\texttt{GT} - \#\texttt{AC}. It can never be 0 when \ell is odd, and a window is canonical if this is positive.
\begin{algorithmic} \Function{canonical}{$w, k, \texttt{in\_out}$} \Comment{\texttt{in\_out} iterates pairs (last bp, first bp) of each window.} \State{$\ell \gets w + k - 1$} \Comment{Window length, it must be odd to guarantee canonicity.} \State{$c \gets -\ell$} \Comment{Count $\#\texttt{GT} - \#\texttt{AC}$, starting at $-\ell$ and adding/subtracting 2 for each \texttt{G} or \texttt{T}.} \For{$(\texttt{in\_bp}, \texttt{\_})$ in $\texttt{in\_out}[0:\ell-1]$} \Comment{The first $\ell-1$ iterations, there is no \texttt{out\_bp}.} \State{$c \gets c + (\texttt{in\_bp \& 2})$} \EndFor \For{$(\texttt{in\_bp}, \texttt{out\_bp})$ in $\texttt{in\_out}[\ell-1:]$} \Comment{2-bit bp coming in and out of each window.} \State{$c \gets c + (\texttt{in\_bp \& 2})$} \State{\textbf{yield} $c > 0$} \Comment{A window is canonical if $\#\texttt{GT} > \#\texttt{AC}$.} \State{$c \gets c - (\texttt{out\_bp \& 2})$} \EndFor \EndFunction \end{algorithmic}

The benefit of this method over, say, determining the strand via the middle character (assuming again that \ell is odd), is that the GT count is more stable across consecutive windows, since it varies by \pm 1. This way, the strandedness and thus the chosen minimizer is less likely to flip between adjacent windows.

One small drawback of this scheme is that it is not forward, i.e. the selected positions are not guaranteed to be non-decreasing. Suppose for instance that a long window has many occurrences of the smallest minimizer, and that shifting the window one position changes its canonical strand. In that case, the position of the sampled minimizer could jump backwards: from sampling the rightmost minimizer to sampling the leftmost minimizer. In practice, this does not seem to be a major limitation, both because it is rare and because downstream methods usually work fine on non-forward schemes anyway. Another (slower) scheme for canonical minimizers preserving the forward property is described Appendix B.

8.4 Vectorized implementation

8.4.1 Parallel iteration and sliding window minimum

Since computing a sliding window minimum is inherently sequential, we reuse the approach presented in Section 7.3 for parallel iteration. This way, we iterate over the hashes of L (the number of lanes) independent chunks, and we can compute their sliding window minimum simultaneously. The only difference here is that the chunks must overlap by w + k - 2 bases (instead of k - 1 for hashes alone) since we want to cover all windows of w k-mers.

The scalar version of the two-stacks method presented in Algorithm 8.4 can easily be adapted to use vectorized instructions. Algorithm 8.6 shows a vectorized version specialized for 8 lanes of 32-bit hashes, which keeps track of the position in the lower 16 bits.

Algorithm 8.6: SIMD two-stacks algorithm for sliding window minimum, with leftmost tie-breaks.
\begin{algorithmic} \Function{SimdSlidingMin}{$w, \texttt{hash\_it}$} \Comment{\texttt{hash\_it} iterates over 32-bit hashes of each k-mer.} \State{$\texttt{val\_mask} \gets \texttt{0xffff0000}$} \Comment{Use high 16 bits of each value.} \State{$\texttt{pos\_mask} \gets \texttt{0x0000ffff}$} \Comment{Low 16 bits store positions.} \State{$\texttt{prefix\_min} \gets \texttt{0xffffffff}$} \State{$\texttt{suffix\_min} \gets$ buffer of size $w$ filled with $\texttt{0xffffffff}$} \State{$n \gets |\texttt{hash\_it}| - w - 1$} \Comment{The first $w-1$ hashes do not complete a window.} \State{$i \gets \texttt{LANE\_IDX} \times n$} \Comment{Current position of the lane.} \State{$j \gets 0$} \Comment{Current index in the buffer.} \For{$h$ in $\texttt{hash\_it}$} \State{$\texttt{val} \gets (h \texttt{ \& val\_mask}) \texttt{ | } i$} \State{$\texttt{prefix\_min} \gets \min(\texttt{prefix\_min}, \texttt{val})$} \State{$\texttt{suffix\_min}[j] \gets \texttt{val}$} \Comment{Write the current value into the buffer.} \State{$j \gets j + 1$} \If{$j = w$} \Comment{When the buffer is full, recompute suffix minima.} \State{$j \gets 0$} \State{$\texttt{prefix\_min} \gets \texttt{0xffffffff}$} \For{$k$ in $\{w-2, w-3, \ldots, 0\}$} \State{$\texttt{suffix\_min}[k] \gets \min(\texttt{suffix\_min}[k], \texttt{suffix\_min}[k+1])$} \EndFor \EndIf \If{$i \ge \texttt{LANE\_IDX} \times n + w-1$} \Comment{Skip the first $w-1$ incomplete windows.} \State{\textbf{yield} $\min(\texttt{prefix\_min}, \texttt{suffix\_min}[j]) \texttt{ \& pos\_mask}$} \EndIf \State{$i \gets i + 1$} \EndFor \EndFunction \end{algorithmic}

8.4.2 Deduplication

The last step of the minimizer algorithm is to deduplicate the results, since adjacent windows often share the same minimizer position. In practice, deduplicating works best when the data to be deduplicated is linear in memory. But the output of the vectorized sliding-window minimum gives a u32x8 containing the position of one minimizer of each chunk. Thus, every L iterations we reuse the matrix transpose (discussed in Section 7.3.1) to obtain a u32x8 for each chunk, containing L consecutive minimizer positions.

We deduplicate each lane using the technique of Lemire (2017). This compares each element to the previous one, and compares the first element to the last minimizer of the window before. The distinct elements are then shuffled to the front of the SIMD vector using a lookup table and appended to a buffer for each lane, as illustrated in Figure 8.3. We end by concatenating all the per-lane buffers into a single vector of minimizer positions, and make sure to avoid duplicates between the end and start of adjacent lanes.

Figure 8.3: Overview of the reordering and deduplication of minimizer positions. Each row on the left corresponds to the minimizer positions outputted by the SIMD sliding window minimum algorithm for L chunks at a time. We first transpose this matrix, so that the resulting SIMD vectors each correspond to a single chunk, with values corresponding to the first chunk highlighted in yellow. These are then compared with their preceding element (including the previous minimizer position, as shown in grey), and distinct elements are shuffled to the front. These positions are accumulated in a separate buffer for each chunk, and these buffers are finally concatenated into a single flat vector. Note that while the figure shows L=4 lanes, we use L=8 lanes in practice for 256-bit registers.

8.5 Variations of minimizers

While the method we presented so far is designed to compute minimizer positions, we can easily adapt it for different variations of the problem.

8.5.1 Computing super-k-mers

One of the most common use case of minimizers is to partition a sequence into super-k-mers (presented in Chapter 4). In that case, in addition to the minimizer positions, we also want to keep track of which windows are associated to which minimizers. A simple way to encode this information is to record the starting position of the first window with a given minimizer position.

The starting position of super-k-mers can extracted by amending the deduplication step. After comparing adjacent minimizer positions, we obtain a mask that determines the shuffle instruction to apply. Normally we shuffle the 32-bit minimizer positions directly. Instead, we can mask out the upper 16 bits (previously used to store hashes) and store there the index of its window. If we then shuffle those values, we obtain for each minimizer its position in the input text, and the position of the first window where this k-mer became a minimizer. This information is sufficient to recover all super-k-mers, and sequences longer than 216 bp can simply be split-up.

8.5.2 Computing syncmers

Another frequent use case is to compute open/closed syncmers (also presented in Chapter 4). For closed syncmers, we want to output windows starting/ending with their minimizer, while for open syncmers, we want to output those with a minimizer in the middle position (assuming w is odd).

Computing syncmers can be implemented as a simple filter on top of the non-deduplicated minimizer positions. For each window at position i, we test whether the minimizer position p satisfies the syncmer condition: p = i or p = i + w - 1 for closed syncmers, or p = i + \lfloor w/2 \rfloor for open syncmers (assuming w is odd so that the middle element is unique). This test can be performed efficiently using vectorized equality comparisons: non-syncmer positions are erased via a blend instruction, and the same shuffle-append mechanism used in Section 8.4.2 efficiently collects the remaining positions. Note that the deduplication step is not needed here since syncmer positions are always distinct.

8.5.3 Other minimizer schemes

Finally, we can change the implementation of the minimizer scheme itself. First, if we want to change the order on k-mers, we can replace the rolling hash by the the function of our choice. For instance, we can simply use the identity for a lexicographic order, or flip the first bits for an anti-lexicographic order (Groot Koerkamp, 2025).

Additionally, we can tweak the sliding window scan to implement other low-density schemes (further discussed in Chapter 15). The best example is probably mod-minimizers (Groot Koerkamp & Pibiri, 2024), which only require applying a modulo w on the position of the minimum as an extra step. Note however that applying a modulo on vector registers is quite impractical, meaning that we prefer subtracting and blending results in practice.

8.6 Experimental evaluation

Our vectorized implementation, built on top of packed-seq and seq-hash, is available at https://github.com/rust-seq/simd-minimizers. This implementation supports canonical minimizers, super-k-mers and syncmers computation, and works with both AVX2 and NEON. The code to reproduce the experiments is available in the bench folder of the same repository.

The experiments were run on an Intel Core i7-10750H with AVX2 running at 2.6GHz and an Apple M1 with NEON running at 3.2GHz, using Rust nightly 1.88.0. As input, we use a fixed random string of 108 bases, encoded either as ASCII or as packed representation depending on the method. Reported timings are the median of five runs and shown in nanoseconds per base.

In our experiments, we use parameter values for w and k as used by Kraken2 (Wood et al., 2019) (5, 31), SSHash (Pibiri, 2022) (11, 21), and minimap2 (Li, 2018) (19, 19).

8.6.1 Incremental time usage

Table 8.2 shows the time usage for various incremental subsets of our method.

Table 8.2: Total time per base taken after each incremental step, for (w,k)=(11,21).
Step ns/bp
AVX2 NEON
Iterate the bases 0.15 0.10
+ collect to vector 0.30 0.18
+ iterate the delayed bases 0.30 0.19
+ ntHash 0.32 0.33
+ sliding window min 0.90 0.82
  + collect 1.48 0.95
  + dedup 1.61 1.38
+ canonical ntHash 1.04 0.85
+ canonical strand 1.53 1.16
  + collect 2.03 1.37
  + dedup 2.20 1.82

To start with, iterating the 8 chunks of the input and summing all bases takes 0.15 ns/bp (0.10 ns/bp on NEON). Appending all u32x8 SIMD vectors containing the bases to a vector takes 0.30 ns/bp (0.18 ns/bp on NEON), indicating that writing to memory induces some overhead. Collecting the second k-1-delayed stream of characters that leave the k-mer (by adding them to the non-delayed stream) has no additional overhead. Computing ntHash only takes 0.02ns/bp extra on AVX2, but adds 0.15 ns/bp on NEON. The sliding window computation nearly triples the total time. Collecting the minimizer positions to a linear vector (i.e., transposing matrices and writing output for each of the 8 chunks) again incurs 50% overhead, ad deduplicating them actually again adds some time.

Going back a step, using canonical ntHash instead of forward ntHash takes 0.14 ns/bp extra on AVX2 but only 0.03 ns/bp on NEON, and determining the canonical strand (via a third \ell-delayed stream and counting GT bases) takes another 0.49 ns/bp on AVX2 and 0.31 ns/bp on NEON. As before, collecting and deduplicating are slow and add around 0.70ns/bp.

In conclusion, we see that iterating the chunks of the input and determining minimizers is quite fast, but that a lot of time must then be spent to “deinterleave” the output into a linear stream. As can be expected, canonical minimizers are slower to compute than forward minimizers, but the overhead is less than 50%, which seems quite low given that the ntHash and sliding window minimum computation are duplicated and a canonical-strand computation is added.

8.6.2 Full comparison

We compare against the minimizer-iter crate v1.2.13, which implements a queue-based sliding window minimum using wyhash4 and also supports canonical minimizers. For an additional comparison, we optimized an implementation of the remap method with ntHash based on a code snippet by Daniel Liu5. Results are in Table 8.3 and Figure 8.4.

Table 8.3: Comparison of simd-minimizers against minimizer-iter and a rescan implementation. Times in ns/bp for forward and canonical minimizers, with various (w,k) tuples.
(a) Comparison on AVX2.
Method (w=5, k=31) (w=11, k=21) (w=19, k=19)
fwd. canon. fwd. canon. fwd. canon.
minimizer-iter 25.30 32.84 26.96 33.93 26.81 34.04
Rescan ntHash 11.65 7.41 5.61
simd-minimizers ntHash
  ‑ packed input 1.69 2.28 1.61 2.20 1.64 2.16
  ‑ on-the-fly packing 1.92 2.50 1.84 2.42 1.91 2.42
Rescan mulHash 11.37 6.79 5.76
simd-minimizers mulHash
  ‑ packed input 1.85 2.49 1.74 2.40 1.78 2.42
  ‑ on-the-fly packing 2.12 2.70 2.05 2.62 2.05 2.65
  ‑ ASCII input 2.11 2.71 2.06 2.63 2.01 2.66
(b) Comparison on NEON.
Method (w=5, k=31) (w=11, k=21) (w=19, k=19)
fwd. canon. fwd. canon. fwd. canon.
minimizer-iter 12.89 16.60 14.26 18.08 14.20 18.04
Rescan ntHash 5.95 3.92 2.82
simd-minimizers ntHash
  ‑ packed input 1.42 1.85 1.38 1.82 1.38 1.83
  ‑ on-the-fly packing 1.66 2.07 1.58 1.99 1.59 2.02
Rescan mulHash 6.94 3.93 3.42
simd-minimizers mulHash
  ‑ packed input 1.97 4.34 1.91 4.27 1.93 4.26
  ‑ on-the-fly packing 2.11 4.37 2.04 4.41 2.03 4.30
  ‑ ASCII input 2.12 4.39 2.06 4.44 2.04 4.29
(a) Running time on AVX2.
(b) Running time on NEON.
Figure 8.4: Running time (logarithmic) of minimizer-iter, rescan, and simd-minimizers for different values of w and k.

minimizer-iter takes around 26 ns/bp for forward and 33 ns/bp for canonical minimizers, and its runtime does not depend much on w and k, because the popping from the queue is unpredictable regardless of w. Rescan starts out at 11.7 ns/bp for w=5 and gets significantly faster as w increases, converging to around 5 ns/bp for w \gg 100. This is explained by the fact that rescan has a branch miss every time the current minimizer falls out of the window, which happens for roughly half the minimizers at a rate of 1/(w+1). Thus, as w increases, the method becomes more predictable and branch misses go down. Our method, simd-minimizers runs around 1.61 ns/bp for forward and 2.20 ns/bp for canonical minimizers when given packed input, and therefore is 3.4× to 6.8× faster than the rescan method.

In Figure 8.4, we see that simd-minimizers performance is mostly independent of k and w since it is mostly data-independent. Only for small w \le 5 it is slightly slower due to the larger number of minimizers and hence larger size of the output.

As we use SIMD with 8 lanes, we could in theory expect up to 8× speedup. In practice this is hard to reach because of constant overhead and because the overhead to work well with SIMD in the first place. In particular for large w, rescan benefits from very predictable and simple code and only outputs unique minimizer positions, making it very efficient. In SIMD, on the other hand, we use a data-independent algorithm, and output the minimizer position for every single window, which then has to be deduplicated. Thus, it is nice to see that even for large w, our method is over 3× faster, despite this overhead.

8.6.2.1 ASCII input and mulHash

Apart from taking bit-packed input, simd-minimizers also works on ASCII-encoded DNA sequences of A/C/T/G characters directly, which are then packed into values \{0,1,2,3\} for ntHash (in that order) during iteration. This is around 0.25ns/bp slower, mostly because of the larger size of the unpacked input.

The mulHash variant is around 0.20ns/bp slower again, but works for any ASCII input. Performance on 100 MB of the Pizza&Chili corpus English6 and Sources7 datasets is nearly identical to performance on the random DNA shown in Table 8.3.

8.6.2.2 Human genome

We also run simd-minimizers on the chromosomes of the T2T-CHM13v2.0 human genome8 (Nurk et al., 2022), of total size 3.2 Gbp. Here, computing forward minimizers takes 5.19 seconds, and canonical minimizers takes 6.71 seconds for (w, k) = (11, 21), which corresponds 1.67 and 2.15 ns/bp, which is within a few percent of Table 8.3.

8.6.2.3 Density

For (w,k)=(11,21), we get a density of 0.173 for forward minimizers, and 0.167 for canonical minimizers, which are both close to the expected density of 2/(w+1) = 1/6\approx 0.167. Changing to (w,k)=(19,19), we get density 0.098 for forward minimizers and 0.100 for canonical minimizers, both close to the expected density of 2/(19+1)=1/10=0.1. Thus, we see that in practice, ntHash is a sufficiently random hash function to approach the expected density of random minimizers with a perfectly uniform random hash function.

8.6.2.4 Multithreading

We test throughput in a multithreaded setting, by using 6 threads to process the chromosomes in parallel. This way, processing takes 0.97 s (forward) and 1.27 s (canonical) when the input data is already loaded into memory, showing slightly above 5× speedup. This is just below 6× speedup as the chromosomes don’t perfectly partition the data into 6 equal parts, so that some threads finish before others. For applications, we recommend to simply call our library in parallel from multiple threads as needed.


  1. https://codeforces.com/blog/entry/71687↩︎

  2. Any pair of non-complementary bases would work here.↩︎

  3. https://github.com/rust-seq/minimizer-iter↩︎

  4. https://github.com/eldruin/wyhash-rs↩︎

  5. https://gist.github.com/Daniel-Liu-c0deb0t/7078ebca04569068f15507aa856be6e8↩︎

  6. https://pizzachili.dcc.uchile.cl/texts/nlang/↩︎

  7. https://pizzachili.dcc.uchile.cl/texts/code/↩︎

  8. Available at https://github.com/marbl/CHM13.↩︎