8 Vectorized computation of minimizers

Note

This chapter is adapted from (Groot Koerkamp & Martayan, 2025), accepted to SEA 2025 and published in LIPIcs.

Ragnar Groot Koerkamp designed the initial AVX2 implementation, which we then improved and turned into the simd-minimizers library together. The paper itself has equal contributions from both of us.

Chapter 7 covered efficient implementations of rolling hashes, allowing us to hash k‑mers at high throughput. These k‑mer hashes are then typically used to partition or select representative k‑mers. One major such application is computing random minimizers (introduced in Chapter 4): for each window of w consecutive k‑mers, we select the smallest k‑mer given a pseudo-random order defined by hash values.

In this chapter, we focus on computing random minimizers as efficiently as possible. More precisely, we are interested in finding the starting positions of random minimizers in a sequence, from which we can extract the minimizers themselves. We will first focus on computing forward minimizers (without taking reverse-complement into account) for simplicity, and later describe a solution for canonical minimizers in Section 8.3. Our implementation of the vectorized algorithm discussed in Section 8.4 is available in the simd-minimizers Rust library.

The computation of random minimizers is divided in three main steps, illustrated in Figure 8.1. First each k‑mer of the input string (corresponding to horizontal lines in the figure) is hashed to a 32-bit integer (value above each line) using a rolling hash as presented in Chapter 7. Then, for each window of w consecutive k‑mers, we find the absolute position of its smallest hash: for instance the fourth window in the figure, highlighted in yellow, has a smallest hash equal to 12 starting at position 5 (zero-based). Finally, the resulting minimizer positions are deduplicated.

Figure 8.1: Overview of the main steps to compute random minimizers, for k=5 and w=4.

8.1 Existing algorithms for sliding window minimum

Assuming that we have computed the hash values, our problem becomes a sliding window minimum. Given a sliding window of w values (in our case k‑mer hashes), we want to output the position of its smallest value (breaking ties with the leftmost one). Many approaches can be used to solve this problem, here we will focus on three common ones: the “naive” algorithm, the “rescan” algorithm and the monotone queue algorithm. A summarized comparison of these three algorithms is given in Table 8.1.

8.1.1 Naive approach

The simplest approach is to simply loop over the w values in each window independently, as described in Algorithm 8.1. This takes \mathcal{O}(nw) time, where n denotes the total number of values, but has the benefit of being branchless and can still be quite efficient when w is small by using vectorized instructions.

Algorithm 8.1: Naive algorithm for sliding window minimum.

\begin{algorithmic} \Function{Naive}{$w, \texttt{vals}$} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - w\}$} \State{\textbf{yield} $\arg \min \{\texttt{vals}[i], \ldots, \texttt{vals}[i+w-1]\}$} \EndFor \EndFunction \end{algorithmic}

8.1.2 Monotone queue

A solution with better complexity, detailed in Algorithm 8.2, is to use a monotone queue, which stores a non-decreasing subsequence of the w hashes, alongside their positions. Every time the window slides one to the right and we are about to push a new k‑mer hash onto the right of the queue, we first remove any values larger than it, as they are “shadowed” by the new hash and can never be minimal anymore. The minimum of the window is then always the leftmost queue element. This data structure guarantees an amortized constant time update, but has many unpredictable branches due to removing between 0 and w values, which makes it costly in practice. Because of this constant-time guarantee, monotone-queue based approaches have been adopted by tools such as Bifrost (Holley & Melsted, 2020).

Algorithm 8.2: Monotone queue algorithm for sliding window minimum.

\begin{algorithmic} \Function{MonotoneQueue}{$w, \texttt{vals}$} \State{$\texttt{queue} \gets$ empty \textsc{DoubleEndedQueue}} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \If{$\texttt{queue.front().pos} + w \le i$} \State{\texttt{queue.pop\_front()}} \EndIf \While{$\texttt{queue.back.val} > \texttt{vals}[i]$} \State{\texttt{queue.pop\_back()}} \EndWhile \State{$\texttt{queue.push\_back}(\texttt{val} \gets \texttt{vals}[i], \texttt{pos} \gets i)$} \If{$i \ge w-1$} \State{\textbf{yield} $\texttt{queue.front().pos}$} \EndIf \EndFor \EndFunction \end{algorithmic}

8.1.3 Rescan

Another approach used in bioinformatics is to only keep track of the minimum value and rescan the entire window of w values when the current minimum goes out of scope, as described in Algorithm 8.3. This method was notably popularized by minimap2 (Li, 2018). While this algorithm does not guarantee a worst-case constant time update, it only branches when the minimum goes out of scope and hence is more predictable. This makes it more efficient in practice, especially since minimizers typically have a density of \mathcal{O}(1/w) so that the \mathcal{O}(w) rescan step takes amortized constant time per element.

Algorithm 8.3: Rescan algorithm for sliding window minimum.

\begin{algorithmic} \Function{Rescan}{$w, \texttt{vals}$} \State{$\texttt{min\_val} \gets +\infty$} \State{$\texttt{min\_pos} \gets 0$} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \If{$\texttt{vals}[i] < \texttt{min\_val}$} \State{$\texttt{min\_val} \gets \texttt{vals}[i]$} \State{$\texttt{min\_pos} \gets i$} \EndIf \If{$\texttt{min\_pos} + w \le i$} \State{$\texttt{min\_pos} \gets \arg \min \{\texttt{vals}[i-w+1], \ldots, \texttt{vals}[i]\}$} \State{$\texttt{min\_val} \gets \texttt{vals}[\texttt{min\_pos}]$} \EndIf \If{$i \ge w-1$} \State{\textbf{yield} $\texttt{min\_pos}$} \EndIf \EndFor \EndFunction \end{algorithmic}

Table 8.1: Comparison of the naive, rescan and monotone queue sliding window minimum algorithms.

Algorithm	Time complexity	Branches
Naive	\mathcal{O}(nw) worst-case	branchless
Rescan	\mathcal{O}(nw) worst-case, \mathcal{O}(n) on average	branch for each rescan
Monotone queue	\mathcal{O}(n) worst-case	branch for each update

8.2 A branchless linear algorithm using two stacks

8.2.1 The two-stacks algorithm

Ideally, we would like to design an algorithm that has both a \mathcal{O}(n) worst-case complexity (like the monotone queue) and no unpredictable branches (like the naive approach). The two-stacks method (Hirzel et al., 2017; Theodorakis et al., 2018), well-known in the competitive programming community¹, allows us to compute online sliding minima where elements may be added and removed at varying rates. Here, we only discuss a version where the number of elements remains constant at w.

Conceptually, we first split the sequence of input k‑mer hashes into blocks of size w, as shown in Figure 8.2. Then, we can compute both prefix minima and suffix minima of each block in \mathcal{O}(w) per block, or \mathcal{O}(1) amortized per input hash. Now, any window of size w can be split into a suffix of the previous block and a prefix of the current block (as highlighted in yellow in Figure 8.2), and we can return the minimum of the two corresponding suffix/prefix minima. When the window exactly coincides with a block (as shown on the right), the suffix and prefix minimum are equal.

Figure 8.2: An example showing how sliding-window minima can be computed for windows of size w=4. The first window highlighted in yellow overlaps two blocks and has 5 as suffix minimum and 1 as prefix minimum. The second highlighted window coincides with a block and has 3 as suffix and prefix minimum.

In the implementation, detailed in Algorithm 8.4, the prefix minima are simply computed incrementally, while the suffix-minima are computed in batches after every block of w hashes has been filled. This way, a single buffer of size w is needed and the two cases above are unified. The only branch in the algorithm triggers exactly every w iterations, and is thus completely data-independent and predictable.

Algorithm 8.4: Two-stacks algorithm for sliding window minimum.

\begin{algorithmic} \Function{TwoStacks}{$w, \texttt{vals}$} \State{$\texttt{prefix\_min} \gets \texttt{MAXINT}$} \State{$\texttt{suffix\_min} \gets$ buffer of size $w$ filled with $\texttt{MAXINT}$} \State{$j \gets 0$} \Comment{Current index in the buffer.} \For{$i$ in $\{0, \ldots, |\texttt{vals}| - 1\}$} \State{$\texttt{prefix\_min} \gets \min(\texttt{prefix\_min}, \texttt{vals}[i])$} \State{$\texttt{suffix\_min}[j] \gets \texttt{vals}[i]$} \Comment{Write the current value into the buffer.} \State{$j \gets j + 1$} \If{$j = w$} \Comment{When the buffer is full, recompute suffix minima.} \State{$j \gets 0$} \State{$\texttt{prefix\_min} \gets \texttt{MAXINT}$} \For{$k$ in $\{w-2, w-3, \ldots, 0\}$} \State{$\texttt{suffix\_min}[k] \gets \min(\texttt{suffix\_min}[k], \texttt{suffix\_min}[k+1])$} \EndFor \EndIf \If{$i \ge w-1$} \Comment{Skip the first $w-1$ incomplete windows.} \State{\textbf{yield} $\min(\texttt{prefix\_min}, \texttt{suffix\_min}[j])$} \EndIf \EndFor \EndFunction \end{algorithmic}

8.2.2 Combining hashes and positions

Since we want to return the leftmost position of the minimum and not the minimum itself, we need to keep track of the positions alongside values. While we could simply store value-position tuples in the buffer, what we do in practice is that we only use the upper 16 bits of each hash value and store the position in the lower 16 bits. This way, taking the minimum still prioritizes the smallest values and breaks ties in favor of the leftmost occurrence as expected. Since the position is stored using 16 bits, strings longer than 2¹⁶ characters are processed in chunks of 2¹⁶. We note here that while using a 16-bit hash is usually not sufficient for k‑mer indexing purposes, this is sufficiently good for selecting the minimum of a window when w \ll 2^{16}, which is typically the case in bioinformatics applications. We intentionally do not expose this hash to the user, and recommend instead to compute a second larger hash for indexing purposes if needed.

8.3 Computing canonical minimizers

The algorithm we described so far does not take reverse-complement into account. However, in many applications, we want a sequence and its reverse-complement to output the same result. To satisfy this constraint, canonical minimizers should return the same set of k‑mers regardless of the orientation of the input. Specifically, if the canonical minimizer of a window W is at position p, then the canonical minimizer of the reverse-complement of W should be at position |W| - k - p = w-1-p. In practice, canonical minimizers are often overlooked as an implementation detail and most existing methods simply compare canonical k‑mers, computed as \widehat{x} = \min(x, \operatorname{rc}(x)), which gives a weaker guarantee (Marçais et al., 2024).

The first step is to adapt the hash function so that a k‑mer and its reverse-complement are always hashed to the same value. For rolling hashes, this can easily be achieved using the method described in Section 7.2.3.

Following (Pan & Reinert, 2024), we define the canonical strand for each window of length \ell = w + k - 1 as the strand where the count of GT bases² (encoded as 11 and 10 respectively) is the highest, as shown in Algorithm 8.5. This count is in [0, \ell] and when \ell is odd there can be no tie between the two strands. In code, we instead compute the more symmetric \#\texttt{GT} -\#\texttt{AC} = 2\#\texttt{GT} - \ell, which is in [-\ell, \ell], so that count >0 defines the canonical strand. Then, we select the leftmost forward minimizer when the window is canonical, and the rightmost reverse-complement minimizer when the window is not canonical.

Algorithm 8.5: Counting \#\texttt{GT} - \#\texttt{AC}. It can never be 0 when \ell is odd, and a window is canonical if this is positive.

\begin{algorithmic} \Function{canonical}{$w, k, \texttt{in\_out}$} \Comment{\texttt{in\_out} iterates pairs (last bp, first bp) of each window.} \State{$\ell \gets w + k - 1$} \Comment{Window length, it must be odd to guarantee canonicity.} \State{$c \gets -\ell$} \Comment{Count $\#\texttt{GT} - \#\texttt{AC}$, starting at $-\ell$ and adding/subtracting 2 for each \texttt{G} or \texttt{T}.} \For{$(\texttt{in\_bp}, \texttt{\_})$ in $\texttt{in\_out}[0:\ell-1]$} \Comment{The first $\ell-1$ iterations, there is no \texttt{out\_bp}.} \State{$c \gets c + (\texttt{in\_bp \& 2})$} \EndFor \For{$(\texttt{in\_bp}, \texttt{out\_bp})$ in $\texttt{in\_out}[\ell-1:]$} \Comment{2-bit bp coming in and out of each window.} \State{$c \gets c + (\texttt{in\_bp \& 2})$} \State{\textbf{yield} $c > 0$} \Comment{A window is canonical if $\#\texttt{GT} > \#\texttt{AC}$.} \State{$c \gets c - (\texttt{out\_bp \& 2})$} \EndFor \EndFunction \end{algorithmic}

The benefit of this method over, say, determining the strand via the middle character (assuming again that \ell is odd), is that the GT count is more stable across consecutive windows, since it varies by \pm 1. This way, the strandedness and thus the chosen minimizer is less likely to flip between adjacent windows.

One small drawback of this scheme is that it is not forward, i.e. the selected positions are not guaranteed to be non-decreasing. Suppose for instance that a long window has many occurrences of the smallest minimizer, and that shifting the window one position changes its canonical strand. In that case, the position of the sampled minimizer could jump backwards: from sampling the rightmost minimizer to sampling the leftmost minimizer. In practice, this does not seem to be a major limitation, both because it is rare and because downstream methods usually work fine on non-forward schemes anyway. Another (slower) scheme for canonical minimizers preserving the forward property is described in Appendix A.

8.4 Vectorized implementation

8.4.1 Parallel iteration and sliding window minimum

Since computing a sliding window minimum is inherently sequential, we reuse the approach presented in Section 7.3 for parallel iteration. This way, we iterate over the hashes of L (the number of lanes) independent chunks, and we can compute their sliding window minimum simultaneously. The only difference here is that the chunks must overlap by w + k - 2 bases (instead of k - 1 for hashes alone) since we want to cover all windows of w k‑mers.

The scalar version of the two-stacks method presented in Algorithm 8.4 can easily be adapted to use vectorized instructions. Algorithm 8.6 shows a vectorized version specialized for 8 lanes of 32-bit hashes, which keeps track of the position in the lower 16 bits.

Algorithm 8.6: SIMD two-stacks algorithm for sliding window minimum, with leftmost tie-breaks.

\begin{algorithmic} \Function{SimdSlidingMin}{$w, \texttt{hash\_it}$} \Comment{\texttt{hash\_it} iterates over 32-bit hashes of each k-mer.} \State{$\texttt{val\_mask} \gets \texttt{0xffff0000}$} \Comment{Use high 16 bits of each value.} \State{$\texttt{pos\_mask} \gets \texttt{0x0000ffff}$} \Comment{Low 16 bits store positions.} \State{$\texttt{prefix\_min} \gets \texttt{0xffffffff}$} \State{$\texttt{suffix\_min} \gets$ buffer of size $w$ filled with $\texttt{0xffffffff}$} \State{$n \gets |\texttt{hash\_it}| - w - 1$} \Comment{The first $w-1$ hashes do not complete a window.} \State{$i \gets \texttt{LANE\_IDX} \times n$} \Comment{Current position of the lane.} \State{$j \gets 0$} \Comment{Current index in the buffer.} \For{$h$ in $\texttt{hash\_it}$} \State{$\texttt{val} \gets (h \texttt{ \& val\_mask}) \texttt{ | } i$} \State{$\texttt{prefix\_min} \gets \min(\texttt{prefix\_min}, \texttt{val})$} \State{$\texttt{suffix\_min}[j] \gets \texttt{val}$} \Comment{Write the current value into the buffer.} \State{$j \gets j + 1$} \If{$j = w$} \Comment{When the buffer is full, recompute suffix minima.} \State{$j \gets 0$} \State{$\texttt{prefix\_min} \gets \texttt{0xffffffff}$} \For{$k$ in $\{w-2, w-3, \ldots, 0\}$} \State{$\texttt{suffix\_min}[k] \gets \min(\texttt{suffix\_min}[k], \texttt{suffix\_min}[k+1])$} \EndFor \EndIf \If{$i \ge \texttt{LANE\_IDX} \times n + w-1$} \Comment{Skip the first $w-1$ incomplete windows.} \State{\textbf{yield} $\min(\texttt{prefix\_min}, \texttt{suffix\_min}[j]) \texttt{ \& pos\_mask}$} \EndIf \State{$i \gets i + 1$} \EndFor \EndFunction \end{algorithmic}

8.4.2 Deduplication

The last step of the minimizer algorithm is to deduplicate the results, since adjacent windows often share the same minimizer position. In practice, deduplicating works best when the data to be deduplicated is linear in memory. But the output of the vectorized sliding-window minimum gives a u32x8 containing the position of one minimizer of each chunk. Thus, every L iterations we reuse the matrix transpose (discussed in Section 7.3.1) to obtain a u32x8 for each chunk, containing L consecutive minimizer positions.

We deduplicate each lane using the technique of Lemire (2017). This compares each element to the previous one, and compares the first element to the last minimizer of the window before. The distinct elements are then shuffled to the front of the SIMD vector using a lookup table and appended to a buffer for each lane, as illustrated in Figure 8.3. We end by concatenating all the per-lane buffers into a single vector of minimizer positions, and make sure to avoid duplicates between the end and start of adjacent lanes.

Figure 8.3: Overview of the reordering and deduplication of minimizer positions. Each row on the left corresponds to the minimizer positions outputted by the SIMD sliding window minimum algorithm for L chunks at a time. We first transpose this matrix, so that the resulting SIMD vectors each correspond to a single chunk, with values corresponding to the first chunk highlighted in yellow. These are then compared with their preceding element (including the previous minimizer position, as shown in gray), and distinct elements are shuffled to the front. These positions are accumulated in a separate buffer for each chunk, and these buffers are finally concatenated into a single flat vector. Note that while the figure shows L=4 lanes, we use L=8 lanes in practice for 256-bit registers.

8.5 Variations of minimizers

While the method we presented so far is designed to compute minimizer positions, we can easily adapt it for different variations of the problem.

8.5.1 Computing super‑k‑mers

One of the most common use cases of minimizers is to partition a sequence into super‑k‑mers (presented in Chapter 4). In that case, in addition to the minimizer positions, we also want to keep track of which windows are associated to which minimizers. A simple way to encode this information is to record the starting position of the first window with a given minimizer position.

The starting position of super‑k‑mers can be extracted by amending the deduplication step. After comparing adjacent minimizer positions, we obtain a mask that determines the shuffle instruction to apply. Normally we shuffle the 32-bit minimizer positions directly. Instead, we can mask out the upper 16 bits (previously used to store hashes) and store there the index of its window. If we then shuffle those values, we obtain for each minimizer its position in the input text, and the position of the first window where this k‑mer became a minimizer. This information is sufficient to recover all super‑k‑mers, and sequences longer than 2¹⁶ bp can simply be split up.

8.5.2 Computing syncmers

Another frequent use case is to compute open/closed syncmers (also presented in Chapter 4). For closed syncmers, we want to output windows starting/ending with their minimizer, while for open syncmers, we want to output those with a minimizer in the middle position (assuming w is odd).

Computing syncmers can be implemented as a simple filter on top of the non-deduplicated minimizer positions. For each window at position i, we test whether the minimizer position p satisfies the syncmer condition: p = i or p = i + w - 1 for closed syncmers, or p = i + \lfloor w/2 \rfloor for open syncmers (assuming w is odd so that the middle element is unique). This test can be performed efficiently using vectorized equality comparisons: non-syncmer positions are erased via a blend instruction (which branchlessly selects values based on a boolean mask), and the same shuffle-append mechanism used in Section 8.4.2 efficiently collects the remaining positions. Note that the deduplication step is not needed here since syncmer positions are always distinct.

8.5.3 Other minimizer schemes

Finally, we can change the implementation of the minimizer scheme itself. First, if we want to change the order on k‑mers, we can replace the rolling hash by the function of our choice. For instance, we can simply use the identity for a lexicographic order, or flip the first bits for an anti-lexicographic order (Groot Koerkamp, 2025).

Additionally, we can tweak the sliding window scan to implement other low-density schemes (further discussed in Chapter 15). The best example is probably mod-minimizers (Groot Koerkamp & Pibiri, 2024), which only require applying a modulo w on the position of the minimum as an extra step. Note however that applying a modulo on vector registers is quite impractical, meaning that we prefer subtracting and blending results in practice.

8.6 Experimental evaluation

Our vectorized implementation, built on top of packed-seq and seq-hash, is available at https://github.com/rust-seq/simd-minimizers. This implementation supports canonical minimizers, super‑k‑mers and syncmers computation, and works with both AVX2 and NEON. The code to reproduce the experiments is available in the bench folder of the same repository.

The experiments were run on an Intel Core i7-10750H with AVX2 running at 2.6GHz and an Apple M1 with NEON running at 3.2GHz, using Rust nightly 1.88.0. As input, we use a fixed random string of 10⁸ bases, encoded either as ASCII or as packed representation depending on the method. Reported timings are the median of five runs and shown in nanoseconds per base.

In our experiments, we use parameter values for w and k as used by Kraken2 (Wood et al., 2019) (5, 31), SSHash (Pibiri, 2022) (11, 21), and minimap2 (Li, 2018) (19, 19).

8.6.1 Incremental time usage

Table 8.2 shows the time usage for various incremental subsets of our method.

Table 8.2: Total time per base taken after each incremental step, for (w,k)=(11,21).

Step	ns/bp
Step	AVX2	NEON
Iterate the bases	0.15	0.10
+ collect to vector	0.30	0.18
+ iterate the delayed bases	0.30	0.19
+ ntHash	0.32	0.33
+ sliding window min	0.90	0.82
+ collect	1.48	0.95
+ dedup	1.61	1.38
+ canonical ntHash	1.04	0.85
+ canonical strand	1.53	1.16
+ collect	2.03	1.37
+ dedup	2.20	1.82

To start with, iterating the 8 chunks of the input and summing all bases takes 0.15 ns/bp (0.10 ns/bp on NEON). Appending all u32x8 SIMD vectors containing the bases to a vector takes 0.30 ns/bp (0.18 ns/bp on NEON), indicating that writing to memory induces some overhead. Collecting the second k-1-delayed stream of characters that leave the k‑mer (by adding them to the non-delayed stream) has no additional overhead. Computing ntHash only takes 0.02ns/bp extra on AVX2, but adds 0.15 ns/bp on NEON. The sliding window computation nearly triples the total time. Collecting the minimizer positions to a linear vector (i.e., transposing matrices and writing output for each of the 8 chunks) again incurs 50% overhead, and deduplicating them again adds some time.

Going back a step, using canonical ntHash instead of forward ntHash takes 0.14 ns/bp extra on AVX2 but only 0.03 ns/bp on NEON, and determining the canonical strand (via a third \ell-delayed stream and counting GT bases) takes another 0.49 ns/bp on AVX2 and 0.31 ns/bp on NEON. As before, collecting and deduplicating are slow and add around 0.70ns/bp.

In conclusion, we see that iterating the chunks of the input and determining minimizers is quite fast, but that a lot of time must then be spent to “deinterleave” the output into a linear stream. As can be expected, canonical minimizers are slower to compute than forward minimizers, but the overhead is less than 50%, which seems quite low given that the ntHash and sliding window minimum computation are duplicated and a canonical-strand computation is added.

8.6.2 Full comparison

We compare against the minimizer-iter crate v1.2.1³, which implements a queue-based sliding window minimum using wyhash⁴ and also supports canonical minimizers. For an additional comparison, we optimized an implementation of the remap method with ntHash based on a code snippet by Daniel Liu⁵. Results are in Table 8.3 and Figure 8.4.

Table 8.3: Comparison of simd-minimizers against minimizer-iter and a rescan implementation. Times in ns/bp for forward and canonical minimizers, with various (w,k) tuples.

(a) Comparison on AVX2.

Method	(w=5, k=31)		(w=11, k=21)		(w=19, k=19)
Method	fwd.	canon.	fwd.	canon.	fwd.	canon.
`minimizer-iter`	25.30	32.84	26.96	33.93	26.81	34.04
Rescan ntHash	11.65		7.41		5.61
`simd-minimizers` ntHash
‑ packed input	1.69	2.28	1.61	2.20	1.64	2.16
‑ on-the-fly packing	1.92	2.50	1.84	2.42	1.91	2.42
Rescan mulHash	11.37		6.79		5.76
`simd-minimizers` mulHash
‑ packed input	1.85	2.49	1.74	2.40	1.78	2.42
‑ on-the-fly packing	2.12	2.70	2.05	2.62	2.05	2.65
‑ ASCII input	2.11	2.71	2.06	2.63	2.01	2.66

(b) Comparison on NEON.

Method	(w=5, k=31)		(w=11, k=21)		(w=19, k=19)
Method	fwd.	canon.	fwd.	canon.	fwd.	canon.
`minimizer-iter`	12.89	16.60	14.26	18.08	14.20	18.04
Rescan ntHash	5.95		3.92		2.82
`simd-minimizers` ntHash
‑ packed input	1.42	1.85	1.38	1.82	1.38	1.83
‑ on-the-fly packing	1.66	2.07	1.58	1.99	1.59	2.02
Rescan mulHash	6.94		3.93		3.42
`simd-minimizers` mulHash
‑ packed input	1.97	4.34	1.91	4.27	1.93	4.26
‑ on-the-fly packing	2.11	4.37	2.04	4.41	2.03	4.30
‑ ASCII input	2.12	4.39	2.06	4.44	2.04	4.29

minimizer-iter takes around 26 ns/bp for forward and 33 ns/bp for canonical minimizers, and its runtime does not depend much on w and k, because the popping from the queue is unpredictable regardless of w. Rescan starts out at 11.7 ns/bp for w=5 and gets significantly faster as w increases, converging to around 5 ns/bp for w \gg 100. This is explained by the fact that rescan has a branch miss every time the current minimizer falls out of the window, which happens for roughly half the minimizers at a rate of 1/(w+1). Thus, as w increases, the method becomes more predictable and branch misses go down. Our method, simd-minimizers runs around 1.61 ns/bp for forward and 2.20 ns/bp for canonical minimizers when given packed input, and therefore is 3.4× to 6.8× faster than the rescan method.

In Figure 8.4, we see that simd-minimizers performance is mostly independent of k and w since it is mostly data-independent. Only for small w \le 5 it is slightly slower due to the larger number of minimizers and hence larger size of the output.

As we use SIMD with 8 lanes, we could in theory expect up to 8× speedup. In practice this is hard to reach because of constant overhead and because the overhead to work well with SIMD in the first place. In particular for large w, rescan benefits from very predictable and simple code and only outputs unique minimizer positions, making it very efficient. In SIMD, on the other hand, we use a data-independent algorithm, and output the minimizer position for every single window, which then has to be deduplicated. Thus, it is nice to see that even for large w, our method is over 3× faster, despite this overhead.

8.6.2.1 ASCII input and mulHash

Apart from taking bit-packed input, simd-minimizers also works on ASCII-encoded DNA sequences of A/C/T/G characters directly, which are then packed into values \{0,1,2,3\} for ntHash (in that order) during iteration. This is around 0.25ns/bp slower, mostly because of the larger size of the unpacked input.

The mulHash variant is around 0.20ns/bp slower again, but works for any ASCII input. Performance on 100 MB of the Pizza&Chili corpus English⁶ and Sources⁷ datasets is nearly identical to performance on the random DNA shown in Table 8.3.

8.6.2.2 Human genome

We also run simd-minimizers on the chromosomes of the T2T-CHM13v2.0 human genome⁸ (Nurk et al., 2022), of total size 3.2 Gbp. Here, computing forward minimizers takes 5.19 seconds, and canonical minimizers takes 6.71 seconds for (w, k) = (11, 21), which corresponds to 1.67 and 2.15 ns/bp, which is within a few percent of Table 8.3.

8.6.2.3 Density

For (w,k)=(11,21), we get a density of 0.173 for forward minimizers, and 0.167 for canonical minimizers, which are both close to the expected density of 2/(w+1) = 1/6\approx 0.167. Changing to (w,k)=(19,19), we get density 0.098 for forward minimizers and 0.100 for canonical minimizers, both close to the expected density of 2/(19+1)=1/10=0.1. Thus, we see that in practice, ntHash is a sufficiently random hash function to approach the expected density of random minimizers with a perfectly uniform random hash function.

8.6.2.4 Multithreading

We test throughput in a multithreaded setting, by using 6 threads to process the chromosomes in parallel. This way, processing takes 0.97 s (forward) and 1.27 s (canonical) when the input data is already loaded into memory, showing slightly above 5× speedup. This is just below 6× speedup as the chromosomes don’t perfectly partition the data into 6 equal parts, so that some threads finish before others. For applications, we recommend to simply call our library in parallel from multiple threads as needed.

Groot Koerkamp, R. (2025) Optimal throughput bioinformatics (PhD thesis). URL https://www.research-collection.ethz.ch/handle/20.500.11850/783091.

Groot Koerkamp, R. & Martayan, I. (2025) SimdMinimizers: Computing Random Minimizers, fast. 23rd international symposium on experimental algorithms (SEA 2025), vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.

Groot Koerkamp, R. & Pibiri, G.E. (2024) The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. 24th international workshop on algorithms in bioinformatics (WABI 2024), vol. 312, Leibniz international proceedings in informatics (LIPIcs) (Pissis, S.P. & Sung, W.-K. eds). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 11:1–11:23.

Hirzel, M., Schneider, S., & Tangwongsan, K. (2017) Sliding-window aggregation algorithms: tutorial. Proceedings of the 11th ACM international conference on distributed and event-based systems, DEBS 2017, barcelona, spain, june 19-23, 2017. ACM, pp. 11–14.

Holley, G. & Melsted, P. (2020) Bifrost: Highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome Biology, 21.

Lemire, D. (2017) Removing duplicates from lists quickly. URL https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/.

Li, H. (2018) Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.

Marçais, G., Elder, C.S., & Kingsford, C. (2024) K-nonical space: Sketching with reverse complements. Bioinformatics, 40, btae629.

Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., Aganezov, S., Hoyt, S.J., Diekhans, M., Logsdon, G.A., Alonge, M., Antonarakis, S.E., Borchers, M., Bouffard, G.G., Brooks, S.Y., Caldas, G.V., Chen, N.-C., Cheng, H., Chin, C.-S., Chow, W., Lima, L.G. de, Dishuck, P.C., Durbin, R., Dvorkina, T., Fiddes, I.T., Formenti, G., Fulton, R.S., Fungtammasan, A., Garrison, E., Grady, P.G.S., Graves-Lindsay, T.A., Hall, I.M., Hansen, N.F., Hartley, G.A., Haukness, M., Howe, K., Hunkapiller, M.W., Jain, C., Jain, M., Jarvis, E.D., Kerpedjiev, P., Kirsche, M., Kolmogorov, M., Korlach, J., Kremitzki, M., Li, H., Maduro, V.V., Marschall, T., McCartney, A.M., McDaniel, J., Miller, D.E., Mullikin, J.C., Myers, E.W., Olson, N.D., Paten, B., Peluso, P., Pevzner, P.A., Porubsky, D., Potapova, T., Rogaev, E.I., Rosenfeld, J.A., Salzberg, S.L., Schneider, V.A., Sedlazeck, F.J., Shafin, K., Shew, C.J., Shumate, A., Sims, Y., Smit, A.F.A., Soto, D.C., Sović, I., Storer, J.M., Streets, A., Sullivan, B.A., Thibaud-Nissen, F., Torrance, J., Wagner, J., Walenz, B.P., Wenger, A., Wood, J.M.D., Xiao, C., Yan, S.M., Young, A.C., Zarate, S., Surti, U., McCoy, R.C., Dennis, M.Y., Alexandrov, I.A., Gerton, J.L., O’Neill, R.J., Timp, W., Zook, J.M., Schatz, M.C., Eichler, E.E., Miga, K.H., & Phillippy, A.M. (2022) The complete sequence of a human genome. Science, 376, 44–53.

Pan, C. & Reinert, K. (2024) A simple refined DNA minimizer operator enables 2-fold faster computation. Bioinformatics, 40, btae045.

Pibiri, G.E. (2022) Sparse and skew hashing of k-mers. Bioinformatics, 38, i185–i194.

Theodorakis, G., Koliousis, A., Pietzuch, P.R., & Pirk, H. (2018) Hammer slide: Work- and CPU-efficient streaming window aggregation. International workshop on accelerating analytics and data management systems using modern processor and storage architectures, ADMS@VLDB 2018, rio de janeiro, brazil, august 27, 2018 (Bordawekar, R. & Lahiri, T. eds). pp. 34–41.

Wood, D.E., Lu, J., & Langmead, B. (2019) Improved metagenomic analysis with Kraken 2. Genome biology, 20, 1–13.

https://codeforces.com/blog/entry/71687 ↩︎
Any pair of non-complementary bases would work here.↩︎
https://github.com/rust-seq/minimizer-iter ↩︎
https://github.com/eldruin/wyhash-rs ↩︎
https://gist.github.com/Daniel-Liu-c0deb0t/7078ebca04569068f15507aa856be6e8 ↩︎
https://pizzachili.dcc.uchile.cl/texts/nlang/↩︎
https://pizzachili.dcc.uchile.cl/texts/code/↩︎
Available at https://github.com/marbl/CHM13.↩︎