15 Background on low-density minimizers

As discussed in Chapter 4, the density of a minimizer scheme drives its space efficiency: lowering the density leads to longer runs of the same minimizer and longer super‑k‑mers, which results in a direct improvement in applications such as k‑mer counting (Kokot et al., 2017; Martayan et al., 2025), DBG compaction (Chikhi et al., 2016; Khan et al., 2022; Cracco & Tomescu, 2023) or large-scale indexes (Marchet et al., 2021; Pibiri, 2022). However, we’ve also seen that the widely used random minimizers achieve a density of 2 / (w+1), well above the 1 / w lower bound implied by the window guarantee. This observation raises two main questions: how can we design new schemes with a lower density? And what is the lowest density we can hope to reach? In this chapter, we briefly cover some of the recent advances on these two questions, and a more detailed survey can be found in (Groot Koerkamp, 2025).

15.1 Universal hitting sets and decycling sets

15.1.1 Universal hitting sets

One of the first directions explored to lower the density emerged from universal hitting sets (Orenstein et al., 2016, 2017). A universal hitting set, or UHS, is simply a set U of k‑mers such that every window of w k‑mers contains at least one element from U. From a DBG perspective (see Chapter 2 for some reminders), designing a UHS is equivalent to selecting a subset of nodes such that every path of length w is guaranteed to contain at least one of them.

Unlike minimizers, UHS are context-free in the sense that a selected k‑mer does not depend on its surrounding context. We can however derive a minimizer scheme compatible with a UHS U by using an order that ranks all k‑mers in U before all others. Since every window contains an element from U and the order favors these, we know that every selected minimizer would belong to U. The density of such a scheme is upper-bounded by |U| / \sigma^k (Marçais et al., 2018), thus designing a smaller UHS directly translates to a lower density. Unfortunately, Orenstein et al. (2016) proved that finding a minimum UHS is NP-hard by reducing it from the Hitting Set problem (Karp, 2009), so we have to rely on heuristics to find small UHS instead.

15.1.2 Decycling sets

A popular heuristic to approximate a good UHS is to start from a minimum decycling set (MDS). The idea behind a decycling set is that, instead of covering every path of length w, we cover every cycle of the DBG. Luckily, building a minimum decycling set is much easier than a minimum UHS: Mykkeltveit (1972) constructed it by embedding each k‑mer x into the complex plane as z = \sum_{i=0}^{k-1} x_i \cdot \omega_k^i where \omega_k is a primitive k-th root of unity. We then select the k‑mers whose embedding z corresponds to the first clockwise rotation with positive imaginary part, i.e. \pi - 2\pi/k \le \arg(z) < \pi.

DOCKS (Orenstein et al., 2016, 2017) first suggested to greedily extend this MDS into a UHS for a specific window size w by repeatedly adding the k‑mer contained in the largest number of uncovered windows, until all windows are hit. However, the exponential search space limited this approach to k \le 14 in practice.

Pellow et al. (2023) proposed a simpler and more scalable approach by building a minimizer scheme directly from this MDS: within any window, prefer k‑mers in the MDS and break ties randomly. Checking MDS membership can be done by simply computing the corresponding embedding or with a precomputed lookup table, and the scheme extends to arbitrary k. Pellow et al. (2023) also introduced double decycling sets as a variant that prioritizes a symmetric MDS by selecting k‑mers with -2\pi/k \le \arg(z) < 0, and falls back to Mykkeltveit’s MDS otherwise. This approach further reduced density, especially for k < w where it significantly outperforms DOCKS and similar UHS-based constructions. A plot of the density of double decycling sets, along with other schemes, is available in Figure 15.1.

15.2 (Open-closed) mod-minimizers

15.2.1 Mod-minimizers

Groot Koerkamp & Pibiri (2024) took a different approach to improve the density of minimizer schemes by using a technique called modulo sampling. Their idea is to first select the position of a minimizer of length t = k \bmod w using a random order, and then map it back to a k‑mer position by applying a modulo w. The resulting scheme, named mod-minimizers is detailed in Algorithm 15.1. Note that when k < w, t = k and this scheme is perfectly equivalent to random minimizers with no density benefit. In practice, t is often forced to be \ge r by setting t = r + ((k - r) \bmod w) for some small lower bound r¹, which avoids pathological cases such as selecting t‑mers of length 1.

Algorithm 15.1: Pseudocode for mod-minimizers.

\begin{algorithmic} \Function{ModMinimizers}{$w, k$} \State{$t \gets k \bmod w$} \State{$i \gets 0$} \For{$j$ in \Call{SlidingMinPos}{$w+k-t$, $t$}} \Comment{window of $w+k-t$ t-mers} \State{\textbf{yield} $i + ((j - i) \bmod w)$} \Comment{smallest integer $i^\prime \ge i$ such that $i^\prime \equiv j \mod w$} \State{$i \gets i + 1$} \EndFor \EndFunction \end{algorithmic}

Groot Koerkamp & Pibiri (2024) proved that the density of the mod-minimizers is given by d = \frac{2 + \frac{k - t}{w}}{w + k - t + 1} + o(1 / (w + k - 1)) which converges to 1/w as k \to \infty with w fixed, leading to an asymptotically optimal scheme. A plot of the resulting density is available in Figure 15.1.

The scheme has the benefit of being very simple and efficient: it only requires computing a sliding minimum (which we discussed in Chapter 8) over short t-mers and a single modulo afterwards, making it a near drop-in replacement for random minimizers. Moreover, the modulo operation, which is usually quite expensive, can be accelerated using a “fast mod” (Lemire et al., 2019) since w is fixed. It should be noted, however, that this optimization does not apply as well on SIMD lanes of fixed size which cannot be widened. Practical experiments with SSHash (Pibiri, 2022) showed that replacing random minimizers by mod-minimizers resulted in a 15% space gain for w = 11 and k = 21, without degrading the speed (Groot Koerkamp & Pibiri, 2024).

15.2.2 Open-closed mod-minimizers

One limitation of mod-minimizers is that they do not offer any improvement over random minimizers when k < w. To address this regime, Groot Koerkamp et al. (2025) introduced the open-closed minimizers, building on the notion of open and closed syncmers (presented in Chapter 4). Their idea is to first look for the smallest open syncmer in the window; if none exists, fall back to the smallest closed syncmer, and default to the globally smallest k‑mer if neither is present. While the density of this scheme has no known closed form, it is similar to double decycling sets for k < w with a small shift due to the inner s parameter of syncmers, and improves over them for w \le k \le 2w.

Groot Koerkamp et al. (2025) then proposed to combine this new scheme with the modulo sampling technique of mod-minimizers, thus resulting in open-closed mod-minimizers: t‑mers are sampled using the open-closed minimizer as the inner scheme, and their position is mapped back to a k‑mer modulo w. This merges the density gains of both approaches and achieves low density across the full range of k, as illustrated in Figure 15.1.

15.3 Lower bounds for forward local schemes

While the density of any scheme with a window guarantee can be trivially lower bounded by 1/w, all of the improved schemes discussed previously remained quite far from it for small values of k, suggesting that a density of 1/w could simply be unreachable in this region.

Marçais et al. (2018) established a first non-trivial lower bound for forward local schemes, that is, schemes that select a k‑mer only based on the content of the current window (local) and have an increasing sequence of selected positions (forward). This bound was then tightened by Groot Koerkamp & Pibiri (2024), and later refined as a near-tight lower bound by Kille et al. (2024): any forward local scheme must satisfy d \ge \frac{\left\lceil\frac{w + k}{w}\right\rceil}{w + k} The key insight behind this result is to consider cyclic strings of length exactly w + k: on such a cycle, the window guarantee forces at least \lceil(w+k)/w\rceil positions to be sampled, so the density per cycle is at least \lceil(w+k)/w\rceil/(w+k). Since the density of any forward scheme equals its average density over all such cycles, the bound follows.

Moreover, Kille et al. (2024) noted that this bound peaks at values k \equiv 1 \mod w and can be smoothed into a monotone version since the optimal density decreases with k²: d \ge \max\left( \frac{ \left\lceil\frac{w + k}{w}\right\rceil }{w+k}, \frac{ \left\lceil\frac{w + k^\prime}{w}\right\rceil }{w+k^\prime} \right) where k^\prime = k + ((1 - k) \bmod w) is the smallest integer \ge k such that k^\prime \equiv 1 \mod w. Figure 15.1 illustrates this lower bound along with the density of the schemes presented above.

Figure 15.1: Plot of the density of multiple minimizer schemes, along with the lower bound for forward local schemes from (Kille *et al.*, 2024).

15.4 Other variants

Note

As these approaches were published recently, they were not included in the following chapters.

15.4.1 Finimizers

Finimizers (Alanko et al., 2025) are variable-length minimizers that ensure that the frequency of each selected k‑mer stays below a given threshold. By avoiding very frequent k‑mers, they reduce biases in downstream analyses, at the cost of a non-uniform k‑mer length.

15.4.2 GreedyMini

GreedyMini (Golan et al., 2025) constructs a minimizer scheme through a greedy brute-force search. It iteratively selects the k‑mer whose designation as the overall minimum would least often place it first or last in a context, thereby minimizing the number of charged contexts and hence the density. The resulting schemes achieve density very close to the lower bound and are optimal for some small parameter values when k \equiv 1 \mod{w}. However, this approach is not scalable for k > 15 as the construction time increases exponentially with k.

15.4.3 10-minimizers

10-minimizers (Shur et al., 2026) are defined by an order that systematically prioritizes k‑mers starting with the binary pattern 10 (named 10-k‑mers), the idea being that they tend to be spaced further apart because the pattern cannot overlap itself. This translates into an expected density approximately 2/(w+2), slightly lower than the 2/(w+1) of random minimizers. Shur et al. (2026) then refine this construction into spacers by ranking 10-k‑mers to maximize the gap to the next selected position, achieving a competitive density.

15.4.4 Anti-lexicographic SUS-anchors

SUS-anchors (Groot Koerkamp, 2026) are a scheme designed for k = 1, or selection scheme, that samples the start position of the smallest unique substring (SUS) in the window, i.e. the shortest substring that occurs only once within the window and is smallest according to a given order. This scheme can be combined with an anti-lexicographic order, which compares the first character normally but reverses the order of all subsequent ones, to avoid over-sampling positions in long runs of the same character and push anchors further apart. Empirically, anti-lexicographic sus-anchors achieve density within 1% of the lower bound for selection schemes on the DNA alphabet.

15.5 Toward multi-hash schemes

All schemes surveyed in this chapter share a common design: they improve density by choosing which ordering to impose on k‑mers, so that the minimum of each window falls as infrequently as possible. In the next chapter we explore a different strategy: we ask whether using multiple independent hash functions simultaneously can push density below what a single scheme can achieve.

Using r = 4 seems to be a good empirical choice.↩︎
A scheme for k can always be turned into a scheme for k' > k by ignoring the last k' - k characters of each window.↩︎