16 Multiminimizers – Algorithm design and implementation for the scale of sequencing data

16.1 A link between density and the expected distance between selected positions

We start by formalizing the link between density and distances between consecutive selected positions, under minimal assumptions on these distances. Specializing to random minimizers, we argue that the standard modeling assumption is unsatisfactory and propose a more realistic one that coincides with the classical regime for large minimizers, verifying empirically that it satisfies our minimal condition.

Different notations

We reuse the notations from Section 4.3: the minimizer size is denoted m while k denotes the number of bases in a window, i.e. k = w + m - 1.

We follow the notations of Chapter 4 and denote \mathcal{O}_m for a total order on m‑mers and \texttt{rank}_{\mathcal{O}_m} : \Sigma^m \to \llbracket 1, \sigma^m \rrbracket the corresponding rank function. In practice, \mathcal{O}_m is induced by a random hash function with U = 2^{64}, which we treat as collision-free (Pibiri et al., 2023). We denote by \mathcal{S}_f(S) the set of selected indices of a string S under a scheme f, and by d(S) = |\mathcal{S}_f(S)| / (|S| - k + 1) its particular density. The expected density d is the limit of \mathbb{E}[d(S)] over random strings of growing length. The definitions below are stated in a non-canonical context, though they extend easily. The practical experiments are based on canonical k‑mers, but our implementations cover both contexts.

Consider the expected number of selected positions in a window of size w, equal to d \cdot w by definition. Intuitively, this quantity is also equal to w divided by the expected distance between two selected consecutive positions, which we denote here by \mu. Indeed, if we select a position every \mu bases, we expect to choose w/\mu positions on average in a window. Hence, with dw \approx w/\mu, we expect d \approx \frac{1}{\mu}. This intuitive link has already been noticed in the literature (Schleimer et al., 2003; Zheng et al., 2021; Martayan et al., 2025), but never, to the best of our knowledge, explicitly formalized as such, nor proved for any scheme. The purpose of this section is to formally establish this observation as a theorem, namely Theorem 16.1, under minimal assumptions.

The sequence S is represented by a vector of integers \mathbf{R} = R_1, R_2, \ldots, where R_i = \texttt{rank}_{\mathcal{O}_m}(S[i:i+m]) \in \llbracket 1, \sigma^m \rrbracket is the rank of the i-th m‑mer of S, according to the order \mathcal{O}_m. A second integer vector, \mathbf{P} = P_1, P_2, \ldots, describes the positions of the selected minimizers when sliding a window of size w, i.e. P_i = \min \{ i \leq j \leq i+w-1 : R_j = \min(R_i, \ldots, R_{i+w-1}) \}.

Note that \mathbf{P} can still be defined no matter how minimizers are chosen, or more generally, a position, in a window. Since consecutive windows may choose the same position as a minimizer, consecutive values of P_i might be equal. It leads us to define the subsequence \mathbf{P}^\ast of distinct values, i.e. P_1^\ast = P_1, P_2^\ast is the first P_i value distinct from P_1^\ast, P_3^\ast is the first P_i value distinct from P_1^\ast and P_2^\ast, and so on. In other words, \mathbf{P}^\ast is the sequence corresponding to the set \mathcal{S}_f(S) – i.e. the set of selected indices. Finally, we define \boldsymbol{\Delta} = \Delta_1, \Delta_2, \ldots as the sequence of distances between selected positions, i.e. \Delta_1 = P_1^\ast, and \Delta_i = P_i^\ast - P_{i-1}^\ast for i \geq 2. An example is provided in Figure 16.1.

Figure 16.1: An example of sequence S, represented as a sequence \mathbf{R} of integers (here, using lexicographical order), and the sequence \mathbf{P} of minimizers positions, with m=3 and w=4. We have \mathbf{P}^\ast = 3, 7, \ldots and \boldsymbol{\Delta} = 3, 4, \ldots.

We put ourselves in the context where S is a random sequence, so that \mathbf{R}, \mathbf{P}, \mathbf{P}^\ast and \boldsymbol{\Delta} can be considered as vectors of random variables. Our goal is to establish a link between d and the expected distance between selected positions, i.e., between d and the \mathbb{E}[\Delta_i]’s. We obtain the following result.

Theorem 16.1 (Density-distance equivalence) Let \mu > 0. Consider \varepsilon_i = \mathbb{E}[\Delta_i] - \mu as some realization of an underlying random variable \varepsilon with \mathbb{E}[\varepsilon] = 0. Then, we have d \cdot \mu = 1.

Proof. The proof is deferred to Section D.1.

Theorem 16.1 states that if we assume that the distances between consecutive selected positions of minimizers are somehow equally distributed, then their expected value is exactly 1/d. This assumption is the minimum viable hypothesis for establishing the result (since \mu must be well defined), and it does not depend on the process by which the R_i’s values are obtained, nor how the positions of the minimizers P_1, P_2, \ldots are selected in successive windows.

At this point, we have not made any assumptions on the random vectors \mathbf{R}, \mathbf{P} and \mathbf{P}^\ast. It is not trivial to formally verify whether the assumption regarding the \mathbb{E}[\Delta_i]’s is verified or not, as the variables are most likely not to be independent nor identically distributed. However, in practice, we can verify that the assumption holds. In the remainder of this section, we investigate how the assumption from Theorem 16.1 concerning the \Delta_i’s holds for random minimizers. The standard hypothesis that has been made in the literature regarding their choice is the following:

Hypothesis 1 Every m‑mer in a window of length w has an equal probability of 1/w of being the smallest m‑mer.

As discussed in (Marçais et al., 2017), while “not strictly true in practice”, this hypothesis is reasonable and reflects reality accurately “when using a randomized ordering”. However, the formulation is unfortunate because it does not take into account the dependency between windows. If we consider a window (R_i, \ldots, R_{i+w-1}) outside of its context, then the hypothesis can be reformulated as P_i \sim \mathcal{U}(\llbracket 1, w \rrbracket). Unfortunately, since the same minimizer will likely be selected by several successive windows, P_i cannot be considered independent, nor even identically distributed (since P_i = P_{i-1} with a non-zero probability). Here, we propose starting from a more elementary hypothesis concerning the R_i’s. We then study, empirically and theoretically, the behavior of the derived variables \mathbf{P}, \mathbf{P}^\ast and \boldsymbol{\Delta}, to assess the gap between Hypothesis 1 and reality, and to test the expectation hypothesis on the \Delta_i’s. Our working assumption is the following.

Hypothesis 2 The random variables R_1, R_2, \ldots are i.i.d., uniformly distributed over \llbracket 1, \sigma^m \rrbracket.

Remark 16.1. Assuming a perfect hash function when computing the random order, our assumption states that the probability of collision between R_i’s is 1/\sigma^m, which is consistent with (Zheng et al., 2021, Lemma S7), that establishes that two distinct m‑mers are identical in a random sequence with probability 1/\sigma^m.

It is not too difficult to establish the law of P_1 = P_1^\ast = \Delta_1 from Hypothesis 2:

Proposition 16.1 (Uniform distribution of P_1) Under Hypothesis 2, \forall 1 \leq i \leq w, \mathbb{P}(P_1 = i) = \frac{1}{w} + \left(\frac{w-i}{w-1} - \frac{1}{2}\right) \cdot \frac{1}{\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right). It follows that P_1 \xrightarrow{a.s.} \mathcal{U}(\llbracket 1, w \rrbracket) when m \to \infty.

Proof. The proof is deferred to Section D.2.

In Section D.3, we observe that the limit distribution is reached very quickly, for fairly small values of m (m=8 in our case). This implies that for all values of m likely to be used in practice, Hypothesis 1 holds for P_1, and \Delta_1 also follows a uniform distribution, meaning that \mathbb{E}[\Delta_1] = \frac{w+1}{2}. If, indeed, all subsequent \Delta_i’s have the same expectation, then, using Theorem 16.1, we retrieve the well-known density of random minimizers, d = \frac{2}{w+1} (Schleimer et al., 2003; Roberts et al., 2004).

Unfortunately, formally establishing the law of further variables becomes rapidly intractable due to the dependencies between them. To complete the study of subsequent \mathbb{E}[\Delta_i]’s, we resort to Monte-Carlo simulations of our probabilistic model and obtain the results depicted in Section D.4. From our results, the assumption that \mathbb{E}[\varepsilon] = 0 is empirically verified (numerically, we get \mathbb{E}[\varepsilon] \approx 0.0060), and the \mathbb{E}[\Delta_i]’s are indeed distributed around \mu = \frac{w+1}{2} = 5.5 (using w=10); see supplementary Figure D.2 (a). We obtain numerically a density of 0.1808, i.e. 0.55% of error compared to the theoretical value of 0.\overline{18}. This empirical result highlights that Hypothesis 2 is indeed a proper assumption to work with, and also that the assumption of Theorem 16.1 is reasonable, since it is verified in practice.

16.2 Multiminimizers: trading time for space

We now introduce a practical family of meta schemes called multiminimizers, that assign a bounded set of candidate minimizers and choose among them to reduce the frequency of selected positions.

Considering Theorem 16.1, obtaining a low density scheme on a sequence S requires having a scheme that selects positions as far as possible from one another in S, while still covering all k‑mers from S. In this regard, the best possible minimizer scheme would select m‑mers at a distance of k - m + 1 bases from each other. However, without a way to retrieve which m‑mer was chosen for a specific k‑mer, querying a k‑mer would require checking if any of its k - m + 1 m‑mers was selected as its minimizer. The same problem arises when inserting a k‑mer into a database: one must perform k - m + 1 checks to verify whether it is already present or not.

To reduce this prohibitive cost, we propose a solution based on super‑k‑mers. As defined in Section 4.3, a super‑k‑mer is a maximal sequence of consecutive k‑mers sharing the same minimizer. Our technique starts by generating N distinct hash functions (e.g. using N distinct random seeds). Then, we use these N hash functions to generate N distinct random minimizer schemes. Each of these minimizer schemes yields a set of super‑k‑mers on the sequence S, each covering all k‑mers of S. When iterating over the minimizers of a sequence, among the super‑k‑mers that cover the first k‑mer, we select the one that ends the farthest in the sequence. Then, we repeat this selection at the end of each selected super‑k‑mer: among all the super‑k‑mers covering the first uncovered k‑mer, we select the one that ends the farthest in the sequence (see Figure 16.2). We detail our method in Algorithm 16.1, running in \mathcal{O}(N \cdot |S|) time and using \mathcal{O}(N \cdot w) memory.

Figure 16.2: Choice of a super‑k‑mer for k=6 using a multiminimizer scheme. The first k‑mer uncovered by the previous super‑k‑mer is highlighted in red. Among the N=4 candidate super‑k‑mers covering this k‑mer, the multiminimizer scheme selects the one ending the farthest in the sequence (i.e. farthest right of the blue line).

Algorithm 16.1: MultiMinimizers algorithm.

\begin{algorithmic} \Function{MultiMinimizers}{$T, w, m, \mathcal{S}_0, \ldots, \mathcal{S}_{N-1}$} \Comment{$T$: text; $\mathcal{S}_0,\ldots,\mathcal{S}_{N-1}$: $N$ minimizer schemes.} \State{Initialize $\texttt{buf}$ of size $N \times w$ and $\texttt{run}$ of size $N$} \State{$k \gets w + m - 1$} \State{$i^* \gets -w$} \Comment{Starting position of the current super-k-mer.} \State{$j^* \gets 0$} \Comment{$i^* \bmod w$.} \State{$h^* \gets 0$} \Comment{Selected minimizer scheme.} \State{$\texttt{run}[h^*] \gets 0$} \For{$0 \leq i < |T| - k + 1$} \If{$\texttt{run}[h^*] = 0$} \Comment{The minimizer of the selected scheme has changed.} \State{$j \gets i \bmod w$} \State{$r \gets i - i^*$} \Comment{Number of outdated values in the buffer.} \For{$0 \leq h < N$} \For{$i^* + w \leq p < \min(i^* + w + r,\; |T| - k + 1)$} \State{$\texttt{buf}[h][p \bmod w] \gets \mathcal{S}_h(T[p : p+k])$} \EndFor \State{$\texttt{run}[h] \gets 1$} \While{$\texttt{run}[h] < w$ \textbf{and} $\texttt{buf}[h][(j + \texttt{run}[h]) \bmod w] = \texttt{buf}[h][j]$} \State{$\texttt{run}[h] \gets \texttt{run}[h] + 1$} \EndWhile \EndFor \State{$i^*, j^* \gets i, j$} \State{$h^* \gets \arg\max\; \texttt{run}$} \EndIf \State{$\texttt{run}[h^*] \gets \texttt{run}[h^*] - 1$} \State{\textbf{yield} $\texttt{buf}[h^*][j^*]$} \EndFor \EndFunction \end{algorithmic}

As mentioned in the introduction, a local scheme selects a m‑mer in a k‑mer solely based on the k‑mer itself. This ensures covering all k‑mers, or equivalently, all windows of w = k - m + 1 m‑mers (Marçais et al., 2018). Formally, a local scheme is a function f : \Sigma^k \to \llbracket 1, w \rrbracket. Importantly, the local scheme is both used at construction (e.g. when partitioning k‑mers based on their minimizer) and query (e.g. to determine in which bucket a k‑mer might be present). For a given k‑mer x \in \Sigma^k, the position f(x) is uniquely determined.

While minimizers are local schemes, multiminimizers are not, as they differ drastically in their approach. At construction, we need to know when the previous super‑k‑mer ends, and where the candidate super‑k‑mers end, information that a k‑mer alone cannot provide. In this regard, the multiminimizer scheme has both to “remember the past” to determine when to select a new minimizer, and “delve into the future” to choose which minimizer candidate to keep. At query, for a given k‑mer, we have to compute and look up the N minimizer candidates to determine whether it has already been seen, since the broader context(s) in which it might have been seen are not known.

Ultimately, the multiminimizer scheme does not behave like a local scheme, either at construction or at query time. It is advantageous as it implies that its density is not subject to the lower bound of local schemes by Kille et al. (2024), although the 1/w bound still applies. As showed in upcoming Section 16.4.1, the multiminimizer scheme does indeed beat this lower bound, and approach a density of 1/w as N increases. This gain in density comes at the price of (bounded) increased query time, as discussed above.

Finally, note that the multiminimizer construction is not limited to random minimizer schemes, but could be used with any combination of N distinct local schemes, whether by varying parameters or using concurrent schemes; in that sense, multiminimizers are a sort of meta scheme. To illustrate this point, in upcoming Section 16.4, we investigate two implementations of the multiminimizer construction, one with random minimizers, and the other with open-closed mod-minimizers (Groot Koerkamp et al., 2025).

16.3 On deduplicated density

It appears the standard density possesses a “twin” concept, that we call deduplicated density (which we denote by d^\ast), and that we define as the fraction of distinct minimizers needed to cover all k‑mers of a set of sequences (or, equivalently, a set of k‑mers), instead of the fraction of positions selected along a sequence. The deduplicated density does not appear to have been studied before, even if it is implicitly linked to filtering tasks, such as the ones tackled by Needle (Darvish et al., 2022).

Definition 16.1 (Minimizer Cover) Let \mathcal{X} = \{x_1, x_2, \ldots\} be a set of (distinct) k‑mers. The minimizer cover of \mathcal{X}, according to some minimizer scheme f, that we denote by \operatorname{MinCover}_f(\mathcal{X}), is defined as \operatorname{MinCover}_f(\mathcal{X}) = \{ \min_f(x) : x \in \mathcal{X} \}, where \min_f(x) designates the minimizer of x according to f.

Analogously to standard density, we can define a particular deduplicated density d^\ast(\mathcal{X}) and obtain simple bounds, mimicking the ones we already have on standard particular density: \frac{1}{|\mathcal{X}|} \leq d^\ast(\mathcal{X}) = \frac{|\operatorname{MinCover}_f(\mathcal{X})|}{|\mathcal{X}|} \leq 1.

Standard density is computed as the limit of the expected particular density \mathbb{E}[d(S)] on a random sequence S when |S| \to \infty, for fixed values of k and m. However, one cannot have an arbitrarily large set \mathcal{X} since |\mathcal{X}| \leq \sigma^k; therefore, we propose to define the deduplicated density simply as the expected particular density of a random set of k‑mers \mathcal{X}, without any notion of limit.

Definition 16.2 (Deduplicated density) The (expected) deduplicated density of a minimizer scheme f is formally defined as d^\ast = \mathbb{E}_{\mathcal{X}}[d^\ast(\mathcal{X})] = \frac{1}{2^{\sigma^k} - 1} \sum_{\mathcal{X} \in \mathcal{P}(\Sigma^k) \setminus \emptyset} d^\ast(\mathcal{X}) where \mathcal{P}(\Sigma^k) designates the powerset of \Sigma^k.

Noticing that 1/|\mathcal{X}| \geq 1/\sigma^k, we obtain the following trivial bounds for d^\ast.

Lemma 16.1 (Bounds on deduplicated density) For any minimizer scheme, we have 1/\sigma^k \leq d^\ast \leq 1.

Then, we have the following result.

Proposition 16.2 (Deduplicated density of random minimizers) Under Hypothesis 2, the deduplicated density of the random minimizer scheme is given by the following, where w = k - m + 1: d^\ast = \frac{1}{2^{\sigma^k} - 1} \sum_{n=1}^{\sigma^k} \binom{\sigma^k}{n} \frac{\sigma^m}{n} \left[1 - \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m} \left[1 - \left(\frac{r}{\sigma^m}\right)^w + \left(\frac{r-1}{\sigma^m}\right)^w\right]^n\right].

Proof. The proof is deferred to Section D.5.

The dependence on \sigma^k makes it intractable to compute d^\ast when k is large. We nonetheless computed the first values, shown in Figure 16.3 (left); one can see that d^\ast seems to be driven by the value of w rather than k, hence we conjecture that d^\ast might only depend on w. Besides, it seems that d^\ast < d holds. To access higher values, we resorted to Monte-Carlo simulations, following the same probabilistic model as before, using Hypothesis 2. We observed how the standard density d and the deduplicated density d^\ast evolved over the course of parsing a random sequence, as shown in Figure 16.3 (right). While both densities seem to be equal for the first few thousand windows, they rapidly diverge. We looked into the data and observed that the number of distinct k‑mers and distinct windows observed are equal (it is not surprising since the probability of collision of k‑mers is negligible), so the difference between the densities is entirely explained by the discrepancy between the number of selected positions and the number of distinct minimizers. While those values are equal at first, soon we obtain repeated minimizers¹, hence the drop in d^\ast.

(a) Theoretical values of d^\ast for 2 \leq k \leq 6 as a function of 1 \leq w \leq k, obtained with the formula of Proposition 16.2.

There are obviously many more aspects of deduplicated density to be studied, but they are beyond the scope of this work. To conclude this section, we propose to consider applying multiminimizers to deduplicated density.

Definition 16.3 (Multiminimizer Cover) Let \mathcal{O}_m^1, \ldots, \mathcal{O}_m^N be N orders on m‑mers, and let \mathcal{X} = \{x_1, x_2, \ldots\} be a set of (distinct) k‑mers. A set \mathcal{Y} = \{y_1, y_2, \ldots\} of m‑mers is said to be a multiminimizer cover of \mathcal{X} if and only if, for any x \in \mathcal{X}, there exists y \in \mathcal{Y} and 1 \leq i \leq N, so that \min_i(x) = y, where \min_i(x) designates the smallest m‑mer of x according to the order \mathcal{O}_m^i. In other words, for each k‑mer x \in \mathcal{X}, at least one of its N minimizer candidates belongs to \mathcal{Y}.

The natural extension of the deduplicated density in this context would be d^\ast = \frac{|\mathcal{Y}|}{|\mathcal{X}|}, with the same trivial bounds as before. Contrary to \operatorname{MinCover}_f(\mathcal{X}), we have here a choice of which candidate to choose as a minimizer, and therefore a chance at minimizing d^\ast by making the appropriate choices. Therefore, this leads to considering the following problem.

Problem 1 (MultiMinCover) Provided a set \mathcal{X} of k‑mers and N orders on m‑mers \mathcal{O}_m^1, \ldots, \mathcal{O}_m^N, find a set \mathcal{Y} of m‑mers that is a multiminimizer cover of \mathcal{X} and of minimum cardinality |\mathcal{Y}|.

This problem is closely related to the one of MinimumHittingSet: provided an integer k, and a set \mathcal{X} of k‑mers, find a set \mathcal{Y} of m‑mers (called a hitting set) such that each k‑mer of \mathcal{X} contains at least one m‑mer from \mathcal{Y}, and of minimum cardinality. This problem is known to be NP-hard (Orenstein et al., 2016, 2017). Here the problem is slightly different, as we do not ask the elements of \mathcal{Y} to simply appear in the k‑mers of \mathcal{X}, but also to be their minimizers for at least one of the orders. This problem also turns out to be difficult, as the following result shows.

Theorem 16.2 (NP-completeness of MultiMinCover) MultiMinCover is NP-complete (for \sigma \geq 3 and N \geq 2).

Proof. The proof is deferred to Section D.6.

Since minimizing |\mathcal{Y}| is difficult, we must resort to heuristics. Furthermore, when parsing long sequences, it would be prohibitively expensive to keep track of all the k‑mers encountered and their candidate minimizers before starting to build the multiminimizer cover. For this reason, we have developed a heuristic that makes decisions throughout the parsing of a sequence, based solely on the local context, that is presented in Section 16.4.3.

16.4 Results

In this section, we evaluate the performance of two implementations of the multiminimizer technique: an iterator over random-minimizers-based multiminimizers, and a proof of concept of multiminimizers using open-closed mod-minimizers (Groot Koerkamp et al., 2025). Both are written in the Rust programming language, and freely available at github.com/lrobidou/multiminimizers. We propose results for canonical k‑mers, but our implementation also supports non-canonical k‑mers.

Density results are provided in Section 16.4.1, whereas the space usage is investigated in Section 16.4.2 and the conservation of sampled multiminimizers is studied in Section D.8. Regarding running time, note that the multiminimizer scheme must compute all candidate minimizers, so iteration over multiminimizers is linear in the number of hash functions. We measured the time required to iterate over a random sequence and confirmed that our implementation scales linearly with the number of hash functions, while remaining very fast as shown in Section D.9. Finally, we provide some results in Section 16.4.3 on multiminimizers applied to filtering approaches.

16.4.1 Density

As discussed in Section 16.2, the multiminimizer construction aims at reducing the density of any local scheme. We expect to reduce the density when increasing the number of hash functions. The intuition is that increasing the number of super‑k‑mer candidates increases the chance of finding a super‑k‑mer that ends further than the others, thus increasing the distance between consecutive selected positions.

To show this density reduction, we generated a random sequence (5M bases) and computed the density of our scheme on it, for 1 to 8 hash functions, using w=15 and varying m (thus k). We observed that the higher the number of hash functions, the lower the density, which allows us to reach densities lower than any other scheme (see Figure 16.4 (a)). We are not aware of any prior iterator over minimizers with a density lower than the lower bound introduced by (Kille et al., 2024). Figures for additional m and w values are available in Section D.7.

Although multiminimizers are not a local scheme, we have observed that the empirical density \hat{d} and the empirical average distance between selected positions \hat{\mu} were indeed related as \hat{d} \approx 1/\hat{\mu}, hence showing that Theorem 16.1 might apply beyond the scope of local schemes.

To show that the multiminimizer construction can be applied to other schemes than random minimizers, we applied it to the open-closed mod-minimizer (Groot Koerkamp et al., 2025) to obtain a proof-of-concept of a multi open-closed mod-minimizer scheme (MOCMM for short). Figure 16.4 (b) shows that this novel scheme achieves a lower density than the multiminimizers on a random sequence. At the time of writing, the library we used to compute the minimizer scheme does not support the open-closed mod-minimizer scheme. Therefore, we used a non-SIMD implementation, making it much slower. For this reason, we could only reasonably run it on a random sequence of 5 \times 10^5 bases.

In both Figure 16.4 (a) and Figure 16.4 (b), one can observe that the density approaches the theoretical lower bound 1/w. This convergence seems intuitive since, when the number of hash functions becomes large enough, all m‑mers of a k‑mer are likely to be chosen as the minimizer candidate for at least one of the hash functions, and therefore the multiminimizer scheme tends to become optimal.

16.4.2 Super‑k‑mers and hyper‑k‑mers space usage

Since the space usage of a super‑k‑mer representation of a sequence is driven by the density of the underlying scheme (Martayan et al., 2025, Thm. 1), we can translate a density reduction to a linear super‑k‑mer representation. Figure 16.5 (a) shows the space usage of a super‑k‑mer representation of multiminimizers, whereas Figure 16.5 (b) provides the same information for MOCMMs, both for multiple numbers of hash functions and k values.

Ideally, to represent a sequence over an alphabet of size 4 (e.g., a DNA sequence S), one could use only 2 bits / k‑mer in S. However, super‑k‑mers use as low as 6 bits / k‑mer in S. hyper‑k‑mers (Martayan et al., 2025) were recently introduced and tend toward 4 bits / k‑mer in S when using a random-minimizer-based hyper‑k‑mer representation. This is still higher than the lower bound of 2 bits / k‑mer, due to random minimizers having a density of \frac{2}{w+1}.

Since our multiminimizer scheme approaches a density of \frac{1}{w}, using a multiminimizer scheme could allow hyper‑k‑mers to tend toward 2 bits / k‑mer in the input sequence. To test this hypothesis, we replaced the iterator over minimizers of KFC (a hyper‑k‑mer-based k‑mer counter introduced in (Martayan et al., 2025)) by our iterator over multiminimizers and report the result in Figure 16.5 (c). Note that KFC is optimized for large k values (e.g. > 60), so we tested our hypothesis in this range. Our results show that KFC converges to 2 bits / k‑mer in S, which we believe to be the first streaming k‑mer representation to do so.

(a) Super‑k‑mer representation + multiminimizers.

16.4.3 Pin: a multiminimizer approach to Needle-like applications

We developed a simple proof-of-concept minimizer index to demonstrate the utility of the multiminimizers approach for Needle-like filtering stages, in a prototype dubbed Pin (github.com/Malfoy/Pin). Such an index stores a set of minimizers that collectively cover all k‑mers of interest, enabling efficient filtering as an indexed minimizer is a necessary condition for an indexed k‑mer, and with sufficiently large minimizers, passing the filter already implies substantial sequence similarity. Our prototype relies on exact hash tables and handles a single uncolored set; the same idea can be readily adapted to colored or approximate index structures. During construction, we test the presence of k‑mers and then greedily select minimizers to cover any uncovered sequences. We evaluate the approach on the Human HiFi accession SRR11292123 (~24 Gb) in Figure 16.6, and observe a smaller index at a moderate cost in build and query time. Merely switching from one to two hash functions yields an index about 20% smaller, with roughly a 20% increase in construction time and about 85% in query time.

Figure 16.6: Relative performances of the minimizer dictionary Pin using multiminimizers, comparing N hash functions vs N=1. The x-axis depicts the relative index size (in terms of the number of minimizers), and the y-axis provides relative construction (left) and query (right) times. As an indicator, for N=1, there are \approx 850 \times 10^6 minimizers, construction time is 234s, and query time is about 0.04s.

16.5 Conclusion

Multiminimizers trade a bounded amount of query time for lower density. The construction is agnostic to the underlying scheme: we instantiated it with random minimizers and with open-closed mod-minimizers, and in both cases the density drops below the forward-scheme lower bound of (Kille et al., 2024) as the number of hash functions grows, approaching 1/w. The gain carries over to downstream representations. Plugging our iterator into KFC brings hyper‑k‑mer storage close to 2 bits / k‑mer, the first streaming k‑mer representation to reach this regime as far as we know. The Pin prototype shows the same pattern for filtering indexes, where even N=2 already cuts index size by about 20% at modest cost.

The chapter also developed two pieces of theory that stand on their own. The density-distance equivalence of Theorem 16.1 gives a way to reason about density under a single, minimal assumption on the expected spacing of selected positions, and it appears to hold empirically even outside the local-scheme setting where we proved it. The deduplicated density d^\ast emerged as a natural twin to standard density when the object of interest is a k‑mer set rather than a sequence. We gave an analytical expression in the random-minimizer case, observed that it diverges from d on long sequences, and showed that minimizing it under the multiminimizer model is NP-complete.

Several questions remain open. The theoretical density of the multiminimizer scheme is still to be worked out, ideally as a function of N and of the densities of the base schemes; conservation properties also lack a formal treatment. The complexity of minimizing standard density globally is still unknown, despite the NP-completeness result for its deduplicated counterpart. Extending deduplicated density to other scheme families, and simplifying its expression, would help clarify how the two notions relate. Better local or global heuristics for MultiMinCover seem worth pursuing.

On the practical side, the MOCMM proof of concept already shows that pairing multiminimizers with other low-density schemes is straightforward; a SIMD-friendly implementation is the natural next step. Replacing existing iterators with our Rust library is also a low-effort way to test whether the gains we saw on hyper‑k‑mer-based counting extend to other minimizer-based tools.