Appendix C — Multiminimizers

C.1 Proof of Theorem 16.1

We start by rewriting the density formula as \begin{equation} d = \lim_{M\to\infty} \frac{1}{\sigma^M} \sum_{S \in \Sigma^M} \frac{|\mathcal{S}_f(S)|}{|S|-k+1} = \lim_{M\to\infty} \frac{\mathbb{E}[|\mathcal{S}_f(S)|]}{M-k+1} \end{equation} where \mathbb{E}[|\mathcal{S}_f(S)|] designates the expected number of selected positions of a (uniform) random sequence S of length M. Suppose we can extend the original sequence S, of size M, and continue selecting minimizers past the first M characters. We define \tau_M = \inf\{n \in \mathbb{N} : P_n^\ast > M\} as the random variable that determines how many minimizers have been selected in S; hence \tau_M = |\mathcal{S}_f(S)|. Therefore, we can rewrite the above as d = \lim_{M\to\infty} \frac{\mathbb{E}[\tau_M]}{M-k+1}.

Let S_n = \sum_{i=1}^n \Delta_i. Notice that, by definition of the \Delta_i’s, S_n = P_n^\ast; then \tau_M = \inf\{n \in \mathbb{N} : S_n > M\} is a stopping time with respect to the \sigma-algebra generated by the \Delta_i’s. Using Wald’s equation (Wald, 1944), we get

\mathbb{E}[S_{\tau_M}] = \mathbb{E}\!\left[\sum_{i=1}^{\tau_M} \mathbb{E}[\Delta_i]\right] = \mathbb{E}[\tau_M] \cdot \mu + \mathbb{E}\!\left[\sum_{i=1}^{\tau_M} \varepsilon_i\right].

We apply again Wald’s equation to the right-hand term and obtain

\mathbb{E}[S_{\tau_M}] = \mathbb{E}[\tau_M] \cdot \mu + \underbrace{\mathbb{E}[\tau_M] \cdot \mathbb{E}[\varepsilon]}_{= 0}.

By definition of S_n, \tau_M and the \Delta_i’s, notice that M \leq S_{\tau_M} \leq M + w - 1. Therefore \frac{M}{M-k+1} \leq \frac{\mathbb{E}[\tau_M]}{M-k+1} \cdot \mu \leq \frac{M+w-1}{M-k+1}, and, taking the limit as M \to \infty, we establish our result. \square

C.2 Proof of Proposition 16.1

We begin with the following lemma.

Lemma C.1 (Trapezoid rule) For any function f: [0,1] \to \mathbb{R} at least two times continuously derivable, we have \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m - 1} f\!\left(\frac{r}{\sigma^m}\right) = \int_0^1 f(x)\, dx - \frac{f(1) + f(0)}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right).

Proof. This is a direct application of the trapezoid rule for numerical integration. Defining E = \int_0^1 f(x)\, dx - \frac{1}{\sigma^m}\!\left(\frac{f(1)+f(0)}{2} + \sum_{r=1}^{\sigma^m-1} f\!\left(\frac{r}{\sigma^m}\right)\right), we have E = -\frac{1}{12\sigma^{2m}} f''(\eta) for some \eta \in [0,1] (Atkinson, 2008, eq. (5.1.17)). \square

Let 1 \leq i \leq w. For R_i to be the minimizer of the window R_1, \ldots, R_w, then we must have: \overbrace{R_1, \ldots, R_{i-1}}^{> R_i},\ \boxed{R_i},\ \overbrace{R_{i+1}, \ldots, R_w}^{\geq R_i}

The events (R_j > R_i) and (R_j \geq R_i) are clearly not independent of R_i; but resorting to the law of total probability, and using that the R_j’s are i.i.d., we obtain

\mathbb{P}(P_1 = i) = \frac{1}{\sigma^m} \sum_{r=0}^{\sigma^m - 1} \left(\frac{r}{\sigma^m}\right)^{i-1} \cdot \left(\frac{r}{\sigma^m} + \frac{1}{\sigma^m}\right)^{w-i}.

Using Lemma C.1 with f : x \mapsto x^{i-1}(x + \sigma^{-m})^{w-i}, we obtain

\mathbb{P}(P_1 = i) = \frac{\mathbf{1}_{i=1}}{\sigma^{mw}} + \int_0^1 x^{i-1}(x + \sigma^{-m})^{w-i}\, dx - \frac{(1 + \sigma^{-m})^{w-i} + \sigma^{-mw}\mathbf{1}_{i=1}}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right).

Using the fact that (x + \sigma^{-m})^n = x^n + \frac{nx^{n-1}}{\sigma^m} + O(\sigma^{-2m}), we get

\mathbb{P}(P_1 = i) = \frac{1}{w} + \left(\frac{w-i}{w-1} - \frac{1}{2}\right) \cdot \frac{1}{\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right). \quad \square

C.3 Convergence of P_1 to the uniform distribution

In Figure C.1, we computed the distribution of P_1 for various values of m, using the exact formula \mathbb{P}(P_1 = i) = \frac{1}{\sigma^m} \sum_{r=0}^{\sigma^m - 1} \left(\frac{r}{\sigma^m}\right)^{i-1} \cdot \left(\frac{r+1}{\sigma^m}\right)^{w-i} rather than the approximation of Proposition 16.1. As one can see, P_1 converges quickly to the limit uniform distribution (as one can expect since the dependency on m is of the order of \sigma^{-m}).

Figure C.1: With w=10, evolution of the distribution of P_1 as m increases. The actual value of P_1 is computed using the exact formula instead of the approximation of Proposition 16.1.

C.4 Monte-Carlo simulations of random minimizers

Simulating our probabilistic model of random minimizers, with Hypothesis 16.2, we obtain the results depicted in Figure C.2.

(a) Empirical values of \mathbb{E}[\Delta_i] for 1 \leq i \leq 100. The red line is equal to \mu = \frac{w+1}{2}.
(b) Empirical distribution of \varepsilon. The red line denotes the empirical mean of \varepsilon.
Figure C.2: Monte-Carlo simulations of our model of random minimizers under Hypothesis 16.2. We used m=8, w=10 and generated N_\text{simu} = 10^6 random sequences \mathbf{R}, all long enough so that the sequence \boldsymbol{\Delta} contains at least 100 values.

One thing that seems surprising at first glance is the peak observed for \mathbb{E}[\Delta_2]. When sliding the window by one position, either the previous minimizer X is still in the window and we change the minimizer only if the new value X' is < X, or X is not in the window anymore and we select as a new minimizer the minimum value in the window (which is called a rescan). Here, P_1 is always obtained by a rescan operation, meaning that P_2^\ast is always obtained as “the next minimizer following a rescan”. We do not observe any other peak in the distribution of \mathbb{E}[\Delta_i]’s as the other rescans are randomly found in the sequences, and therefore their impact is smoothed among all simulations. We have the following result; note that both the formula and the empirical simulations yield a value of \mathbb{E}[\Delta_2] \approx 5.87 for w=10 and m=8.

Proposition C.1 (Expected value of \Delta_2 (simplified)) Assuming w^2 = o(\sigma^m), \mathbb{E}[\Delta_2] = \frac{3\ln 2 - 1}{2} \cdot w + \frac{\ln 2}{2} + \frac{1}{8} - \frac{w-1}{32w^2} + O\!\left(\frac{1}{\sigma^m}\right).

We actually prove the following, more general, result:

Proposition C.2 (Expected value of \Delta_2 (general)) \mathbb{E}[\Delta_2] = (H_{2w} - H_w) \frac{3w+1}{2} - \frac{w-1}{2} + O\!\left(\frac{1}{\sigma^m}\right) where H_n denotes the n-th harmonic number 1 + \frac{1}{2} + \cdots + \frac{1}{n}.

Proof. To prove this, we distinguish two cases: either P_2^* “overwrites” P_1^* because R_{P_2^*} < R_{P_1^*}, or P_1^* goes out of scope and P_2^* is selected by a rescan. Therefore,

\mathbb{E}[\Delta_2] = \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^\ast \text{ overwrites } P_1^\ast}] + \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}].

In the following paragraphs, we show that:

  • \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] = w(H_{2w} - H_w) - \frac{w-1}{2} + O(1/\sigma^m)
  • \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}] = \frac{w+1}{2}(H_{2w} - H_w) + O(1/\sigma^m)

Combining these, \mathbb{E}[\Delta_2] = (H_{2w} - H_w)\frac{3w+1}{2} - \frac{w-1}{2} + O\!\left(\frac{1}{\sigma^m}\right). \quad \square

By approximating H_n = \ln n + \gamma + \frac{1}{2n} - \frac{1}{12n^2} + O(1/n^4), we get Proposition C.1.

Overwrite case.

Lemma C.2 For any 1 \leq j \leq i \leq w, the probability that P_2^* = w+j overwrites P_1^* = i is \mathbb{P}((P_1^*=i) \cap (P_2^*=w+j) \cap (R_{w+j} < R_i)) = \frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right).

Proof. Let O_{i,w+j} denote this event. By conditioning on the values of R_i and R_{w+j}, using the i.i.d. structure and applying Lemma C.1, we obtain:

\mathbb{P}(O_{i,w+j}) = \int_0^1 x^{w+j-2}(1-x)\, dx + O\!\left(\frac{1}{\sigma^m}\right) = \frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right). \quad \square

Assuming w^2 = o(\sigma^m), \begin{align*} \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] &= \sum_{i=1}^{w} \sum_{j=1}^{i} (w-i+j) \left[\frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right)\right]. \end{align*} Expanding and simplifying using harmonic numbers yields w(H_{2w} - H_w) - \frac{w-1}{2} + O(1/\sigma^m).

Rescan case.

Let 1 \leq r \leq \sigma^m and \alpha(r) = \frac{\sigma^m - r + 1}{\sigma^m}. We use the truncated trapezoid approximation: \frac{1}{\sigma^m}\sum_{s=1}^{\sigma^m - r} f\!\left(\frac{s}{\sigma^m}\right) = \int_0^{\alpha(r)} f(x)\, dx + O\!\left(\frac{1}{\sigma^m}\right) for any twice continuously derivable f.

Lemma C.3 For any 1 \leq i, j \leq w, the probability that P_2^* = i+j as a rescan when P_1^* = i is \mathbb{P}((P_1^\ast = i) \cap (P_2^\ast = i+j) \cap (\text{rescan})) = \frac{1}{w(w+i)} + O\!\left(\frac{1}{\sigma^m}\right).

Proof. By conditioning on the values of R_i and R_{i+j} and applying the i.i.d. structure and the trapezoid rule twice, we obtain: \mathbb{P}((P_1^\ast=i) \cap (P_2^\ast=i+j) \cap \text{rescan}) = \frac{1}{w} \int_0^1 x^{i+w-1}\, dx + O\!\left(\frac{1}{\sigma^m}\right) = \frac{1}{w(w+i)} + O\!\left(\frac{1}{\sigma^m}\right). \quad \square

Finally, still assuming w^2 = o(\sigma^m): \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}] = \sum_{i=1}^w \sum_{j=1}^w \frac{j}{w(w+i)} + O\!\left(\frac{1}{\sigma^m}\right) = \frac{w+1}{2}(H_{2w} - H_w) + O\!\left(\frac{1}{\sigma^m}\right).

C.5 Proof of Proposition 16.2

We have, from Definition 16.6: d^\ast = \frac{1}{2^{\sigma^k}-1} \sum_{n=1}^{\sigma^k} \binom{\sigma^k}{n} \mathbb{E}_{|\mathcal{X}|=n}[d^\ast(\mathcal{X})].

To compute the required expectation, let \mathcal{X} = \{X_1, \ldots, X_n\} be a set of k‑mers chosen uniformly at random. We denote by M_i \in \llbracket 1, \sigma^m \rrbracket the random variable denoting the minimizer of k‑mer X_i. Under Hypothesis 16.2, one establishes that \mathbb{P}(M_i = r) = \left(\frac{\sigma^m - r + 1}{\sigma^m}\right)^w - \left(\frac{\sigma^m - r}{\sigma^m}\right)^w.

We denote by P_r the number of k‑mers in \mathcal{X} that admit r as their minimizer; then: \begin{align*} \mathbb{E}[|\operatorname{MinCover}(\mathcal{X})|] &= \sum_{r=1}^{\sigma^m} (1 - \mathbb{P}(P_r = 0)) = \sigma^m - \sum_{r=1}^{\sigma^m}(1 - \mathbb{P}(M_1 = r))^n \\ &= \sigma^m \left[1 - \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m} \left[1 - \left(\frac{r}{\sigma^m}\right)^w + \left(\frac{r-1}{\sigma^m}\right)^w\right]^n\right]. \end{align*}

This is exactly the standard coupon collector’s problem, where the coupons are distinct minimizers and |\mathcal{X}| is the number of coupons bought. Substituting and simplifying yields Proposition 16.2. \square

C.6 Proof of Theorem 16.2

We prove the NP-completeness of MultiMinCover by (i) showing first that the problem is in NP and then (ii) by reduction from SetCover. An alternative reduction from VertexCover is presented at the end of this section.

MultiMinCover is in NP. For the decision version of MultiMinCover, given a proposed set \mathcal{Y} of m‑mers of cardinality \leq k, one can verify whether \mathcal{Y} is a multiminimizer cover of \mathcal{X} by iterating over k‑mers in \mathcal{X}, computing their N minimizers, and checking whether at least one belongs to \mathcal{Y}. This can be done in O(|\mathcal{X}| \cdot N), hence the result.

Reduction from SetCover. Let \mathcal{U} = \{1, \ldots, n\} and let \mathcal{S} \subseteq 2^{\mathcal{U}} be a collection of subsets of \mathcal{U} whose union is \mathcal{U}. The frequency of an element u \in \mathcal{U} is the number of subsets in \mathcal{S} that contain u. We define N = \max_{u \in \mathcal{U}} \text{freq}(u) as the maximum frequency.

We use the 3-letter alphabet \Sigma = \{0, 1, a\}. Let m = \lceil \log_2 |\mathcal{S}| \rceil and take any indexing S_1, \ldots, S_{|\mathcal{S}|} of the subsets; we associate to each subset S_i the m‑mer P_i corresponding to i in binary coded with m bits. Let k = N(m+1) - 1. To each element u \in \mathcal{U}, if i_1, \ldots, i_f are the indices of subsets containing u (f = \text{freq}(u)), we associate the k‑mer X_u = m_{i_1} a m_{i_2} a \cdots a m_{i_f} a^{(N-f)(m+1)}.

Example C.1 With \mathcal{U} = \{1,\ldots,5\} and \mathcal{S} = \{\{1,2,3\}, \{4,5\}, \{2,4\}, \{3,4\}\}, we have m = 2, k = 8, and: X_1 = \underline{00}aaaaaa,\quad X_2 = \underline{00}a\underline{10}aaa,\quad X_3 = \underline{00}a\underline{11}aaa,\quad X_4 = \underline{01}a\underline{10}a\underline{11},\quad X_5 = \underline{01}aaaaaa. The optimal set cover \{\{1,2,3\}, \{4,5\}\} has cardinality 2.

The goal is to exhibit N orders \mathcal{O}_m^1, \ldots, \mathcal{O}_m^N on m‑mers such that for each k‑mer x, its N minimizers correspond exactly to its m‑mers coming from the m_i’s (i.e. its minimizers contain only letters 0 and 1). Any solution to MultiMinCover in this context is then equivalent to a solution of the original SetCover instance. Since there are n k‑mers of size O(|\mathcal{S}| \log_2 |\mathcal{S}|) and exactly |\mathcal{S}| minimizers to choose from, the reduction is polynomial, once we prove that said orders can be constructed in polynomial time. Algorithm DetermineOrders (Algorithm C.1) constructs those orders.

Algorithm C.1: DetermineOrders algorithm.
\begin{algorithmic} \Function{DetermineOrders}{$m_1,\ldots,m_{|\mathcal{S}|}$; $X_1,\ldots,X_n$; $N$} \State{Initialize $\texttt{ranks}$, a $|\mathcal{S}| \times N$ array} \State{Initialize $\texttt{minimizers}$, a $n \times N$ array} \For{$1 \leq i \leq n$} \State{Let $\texttt{cand}(X_i) \subseteq \{m_1,\ldots,m_{|\mathcal{S}|}\}$ be the set of candidate minimizers of $X_i$} \EndFor \While{$\exists\, i$ such that $\texttt{cand}(X_i) \neq \emptyset$} \State{Let $m_j \in \texttt{cand}(X_i)$} \If{$\texttt{cand}(X_i) \setminus \{m_j\} = \emptyset$} \Comment{The relative rank of $m_j$ does not matter.} \For{all $l$ such that $\texttt{minimizers}(X_i, l)$ is not defined} \State{$\texttt{minimizers}(X_i, l) \gets m_j$} \EndFor \Else \State{Let $l$ be an index such that $\texttt{minimizers}(X_i, l)$ is not defined} \State{Set $\texttt{ranks}(m_j, l)$ to the next unused integer in $\{1,\ldots,|\mathcal{S}|\}$ in column $l$} \For{all $x'$ such that $m_j \in \texttt{cand}(x')$} \If{$\texttt{minimizers}(x', l)$ is not defined} \State{$\texttt{minimizers}(x', l) \gets m_j$} \EndIf \EndFor \EndIf \State{Remove $m_j$ from $\texttt{cand}(X_i)$} \EndWhile \Comment{Remaining ranks do not matter; complete them arbitrarily.} \For{$1 \leq l \leq N$} \State{Assign remaining empty values in $\texttt{ranks}(\cdot, l)$ arbitrarily} \EndFor \EndFunction \end{algorithmic}

Proposition C.3 Algorithm DetermineOrders is correct and runs in polynomial time.

Proof. Each iteration eliminates one candidate minimizer, of which there are at most |\mathcal{U}| \cdot N. The relevant operations per iteration take at most |\mathcal{S}| + |\mathcal{U}| steps. Since N \leq |\mathcal{S}|, we obtain worst-case complexity O(|\mathcal{U}| \cdot |\mathcal{S}| \cdot (|\mathcal{S}| + |\mathcal{U}|)), polynomial as claimed.

For correctness, we must verify that we never reach a state where \texttt{minimizers}(X_i, l) is defined for all l but some m_j \in \texttt{cand}(X_i) remains. Were this the case, by the pigeonhole principle some other m_{j'} \neq m_j fills two minimizer slots for X_i. But minimizers assigned via the else branch are immediately removed from the candidate list, so they cannot be added twice via that path. And the if branch (Lines 7–9) only fires when m_{j'} is the only remaining candidate — contradicting the presence of m_j. \square

Reduction from VertexCover. Let G = (V, E) be a graph. Associate to each vertex i a distinct m‑mer m_i = \texttt{binary}(i) with m = \lceil |V| \rceil bits, using alphabet \Sigma = \{0, 1, a\}. For each edge (i,j) \in E (w.l.o.g. m_i < m_j), construct the k‑mer X_{(i,j)} = m_i a m_j. Define orders \mathcal{O}^1 and \mathcal{O}^2 such that \min_1(X_{(i,j)}) = m_i and \min_2(X_{(i,j)}) = m_j for all edges (any m‑mer containing a ranks higher than binary m‑mers in both orders). Let \mathcal{X} = \{X_{(i,j)} : \{i,j\} \in E,\, m_i < m_j\}.

Lemma C.4 VertexCover admits a yes-instance (V, E, k) iff the MultiMinCover instance (\mathcal{X}, \mathcal{O}^1, \mathcal{O}^2, k) is a yes-instance.

Proof. (\Rightarrow) If \mathcal{C} is a vertex cover of size k, then \mathcal{Y} = \{m_i : i \in \mathcal{C}\} is a multiminimizer cover: for any k‑mer X_{(i,j)}, either i \in \mathcal{C} and \min_1(X_{(i,j)}) = m_i \in \mathcal{Y}, or j \in \mathcal{C} and \min_2(X_{(i,j)}) = m_j \in \mathcal{Y}.

(\Leftarrow) Given a multiminimizer cover \mathcal{Y} of size k, the set \mathcal{C} = \{i : m_i \in \mathcal{Y}\} is a vertex cover of size k: for any edge (i,j), the corresponding k‑mer X_{(i,j)} has \min_1 = m_i or \min_2 = m_j, so at least one endpoint is in \mathcal{C}. \square

C.7 Density plots for multiple w values

In this section, we ran the same evaluation as in Section 16.3.1, but applied our multiminimizer iterator to multiple w and more m values to check robustness. Figure C.3 and Figure C.4 show that for all w values, our iterator can yield densities lower than the lower bound.

Figure C.3: Density of multiple schemes, including multiminimizers with up to 32 hash functions. Same plot as Figure 16.4 (a), for w \in \{3, 7, 15, 31, 63\}.
Figure C.4: Density of multiple schemes, including multiminimizers based on open-closed mod-minimizers (Groot Koerkamp et al., 2025) with up to 32 hash functions. Same plot as Figure 16.4 (b), for w \in \{3, 7, 15, 31, 63\}.

C.8 Conservation with respect to error rate

In this section, we study the evolution of the conservation of the sampled multiminimizers with respect to the error rate and the number of hash functions. Given a sequence S and a mutated version S' for a given error rate, we measure conservation using the Jaccard similarity of A = \operatorname{MultiMinimizers}(S) and B = \operatorname{MultiMinimizers}(S'), i.e. \frac{|A \cap B|}{|A \cup B|}. The results are shown in Figure C.5. In particular, we can observe that the conservation decreases slightly faster with respect to the error rate as the number of hash functions increases.

Figure C.5: Conservation of sampled multiminimizers, measured by their Jaccard similarity, with 1 \leq N \leq 4 for a random string of size 10^5 compared to a mutated version of the string with up to 3.5\% of errors.

C.9 Linearity of indexation time

In this section, we show that the iteration over multiminimizers is linear with regard to the number of hash functions. The results are shown in Figure C.6.

Figure C.6: Time taken to index a random string of size 10^5. The linear regression shows that the time taken to iterate over the multiminimizers is approximately the time needed to iterate over one random minimizer scheme times the number of hash functions.