Appendix D — Proofs and experiments on multiminimizers
D.1 Proof of Theorem 16.1
Proof. We start by rewriting the density formula as \begin{equation} d = \lim_{M\to\infty} \frac{1}{\sigma^M}\sum_{S\in\Sigma^M} \frac{|\mathcal{S}_f(S)|}{|S|-k+1} = \lim_{M\to\infty} \frac{1}{M-k+1} \sum_{S\in \Sigma^M} \frac{1}{\sigma^M} \cdot |\mathcal{S}_f(S)| = \lim_{M\to\infty} \frac{\mathbb{E}[|\mathcal{S}_f(S)|]}{M-k+1} \end{equation} where \mathbb{E}[|\mathcal{S}_f(S)|] designates the expected particular density of a (uniform) random sequence S of length M, seen as a random variable. Suppose we can extend the original sequence S, of size M, and continue selecting minimizers past the first M characters. We define \tau_M = \inf\{n \in \mathbb{N} : P_n^\ast > M\} as the random variable that determines how many minimizers have been selected in S; hence \tau_M = |\mathcal{S}_f(S)|. Therefore, we can rewrite the above as d = \lim_{M\to\infty} \frac{\mathbb{E}[\tau_M]}{M-k+1}.
Let us call S_n = \sum_{i=1}^n \Delta_i. Notice that, by definition of the \Delta_i’s, S_n = P_n^\ast; then \tau_M = \inf\{n \in \mathbb{N} : S_n > M\} is a stopping time with respect to the \sigma-algebra generated by the \Delta_i’s. Using Wald’s equation (Wald, 1944), we get
\mathbb{E}[S_{\tau_M}] = \mathbb{E}\left[\sum_{i=1}^{\tau_M} \mathbb{E}[\Delta_i]\right] = \mathbb{E}\left[ \tau_M\cdot\mu + \sum_{i=1}^{\tau_M}\varepsilon_i\right] = \mathbb{E}[\tau_M] \cdot \mu + \mathbb{E}\left[\sum_{i=1}^{\tau_M}\varepsilon_i\right].
We apply again Wald’s equation to the right-hand term and obtain
\mathbb{E}[S_{\tau_M}] = \mathbb{E}[\tau_M]\cdot\mu + \mathbb{E}\left[\sum_{i=1}^{\tau_M}\mathbb{E}[\varepsilon_i]\right] =\mathbb{E}[\tau_M]\cdot\mu + \underbrace{\mathbb{E}[\tau_M] \cdot \mathbb{E}[\varepsilon]}_{= 0}.
By definition of S_n, \tau_M and the \Delta_i’s, notice that M \leq S_{\tau_M} \leq M + w - 1. Therefore \frac{M}{M-k+1} \leq \frac{\mathbb{E}[\tau_M]}{M-k+1} \cdot \mu \leq \frac{M+w-1}{M-k+1}, and, taking the limit as M \to \infty, we establish our result.
D.2 Proof of Proposition 16.1
Proof. We begin with the following lemma.
Lemma D.1 (Trapezoid rule) For any function f: [0,1] \to \mathbb{R} at least two times continuously derivable, we have \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m - 1} f\!\left(\frac{r}{\sigma^m}\right) = \int_0^1 f(x)\, dx - \frac{f(1) + f(0)}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right).
Proof. This is a direct application of the trapezoid rule for numerical integration. Defining E = \int_0^1 f(x)\, dx - \frac{1}{\sigma^m}\!\left(\frac{f(1)+f(0)}{2} + \sum_{r=1}^{\sigma^m-1} f\!\left(\frac{r}{\sigma^m}\right)\right), we have E = -\frac{1}{12\sigma^{2m}} f''(\eta) for some \eta \in [0,1] (Atkinson, 2008, eq. (5.1.17)).
Let 1 \leq i \leq w. For R_i to be the minimizer of the window R_1, \ldots, R_w, then we must have: \overbrace{R_1, \ldots, R_{i-1}}^{> R_i},\ \boxed{R_i},\ \overbrace{R_{i+1}, \ldots, R_w}^{\geq R_i}
The events (R_j > R_i) and (R_j \geq R_i) are clearly not independent of R_i; but resorting to the law of total probability, and using that the R_j’s are i.i.d., we obtain
\begin{align*} \mathbb{P}(P_1=i) &= \sum_{r=1}^{\sigma^m} \mathbb{P}(R_1 > r)^{i-1} \cdot \mathbb{P}(R_1\geq r)^{w-i}\cdot \mathbb{P}(R_i=r) \\ &=\frac{1}{\sigma^{m}} \sum_{r=1}^{\sigma^m} \left(\frac{\sigma^m-r}{\sigma^m}\right)^{i-1}\cdot \left(\frac{\sigma^m -r}{\sigma^m}+\frac{1}{\sigma^m}\right)^{w-i}\\ &= \frac{1}{\sigma^{m}}\sum_{r=0}^{\sigma^m -1} \left(\frac{r}{\sigma^m}\right)^{i-1}\cdot \left(\frac{r}{\sigma^m}+\frac{1}{\sigma^m}\right)^{w-i} \quad \text{by change of variable.} \end{align*}
Using Lemma D.1 with f : x \mapsto x^{i-1}(x + \sigma^{-m})^{w-i}, we obtain
\mathbb{P}(P_1 = i) = \frac{\mathbf{1}_{i=1}}{\sigma^{mw}} + \int_0^1 x^{i-1}(x + \sigma^{-m})^{w-i}\, dx - \frac{(1 + \sigma^{-m})^{w-i} + \sigma^{-mw}\mathbf{1}_{i=1}}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right).
Using the fact that (x + \sigma^{-m})^n = x^n + \frac{nx^{n-1}}{\sigma^m} + O(\sigma^{-2m}), we get
\begin{align*} \mathbb{P}(P_1=i) &= \int_0^1 x^{w-1}\,dx + \frac{w-i}{\sigma^m}\int_0^1 x^{w-2}\,dx - \frac{1}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right)\\ &= \frac{1}{w} + \left(\frac{w-i}{w-1} - \frac{1}{2}\right) \cdot \frac{1}{\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right). \end{align*}
D.3 Convergence of P_1 to the uniform distribution
In Figure D.1, we computed the distribution of P_1 for various values of m, using the exact formula \mathbb{P}(P_1 = i) = \frac{1}{\sigma^m} \sum_{r=0}^{\sigma^m - 1} \left(\frac{r}{\sigma^m}\right)^{i-1} \cdot \left(\frac{r+1}{\sigma^m}\right)^{w-i} rather than the approximation of Proposition 16.1. As one can see, P_1 converges quickly to the limit uniform distribution (as one can expect since the dependency on m is of the order of \sigma^{-m}).
D.4 Monte-Carlo simulations of random minimizers
Simulating our probabilistic model of random minimizers, with Hypothesis 2, we obtain the results depicted in Figure D.2, which have been commented on in the core of the paper.
One thing that seems surprising at first glance is the peak observed for \mathbb{E}[\Delta_2]. When sliding the window by one position, either the previous minimizer X is still in the window and we change the minimizer only if the new value X' is < X, or X is not in the window anymore and we select as a new minimizer the minimum value in the window (which is called a rescan). Here, P_1 is always obtained by a rescan operation, meaning that P_2^\ast is always obtained as “the next minimizer following a rescan”. We do not observe any other peak in the distribution of \mathbb{E}[\Delta_i]’s as the other rescans are randomly found in the sequences, and therefore their impact is smoothed among all simulations. We have the following result; note that both the formula and the empirical simulations yield a value of \mathbb{E}[\Delta_2] \approx 5.87 for w=10 and m=8.
Proposition D.1 (Expected value of \Delta_2 (simplified)) Assuming w^2 = o(\sigma^m), \mathbb{E}[\Delta_2] = \frac{3\ln 2 - 1}{2} \cdot w + \frac{\ln 2}{2} + \frac{1}{8} - \frac{w-1}{32w^2} + O\!\left(\frac{1}{\sigma^m}\right).
We actually prove the following, more general, result:
Proposition D.2 (Expected value of \Delta_2 (general)) \mathbb{E}[\Delta_2] = (H_{2w} - H_w) \frac{3w+1}{2} - \frac{w-1}{2} + O\!\left(\frac{1}{\sigma^m}\right) where H_n denotes the n-th harmonic number 1 + \frac{1}{2} + \cdots + \frac{1}{n}.
Proof. To prove this, we distinguish two cases: either P_2^* “overwrites” P_1^* because R_{P_2^*} < R_{P_1^*}, or P_1^* goes out of scope and P_2^* is selected by a rescan. Therefore,
\mathbb{E}[\Delta_2] = \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^\ast \text{ overwrites } P_1^\ast}] + \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}].
In the following paragraphs, we show that:
- \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] = w(H_{2w} - H_w) - \frac{w-1}{2} + O(1/\sigma^m)
- \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}] = \frac{w+1}{2}(H_{2w} - H_w) + O(1/\sigma^m)
Combining these, \mathbb{E}[\Delta_2] = (H_{2w} - H_w)\frac{3w+1}{2} - \frac{w-1}{2} + O\!\left(\frac{1}{\sigma^m}\right).
By approximating H_n as \ln n + \gamma + \frac{1}{2n} - \frac{1}{12n^2} + O\!\left(\frac{1}{n^4}\right), we get the desired result: \mathbb{E}[\Delta_2] = \frac{3 \ln 2 - 1}{2} \cdot w + \frac{\ln 2}{2} + \frac{1}{8} - \frac{w-1}{32 w^2} + O\!\left(\frac{1}{\sigma^m}\right).
Overwrite case.
Lemma D.2 For any 1 \leq j \leq i \leq w, the probability that P_2^* = w+j overwrites P_1^* = i is \mathbb{P}((P_1^*=i) \cap (P_2^*=w+j) \cap (R_{w+j} < R_i)) = \frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right).
Proof. Let O_{i,w+j} denote the event (P_1^*=i) \cap (P_2^*=w+j) \cap (R_{w+j} < R_i).
\begin{align*} \mathbb{P}(O_{i,w+j}) &= \sum_{r=1}^{\sigma^m} \mathbb{P}(R_1 > r)^{i-1} \cdot \mathbb{P}(R_i=r) \cdot \mathbb{P}(R_1 \ge r)^{w-i+j-1} \cdot \mathbb{P}(R_{w+j} < r) \\ &= \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m} \mathbb{P}(R_1 > r)^{i-1} \cdot \mathbb{P}(R_1 \ge r)^{w-i+j-1} \cdot (1 - \mathbb{P}(R_{w+j} \geq r))\\ &= \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m}\left(\frac{\sigma^m-r}{\sigma^m}\right)^{i-1}\left(\frac{\sigma^m-r+1}{\sigma^m}\right)^{w-i+j-1}\left(1-\frac{\sigma^m-r+1}{\sigma^m}\right)\\ &= \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m} \left(\frac{r-1}{\sigma^m}\right)^{i-1}\left(\frac{r}{\sigma^m}\right)^{w-i+j-1}\left(1-\frac{r}{\sigma^m}\right) \quad \text{by change of variable.} \end{align*}
Applying Lemma D.1 with f: x \mapsto (x-\sigma^{-m})^{i-1}x^{w-i+j-1}(1-x), we obtain \mathbb{P}(O_{i,w+j}) = \int_0^1 (x-\sigma^{-m})^{i-1}x^{w-i+j-1}(1-x)\,dx - \frac{\mathbf{1}_{w+j=i+1}(-\sigma^{-m})^{i-1}}{2\sigma^m} + O\!\left(\frac{1}{\sigma^{2m}}\right). Using the expansion (x-\sigma^{-m})^{i-1} = x^{i-1}+ O(\sigma^{-m}), we get \begin{align*} \mathbb{P}(O_{i,w+j}) &= \int_0^1 x^{w+j-2}(1-x)\,dx + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right), \end{align*} since \int_0^1 x^n(1-x)\,dx = \frac{1}{n+1}-\frac{1}{n+2}.
Assuming w^2 = o(\sigma^m), \begin{align*} \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] &= \sum_{i=1}^{w} \sum_{j=1}^{i} (w-i+j) \cdot \mathbb{P}(O_{i,w+j}) \\ &= \sum_{i=1}^{w} \sum_{j=1}^{i} (w-i+j) \left[\frac{1}{w+j-1} - \frac{1}{w+j} + O\!\left(\frac{1}{\sigma^m}\right)\right] \\ &= \sum_{i=1}^w \left[ \sum_{j=1}^i \frac{w-i+j}{w+j-1} - \sum_{j=1}^i \frac{w-i+j}{w+j}\right] + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \sum_{i=1}^w \left[ \sum_{j=1}^i \frac{w-i+j-1+1}{w+j-1} - \left(i - i\sum_{j=1}^i \frac{1}{w+j}\right) \right] + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \sum_{i=1}^w \left[ i - (i-1)\sum_{j=1}^i \frac{1}{w+j-1} -i + i\sum_{j=1}^i \frac{1}{w+j}\right] + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \sum_{i=1}^w \left[\sum_{j=1}^i \frac{1}{w+j-1} + i\left(\frac{1}{w+i}-\frac{1}{w}\right)\right] + O\!\left(\frac{1}{\sigma^m}\right). \end{align*} Expanding into three different sums, \begin{align*} \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] &= \underbrace{\sum_{i=1}^w \sum_{j=1}^i \frac{1}{w+j-1}}_{(A)} + \underbrace{\sum_{i=1}^w \frac{i}{w+i}}_{(B)} - \underbrace{\frac{1}{w}\sum_{i=1}^w i}_{(C)} + O\!\left(\frac{1}{\sigma^m}\right). \end{align*} (C) is obviously \frac{w+1}{2}. For (A), note that the term \frac{1}{w} appears w times, \frac{1}{w+1} appears w-1 times, and so on up to \frac{1}{2w-1} which appears only once. Hence, (A) = \sum_{i=0}^{w-1}\frac{w-i}{w+i} = 1 + \sum_{i=1}^w\frac{w-i}{w+i} = 1 + w(H_{2w}-H_w) - \underbrace{\sum_{i=1}^w\frac{i}{w+i}}_{(B)}. We do not need to compute (B) as terms cancel, and \begin{align*} \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{P_2^* \text{ overwrites } P_1^*}] &= 1 + w(H_{2w}-H_w) - (B) + (B) - \frac{w+1}{2} + O\!\left(\frac{1}{\sigma^m}\right)\\ &= w(H_{2w}-H_w) - \frac{w-1}{2} + O\!\left(\frac{1}{\sigma^m}\right). \end{align*}
Rescan case.
Let 1 \leq r \leq \sigma^m and \alpha(r) = \frac{\sigma^m - r + 1}{\sigma^m}. We use the truncated trapezoid approximation: \frac{1}{\sigma^m}\sum_{s=1}^{\sigma^m - r} f\!\left(\frac{s}{\sigma^m}\right) = \int_0^{\alpha(r)} f(x)\, dx + O\!\left(\frac{1}{\sigma^m}\right) for any twice continuously derivable f.
Lemma D.3 For any 1 \leq i, j \leq w, the probability that P_2^* = i+j as a rescan when P_1^* = i is \mathbb{P}((P_1^\ast = i) \cap (P_2^\ast = i+j) \cap (\text{rescan})) = \frac{1}{w(w+i)} + O\!\left(\frac{1}{\sigma^m}\right).
Proof. Let us denote by R_{i,j} the event “(P_1^\ast = i)\cap (P_2^\ast = i+j) \cap (\text{rescan})” for any 1\leq i,j\leq w. When this event is realized, then all variables
- R_1,\dots,R_{i-1} are >R_i;
- R_{i+1},\dots, R_{w+i} are \geq R_i (including R_{i+j});
- R_{i+1},\dots,R_{i+j-1} are >R_j;
- R_{i+j+1},\dots,R_{w+i} are \geq R_j.
Therefore, \begin{align*} \mathbb{P}(R_{i,j}) &= \sum_{r=1}^{\sigma^m}\sum_{s=1}^{\sigma^m}\mathbb{P}(R_1>r)^{i-1} \cdot\mathbb{P}(R_i=r)\cdot \mathbb{P}((R_1\geq r)\cap (R_1>s))^{j-1}\\&\quad\quad\quad\quad\quad\cdot\mathbb{P}((R_{i+j}=s)\cap (R_{i+j}\geq r))\cdot\mathbb{P}((R_1\geq r)\cap (R_1\geq s))^{w-j}\\ &= \frac{1}{\sigma^{2m}}\sum_{r=1}^{\sigma^m}\mathbb{P}(R_1>r)^{i-1} \sum_{s=r}^{\sigma^m}\mathbb{P}(R_1>s)^{j-1}\cdot \mathbb{P}(R_1\geq s)^{w-j} \\ &= \frac{1}{\sigma^{2m}}\sum_{r=1}^{\sigma^m} \left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1}\sum_{s=r}^{\sigma^m} \left(\frac{\sigma^m -s}{\sigma^m}\right)^{j-1} \left(\frac{\sigma^m-s+1}{\sigma^m}\right)^{w-j}\\ &=\frac{1}{\sigma^{2m}}\sum_{r=1}^{\sigma^m}\left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1} \sum_{s=0}^{\sigma^m-r} \left(\frac{s}{\sigma^m}\right)^{j-1}\left(\frac{s+1}{\sigma^m}\right)^{w-j}. \end{align*} Notice that the term s=0 in the inner sum equals \mathbf{1}_{j=1} \sigma^{-m(w-1)} which is O(\sigma^{-m}) as we suppose that w>1. Applying the truncated trapezoid approximation to the function f: x\mapsto x^{j-1}(x+\sigma^{-m})^{w-j}, we get \begin{align*} \mathbb{P}(R_{i,j}) &= \frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m} \left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1} \left[\int_0^{\alpha(r)}f(x)\,dx +O\!\left(\frac{1}{\sigma^m}\right)\right]\\ &= \frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m} \left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1} \int_0^{\alpha(r)} f(x)\,dx + O\!\left(\frac{1}{\sigma^m}\right). \end{align*} Using the expansion f(x) = x^{w-1} + O(\sigma^{-m}), we get \begin{align*} \mathbb{P}(R_{i,j}) &= \frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m} \left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1} \int_0^{\alpha(r)} x^{w-1}\,dx + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \frac{1}{w}\cdot\frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m}\left(\frac{\sigma^m -r}{\sigma^m}\right)^{i-1}\cdot\left(\frac{\sigma^m -r+1}{\sigma^m}\right)^{w} + O\!\left(\frac{1}{\sigma^m}\right)\\ &= \frac{1}{w}\cdot\frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m}\left(\frac{r-1}{\sigma^m}\right)^{i-1}\cdot\left(\frac{r}{\sigma^m}\right)^{w} + O\!\left(\frac{1}{\sigma^m}\right). \end{align*} Applying Lemma D.1 with the function f: x \mapsto (x-\sigma^{-m})^{i-1}x^w, we get \mathbb{P}(R_{i,j}) = \frac{1}{w} \int_0^1 (x-\sigma^{-m})^{i-1}x^w\, dx +O\!\left(\frac{1}{\sigma^m}\right) = \frac{1}{w(w+i)}+O\!\left(\frac{1}{\sigma^m}\right).
Finally, still assuming w^2 = o(\sigma^m): \mathbb{E}[\Delta_2 \cdot \mathbf{1}_{\text{rescan}}] = \sum_{i=1}^w \sum_{j=1}^w \frac{j}{w(w+i)} + O\!\left(\frac{1}{\sigma^m}\right) = \frac{w+1}{2}(H_{2w} - H_w) + O\!\left(\frac{1}{\sigma^m}\right).
D.5 Proof of Proposition 16.2
Proof. We have, from Definition 16.2: d^\ast = \frac{1}{2^{\sigma^k}-1} \sum_{n=1}^{\sigma^k} \binom{\sigma^k}{n} \mathbb{E}_{|\mathcal{X}|=n}[d^\ast(\mathcal{X})].
To compute the required expectation, let \mathcal{X} = \{X_1, \ldots, X_n\} be a set of k‑mers chosen uniformly at random. We denote by M_i \in \llbracket 1, \sigma^m \rrbracket the random variable denoting the minimizer of k‑mer X_i (assimilating minimizers and their rank according to the chosen order \mathcal{O}_m). Under Hypothesis 2, it is not difficult to establish that \mathbb{P}(M_i>r) = \mathbb{P}(X_1>r)^w where X_i denotes the i-th m‑mer of X_i and w=k-m+1. Therefore, we obtain \mathbb{P}(M_i = r) = \left(\frac{\sigma^m - r + 1}{\sigma^m}\right)^w - \left(\frac{\sigma^m - r}{\sigma^m}\right)^w.
We denote by P_r the number of k‑mers in \mathcal{X} that admit r as their minimizer; then: \begin{align*} \mathbb{E}[|\operatorname{MinCover}(\mathcal{X})|] &= \sum_{r=1}^{\sigma^m}\mathbb{E}[\mathbf{1}_{P_r>0}] = \sum_{r=1}^{\sigma^m} (1-\mathbb{P}(P_r=0))= \sigma^m - \sum_{r=1}^{\sigma^m}(1- \mathbb{P}(M_1=r))^n\\ &= \sigma^m \left[1 - \frac{1}{\sigma^m}\sum_{r=1}^{\sigma^m} \left[1-\left(\frac{\sigma^m-r+1}{\sigma^m}\right)^w + \left(\frac{\sigma^m-r}{\sigma^m}\right)^w\right]^n\right]\\ &= \sigma^m \left[1 - \frac{1}{\sigma^m} \sum_{r=1}^{\sigma^m} \left[1 - \left(\frac{r}{\sigma^m}\right)^w + \left(\frac{r-1}{\sigma^m}\right)^w\right]^n\right]. \end{align*}
This is exactly the standard coupon collector’s problem, where the coupons are distinct minimizers and |\mathcal{X}| is the number of coupons bought. Substituting and simplifying yields Proposition 16.2.
D.6 Proof of Theorem 16.2
We prove the NP-completeness of MultiMinCover by (i) showing first that the problem is in NP and then (ii) by reduction from SetCover. An alternative reduction from VertexCover is presented at the end of this section.
D.6.1 MultiMinCover is in NP
For the decision version of MultiMinCover, given a proposed set \mathcal{Y} of m‑mers of cardinality \leq k, one can verify whether \mathcal{Y} is a multiminimizer cover of \mathcal{X} by iterating over k‑mers in \mathcal{X}, computing their N minimizers, and checking whether at least one belongs to \mathcal{Y}. This can be done in O(|\mathcal{X}| \cdot N), hence the result.
D.6.2 Reduction from SetCover
Let \mathcal{U} = \{1, \ldots, n\} and let \mathcal{S} \subseteq 2^{\mathcal{U}} be a collection of subsets of \mathcal{U}, whose union is \mathcal{U}. We recall that the SetCover problem amounts to finding the smallest subcollection of \mathcal{S} whose union is \mathcal{U}. The frequency of an element u \in \mathcal{U} is defined as the number of subsets in \mathcal{S} that contain u. We define N = \max_{u \in \mathcal{U}} \text{freq}(u) as the maximum frequency of any element of \mathcal{U}. Note that N \leq |\mathcal{S}|.
To illustrate the proof, we propose to follow an example. Let \mathcal{U}=\{1,\ldots,5\} and \mathcal{S}=\{\{1,2,3\},\{4,5\},\{2,4\},\{3,4\}\}. The optimal solution for this instance of set cover is \{1,2,3\},\{4,5\} of cardinality 2. The maximal frequency is N=3 (since 4 appears in 3 subsets).
We use the following 3-letter alphabet \Sigma = \{0, 1, a\}. Let m = \lceil \log_2 |\mathcal{S}| \rceil. Take any indexing S_1, \ldots, S_{|\mathcal{S}|} of the subsets forming \mathcal{S}; we associate to each subset S_i the m‑mer P_i corresponding to i in binary coded with m bits. Then, let k = Nm + N-1 = N(m+1) - 1. To each element u \in \mathcal{U}, let i_1, \ldots, i_f be the indices of the S_i’s containing u, with f = \text{freq}(u). We associate to u the k‑mer X_u = m_{i_1} a m_{i_2} a \cdots a m_{i_f} a^{(N-f)(m+1)}.
Example D.1 We have m = \lceil \log_2 4\rceil = 2. We associate m_1=00 to \{1,2,3\}, m_2=01 to \{4,5\}, m_3=10 to \{2,4\} and m_4=11 to \{3,4\}. We have k = 8 and the following k‑mers: X_1 = \underline{00}aaaaaa,\quad X_2 = \underline{00}a\underline{10}aaa,\quad X_3 = \underline{00}a\underline{11}aaa,\quad X_4 = \underline{01}a\underline{10}a\underline{11},\quad X_5 = \underline{01}aaaaaa.
The goal now is to exhibit N orders \mathcal{O}_m^1, \ldots, \mathcal{O}_m^N on m‑mers, so that the following is true: for each k‑mer x, its N minimizers correspond exactly to its m‑mers coming from the m_i’s; i.e. its minimizers contain only the letters 0 and 1. It should be clear that any solution to MultiMinCover in this context is equivalent to a solution of the original SetCover instance. Since there are |\mathcal{U}|=n k‑mers of size N(m+1)-1 = O(|\mathcal{S}| \log_2 |\mathcal{S}|) and exactly |\mathcal{S}| minimizers to choose from, the reduction is indeed polynomial, hence the result — once we prove said orders on m‑mers can indeed be constructed in polynomial time.
By construction, we want the m‑mers m_1,\ldots,m_{|\mathcal{S}|} to receive a rank (for all orders) between 1 and |\mathcal{S}|; all subsequent possible m‑mers (either binary, or containing at least one a) should receive a rank >|\mathcal{S}|+1, and their relative order is of no importance since they will not be selected as minimizers whatsoever. By doing so, we ensure that the N minimizers of each k‑mer X_u can only be the m_i’s it is made of. If X_u contains several such m_i’s, the only thing left to do is to ensure that each of them is indeed a minimizer for at least one order – so that the number of distinct minimizers of X_u is equal to the frequency of u in the original SetCover instance — thus ensuring the 1-to-1 correspondence between solutions of the two problems.
We exhibit Algorithm DetermineOrders (Algorithm D.1) to construct those orders.
At this point, we have determined that any m‑mer containing at least one a has rank at least 5 in any of the 3 orderings to build, and \{00,01,10,11\} must receive ranks 1 to 4. We initialize two arrays, ranks and minimizers, following Algorithm DetermineOrders. Applying Lines 7–9 to X_1 and X_5, we get:
rank |
\mathcal{O}_m^1 | \mathcal{O}_m^2 | \mathcal{O}_m^3 |
|---|---|---|---|
| 00 | |||
| 01 | |||
| 10 | |||
| 11 |
minimizers |
\min_1(x) | \min_2(x) | \min_3(x) | cand(x) |
|---|---|---|---|---|
| X_1 | 00 | 00 | 00 | \emptyset |
| X_2 | 00, 10 | |||
| X_3 | 00, 11 | |||
| X_4 | 01, 10, 11 | |||
| X_5 | 01 | 01 | 01 | \emptyset |
Now, consider X_2 and the m‑mer 00. Applying Lines 11–15, choosing l=1, we get:
rank |
\mathcal{O}_m^1 | \mathcal{O}_m^2 | \mathcal{O}_m^3 |
|---|---|---|---|
| 00 | 1 | ||
| 01 | |||
| 10 | |||
| 11 |
minimizers |
\min_1(x) | \min_2(x) | \min_3(x) | cand(x) |
|---|---|---|---|---|
| X_1 | 00 | 00 | 00 | \emptyset |
| X_2 | 00 | 10 | ||
| X_3 | 00 | 11 | ||
| X_4 | 01, 10, 11 | |||
| X_5 | 01 | 01 | 01 | \emptyset |
For X_2 and X_3, their respective remaining candidate fills the empty spots in minimizers by applying again Lines 7–9. Considering X_4, a possible outcome could be the following:
rank |
\mathcal{O}_m^1 | \mathcal{O}_m^2 | \mathcal{O}_m^3 |
|---|---|---|---|
| 00 | 1 | ||
| 01 | 2 | ||
| 10 | 1 | ||
| 11 |
minimizers |
\min_1(x) | \min_2(x) | \min_3(x) | cand(x) |
|---|---|---|---|---|
| X_1 | 00 | 00 | 00 | \emptyset |
| X_2 | 00 | 10 | 10 | \emptyset |
| X_3 | 00 | 11 | 11 | \emptyset |
| X_4 | 01 | 10 | 11 | \emptyset |
| X_5 | 01 | 01 | 01 | \emptyset |
After which, any completion of ranks would work.
Proposition D.3 Algorithm DetermineOrders is correct and runs in polynomial time.
Proof. Each iteration in the while loop eliminates one candidate minimizer, of which there are at most |\mathcal{U}|\cdot N. For a given candidate minimizer, if Lines 7–9 are executed, the for loop takes at most N steps; and if Lines 11–15 are executed, parsing a column in ranks takes |\mathcal{S}| steps, and the for loop at most |\mathcal{U}| steps. Since N\leq |\mathcal{S}|, we end up with a worst-case complexity of O\big(|\mathcal{U}|\cdot |\mathcal{S}| \cdot(|\mathcal{S}|+|\mathcal{U}|)\big), polynomial as claimed.
For the algorithm to be correct, we must never encounter the case where there exists X_i and m_j\in \texttt{cand}(X_i) such that \texttt{minimizers}(X_i,l) is already defined for all 1\leq l\leq N. In such a case, m_j could not be a minimizer for X_i. Suppose we are in such a case. Let \texttt{cand}_I(X_i) be the set of candidates of X_i at the initialization of the algorithm. Since m_j\in \texttt{cand}_I(X_i) and since \texttt{minimizers}(X_i,l) is defined for all l, then surely there exists m_{j'}\in \texttt{cand}_I(X_i), j'\neq j, that appears at least twice in the list of minimizers of X_i, by pigeonhole principle (|\texttt{cand}_I(X_i)\setminus \{m_j\}| < N). When a minimizer is affected as a consequence of Lines 11–15, it is removed from the list of candidates (and therefore cannot be added again as a minimizer for another order), so the only way m_{j'} could have been affected to two orders is as a consequence of Lines 7–9 — but those are only triggered if m_{j'} is the only remaining element of \texttt{cand}(X_i), thus a contradiction since it still contains at least m_j.
D.6.3 Reduction from VertexCover
This alternative reduction was proposed by an anonymous reviewer. Let G=(V,E) be a graph. Using the same formalism as above, associate to each vertex i a distinct m‑mer m_i = \texttt{binary}(i) encoded with m = \lceil |V|\rceil bits. As above, we use the alphabet \Sigma = \{0,1,a\}.
For each edge (i,j)\in E, assume w.l.o.g. m_i<m_j, we construct the k‑mer X_{(i,j)}=m_i a m_j. We define two orders \mathcal{O}^1 and \mathcal{O}^2 as (m_i,m_j)\in\mathcal{O}^1 \iff m_i<m_j and (m_i,m_j)\in\mathcal{O}^2 \iff m_i>m_j. Finally, we complete the orders by saying that x<y if a appears in y and not in x, for both orders \mathcal{O}^1 and \mathcal{O}^2. If a belongs to both x and y, any arbitrary order can do. In the end, for all k‑mers X_{(i,j)}, \min_1(X_{(i,j)})= m_i and \min_2(X_{(i,j)})= m_j. Let \mathcal{X}=\{X_{(i,j)} : \{i,j\} \in E, m_i < m_j \}.
Lemma D.4 VertexCover admits a yes-instance (V, E, k) iff the MultiMinCover instance (\mathcal{X}, \mathcal{O}^1, \mathcal{O}^2, k) is a yes-instance.
Proof. (\Rightarrow) If \mathcal{C} is a vertex cover of size k, then \mathcal{Y} = \{m_i : i \in \mathcal{C}\} is a multiminimizer cover: for any k‑mer X_{(i,j)}, either i \in \mathcal{C} and \min_1(X_{(i,j)}) = m_i \in \mathcal{Y}, or j \in \mathcal{C} and \min_2(X_{(i,j)}) = m_j \in \mathcal{Y}.
(\Leftarrow) Let (V,E,k) be a no-instance. Assume that our constructed instance (\mathcal{X},\mathcal{O}^1,\mathcal{O}^2,k) is a yes-instance with multiminimizer cover \mathcal{Y}. Then every X_{(i,j)} \in \mathcal{X} has a minimizer \min_1(X_{(i,j)}) = m_i or \min_2(X_{(i,j)}) = m_j per definition of \mathcal{O}^1 and \mathcal{O}^2; and the set \mathcal{C} = \{i : m_i \in \mathcal{Y}\} is a vertex cover of size k for G=(V,E) with E=\{(i,j) : X_{(i,j)} \in \mathcal{X}\}.
D.7 Density plots for multiple w values
In this section, we ran the same evaluation as in Section 16.4.1, but applied our multiminimizer iterator to multiple w and more m values to check that our results are robust with regard to w and m. Figure D.3 and Figure D.4 show that our results are indeed robust with regard to w: for all w values, our iterator can yield densities lower than the lower bound.
D.8 Conservation with respect to error rate
In this section, we study the evolution of the conservation of the sampled multiminimizers with respect to the error rate and the number of hash functions. Given a sequence S and a mutated version of the sequence S' for a given error rate, we measure the conservation using the Jaccard similarity of A = \operatorname{MultiMinimizers}(S) and B = \operatorname{MultiMinimizers}(S'), i.e. \frac{|A \cap B|}{|A \cup B|}. The results are shown in Figure D.5. In particular, we can observe that the conservation decreases slightly faster with respect to the error rate as the number of hash functions increases.
D.9 Linearity of indexation time
In this section, we show that the iteration over multiminimizers is linear with regard to the number of hash functions. The results are shown in Figure D.6.