17 Sketching super‑k‑mers

Note

This chapter is adapted from (Rouzé et al., 2023) and (Rouzé et al., 2025), accepted to WABI 2023 and published in LIPIcs and Algorithms for Molecular Biology.

Antoine Limasset and Camille Marchet came up with the original idea. Timothé Rouzé wrote most of the code and ran all the experiments. I proposed the notion of fractional hitting sets, derived the bounds, wrote all the proofs and implemented the decycling sets optimization.

As discussed in Chapter 3, FracMinHash (Irber et al., 2022) is a practical method for large-scale containment and Jaccard estimation of k‑mer sets: its sketches scale linearly with input, avoiding the fixed-budget limitations of tools like Mash (Ondov et al., 2016), and can handle documents of widely differing sizes. The storage strategy as implemented in sourmash (Pierce et al., 2019) treats each selected k‑mer as an opaque 32-bit hash, without exploiting the overlapping structure that consecutive k‑mers naturally share. For collections at the scale of NCBI RefSeq, which contains over 1.2 million bacterial genomes totalling more than 5 Tbp, this per-element overhead becomes a bottleneck.

The key observation we build on is that consecutive k‑mers share overlaps and naturally group into super‑k‑mers (see Section 4.3), making it possible to represent a set of selected k‑mers more compactly than by storing individual hashes. Universal Hitting Sets (discussed in Chapter 15) aim to cover all k‑mers with a low-density scheme; however, our application does not require complete coverage of the k‑mer space. We therefore introduce Fractional Hitting Sets (FHS) that select a near-uniform fraction of the k‑mer space, study the achievable density as a function of that fraction, and show that the resulting super‑k‑mers are particularly amenable to compact encoding. We implement this scheme in a tool called SuperSampler, and evaluate it against sourmash (Pierce et al., 2019).

17.1 Preliminaries

Throughout this chapter we fix a DNA alphabet \Sigma = \{A,C,G,T\} with \sigma = |\Sigma| = 4, and consider two input sets of sequences S_A and S_B (in practice, read sets or assembled genomes), with A and B denoting their corresponding k‑mer sets.

Different notations

We reuse the notations from Section 4.3: the minimizer size is denoted m while k denotes the number of bases in a window, i.e. k = w + m - 1.

Definition 17.1 (maximal super‑k‑mer) Let s (|s|\geq k) be a string and v a super‑k‑mer of s. Let i_m be the first position of the minimizer on s. v is a maximal super‑k‑mer iff v starts at position i_m + m - k in s and v ends at position i_m + k - 1 in s. It follows that v has a length of 2k-m and contains w=k-m+1 k‑mers.

Examples of regular and maximal super‑k‑mers are shown in Figure 17.1. As noted in (Pibiri et al., 2023, theorem 3), the proportion of maximal super‑k‑mers approaches \frac{1}{4} for large k.

Figure 17.1: Super‑k‑mers extracted from a sequence for k=6, m=3. Minimizers are shown in pink, here we use the lexicographic order instead of hashing minimizers for the sake of simplicity. Super‑k‑mers 1 and 4 are maximal (they contain respectively k‑mers {`CTGAAA`, `TGAAAT`, `GAAATG`, `AAATGC`} and {`TGCACA`, `GCACAT`, `CACATT`, `ACATTT`}), while 2 and 3 are not (and contain respectively k‑mers {`AATGCA`} and {`ATGCAC`}).

As recalled in Section 4.2, random minimizers achieve a density of 2/(w+1), and we define the density factor as the density multiplied by w+1 (so random minimizers have a factor of 2). The question we explore here is whether relaxing full coverage to a fraction f of k‑mers allows us to push the density factor below 1, and what the implications are for sketch compactness.

17.2 Fractional Hitting Sets

A fractional hitting set relaxes the UHS requirement by only requiring that a fraction f of windows are hit, rather than all of them.

Definition 17.2 (fractional hitting set or FHS) Given f \le 1, a set \mathcal{F} \subseteq \Sigma^m is a Fractional Hitting Set (FHS) if a fraction at least f of the sequences of w consecutive m‑mers have an element contained in \mathcal{F}.

To avoid selection bias, we use a random hash function that distributes m‑mers uniformly over \llbracket 1, \sigma^m \rrbracket and select all m‑mers whose hash falls below a fixed threshold. We call these small minimizers. Note that any method selecting a fraction of the minimizers hashes would be suitable here.

Definition 17.3 (small m‑mer) Given a fixed threshold t \in \llbracket 1, \sigma^m \rrbracket, we say that a m‑mer is small if its hash is below t. We denote by \mathcal{S} the set of small m‑mers, and p = \frac{t}{\sigma^m} the probability that a m‑mer is small.

From p we can derive the proportion of covered k‑mers.

Proposition 17.1 Given t \in \llbracket 1, \sigma^m \rrbracket and p = \frac{t}{\sigma^m}, the expected fraction of k‑mers with distinct m‑mers containing a small m‑mer is f = 1 - (1 - p)^w where w = k - m + 1 and p is the probability that a m‑mer is small.

Proof. Given a k‑mer with w distinct m‑mers x_1, \dots, x_{w}, \Pr(\forall i \in \llbracket 1, w \rrbracket, h(x_i) > t) = \prod_{i=1}^{w} \Pr(h(x_i) > t) = (1 - p)^w because the hashes of distinct m‑mers are independent, so f = \Pr\left(\min_{i \in \llbracket 1, w \rrbracket} h(x_i) \le t\right) = 1 - (1 - p)^w

Note that this property is valid for k‑mers with distinct m‑mers; k‑mers with duplicated m‑mers (i.e. k‑mers containing repetitions) have a lower coverage since they have fewer candidates for small minimizers. Fortunately, as shown in (Zheng et al., 2020) (see lemma 9), the proportion of k‑mers with duplicated m‑mers is negligible for a sufficiently large m (> (3 + \varepsilon) \log_\sigma w). Thus, our selection method is uniform for k‑mers with distinct m‑mers, and almost uniform in the entire k‑mer space.

Conversely, if we want a given fraction f of the k‑mers to be covered, the threshold should be chosen as t = \left[1 - (1 - f)^{1/w}\right] \cdot \sigma^m \tag{17.1}

Related to this, let us define the subsampling rate as s = \frac{1}{f}. For instance, a desired subsampling rate of 1000 will give f=\frac{1}{1000}.

17.2.1 Density of small minimizers

We showed that selecting k‑mers with a small minimizer induces an FHS. By considering the usual definition of density (from Definition 4.1) for this scheme (which covers a fraction f of all k‑mers), we obtain the following bound (proven in Section E.2):

Theorem 17.1 Given f \le 1 and t = \left[1 - (1 - f)^{1/w}\right] \cdot \sigma^m, assuming m > (3 + \varepsilon) \log_\sigma w, the expected density of small minimizers in a random sequence is upper bounded by \frac{2 f}{w + 1} + o(1/w)

At first glance, the results may be surprising, as the density is smaller than the lower bound of 1/w for f < 1/2 and can approach zero. This is because some k‑mers may not contain any small minimizers and are therefore not covered, and the proportion of such k‑mers increases as f approaches 0. However, it is worth noting that this bound does match the 2/(w+1) density when f=1 (i.e., when every k‑mer is covered).

To obtain a more meaningful metric, we can compute the density on the fraction of the sequence that is covered, instead of the entire sequence. With this approach, we obtain the following theorem, which has been proven in the Section E.3:

Theorem 17.2 (restricted density) If we restrict to sequences in which every k‑mer contains a small minimizer, then given f \le 1 and t = \left[1 - (1 - f)^{1/w}\right] \cdot \sigma^m, assuming m > (3 + \varepsilon) \log_\sigma w, the expected density of small minimizers is upper bounded by 2 \cdot \frac{f + (1 - f) \ln (1 - f)}{f^2 (w+1)} + o(1/w)

Although less intuitive than the previous one, this result provides valuable insights into the density within the covered portion of the sequence. As shown in Figure 17.2 (a), the associated density factor ranges from 2 when f=1 (consistent with existing results) to 1 when f=0. Therefore, as f approaches 0, we can approach the optimal density.

17.2.2 Proportion of maximal super‑k‑mers

Although measuring the density provides an overview of the average length of super‑k‑mers, it does not indicate how many of them are maximal (i.e., of length 2k-m). The following result (proven in the Section E.4) answers this question:

Theorem 17.3 Given f \le 1 and t = \left[1 - (1 - f)^{1/w}\right] \cdot \sigma^m, the average proportion of maximal super‑k‑mers (with respect to all super‑k‑mers) built from small minimizers in a random sequence is given by \left[\left(1 - \frac{1}{w}\right) \frac{f}{1 + f}\right]^2 + \frac{1 - f(1 - 2/w)}{1 + f}

Note that this result generalizes theorem 2 from (Pibiri et al., 2023), which corresponds to f = 1. As shown in Figure 17.2 (b), the proportion increases towards 100% as f approaches 0.

(a) Theoretical upper bound on the density factor from Theorem 17.2, lower is better.

17.2.3 Improving the density of fractional hitting sets using UHS

This effect is more pronounced for smaller values of f. This observation raises a natural question: what is the lowest achievable density for a given f? Since universal hitting sets (UHS) with a density lower than 2 have been proposed for f=1, it is possible that they may also improve the density for smaller f values by considering only the m‑mers selected by the UHS as potential minimizers.

Theorem 17.4 Given a UHS \mathcal{U} with density d_{\mathcal{U}}, f \le 1 and t = \left[1 - (1 - f)^{1/w}\right] \cdot \sigma^m, assuming m > (3 + \varepsilon) \log_\sigma w, the expected density of small minimizers selected from \mathcal{U} (that is, \mathcal{S} \cap \mathcal{U}) in a random sequence is upper bounded by f \cdot d_{\mathcal{U}} + o(1/w)

The proof is given in Section E.5. Note that this result generalizes Theorem 17.1 since the UHS of minimizers selected using a random ordering has a density of 2/(w+1) (Schleimer et al., 2003).

17.3 Sketching technique in SuperSampler

17.3.1 SuperSampler’s sketch construction

Definition 17.4 (SuperSampler’s sketch) Given a sequence S, each super‑k‑mer whose minimizer’s hash is lower or equal to a threshold t is selected, and all its surrounding k‑mers are kept in the sketch as a super‑k‑mer. A SuperSampler sketch can therefore be represented as a super‑k‑mer set.

Sourmash stores each selected k‑mer as a 32-bit hash, which introduces a small false-positive rate but keeps the encoding simple and sequence-agnostic. As illustrated in Figure 17.3, SuperSampler instead stores k‑mers as super‑k‑mers: fingerprints are exact (no false matches, shared k‑mers can be retrieved), and overlapping k‑mers within a super‑k‑mer share bases, reducing space.

When using random minimizers, super‑k‑mers contain (k-m+1)/2 k‑mers on average (Pibiri et al., 2023), giving a mean length of (3k-m-1)/2 bases. Each super‑k‑mer also needs a \log_2(k-m+1)-bit length header to delimit it, yielding a lower bound in bits / k‑mer of \frac{2\left(3k - m - 1 + \log_2(k - m + 1)\right)}{k - m + 1} Using Equation 17.1 to set the fraction f lowers the density further, producing longer super‑k‑mers. At low f, almost all super‑k‑mers are maximal (Theorem 17.3), which removes the need for length headers (all have the same length 2k-m) and gives the tighter bits / k‑mer bound \frac{2(2k-m)}{k-m+1}

Figure 17.3: SuperSampler’s sketching strategy. In order to build sketches, SuperSampler computes super‑k‑mers over the input sequence. Fingerprints are associated with each super‑k‑mer by hashing their minimizers to an integer, hence an integer per super‑k‑mer. Super‑k‑mers associated to sufficiently low integers are kept in the sketch. Super‑k‑mers are put into partitions according to their minimizer.

17.3.2 Partitioned sketches

Since each super‑k‑mer is centered around a unique minimizer occurrence, super‑k‑mers naturally partition into groups sharing the same minimizer. SuperSampler stores each non-empty partition (a minimizer and its associated super‑k‑mers) independently. Within a partition, the minimizer sequence is stored once and omitted from all maximal super‑k‑mers, since its position within each maximal super‑k‑mer is known beforehand. This gives an even lower bits / k‑mer ratio of \frac{4(k-m)}{k-m+1} Figure 17.4 shows the space cost of the three encodings (plain super‑k‑mer, maximal super‑k‑mer, partitioned super‑k‑mer) as a function of minimizer size, alongside measured performance on random sequences.

Figure 17.4: Theoretical space cost of different encodings in bits / k‑mer according to the k‑mer size along with practical space usage of SuperSampler sketches on random sequences.

Partitioning also accelerates comparisons: matching k‑mers between two sketches must share a minimizer, so only one partition needs to be in memory at a time.

17.3.3 Set comparisons

Figure 17.5: How SuperSampler and sourmash perform their respective sketch comparison. Colored rectangles represent k‑mers. Those sharing the same color are sharing a common minimizer. In SuperSampler sketches, k‑mers sharing their minimizers are stored in the same partition. In this example, we discuss the comparison of one document against a collection, although other use cases can be inferred. SuperSampler is capable of skipping certain partitions that are not relevant to the query. By focusing on smaller sub-parts of the collection one at a time, SuperSampler effectively improves practical performance and reduces memory usage.

SuperSampler’s sketch comparison produces an estimator for both Jaccard index J(A,B) = |A \cap B| / |A \cup B| and containment index C(A,B) = |A \cap B| / |A|.

Unlike sourmash, which compares all fingerprints globally (Algorithm 17.1), SuperSampler iterates over minimizer partitions (Algorithm 17.2, Figure 17.5): partitions absent from the query are skipped entirely, and the remaining ones are processed independently. This is particularly beneficial for large collections, where most partitions are query-specific and can be skipped.

Algorithm 17.1: Sketch comparison in sourmash.

\begin{algorithmic} \State{Let $Q$ be a list of query files $Q_0..Q_i..Q_{s-1}$} \State{Let $T$ be a list of target files $T_0..T_i..T_{p-1}$} \State{$HQ$ is an array of $s$ sets of fingerprints, one for each query file} \State{$HT$ is built similarly for each target} \State{$M$ is a matrix of size $s \times p$ filled with $0$} \ForAll{$i \in HQ$} \ForAll{$j \in HT$} \State{$M[i][j] = |HQ[i] \cap HT[j]|$} \Comment{time linear in the size of the smallest set} \EndFor \EndFor \end{algorithmic}

Algorithm 17.2: Sketch comparison in SuperSampler.

\begin{algorithmic} \State{Let $Q$ be a list of Query files $Q_0..Q_i..Q_{s-1}$} \State{$HQ$ is a hashtable of hashtables} \State{$HQ$ maps any minimizer in a file $Q_i$ from $Q$ to a hashtable} \State{The second hashtable maps fingerprints from a given minimizer to a list of indexes of source files in $0..s-1$} \State{Similarly, a hashtable of hashtables $HT$ is built for each target dataset $T_0..T_{p-1}$} \State{$M$ is a matrix of size $s \times p$ filled with $0$s} \ForAll{minimizer $m=\mathrm{key}$ in $HQ$} \If{$m \in HT$} \ForAll{fingerprint $f=\mathrm{key}$ in $HQ[m]$} \If{$f \in HT[m]$} \Comment{Partition $HT[m]$ is loaded only if necessary} \ForAll{index $dq \in HQ[m][f]$} \ForAll{index $dt \in HT[m][f]$} \State{$M[dq][dt] \mathrel{+}= 1$} \EndFor \EndFor \EndIf \EndFor \EndIf \EndFor \end{algorithmic}

17.4 Results

All experiments were run on a cluster node with an Intel Xeon Gold 6130 CPU (2.10 GHz) and 128 GB of DDR4 RAM.

17.4.1 Space efficiency of SuperSampler.

For m = 15, the first bound from Section 17.3.1 gives 9.2 bits / k‑mer with k=31 and 7.1 with k=63, but in practice SuperSampler achieves approximately 6.5 and 5 bits / k‑mer respectively as depicted in Figure E.2, thanks to the lower density of the FHS scheme. As shown in Figure 17.6, the density factor drops quickly with the subsampling rate: already below 1.04 at rate 100, and approaching 1 at rate 1000 (the sourmash default). Combining the FHS scheme with double decycling sets (Pellow et al., 2023) gives an additional improvement, particularly noticeable for small k (Figure 17.7).

Figure 17.6: Measured density factor compared to the model.

Figure 17.7: Space cost in bits / k‑mer according to the subsampling rate with and without using decycling sets (yellow and red lines).

17.4.2 Performance Comparison

We compared SuperSampler against sourmash on two datasets: 1024 Salmonella genomes (high mutual similarity) and 1024 RefSeq bacterial genomes (Jaccard similarity near 0). All-vs-all comparisons were timed and profiled with Snakemake, and exact Jaccard and containment values from Simka (Benoit et al., 2016) serve as ground truth for precision estimates.

Figure 17.8: Resource consumption results for 1024 bacterial genomes from RefSeq with *k = 31*.

17.4.2.1 RAM, time, and sketch size

Figure 17.8 and Figure E.3 demonstrate that, for k=31, SuperSampler generally consumes 5 times less RAM and requires 16 times less space than sourmash. Additionally, SuperSampler performs computations 50 times faster than sourmash when comparing highly dissimilar genomes. However, when genomes are very similar, such as with Salmonella, comparison times are comparable since SuperSampler’s time optimization does not apply to very similar documents. We also note that minimizer size has little to no impact on these metrics.

Figure E.4 and Figure E.5 reveal that the improvement in sketch disk size is even more significant with larger values of k. SuperSampler uses in general 50 times less disk space than sourmash with k=63, while maintaining similar differences in RAM and computation time.

17.4.2.2 Error

SuperSampler’s accuracy is slightly below sourmash’s (Figure 17.9), due to a mild clustering effect: k‑mers sharing a small minimizer tend to be co-selected. In terms of precision per unit of compressed sketch size, however, SuperSampler compares favorably (Figure 17.10). SuperSampler also stores k‑mers exactly, so shared k‑mers can be retrieved directly, something sourmash cannot do without switching to larger, invertible hashes.

Figure 17.9: Error Jaccard similarity approximation for sourmash and SuperSampler against the compressed sketch sizes. This plot shows the results for 20 files of salmonellas simulated reads with k = 31, m = 13. Reads are 150bp long for a 100X coverage. Dashed lines represent the sizes of the sketches indexing all k‑mers.

Figure 17.10: Error against Simka on Containment similarity approximation for sourmash (red line) and SuperSampler with different minimizer sizes, on 1024 Salmonella genomes with *k = 63*.

17.4.2.3 Massive collection indexing

Figure 17.11 shows performance on 100 to 128,000 RefSeq bacterial genomes. SuperSampler handles the full range in under 25 CPU hours; sourmash was stopped at 32,000 genomes. The performance gap narrows at large scale because the n \times n output matrix dominates memory pressure for both tools. Sketch creation is cheap in both cases, with IO as the bottleneck.

Figure 17.11: Computational time for comparisons on different amounts of bacterial genomes from RefSeq. From 100 to 128,000 genomes with k = 63, s = 1000 and m = 15. sourmash was run up to 32,000 genomes as it was taking too much time for the 2 last experiments.

17.5 Conclusion

This chapter introduced fractional hitting sets, a relaxation of universal hitting sets that covers a fraction f of the k‑mer space. The resulting small-minimizer scheme achieves a density factor below 1.5 for f \le 0.8 and approaching 1 as f \to 0, producing longer and more compact super‑k‑mers than full-coverage schemes, with an increasing proportion of maximal super‑k‑mers as f decreases. SuperSampler exploits this by storing selected super‑k‑mers in minimizer-partitioned sketches, enabling both compact encoding and efficient partition-level comparisons. Experiments confirm the theoretical density bounds and show substantially lower disk and RAM usage than sourmash at comparable accuracy.

We plan to investigate alternative methods for sketch comparison, like sorted fingerprints, which could potentially reduce the complexity of the comparison process. From a theoretical perspective, delving deeper into the properties of fractional hitting sets and gaining a better understanding of density and restricted density bounds for various values of f may lead to even more efficient and robust sketching techniques. Finally, studying theoretical guarantees on the bias of this approach would be an interesting direction.