3 Sketching sequences

Sketching is the idea of summarizing a (large) dataset into a (small) set of fingerprints, called a sketch. A sketch is designed to be much cheaper to store and compare, while still providing good approximations for properties of interest. In this chapter, we treat sequences as sets of k‑mers (without accounting for their number of occurrences or positions) and we focus on sketches approximating the similarity of two sequences.

3.1 Why do we want to sketch sequences?

The primary motivation to do sketching in the first place is the colossal amount of sequencing data we have to deal with. What is especially limiting is not just the length of the sequences themselves but more importantly how many of them we have to compare, for instance when comparing an unknown sample against millions of reference genomes to find the closest matches. Thus, sketching is generally used as an efficient filtering step, rapidly narrowing the set of sequences before applying more expensive, exact operations (Rowe, 2019).

This framing naturally leads to two design goals: sketches must be lightweight since we may need to store millions of them, and similarity estimation must be fast since the number of pairwise comparisons may scale quadratically. Ultimately, both goals push against accuracy: a smaller sketch is cheaper to store and compare, but produces less precise estimates.

3.2 Approximating similarity

The similarity of two genomic sequences is generally defined by their average nucleotide identity (ANI), which measures the expected probability that a nucleotide position in a shared genomic region is identical between two genomes (Konstantinidis & Tiedje, 2005). However, computing ANI exactly is quite expensive, so we typically rely on the similarity of the corresponding k‑mer sets instead and use it to derive an approximation of ANI.

To measure the similarity of two sets A and B, we generally use the Jaccard index (sometimes called Tanimoto coefficient) defined as J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Nevertheless, we still want to avoid having to intersect large k‑mer sets. Thus, instead of working with the original sets, we sample them into smaller sketches S(A) and S(B) for faster comparison. Figure 3.1 illustrates this idea for “bottom MinHash” sketching that we present in the next section.

Figure 3.1: Overview of Jaccard similarity and how bottom MinHash can approximate it, adapted from Ondov *et al.* (2016). Dots filled in blue represent the sketch of A, while those filled in red represent the sketch of B and crosses indicate the sketch of A \cup B.

3.3 MinHash and its variations

Now that we know we want to sketch k‑mer sets to quickly estimate their Jaccard similarity, the main question becomes: how do we design sketching methods that are as accurate as possible with a tight memory budget?

3.3.1 MinHash

MinHash (Broder, 1997) is one of the simplest solutions to this problem. Given a regular hash function h, it computes the sketch of a set A as h(A) = \min_{x \in A} h(x) The nice property of this definition is that the probability for the minimum hashes of two sets to be equal exactly matches their Jaccard index, in other words it is an unbiased estimator: \mathbb{P}(h(A) = h(B)) = J(A, B)

While comparing a single hash gives the desired result on expectation, this estimate has a high variance. Thus, in order to reduce the variance, we want to average multiple independent samples. One way to do this is to generate s independent hash functions h_1, \dots, h_s by using multiple seeds: S(A) = (\min_{x \in A} h_1(x), \dots, \min_{x \in A} h_s(x)) \\ \widehat{J}(A, B) = \frac{1}{s} \sum_{i=1}^{s} \mathbf{1}_{h_i(A) = h_i(B)} This estimator is still unbiased and its standard deviation is proportional to 1 / \sqrt{s}.

3.3.2 Bottom MinHash

One drawback of using s hash functions is that building the sketch takes \mathcal{O}(Ns) time for a sequence of size N. What we can do instead is to select the s smallest hashes using a single hash function, which we refer to as bottom-s MinHash.

A common way to maintain the s smallest values is to use a max-heap, only updating it if the incoming hash is smaller than the current max of the sketch. While the construction time can be upper-bounded by \mathcal{O}(N \log s), Broder (1997) refined the bound down to \mathcal{O}(N + s \log N \log s) by noting that the heap is only updated \mathcal{O}(s \log N) times on average.

Broder (1997) proved that this sketch also provides an unbiased estimator of Jaccard similarity¹ \widehat{J}(A, B) = \frac{|S(A \cup B) \cap S(A) \cap S(B)|}{|S(A \cup B)|} where S(A) denotes the set of the s smallest hashes in A. This approach has been used in Mash (Ondov et al., 2016) to define a distance approximating ANI for genomes and metagenomes, and is illustrated in Figure 3.1.

3.3.3 FracMinHash

One limitation of fixed-sized sketches like (bottom) MinHash is that they perform poorly on sets of very dissimilar sizes (Rowe, 2019). To address this limitation, we can create a sketch whose size is proportional to the k‑mer set. Given a threshold t \in [0,1], and assuming the hash values are in [0,1], we select any hash smaller than t: S_t(A) = \lbrace h(x) : x \in A, h(x) < t \rbrace This method is now known in the literature as FracMinHash (Irber, Brooks, et al., 2022), but has appeared under different names such as scaled MinHash (Pierce et al., 2019), universe minimizers (Ekim et al., 2021) or mincode submer (Edgar, 2021) in the past.

Given a sequence of size N, and the corresponding k‑mer set of size n, a FracMinHash sketch contains nt elements in expectation and can be built in \mathcal{O}(N) time with a linear scan on the hash values. This approach has the benefit of being simple and efficient, and supports straightforward set operations such as union or intersection (for instance S_t(A \cup B) = S_t(A) \cup S_t(B)).

Rahman Hera et al. (2023) showed that the naive estimator J(S_t(A), S_t(B)) is biased, and that it can be corrected as follows: \widehat{J}(A, B) = \frac{J(S_t(A), S_t(B))}{1 - (1 - t)^{|A \cup B|}} However, one may notice that this formula depends on |A \cup B|, which is precisely what we were trying to avoid with J(A,B) in the first place. As a workaround, Rahman Hera et al. (2023) suggested to approximate |A \cup B| by |S_t(A \cup B)| / t, leading to the following approximation: \widehat{J}(A, B) = \frac{J(S_t(A), S_t(B))}{1 - (1 - t)^{|S_t(A) \cup S_t(B)| / t}}

Among practical use cases, FracMinHash is implemented as part of the popular sourmash toolchain (Pierce et al., 2019; Irber, Brooks, et al., 2022) and has been used to index all public metagenomes from the Sequence Read Archive (Irber, Pierce-Ward, et al., 2022).

3.4 Partitioned sketches and smaller fingerprints

3.4.1 Partitioned sketches

An intuitive idea to increase the number of samples without using multiple hash functions is to partition the universe of hash values into s parts, and compute a sketch for each of them. This technique is sometimes referred to as one permutation hashing in the literature (Li et al., 2012).

A simple way to implement this is to set s = 2^p and use the first p bits of a hash to select its partition. This has three main benefits: 1/ this is efficient to compute with bitwise operations, 2/ this allows us to save p bits for each element of the sketch since the first p bits are implicitly encoded by the partition and 3/ this enables faster (and parallelizable) comparisons at partition-level.

One limitation of using many partitions is that some of them may end up being empty if the input is not large enough. Some articles addressed this issue by designing densification strategies (Shrivastava & Li, 2014): each empty slot is filled with other values from the sketch. Note that densification is not mandatory and that some sketching schemes simply skip empty slots during comparison, at the cost of a slightly worse precision (Yu & Weber, 2020).

3.4.2 b-bit MinHash

One thing we may notice when computing min hashes is that the low bits carry much more information than the high ones (which are more likely to be zeros). Based on this observation, we can restrict the selected hashes to their lowest b bits to save space (Li & König, 2010). However, this introduces a collision probability which must be taken into account to correct the estimator: two distinct hashes might share the same lowest b bits, which would lead to an overestimated similarity. This technique is used in tools such as BinDash (Zhao, 2019) to estimate genome distance.

3.4.3 HyperMinHash

Yu & Weber (2020) extended this idea with an additional trick: instead of discarding the high bits, use them in a HyperLogLog counter (Flajolet et al., 2007). Each hash is thus represented by both its number of leading zeros and the b bits that follow this run of zeros (skipping the first one). Compared to b-bit hashing, HyperMinHash has the benefit of being updatable after construction and provides a precise cardinality estimator with HyperLogLog. This technique inspired the Dashing (Baker & Langmead, 2019; Baker & Langmead, 2023) toolkit.

3.4.4 MaxGeomHash

MaxGeomHash (Hera et al., 2025) is a recent method that draws inspiration from many of the previous techniques to design a sketch that scales sublinearly with the input size. Its main idea is to partition the hash values by their number of leading zeros instead of their first p bits, and select the largest² s hash values of each partition. For a fixed s, this leads to a sketch of \mathcal{O}(\log n) elements.

Moreover, Hera et al. (2025) also introduced a variant called \alpha-MaxGeomHash, which changes the number of elements selected in each partition: given \alpha \in (0,1), the partition with i leading zeros will select 2^{i \frac{\alpha}{1 - \alpha}} elements to correct the imbalance of partition sizes. With this change, the size of the sketch becomes \mathcal{O}(n^\alpha), for any desired \alpha \in (0, 1).

These two methods offer a middleground between fixed-sized sketches such as (bottom) MinHash and linear ones such as FracMinHash, while providing similar unbiased estimators for Jaccard similarity: \widehat{J}(A, B) = \frac{\sum_{i=1}^{\log n} |S_i(A \cup B) \cap S_i(A) \cap S_i(B)|}{\sum_{i=1}^{\log n} |S_i(A \cup B)|}

3.5 Other types of sketches

The sketching methods that we presented so far all have in common that they sample k‑mer hashes to estimate Jaccard similarity, but many other families of sketches have been designed with different goals in mind (Chen et al., 2026). In this section, I will cover some that I find especially promising.

3.5.1 DotHash for intersection size

DotHash (Nunes et al., 2023) takes a different approach than MinHash-like sketches: rather than sampling a subset of k‑mers and storing their hash, it computes a random projection for every k‑mer and aggregates all of them in a vector of fixed dimension d. Note that d is independent of the size n of the k‑mer set but the precision required for each aggregated component increases with n.

A key property of DotHash is that it can efficiently estimate the intersection size of two sets A and B by computing the dot product of the corresponding sketch vectors. In particular, by computing the dot product of the vector of A with itself, we obtain an estimator of the cardinality of A. Therefore, Jaccard similarity can be estimated as \widehat{J}(A, B) = \frac{S(A) \cdot S(B)}{S(A) \cdot S(A) + S(B) \cdot S(B) - S(A) \cdot S(B)}

However, DotHash has a major drawback: it only works on deduplicated k‑mer sets and cannot easily detect duplicates since the vectors are aggregated. In particular, if a given k‑mer has c_A occurrences in A and c_B in B, it will be counted c_A \times c_B times in their dot product, leading to an overestimated intersection size. Thus, DotHash is suitable on repeat-free sequences such as unitigs, but cannot be directly used on arbitrary sequences.

HyperGen (Xu et al., 2024) addresses this limitation by combining DotHash with FracMinHash sampling to sparsify and deduplicate its input. This approach results in a better space-accuracy tradeoff compared to existing MinHash-based tools, and is suitable for hardware acceleration.

Ongoing work by Faure et al. (2025) aims to adapt DotHash for large-scale all-vs-all comparisons of metagenomes.

3.5.2 SubseqSketch for edit distance

Recent results by Chen et al. (2025) focus on estimating the edit distance between two sequences without having to use an expensive alignment algorithm. Their approach, named SubseqSketch, relies on (non-contiguous) subsequences rather than k‑mers, and keeps track of their longest prefix match with respect to multiple random references. The match lengths are then used to compute a cosine similarity that strongly correlates with edit similarity. However, while this approach results in an improved sensitivity, it remains more expensive to compute than simple scans such as FracMinHash.

3.6 Toward locality-aware sampling

The sketching methods presented in this chapter share the property that each k‑mer is selected or not independently of its neighbors in the sequence. This context-free property makes them well suited for global similarity estimation, where all that matters is the overall composition of the k‑mer set. For tasks that instead demand a guarantee that two sequences sharing a long local match will also share a selected k‑mer, such as read mapping, the next chapter turns to a different family of sampling strategies.

Note that J(S(A), S(B)) is not a correct estimator.↩︎
Using the smallest hashes would work just as well.↩︎