4 Sampling with minimizers

The sketching methods presented in Chapter 3 compress a sequence into a compact summary, but they treat the sequence as a simple “bag” of k‑mers: two sequences with a high Jaccard index are considered similar regardless of where the shared k‑mers appear.

Many bioinformatics tasks require something stronger: a guarantee that two sequences sharing a long local exact match will also share at least one selected k‑mer at corresponding positions. This locality property is what distinguishes minimizers from general “context-free” sketching: here selected k‑mers act as representatives of their local context.

4.1 Definition and fundamental properties

The concept of minimizers was introduced independently by two groups in 2003–2004. Schleimer et al. (2003) introduced winnowing in the context of document fingerprinting and plagiarism detection: a sliding window scans the document and the smallest k‑mer of each window (according to some ordering) is selected as a fingerprint. Roberts et al. (2004) proposed the same idea for biological sequences, introducing the term minimizer, with the explicit goal of providing a locality-preserving sketch: any two sequences sharing a long exact match are guaranteed to share at least one minimizer.

A minimizer scheme is parameterized by three values: the k‑mer length k, the window size w, and a total ordering \mathcal{O}_k on k‑mers. A window of size w is a set of w consecutive k‑mers, spanning a substring of length \ell = w + k - 1. The minimizer of a window is the k‑mer that is smallest according to \mathcal{O}_k, with ties broken by selecting the leftmost occurrence¹. Sliding the window one position at a time along the sequence yields the minimizer sketch of that sequence: the (deduplicated) sequence of minimizer positions. We discuss efficient implementations of this sliding window algorithm in more detail in Chapter 8.

Figure 4.1: Example of lexicographic minimizer selection, for k=3 and w=5. Each line corresponds to a sliding window of 5 consecutive 3-mers, selecting a minimizer highlighted in red. The starting positions of the 3 minimizers are marked by red arrows above the sequence. Consecutive windows selecting the same minimizer position can be merged into a single super‑k‑mer.

An essential property of minimizers is the so-called window guarantee.

Property 1 (Window guarantee) Any window of w consecutive k‑mers, i.e. any substring of length \ell = w + k - 1, selects one minimizer.

This property has two important consequences that make minimizers especially useful in practice.

Property 2 (Conservation) Any pair of sequences sharing an exact match of length at least \ell = w + k - 1 will share at least one minimizer.

Property 3 (Bounded distance) The distance between consecutive minimizers is at most w.

These two properties together ensure that minimizers can serve as anchors for sequence comparison: shared minimizers between a query and a reference indicate candidate matching positions, and the bounded distance guarantees that no long match is missed (Ndiaye et al., 2024).

4.2 Density and orderings

Definition 4.1 (Density of a minimizer scheme) The density of a minimizer scheme is the expected fraction of k‑mers selected over all positions of a random sequence.

Since consecutive minimizers are at most w positions apart (Property 3), the density is always lower-bounded by 1/w. We’ll see in the following subsections that the ordering used to compare k‑mers can have a big impact on the density, and thus on the size of the resulting sketch.

4.2.1 Lexicographic ordering

The simplest ordering we can use is the lexicographic order on k‑mer strings. While it requires no precomputation and is straightforward to implement, lexicographic ordering has well-known limitations: homopolymers (especially repeated As) tend to select adjacent k‑mers, thus increasing the density and skewing the distribution of minimizer values (Marçais et al., 2017; Zheng et al., 2023; Ingels et al., 2024). In particular, Marçais et al. (2017) report an empirical density of 2.24 / (w+1) for lexicographic minimizers with k=10 and w=10. An example of lexicographic minimizer selection is given in Figure 4.1.

4.2.2 Random ordering

Another popular solution is to order k‑mers randomly, typically using a pseudo-random hash function. Randomizing the order reduces the impact of poly-As and spreads out the distribution of minimizer values (Ingels et al., 2026). Zheng et al. (2020) proved that random minimizers achieve a density of 2 / (w+1) + o(1/w), a ~10% improvement over lexicographic density, as long as k > 3 \log_{|\Sigma|}(w+1). In practice this condition is almost always satisfied and random ordering is the default choice in most bioinformatics tools.

This gap between the density of 2 / (w+1) and the 1/w lower bound motivated the design of improved orderings and low-density schemes, which we discuss in more details in Chapter 15.

4.3 Super‑k‑mers

An important property of minimizer schemes is that consecutive windows tend to share the same minimizer. More precisely, random minimizers are shared between (w+1) / 2 windows on average (i.e. the inverse of the density). Therefore, instead of being represented independently, the windows associated to a given minimizer occurrence can be merged into a compact representation called super‑k‑mer (Li et al., 2013). Formally, a super‑k‑mer is the smallest substring covering all windows selecting a given minimizer position. An example of super‑k‑mer is depicted in Figure 4.1.

Different notations

In the context of super‑k‑mers, the minimizer size is usually denoted m (instead of k previously) while k denotes the number of bases in a window (previously denoted as \ell), i.e. k = w + m - 1.

Super‑k‑mers have been widely adopted in bioinformatics tools for two main reasons. First, storing super‑k‑mers is much more compact than storing each k‑mer individually (by avoiding duplicated overlaps). We will see in Chapter 14 that transforming a sequence into a set of super‑k‑mers roughly increases its space usage by a factor 3 (while individual k‑mers lead to a factor k) and that we can reduce it further with hyper‑k‑mers. Second, since each k‑mer belongs to a single super‑k‑mer, this induces a natural partition of the original sequence. This partitioning method is useful to distribute k‑mers in memory and balance the computational load, and preserves locality by grouping consecutive k‑mers sharing the same minimizer.

4.4 Improving conservation

While we have seen that density directly impacts the space usage of minimizers and super‑k‑mers, conservation is an important complementary metric that measures how well minimizers are preserved under mutations. Formally, conservation is defined as the expected fraction of bases covered when matching selected k‑mers from the original sequence to those in the mutated version (Edgar, 2021; Shaw & Yu, 2021).

4.4.1 Syncmers

Syncmers were introduced by Edgar (2021) as an alternative to minimizers that improve conservation. Rather than selecting the minimizer of each window, syncmers select the window itself if its minimizer is located at a specific position within the window. In closed syncmers, the minimizer must appear at the first or last position of the window. In open syncmers, it must appear at a fixed interior position, typically the middle one (assuming an odd number of elements). Examples of open and closed syncmers are shown in Figure 4.1.

Different notations

In the context of syncmers, the minimizer size is usually denoted s (instead of k previously) while k denotes the number of bases in a window (previously denoted as \ell), i.e. k = w + s - 1.

Closed syncmers are known to have a density of \frac{2}{k - s + 1} = \frac{2}{w} and a window guarantee of k - s = w - 1 (Edgar, 2021; Shaw & Yu, 2021): any window of w - 1 consecutive k‑mers (i.e. 2 w - 2 consecutive s‑mers) contains at least one closed syncmer. Open syncmers, on the other hand, achieve half the density at \frac{1}{k - s + 1} = \frac{1}{w}, but have no window guarantee (Shaw & Yu, 2021).

The distinction between selecting a subword versus selecting a window gives minimizers and syncmers different properties and use cases. While a minimizer depends on the context around it and may change if another k‑mer in the window becomes smaller after a mutation, a syncmer is guaranteed to stay selected even if the context around it is altered (we say that it is 1-local). Because of this, syncmers achieve a higher conservation than random minimizers, with open syncmers performing better than closed syncmers, making them a better option for anchor selection in mappers (Shaw & Yu, 2021). On the other hand, since each window selects a unique minimizer, this minimizer can serve as a representative of the whole window for partitioning, while syncmers do not serve the same role.

4.4.2 Strobemers

Strobemers were introduced by Sahlin (2021) to address a different limitation of k‑mer-based sampling: a single mutation invalidates k consecutive k‑mers, preventing any match within these 2 k - 1 bases. Instead of selecting a single contiguous k‑mer, strobemers link two or more shorter k‑mers (called strobes) sampled at variable offsets to produce a spaced k‑mer. In particular, this spaced k‑mer can still be selected if a mutation occurs in the gaps between the strobes, thus making it more robust. Strobemers have been shown to improve sensitivity in read alignment, and are the basis of the strobealign aligner (Sahlin, 2022).

4.5 Practical use cases

A careful reader may notice that it took more than a decade after the introduction of minimizers for them to appear in widely-used tools (Wood & Salzberg, 2014; Chikhi et al., 2015; Li, 2016). This delay is not entirely coincidental, and actually follows the development of third-generation sequencing that made long reads widely available. Minimizers are particularly well-suited for long reads, where k‑mer sampling is essential to reduce the computational cost, and it was only once such data became accessible that minimizers reached their full practical potential. In the following subsections, we briefly survey the main use cases of minimizers in practical applications.

4.5.1 Minimizers as anchors

One of the main applications of minimizers is read mapping, with the seed-chain-align paradigm popularized by minimap (Li, 2016, 2018; Sahlin et al., 2023). In this paradigm, both the reference and the query use minimizers as seeds, and index them along with their positions. Matching minimizers are then identified as anchors, indicating potentially large matches, and these anchors are chained using dynamic programming. The best chain of anchors is finally extended into a complete alignment, reducing the quadratic cost of a single large alignment to a more efficient succession of short alignments.

4.5.2 Minimizers for partitioning

Another important use of minimizers is that they can act as a representative of their window, which is useful to distribute k‑mers in memory. This property is used for building compacted de Bruijn graphs (Chikhi et al., 2015; Holley & Melsted, 2020; Cracco & Tomescu, 2023) or counting k‑mers (Deorowicz et al., 2015; Li & Yan, 2015; Kokot et al., 2017), where each partition corresponds to a given minimizer or subset of minimizers.

4.5.3 Minimizers as lossy fingerprints

While minimizers are often used as part of a bigger exhaustive index, they can also be used on their own as a lossy sketch of the original sequence. For instance, Kraken2 (Wood et al., 2019) classifies reads against a reference database by mapping each minimizer to a lowest common ancestor in a phylogenetic tree. While this is less accurate than having this information for all k‑mers, storing only minimizers reduces the database size by 85% compared to the original Kraken (Wood & Salzberg, 2014). Needle (Darvish et al., 2022) uses the same trick to accelerate transcript quantification, while Kaminari (Levallois et al., 2026) uses it to build a lightweight colored de Bruijn graph.

4.5.4 Minimizer-space data structures

Finally, another idea that emerged recently is to transform the original sequence into its sequence of minimizers, making it shorter but with a larger alphabet, and work directly in this minimizer space. Ekim et al. (2021) successfully applied this technique to accelerate genome assembly by building a minimizer-space de Bruijn graph, where k-min-mers correspond to a chain of k consecutive minimizers. Ekim et al. (2023) later used the same technique to speed up long read mapping. It should be noted, however, that the term “minimizer” used in these two contributions does not correspond to our definition and is actually closer to a FracMinHash sketch (presented in Chapter 3). Ayad et al. (2025) generalized this idea to build other string data structures in sketch space, illustrating the approach with a minimizer-space suffix array that improves space efficiency.

Choosing the leftmost occurrence is a convention, some schemes use the rightmost one or select all tied k‑mers.↩︎