4 Sampling with minimizers
The sketching methods presented in Chapter 3 compress a sequence into a compact summary, but they treat the sequence as a simple “bag” of k-mers: two sequences with a high Jaccard index are considered similar regardless of where the shared k-mers appear.
Many bioinformatics tasks, such as read alignment or index construction, require something stronger: a guarantee that two sequences sharing a long local exact match will also share at least one selected k-mer at corresponding positions. This locality property is what distinguishes minimizers from general sketching, and is the subject of this chapter.
4.1 Definition and basic properties
4.1.1 Historical context
The concept of minimizers was introduced independently by two groups in 2003–2004. Schleimer et al. (2003) introduced winnowing in the context of document fingerprinting and plagiarism detection: a sliding window scans the k-mers of a document and the smallest k-mer (according to some ordering) is selected as a fingerprint. Roberts et al. (2004) proposed the same idea for biological sequences, introducing the term minimizer, with the explicit goal of providing a locality-preserving sketch: any two sequences sharing a long exact match are guaranteed to share at least one minimizer.
A minimizer scheme is parametrized by three values: the k-mer length k, the window size w, and a total ordering \prec on k-mers. A window of size w is a set of w consecutive k-mers, spanning a substring of length \ell = w + k - 1. The minimizer of a window is the k-mer that is smallest under \prec, with ties broken by selecting the leftmost occurrence1. Sliding the window one position at a time along the sequence yields the minimizer sketch of that sequence: the (deduplicated) sequence of minimizer positions.
4.1.2 Fundamental properties
4.1.3 Density
4.2 Orderings
4.2.1 Lexicographic ordering
4.2.2 Random ordering
4.3 Super-k-mers and sequence partitioning
4.4 Canonical minimizers
4.5 Improving conservation
4.5.1 Syncmers
4.5.2 Strobemers
4.6 Practical applications
4.6.1 Read alignment
4.6.2 Sequence classification and metagenomics
4.6.3 Partitioning indexes
4.6.4 Genome assembly
TBA
- historical context (Schleimer et al., 2003; Roberts et al., 2004)
- random vs lexicographic
- practical use cases (Ndiaye et al., 2024)
- improving conservation (Edgar, 2021; Sahlin, 2021, 2022; Shaw & Yu, 2021)
- low density schemes detailed in Chapter 15
Choosing the leftmost occurrence is a convention, some schemes use the rightmost one or select all tied k-mers.↩︎