4  Sampling with minimizers

The sketching methods presented in Chapter 3 compress a sequence into a compact summary, but they treat the sequence as a simple “bag” of k-mers: two sequences with a high Jaccard index are considered similar regardless of where the shared k-mers appear.

Many bioinformatics tasks, such as read alignment or index construction, require something stronger: a guarantee that two sequences sharing a long local exact match will also share at least one selected k-mer at corresponding positions. This locality property is what distinguishes minimizers from general sketching, and is the subject of this chapter.

4.1 Definition and basic properties

4.1.1 Historical context

4.1.2 Fundamental properties

4.1.3 Density

4.2 Orderings

4.2.1 Lexicographic ordering

4.2.2 Random ordering

4.3 Super-k-mers and sequence partitioning

4.4 Canonical minimizers

4.5 Improving conservation

4.5.1 Syncmers

4.5.2 Strobemers

4.6 Practical applications

4.6.1 Read alignment

4.6.2 Sequence classification and metagenomics

4.6.3 Partitioning indexes

4.6.4 Genome assembly


TBA