2 Comparing using k‑mers
In contrast to full-text methods such as the Burrows-Wheeler Transform (BWT) (Burrows & Wheeler, 1994; Ferragina & Manzini, 2000), which aim to losslessly represent the original sequence while enabling more operations, we can often afford to lose some information on a sequence if this allows us to reduce its memory footprint or speed up comparisons. In this thesis, we will not cover full-text representations, and will focus on representing substrings of fixed-size k, or k‑mers. In particular, we will often see sequences as sets of k‑mers, without accounting for their positions or number of occurrences.
2.1 Coarse-grained comparisons with k‑mers
Intuitively, indexing k‑mers allows comparisons at a granularity of at least k bases, rather than at the level of individual nucleotides. This immediately speeds up comparisons by short-circuiting the first k positions, but it also means that matches shorter than k cannot be identified.
Working with fixed-size substrings naturally leads to fairly simple data structures. For small values of k (up to around 15), k‑mer sets are typically dense enough in \Sigma^k to use a plain lookup table. For larger k, however, the k‑mer set of a sequence becomes much sparser in \Sigma^k, and hash tables are usually preferred instead.
In the applications considered in this thesis, k typically ranges from 20 to 70, thus yielding sparse sets. Additionally, a k‑mer and its reverse-complement are often treated as equivalent, since sequencing methods do not tell us which DNA strand a read originates from. We therefore designate one of the two1 as a canonical representative. For this reason, k is usually restricted to odd values to avoid the degenerate case where a k‑mer is its own reverse-complement2.
2.2 De Bruijn graph representation
One immediate observation about the k‑mer set of a sequence is that two k‑mers at successive positions always overlap by k-1 bases. From this property stems the notion of de Bruijn graph (DBG), where each node corresponds to a k‑mer and a directed edge from x to y exists whenever x_2 \dots x_k = y_1 \dots y_{k-1}. This definition is referred to as node-centric, but we can define an edge-centric version that’s equivalent by representing k‑mers as edges and k - 1 overlaps as nodes. Figure 2.1 shows the node-centric DBG induced by the sequence CTAAGAAGGT, while Figure 2.2 shows its edge-centric counterpart. In this setting, any sequence of length \ge k corresponds to a walk in the DBG. This representation can however lead to ambiguities in the presence of repeats: for instance, CTAAGAAGGT, CTAAGAAGAAGGT and CTAAGAAGAAGAAGGT all correspond to the same graph for k = 3.
CTAAGAAGGT.
CTAAGAAGGT.
Note that in practice, when a k‑mer and its reverse-complement are treated as equivalent, the corresponding nodes are usually merged and DBGs become bidirected graphs, where each node carries two orientations and edges connect compatible ends.
2.2.1 Identifying small variations
An important property of DBGs is that they make small edits between similar sequences easy to spot (Iqbal et al., 2012). For instance, in Figure 2.3, substituting the last A with a C creates a bubble around the k k‑mers affected by the substitution. When combined with the abundance of each k‑mer, this allows erroneous k‑mers in raw reads to be corrected, or different haplotypes to be distinguished within a genome. DBGs therefore prove particularly useful for genome assembly (Bankevich et al., 2012; Kolmogorov et al., 2019; Ekim et al., 2021). Figure 2.4 shows a more complex DBG for k = 31 built from mRNA transcripts of SSNA1 gene, highlighting different exons identified in the graph.
Historically, this is also where de Bruijn graphs first entered bioinformatics: reconstructing a sequence from its reads raised scalability issues that motivated collapsing read depth into distinct k‑mers and exploiting the compressibility of non-branching paths. Both ideas later found applications well beyond assembly.
TCTAAGGTT and TCTACGGTT, each highlighted in a distinct color.
2.2.2 Compact representations
The set of k‑mers associated to a sequence is often referred to as its k‑mer spectrum3 and denoted \operatorname{Sp}_k(T) for a string T. This notion can naturally be extended to sets of strings by merging their spectra. Since having access to the k‑mer spectrum is sufficient to construct the corresponding DBG, one may want to find a compact way to represent it. Rahman & Medevedev (2021) and Břinda et al. (2021) formulated this problem as finding a minimum spectrum-preserving string set (SPSS) / simplitigs: given an input set of sequences \mathcal{I}, find a new set of sequences \mathcal{S} such that \operatorname{Sp}_k(\mathcal{I}) = \operatorname{Sp}_k(\mathcal{S}), each k‑mer appears exactly once in \mathcal{S} and \sum_{T \in \mathcal{S}} |T| is minimal.
A simple way to build a (non-minimal) SPSS is to decompose the edge-centric DBG into maximal paths with non-branching inner nodes, known as unitigs (Chikhi et al., 2015). For instance, the DBG in Figure 2.2 is decomposed as \{\textsf{CTAA}, \textsf{AAG}, \textsf{AGAA}, \textsf{AGGT}\}. The resulting graph is called a compacted de Bruijn graph (cDBG), and can be computed efficiently in (external) memory (Chikhi et al., 2016; Khan et al., 2022; Cracco & Tomescu, 2023).
While building unitigs does not result in a minimal SPSS (unless there are no branches), Schmidt & Alanko (2023) introduced eulertigs as an optimal solution based on Eulerian circuits that can be computed in linear time. Moreover, we can relax the unicity constraint and allow repeated k‑mers to further reduce output size, generalizing SPSS into repetitive SPSS (rSPSS). Schmidt et al. (2023) proposed an algorithm based on perfect matching and Eulerian circuits to find minimal rSPSS, referred to as matchtigs, and presented a faster greedy heuristic to approximate it. All these solutions have since been implemented as part of GGCAT (Cracco & Tomescu, 2023).
2.3 Using k‑mers for petabase-scale data structures
The compact representations above address how to store a k‑mer set efficiently. A complementary question is how to search across very large collections of such sets. Driven by projects like Tara Oceans (Karsenti et al., 2011) and Pebblescout (Shiryev & Agarwala, 2024), the unprecedented growth of sequencing data (Stephens et al., 2015) has motivated a parallel effort to make these massive collections efficiently searchable. The central problem here is sequence retrieval: given a query sequence, find all samples in the collection that contain it. In this context, k‑mers provide a natural intermediate representation, as each sample can be summarized by its k‑mer set and collections of such sets indexed jointly. We survey recent methods for indexing individual k‑mer sets in Chapter 10, and focus on collection-level methods in the rest of this section.
A first generation of approaches relied on approximate membership data structures. Sequence Bloom Trees (Solomon & Kingsford, 2016) organized per-sample Bloom filters in a hierarchy to prune the search space. BIGSI (Bradley et al., 2019) stored these filters in a bit-sliced layout for fast intersection across hundreds of thousands of samples. REINDEER (Marchet et al., 2020; Hernandez-Courbevoie et al., 2025) took a different angle and indexed k‑mer counts rather than their presence or absence.
Exact, lossless representations emerged with colored compacted de Bruijn graphs (ccDBG), which annotate each k‑mer in a shared DBG with the set of samples (colors) containing it. The index is usually decomposed into a k‑mer dictionary and a sparse annotation matrix over samples. Compressing this annotation matrix is the main algorithmic challenge: Mantis (Pandey et al., 2018) groups k‑mers with identical color sets into equivalence classes, MetaGraph (Karasikov et al., 2025) uses topology-aware compression, and Fulgor (Campanelli et al., 2024; Fan et al., 2024) exploits color similarity between adjacent unitigs and uses repetition-aware compression to achieve state-of-the-art index sizes.
The Logan project (Chikhi et al., 2024) recently proposed a complementary approach: rather than indexing raw reads directly, it first compacted the majority of the Sequence Read Archive into unitigs, drastically reducing the volume of data before indexing. Logan-Search4 then builds a k‑mer index over the resulting unitigs to enable large-scale search over all entries. Ongoing work by Rouzé et al. (2025) aims to further reduce the space usage of Logan’s unitigs by using a compressed color-to-k‑mer mapping.