Second discussion

This part started with a simple question: can the representation of a k‑mer set itself be a source of performance gains? The answer that emerged is yes, and a fairly broad one. Choosing a good unit of representation (a k‑mer, a super‑k‑mer, a hyper‑k‑mer, or a necklace) shapes nearly everything downstream, from how compactly the set can be stored to how well streaming queries behave on a real CPU, whether dynamic updates are possible at all, and how easily two indexes can be combined. This is in some ways a continuation of the first part, where I argued that algorithm and implementation should be designed together. Here the same idea moves up one level. The data structure that holds the k‑mer set is part of the algorithm, and choosing it well makes the rest of the pipeline simpler.

CBL, presented in Chapter 11, came out of this perspective at the k‑mer level. By representing each k‑mer through its necklace, consecutive k‑mers naturally share long prefixes, and a quotiented layout turns this redundancy into both compression and fast streaming access. This layout also made it easy to support exact set operations in Chapter 12, something that very few k‑mer indexes offer at all. Brisk, in Chapter 13, changes the indexing unit to whole super‑k‑mers, with the interleaved transform making binary search inside a bucket viable. KFC, in Chapter 14, goes further by sharing the k - 1 overlap between consecutive super‑k‑mers, reaching 4 bits / k‑mer while keeping streaming construction. Looking at these three together, my impression is that current k‑mer indexes are already well established when it comes to space usage and single-query speed, as the comparison in Chapter 10 makes clear, and that the next frontier lies in widening the range of operations they support natively.

Dynamic updates and set operations are the two that I find the most compelling. Pangenomes and large public databases keep growing, and most workflows would benefit from being able to incrementally add new genomes to an existing index, or to combine indexes through union, intersection and difference without rebuilding from scratch. These operations also act as compact filters. Intersection isolates the shared core of a collection, while difference removes contaminants or extracts what is specific to a sample. CBL takes a small step in this direction, but I am convinced there is more to do, especially on the super‑k‑mer side where deduplication during streaming insertion is still an open problem, as discussed at the end of Chapter 13. A clean solution there would let super‑k‑mer dictionaries keep their streaming insertion without losing the inherent deduplication that k‑mer-level structures get for free.

A direction I find promising is a closer convergence with full-text indexes. I chose not to cover them in this thesis, but they face closely related concerns and have made significant practical progress recently, with structures such as Movi (Zakeri et al., 2024, 2026) pushing the performance of matching queries. The SBWT stands out in that landscape by bridging full-text approaches and k‑mer methods, and recent work by Alanko et al. (2026) even implements set operations directly on top of it. There seems to be a lot to learn between the two communities, and I expect more exchanges in the coming years.

Closer to the application side, I see a real opportunity in importing ideas from database query engines, as I outlined at the end of Chapter 12. Sort-merge joins, hash joins, partition pruning, exponential search on skewed inputs, SIMD-accelerated intersection of sorted lists, all of these have direct counterparts in k‑mer set operations and have been refined for decades in the database literature. Vizitig (Degardins et al., 2025) is one concrete example of what becomes possible once a k‑mer index is treated as a queryable database rather than a fixed pipeline component, and I expect this line of work to grow.

Multi-k approaches also strike me as an underexplored direction in the current literature. Even though varying k along the analysis is already standard practice in assembly, where short k‑mers help resolve low-coverage regions and longer ones disambiguate repeats, k‑mer set structures are almost always built for a single fixed value of k. I think there is room for indexes that natively support several values of k, or that can answer queries at a finer granularity than the indexed one without rebuilding. The SBWT already supports querying any m‑mer with m \le k, which is a nice starting point.

Finally, the range of relevant k‑mer sizes is itself evolving. Long-read technologies now reach error rates low enough that k‑mers in the hundreds become tractable, as already exploited by KFC, and pangenomic comparisons benefit from this range. This shift makes the asymptotic behavior of representations matter more, and explains why moving from super‑k‑mers to hyper‑k‑mers pays off so visibly.