Locality-preserving representations of k‑mer sets

Keep your friends close, and your k‑mers closer.

Part I established that careful implementation can push sequence processing close to the hardware limit. Now we ask a different question: can the representation of a k‑mer set itself be a source of performance gains? The central observation is that consecutive k‑mers extracted from a genomic sequence are not independent. A representation that ignores this locality pays for it in both memory and query time. Chapter 10 surveys existing approaches and establishes the metrics used throughout. Chapters 11 and 12 introduce CBL, a dynamic structure built on minimizer-partitioned sorted buckets that supports efficient set operations. Chapter 13 presents Brisk, which stores k‑mers implicitly inside super‑k‑mers for a lower per-k‑mer footprint. Chapter 14 pushes this idea further with hyper‑k‑mers, reducing the space overhead asymptotically to 4 bits / k‑mer.