10  Background on k‑mer sets

Spectrum-preserving string sets, introduced in Chapter 2, offer one compact way to represent a k‑mer spectrum. By itself, however, such a representation gives no direct way to test whether an arbitrary k‑mer belongs to the spectrum or not. Building an efficient index is therefore a necessary step in any k‑mer-based pipeline, from read mapping and taxonomic classification to genome assembly. Depending on the application, this index may be built from an SPSS (repetitive or not), directly from raw reads, or from a reference genome, all of which may contain repeated k‑mers.

At the most basic level, a k‑mer set supports a single operation: membership, testing whether a given k‑mer is present or not. Many applications require more than this and associate each k‑mer with satellite information such as an abundance count or a color. A k‑mer dictionary extends a k‑mer set with a lookup operation: given a k‑mer, it returns a unique integer identifier that serves as a key into any external array of satellite values, or -1 if the k‑mer is absent. Both sets and dictionaries are commonly used in a streaming regime, where queries are issued for every k‑mer extracted consecutively from a sequence. This modality is prevalent enough that several structures are explicitly designed for it, favoring cache-efficient access over runs of consecutive k‑mers rather than minimizing the cost of individual queries.

This chapter focuses on three approaches representative of the state of the art for exact k‑mer set indexing, and a broader survey of the landscape can be found in (Marchet, 2024). SSHash (Pibiri, 2022; Pibiri & Patro, 2026) uses compact hashing over the super‑k‑mer structure of the input to achieve very fast queries at moderate space. The SBWT (Alanko, Puglisi, et al., 2023) exploits the colexicographic ordering of k‑mers in the de Bruijn graph to compress the index close to the navigational lower bound, at the cost of slower \mathcal{O}(k)-time queries. FMSI (Sladký et al., 2023, 2024, 2025) indexes a masked superstring with an FM-index, offering a different point on the space-time tradeoff.

10.1 Path locality and hashing

CautionDifferent notations

We reuse the notations from Section 4.3: the minimizer size is denoted m while k denotes the number of bases in a window, i.e. k = w + m - 1.

A first family of k‑mer indexes exploits the path locality of consecutive k‑mers in the input. As established in Section 4.3, k‑mers at successive positions tend to share the same minimizer and can therefore be grouped into super‑k‑mers. In the context of indexing, this grouping has two important consequences. First, there are far fewer super‑k‑mers than k‑mers: each super‑k‑mer spans on average (w+1)/2 k‑mers, the inverse of the minimizer density, reducing the number of objects to be indexed by the same factor. Second, consecutive k‑mers within a super‑k‑mer are stored contiguously, so a streaming query traversing a sequence left to right will access the same super‑k‑mer repeatedly before moving to the next, yielding favorable cache behavior. More broadly, the super‑k‑mer structure induces a partial order on k‑mers that an index can preserve: k‑mers within the same super‑k‑mer can be assigned consecutive identifiers, enabling fast satellite data access for streaming queries, a property formalized and exploited by LPHash (Pibiri et al., 2023).

10.1.1 Sparse and skew hashing: SSHash

A minimal perfect hash function (MPHF) for a set \mathcal{X} of n keys is a bijection f: \mathcal{X} \to [0, n) mapping each key to a distinct integer with no gaps (Lehmann et al., 2026). MPHFs can be constructed using close to \log_2 e \approx 1.44 bits per key in expectation (Limasset et al., 2017; Pibiri & Trani, 2021; Groot Koerkamp, 2025), far below the cost of storing the keys themselves. Since f assigns an arbitrary value in [0, n) to any input outside \mathcal{X}, a structure relying on an MPHF must store the keys or compact fingerprints separately to verify membership.

SSHash (Pibiri, 2022; Pibiri & Patro, 2026) takes an SPSS as input, identifies super‑k‑mer boundaries on the fly to record their offsets within the strings, and builds an MPHF over the set of minimizers to index those offsets, while retaining the SPSS strings as the underlying storage. Rather than hashing k‑mers directly, the MPHF operates on the much smaller set of minimizers. This is the sparse component of the design: since each super‑k‑mer spans on average (w+1)/2 k‑mers, the MPHF is built over a key set far smaller than the k‑mer set. The input strings are stored at 2 bits per base, alongside a prefix-sum array encoding partition sizes and an offsets array pointing to super‑k‑mer positions within the strings. Importantly, the SPSS strings remain stored as a single contiguous block: minimizer partitions are virtual, materialized through the auxiliary arrays rather than by physically grouping super‑k‑mers in memory. Figure 10.1 illustrates the resulting data structure layout.

Figure 10.1: Overview of the SSHash dictionary data structure, from Pibiri (2022). The input strings are stored contiguously; minimizers are mapped to partition identifiers via an MPHF; the Sizes and Offsets arrays locate each super‑k‑mer within the strings.

The skew component addresses the non-uniformity of partition sizes: even with a random hash function, most minimizers appear only once while a small fraction of minimizers appears in a disproportionately large number of super‑k‑mers, yielding highly skewed minimizer partition sizes. In particular, for k = 31 and m = 20, over 97% of minimizers have a single occurrence (Pibiri, 2022, table 1), while the most frequent minimizer appears more than 36000 times in the SPSS of a human genome. This observation was later confirmed by combinatorial analyses (Ingels et al., 2024, 2026).

SSHash encodes the prefix-sum of partition sizes using Elias-Fano (Fano, 1971; Elias, 1974), which compresses increasing integer sequences close to their information-theoretic minimum. A secondary skew index groups frequent minimizers by partition size and builds a separate MPHF per group, limiting the number of super‑k‑mers to inspect at query time.

To answer a lookup query for k‑mer x, SSHash computes the minimizer of x, retrieves the corresponding partition via the MPHF, and scans its super‑k‑mers until x is found, returning its identifier or -1 if it is absent. Streaming queries benefit from caching the last minimizer: consecutive k‑mers sharing the same minimizer extend the previous match without recomputing the partition, reducing the average number of memory accesses per query. In practice, SSHash achieves around 7–10 bits / k‑mer and answers individual queries in hundreds of nanoseconds (Pibiri, 2022, table 3).

10.1.2 Improvements on SSHash

A subsequent refinement (Pibiri & Patro, 2026) improves query efficiency on two fronts. First, rather than storing the start of each super‑k‑mer and scanning it, the refined index stores a single position per minimizer occurrence; combined with the offset of the minimizer within the query k‑mer x, this is sufficient to recover the exact location of x in the SPSS in constant time. Second, a tag array classifies each minimizer as singleton, light, or heavy based on how many super‑k‑mers share it; for singleton minimizers, the unique position is encoded directly in the tag itself, so that a lookup requires only two cache misses: one to evaluate the MPHF and one to read the tag. Since over 97% of minimizers are singletons for m \ge 20, this shortcut applies to the vast majority of queries. Figure 10.2 illustrates the two layouts side by side.

Figure 10.2: Comparison of the original and refined SSHash data structures, from Pibiri & Patro (2026). The tag array classifies minimizers as singleton, light, or heavy; singleton minimizers have their position encoded directly in the tag, avoiding any further memory access.

LPHash (Pibiri et al., 2023) takes a slightly different approach: rather than building a complete dictionary with the SPSS strings as underlying storage, it constructs a locality-preserving MPHF alone, without retaining the strings. This omission means that a k‑mer absent from the indexed set will also receive some identifier in [0, n) rather than -1, so membership cannot be verified without external storage. In exchange, the space footprint is substantially smaller, achieving 0.6–0.9 bits / k‑mer, and lookup queries are an order of magnitude faster than in SSHash.

10.2 Colexicographic clustering and spectral BWT

10.2.1 Colexicographic order and Wheeler graphs

A second family of k‑mer indexes stems from sorting k‑mers in colexicographic order, comparing them from right to left. Under this ordering, k‑mers sharing the same length-(k-1) suffix are adjacent, since they only differ in their leftmost character. Recall that in a node-centric de Bruijn graph, an edge from k‑mer x to k‑mer y exists iff the last k-1 characters of x match the first k-1 characters of y; we label such an edge by the last character of y, i.e. the character appended when extending x into y. The colexicographic ordering of k‑mers makes the de Bruijn graph a Wheeler graph (Gagie et al., 2017): nodes can be totally ordered such that edges with the same label preserve this ordering. A key consequence is that for any character c, c-labeled edges map a contiguous range of sources to a contiguous range of destinations, and vice versa. This property enables backward search: membership of a query k‑mer x can be tested by starting from the interval of all k‑mers ending in the last character of x, then iteratively narrowing it by prepending characters of x from right to left, each step intersecting with a contiguous range. Crucially, the colexicographic rank of a k‑mer implicitly encodes its identity, so the structure does not require explicit storage of the 2k-bit label of each k‑mer, only small per-node annotations. This is the root of the space efficiency of BWT-based approaches. The BOSS structure (Bowe et al., 2012) realizes these ideas for the edge-centric de Bruijn graph; the SBWT (Alanko, Puglisi, et al., 2023), described below, extends them to the node-centric view.

10.2.2 The spectral BWT and subset rank

The spectral Burrows-Wheeler transform, or SBWT (Alanko, Puglisi, et al., 2023), builds directly on the colexicographic structure. Given a k‑mer set S, the k‑mers are sorted colexicographically and grouped by their length-(k-1) suffix. For each group, only the first k‑mer in colexicographic order carries a non-empty set of characters: those that label outgoing edges from that group in the node-centric de Bruijn graph. Source k‑mers (those with no predecessor in S) are augmented with dollar-padded prefix k‑mers to give them a virtual predecessor from a dummy root. When a k‑mer has multiple predecessors, only the colexicographically smallest is retained, so that every k‑mer has exactly one incoming edge in the resulting structure. The resulting sequence of character subsets, one entry per k‑mer, is the SBWT. Figure 10.3 illustrates the construction on a small example.

Figure 10.3: Example of SBWT construction on a small k‑mer set, courtesy of Camille Marchet. The k‑mers are sorted colexicographically, as listed on the right. Only the first k‑mer in each length-(k-1) suffix group (highlighted in red) carries the outgoing edge subset, while the other edges (dashed in blue) are pruned. After construction, the SBWT only retains the subsets of outgoing edge labels, listed on the right. Note that this figure omits the dollar-padded prefix k‑mers for clarity.

Membership queries rely on a single primitive: subset rank. Given a character c and a position i, \text{SubsetRank}_c(i) counts how many of the first i subsets contain c. To query k‑mer x = x_1 \dots x_k, backward search starts from the full interval [1, n] and iterates over the characters of x from left to right, narrowing the interval at each step using: \ell' = 1 + C[c] + \text{SubsetRank}_c(\ell - 1) + 1, \quad r' = 1 + C[c] + \text{SubsetRank}_c(r) where C[c] counts k‑mers whose last character is smaller than c. After k steps, a non-empty interval indicates that x is present, and the resulting \ell is the colexicographic rank of x, serving directly as its integer identifier. The total cost is \mathcal{O}(k) subset rank operations. Note that, using the same process, the SBWT can actually query any m‑mer for m \le k.

A further benefit is that de Bruijn graph navigation comes for free: successors and predecessors of any k‑mer can be read off from the index with a single step of the same algorithm, without auxiliary structures.

Several data structures can be used to support subset rank queries, offering different time-space tradeoffs. The most commonly used in practice is the bit-matrix SBWT, which stores one bitvector of length n per character c \in \Sigma, where bit i is set if c \in X_i. \text{SubsetRank}_c(i) then reduces to a standard rank query on the c-th bitvector, and occupies around 4–5 bits / k‑mer1 (Alanko, Puglisi, et al., 2023). More space-efficient variants exist, the most compact being the subset wavelet tree SBWT, which compresses the subset sequence to its zeroth-order entropy, achieving 2.6–2.8 bits / k‑mer in practice at the cost of slower queries.

The SBWT can be further augmented with a longest common suffix array to accelerate streaming queries (Alanko, Biagi, et al., 2023), by allowing consecutive k‑mer lookups to reuse intervals computed at the previous position. This same array also enables the SBWT to compute matching statistics between a query sequence and the indexed k‑mer set (Mäklin et al., 2025), extending its use to pseudo-alignment and sequence similarity tasks.

10.3 Masked superstrings

Note

As this approach was published recently, it was not included in our comparisons in the following chapters.

Masked superstrings (Sladký et al., 2023) generalize spectrum-preserving string sets by allowing the representation to include k‑mers absent from the input, saving space at the cost of introducing spurious k‑mers. A bitmask is added to resolve the resulting ambiguity: a masked superstring is a pair (S, M) where S is a single string containing all k‑mers of the indexed set K as substrings, and M is a binary mask indicating which k‑mer occurrences in S are active, i.e. belong to K. Conceptually, this replaces the de Bruijn graph with an overlap graph admitting edges of any length, providing better compression power when the k‑mer set does not follow the spectrum-like property, as in subsampled or high-diversity metagenomic data.

Several mask variants serve different purposes. A max-one mask, maximizing active k‑mers, optimizes for membership queries. A min-one mask, activating each k‑mer exactly once, enables dictionary queries by giving each k‑mer a unique identifier. A min-runs mask minimizes the number of runs of 1’s, making the mask itself more compressible.

While computing the shortest masked superstring is NP-hard, approximately shortest masked superstrings can be computed from an input k‑mer set or a precomputed SPSS using the greedy tool KmerCamel (Sladký et al., 2023).

FMSI (Sladký et al., 2025) indexes the masked superstring via the Masked Burrows-Wheeler Transform (MBWT), an extension of the classical BWT that incorporates the mask. The MBWT is computed by pairing each superstring symbol with its mask bit, sorting pairs lexicographically on the superstring symbol alone, then unzipping the result into a transformed string S' and transformed mask M'. Membership of a query k‑mer x is answered by backward search in S' to find a position range [i, j], then checking M' for any active k‑mer in that range; the total cost is \mathcal{O}(k) rank operations. In practice, FMSI achieves around 3–4 bits / k‑mer (Sladký et al., 2025), at the cost of slower queries compared to SSHash or SBWT. Unlike SPSS-based approaches, it does not rely on the spectrum-like property, so its space cost is less sensitive to subsampled or high-diversity k‑mer sets.

FMSI can be extended to support k‑mer set operations and dynamic updates via f-masked superstrings (Sladký et al., 2024). The binary mask is replaced by an integer vector so that k‑mer occurrences from different input sets can be distinguished, and a demasking function f specifies how to combine the integer values seen at all occurrences of a k‑mer to decide its membership in the final set. Union, intersection, difference, and symmetric difference of two indexed k‑mer sets are then implemented by concatenating their masked superstrings, reapplying the appropriate demasking function to compute the resulting mask, and optionally compacting the structure to reduce its size. Individual k‑mer insertions and deletions follow as special cases of these set operations with singleton sets, providing dynamic updates without rebuilding the index.

10.4 Comparison

The three approaches sit on a clear space-time tradeoff curve. SSHash is the fastest by a wide margin (74 ns/k‑mer for streaming), at a moderate space cost of 10.01 bits / k‑mer; it benefits from the SPSS strings being kept as contiguous raw storage, enabling cache-friendly lookups but inflating the bit budget. The SBWT natively uses 4–5 bits / k‑mer, doubled to 10.50 bits / k‑mer in this benchmark setting where both strands of each k‑mer are indexed for canonical queries; with 266 ns/k‑mer for streaming, it occupies a middle ground in time and offers de Bruijn graph navigation and matching statistics out of the box. FMSI is by far the most compact at 3.31 bits / k‑mer, but its query time (1176 ns/k‑mer) is significantly slower than SSHash, since each query requires backward search over the FM-index without the cache benefits of super‑k‑mer partitioning.

The space gap between FMSI and the others widens further for k‑mer sets that do not follow the spectrum-like property (subsampled sketches, high-diversity metagenomes), where SPSS-based representations become fragmented. Conversely, SSHash’s advantage is most pronounced for streaming queries on long sequences, where super‑k‑mer caching amortizes lookups across consecutive k‑mers. Among the three, only FMSI extends natively to dynamic updates and set operations through f-masked superstrings; SSHash and SBWT remain static and rely on external mechanisms for these tasks, the subject of dedicated chapters later in this thesis.

Table 10.1 summarizes these properties, using values from the benchmark of Pibiri & Patro (2026) on the human genome at k = 31, publicly available at https://github.com/jermp/kmer_sets_benchmark.

Table 10.1: Summary of the three k‑mer indexing approaches discussed in this chapter, with space and streaming lookup values for the human genome at k = 31 from the benchmark of Pibiri & Patro (2026).
Approach Input Static / dynamic Space
(bits/k‑mer)
Streaming lookup
(ns/k‑mer)
SSHash (Pibiri & Patro, 2026) SPSS static 10.01 74
SBWT (Alanko, Puglisi, et al., 2023) SPSS static 10.502 266
FMSI (Sladký et al., 2025) Masked superstring static or dynamic3 3.31 1176

  1. The bit-matrix itself uses exactly 4 bits / k‑mer; the overhead above this comes from rank support and the dollar-padded prefix k‑mers added for source nodes.↩︎

  2. The 4–5 bits / k‑mer reported earlier in the chapter doubles here because both strands of each k‑mer are indexed to support canonical queries.↩︎

  3. Dynamic updates via f-masked superstrings.↩︎