11 Necklaces and minimizers

Note

This chapter is partially adapted from (Martayan et al., 2024), accepted to ISMB 2024 and published in Bioinformatics.

While we have seen in Chapter 4 that minimizers can be used as a way to partition k‑mers while preserving locality, we present in this chapter a new approach based on smallest cyclic rotations, or necklaces. We will discuss some of their properties, highlight their connection to lexicographic minimizers and explain how they can represent k‑mer sets efficiently. In particular, one property that will prove particularly useful is that necklaces of consecutive k‑mers often share long prefixes.

11.1 Necklaces and their properties

Definition 11.1 (Necklace) We define the necklace \langle x \rangle of a string x as its smallest cyclic rotation: \langle x \rangle = \min_i \operatorname{rol}^i(x)

It’s pretty clear that the necklace transformation is not injective, for instance CAT, ATC and TCA all have ATC as a necklace. Intuitively, each necklace of size k acts as a representative of its k cyclic rotations, as depicted in Figure 11.1. The following property confirms this intuition and shows that necklaces of size k roughly account for a fraction \frac{1}{k} of all k‑mers:

Proposition 11.1 (Number of necklaces (Lothaire, 1997)) Given an alphabet of size \sigma, the number of necklaces of size k is given by N(k) = \frac{1}{k} \sum_{d | k} \varphi\left(\frac{k}{d}\right) \sigma^d \sim \frac{\sigma^k}{k} where \varphi denotes Euler’s totient function.

Figure 11.1: Example of all cyclic rotations of `GATAC`, with the necklace `ACGAT` highlighted in color.

11.1.1 Ranking necklaces

Because they only represent a fraction \frac{1}{k} of the universe, it should be possible to have an encoding of necklaces that saves about \lg k bits. One way to obtain such encoding is to rank necklaces: given a necklace \langle x \rangle, find i such that \langle x \rangle is the i-th smallest necklace of size k. Figure 11.2 illustrates the effect of ranking on necklaces of size 4.

Figure 11.2: Repartition of necklaces in \Sigma^4, applying rank maps all the necklaces to [0, N(k)).

Sawada & Williams (2017) proposed an algorithm to compute the rank of a necklace of size k in \mathcal{O}(k^2), which is still the best known as of today. Unfortunately, this running time is quite prohibitive for practical applications on large collections of necklaces where we typically aim for near-constant queries. Finding a faster way to compute, even approximately¹, the rank of a necklace is still an open question.

11.1.2 Bijective encoding of k‑mers

We’ve seen that the necklace transformation is lossy in the sense that we cannot tell which of its cyclic rotations it represents. Therefore, in addition to the necklace, we want to keep track of the rotation offset in order to make the transformation reversible.

Definition 11.2 (Rotation offset) We define the rotation offset \rho(x) of a k‑mer x as the minimum number of left rotations required to obtain its necklace: \rho(x) = \arg\min_i \operatorname{rol}^i(x)\\ \operatorname{rol}^{\rho(x)}(x) = \langle x \rangle

Since there are k such rotation offsets, it requires \lceil \lg k \rceil bits to encode \rho(x). Thus, when combined with the rank encoding of the necklace, this yields a fully reversible transformation whose total size matches the original k‑mer up to an extra rounding bit.

11.1.3 Encoding canonical k‑mers

The encoding discussed above does not take the canonical aspect into account, but we can adapt the technique of Wittler (2023) to save one additional bit when k is odd. The idea is that, when using the encoding presented in Section 6.2, the complement operation flips exactly one bit for each base (e.g. A = 00 \leftrightarrow T = 10). Therefore, when k is odd, the parity of the number of 1s (efficiently computed with a popcount) changes between a k‑mer and its reverse-complement. We can thus define a k‑mer as canonical if its number of 1s is odd, and safely erase its last bit: if the remaining 2k - 1 bits have an odd number of 1s we know that the last bit must be 0, otherwise it must be 1. From this (2k - 1)-bit encoding, we then apply the reversible necklace transformation, this time using a binary alphabet. An example of this canonical encoding is shown in Figure 11.3.

Figure 11.3: Example of canonical necklace encoding for `ATA` and `TTG`. `TTG` has an even number of 1s so it is first transformed into its reverse-complement `CAA` before applying the necklace transformation. Both k‑mers end up having the same binary necklace, but the rotation offsets distinguish them.

11.2 Necklaces of consecutive k‑mers

We’ve seen how to encode individual k‑mers as necklaces, but in most applications we actually want to index consecutive k‑mers extracted from a sequence. We now discuss the benefits of using necklaces in this context, and how this can be used to design a new data structure.

11.2.1 Runs of common prefixes

A key property of necklaces is that, when considering two consecutive k‑mers x and y, their necklaces tend to share a long prefix. The intuition behind this is quite simple: if x_1 = y_k their necklaces are completely identical, and if x_1 \ne y_k they are likely to differ by a single letter at position k - \rho(x), with \rho(y) = \rho(x) - 1. The only situation where the necklaces diverge is when the incoming letter y_k creates a rotation that is lexicographically smaller than the current necklace. This is more likely to happen when \rho(x) is small, since the substitution then falls closer to the start of the necklace. This phenomenon is illustrated in Figure 11.4, where we can see runs of increasingly long prefix matches of length k - \rho(x) as we advance. We can also notice that roughly 1/4 of the matches have length k, corresponding to the event where x_1 = y_k.

Figure 11.4: Length of the common prefix between necklaces of consecutive k‑mers with respect to their position in the sequence, simulated on a random sequence with *k = 31*.

11.2.2 Computing necklaces with minimizers

While computing a necklace individually requires \mathcal{O}(k) operations since we have to consider every rotation, we can accelerate this process down to \mathcal{O}(\log k) operations in the context of consecutive k‑mers. The main property that we rely on is that the prefix of size m of a given necklace is always the smallest m‑mer in the circular word. In particular, this m‑mer is either fully included in the original k‑mer or overlaps the k‑mer’s start and end, in that case we refer to it as a boundary substring. Boundary substrings change for each new k‑mer, so we have to recompute them every time. Non-boundary substrings, however, are preserved for consecutive k‑mers. Because of this, we can use classical algorithms for minimizers (such as those discussed in Chapter 8) to compute the smallest non-boundary m‑mer in \mathcal{O}(1) amortized time. Moreover, we know that if m is large enough (\Omega(\log k)) the smallest m‑mer is unique with high probability (Zheng et al., 2020, lemma 9). Therefore, by choosing m = \Theta(\log k), we only have one non-boundary m‑mer to consider w.h.p. and m-1 boundary m‑mers to compute, leading to \mathcal{O}(\log k) time overall. In practice, we found that using substrings of m = 9 bits gave the best results for k ranging from 31 to 63. Our implementation of this algorithm can compute 100M necklaces per second on a laptop with a M1 processor.

11.2.3 Super-necklaces

The connection between necklaces and lexicographic minimizers does not end here: just like minimizers can group consecutive k‑mers into super‑k‑mers, consecutive necklaces can be grouped into super-necklaces. The idea is simple: as long as \rho(y) = \rho(x) - 1, i.e. as long as we are in the same run, only one letter is replaced at a time and we know its position. Therefore, assuming a run of r k‑mers x^{(1)}, \dots, x^{(r)}, we can compactly encode all their necklaces \left\langle x^{(1)} \right\rangle, \dots, \left\langle x^{(r)} \right\rangle as a tuple of strings containing r + k - 1 bases: \left( \left\langle x^{(1)} \right\rangle, \left\langle x^{(r)} \right\rangle[k-r+2, k] \right) Using this representation, we can recover all individual necklaces by combining a prefix of the first string with a suffix of the second: \left\langle x^{(i)} \right\rangle = \operatorname{pref}_{k-i+1}\!\left(\left\langle x^{(1)} \right\rangle\right) \cdot \operatorname{suff}_{i-1}\!\left(\left\langle x^{(r)} \right\rangle\right)

11.3 CBL: a dynamic data structure for k‑mer sets

Building on the necklace representation described above, we now present CBL (Conway-Bromage-Lyndon), a compressed and dynamic data structure for exact k‑mer set representation, available at https://github.com/imartayan/CBL. As highlighted in the previous section, necklaces of consecutive k‑mers tend to share long common prefixes, making them amenable to prefix-based compression. Moreover, since k‑mer sets extracted from biological sequences are sparse in [0, 2^{2k-1}-1], necklaces concentrate towards small values and many share common prefixes. We exploit this skewness with a quotienting strategy, storing prefixes and suffixes separately. Figure 11.5 gives a schematic overview of the resulting structure.

11.3.1 Quotienting sets of necklaces

To achieve compression, we select a prefix size p and store prefixes and suffixes of necklaces separately, the suffix being 2k-1-p bits long. The rationale is that many necklaces share the same prefix, particularly since their distribution is skewed towards smaller values. Quotienting thus concentrates the bulk of the set into a small number of prefix buckets, each holding a short list of suffixes.

11.3.2 Prefix data structure

For the storage of prefixes, we use a bitvector of size 2^p. To facilitate rapid access and insertion by rank, a structure supporting efficient dynamic rank/select operations is required. Several such structures have been discussed in the literature (Vigna, 2008; Gog & Petri, 2013; Zhou et al., 2013), but dynamic variants are comparatively rare. We adopt an implementation based on (Marchini & Vigna, 2020; Pibiri & Kanda, 2021; Donges et al., 2022) that supports fast dynamic rank/select.

The key ingredient is a Fenwick tree, an efficient data structure for prefix sums of a dynamic array. Here the array elements are bits, so the binary rank at any position equals the prefix sum up to that position. This arrangement allows rank to be computed in \mathcal{O}(\log 2^p) = \mathcal{O}(p) time by traversing the tree, and updates are equally efficient.

Figure 11.5: Schematic view of CBL’s data structure. For clarity, the example uses lexicographic rather than binary necklaces. The full necklace vector is shown on the left, with two overlapping k‑mers sharing closely related necklaces. Prefixes are stored in a bitvector and associated to suffix buckets via their rank.

11.3.3 Associating prefixes to suffixes

For associating each prefix with its corresponding list of suffixes we use a tiered vector (Bille et al., 2017). This structure maps each prefix rank to a bucket of suffixes by storing a pointer at the index corresponding to that rank. Insertions and deletions are supported at any position in \mathcal{O}(n^{\varepsilon}) time (where n is the total number of elements and \varepsilon < 1), while access remains \mathcal{O}(1).

Tiered vectors maintain a dynamic sorted array in the leaves of a shallow tree (internal depth 4 in our implementation). Fast insertion is achieved by filling available slots in the leaves; since this may disturb leaf order, internal nodes store the rotations needed to restore the correct ordering.

11.3.4 Suffix storage

As noted in previous studies, the distribution of small-sized genomic words is highly skewed. We therefore expect most buckets to hold only one or a few suffixes, with a minority containing many. To handle both cases efficiently, we use an adaptive strategy. Small buckets store suffixes in a contiguous vector that is scanned linearly. Large buckets (more than 1024 elements by default) switch to a trie, exploiting the high suffix similarity typically found there, e.g. from repetitive or low-complexity regions. The trie uses 1 byte per level; for vector-based storage, suffixes are likewise divided into bytes to minimize footprint. The rotation offset (see Definition 11.2) is encoded in the trailing bits of each suffix to make the transformation reversible.

11.3.5 Operations

The main operations (membership, insertion, deletion) follow the same pipeline:

compute the necklace and rotation offset of the k‑mer (as described above), and append the rotation offset at the end of the necklace
split the result into a prefix q of p bits and a suffix r
look up the prefix in the bitvector
compute the rank of q using the Fenwick tree
retrieve the bucket associated to that rank via the tiered vector
perform membership/insertion/deletion of r in the bucket:
- for small buckets, scan the vector linearly
- for large buckets, navigate the trie byte by byte

When the necklaces of consecutive k‑mers share the same prefix q, steps 1–5 only need to be performed once and the remaining operations are batched on the same bucket. This optimization is particularly effective when streaming all k‑mers of a sequence.

11.3.6 Informal comparison to Elias-Fano

CBL’s quotienting layout is analogous to Elias-Fano encoding for sorted integers: both split elements into high (prefix) and low (suffix) parts stored separately. The key difference is in how suffixes are handled. Elias-Fano omits an explicit prefix-to-suffix mapping and stores all suffixes contiguously, achieving higher compression but making insertion expensive by shifting many elements. Dynamic adaptations of Elias-Fano have been proposed by Pibiri & Venturini (2017) but have no practical implementation yet. CBL trades some compression for full dynamicity: every layer of the structure supports insertions and deletions efficiently.

11.3.7 Implementation details

The necklace and rotation offset are packed into a single primitive integer of up to 128 bits. The combined size 2k-1 + \lceil \log_2(2k-1) \rceil must therefore stay below 128 bits, limiting support to k \le 59.

The prefix size p is a compile-time parameter, supported up to 32 bits. We use p = 24 bits as a default, balancing the size of the prefix bitvector (2^p bits) against the load on suffix buckets. Increasing p for very large sets reduces bucket sizes and speeds up bucket operations.

11.4 Comparison against other indexes

We conducted extensive benchmarks on the CBL library, focusing on its ability to represent and manipulate k‑mers in various biological datasets. All experiments were performed on a single cluster node with an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz, 128 GB of RAM, and Ubuntu 22.04. Experiment scripts, competing tools, and plot generation code are available at https://github.com/imartayan/CBL_experiments.

We compare CBL against two state-of-the-art static structures: SSHash (Pibiri, 2022), a minimal-perfect-hash-based k‑mer index, and SBWT (Alanko et al., 2023), a succinct full-text k‑mer index. Among dynamic structures we include Bifrost (Holley & Melsted, 2020), a colored de Bruijn graph supporting insertions, and BufBOSS (Alanko et al., 2021), a BWT-based de Bruijn graph with insertion/deletion buffers. We also include Rust’s HashSet (based on Google’s Swiss Tables) as a generic dynamic baseline.

11.4.1 Index construction

For our initial experiment, we built indexes on an expanding collection of bacterial genomes. We randomly selected subsets of genomes from which we computed a compacted de Bruijn graph (unitig set), doubling the subset size each time. The largest file comprised 1024 genomes and 1,574,701,184 distinct k‑mers. Time and memory usage are reported in Figure 11.6.

SSHash is the best overall performer in terms of construction speed and memory, but it is static and requires duplicate-free input. HashSet is similarly fast but uses more memory. CBL is slightly slower than both while matching SSHash’s memory footprint. SBWT is slower than these three despite being static, with memory comparable to HashSet. Bifrost is significantly slower but equally memory-efficient to SSHash and CBL. BufBOSS is the slowest and most memory-intensive tool.

The fact that CBL matches the throughput of a high-performance hash table while being equally memory-efficient is particularly noteworthy. Importantly, Bifrost, HashSet, and CBL are the only tools that accept any FASTA input directly. The other tools require a k‑mer counting preprocessing step (e.g. KMC3 for SBWT and BufBOSS, unitig assembly for SSHash), which externalizes part of their construction cost and may impose heavy disk usage.

11.4.2 Queries

To evaluate query performance we benchmark positive queries (k‑mers present in the index) and negative queries separately. Positive queries use the input unitigs’ k‑mers; negative queries use randomly generated k‑mers with a negligible probability of being present. Results are shown in Figure 11.7, reporting query time and RAM usage per k‑mer, averaged over a series of experiments.

As expected, static structures and BufBOSS use very little RAM during queries, while other dynamic structures are more memory-intensive. SSHash and SBWT are the fastest for queries; BufBOSS is the slowest. Among dynamic structures, CBL and Bifrost have the smallest memory footprint, while HashSet is the most demanding.

The relative ranking changes between positive and negative queries: Bifrost is fastest for positive queries (with CBL and HashSet close behind) but slowest for negative ones. HashSet excels at negative queries, slightly outperforming even SSHash. CBL consistently maintains high throughput and low memory usage across both query types.

11.4.3 Insertion and deletion

To assess update performance, we measure the cost of inserting and deleting k‑mers; results are shown in Figure 11.8. Among the four tools supporting insertion, CBL and HashSet are the fastest, processing each k‑mer in under a microsecond. Bifrost is approximately an order of magnitude slower, and BufBOSS another order of magnitude behind that. Memory usage is comparable across tools; BufBOSS is the most efficient, followed by Bifrost, HashSet, and CBL.

For deletion (not supported by Bifrost), the remaining tools show a similar pattern: CBL and HashSet lead in speed, BufBOSS lags by an order of magnitude, and memory usage is lowest for BufBOSS, then CBL, then HashSet.

11.5 Perspectives

The use of necklaces as a k‑mer representation is relatively recent, and many directions remain open. One open question concerns the rank of a necklace: the best known algorithm runs in \mathcal{O}(k^2) time (Sawada & Williams, 2017), which makes it impractical for large-scale use. A promising avenue would be to compute an approximate rank, mapping necklaces to a space slightly larger than N(k) but still \mathcal{O}(k) times smaller than the original k‑mer space. Piecewise linear approximation, as used in the PLA-index (Abrar & Medvedev, 2024), could provide such an approximation with controllable error, and would be a natural fit given the distribution of necklaces.

A second open question is the integration of super-necklaces into an actual index. We introduced the super-necklace representation and showed that it compactly encodes runs of consecutive necklaces sharing the same prefix, but this structure has not yet been connected to a data structure supporting efficient retrieval. Adapting the trie component of CBL to operate directly on super-necklaces is one concrete direction: within a run, successive suffixes differ by a single letter at a known position, which could allow the trie to share internal nodes more aggressively than it currently does.

Regarding CBL itself, several implementation-level improvements are within reach. The current implementation is single-threaded, though suffix buckets are independent given a fixed prefix and could straightforwardly be distributed across threads. The core necklace computation and bucket scan operations are also good candidates for SIMD vectorization, and the techniques developed in Chapter 8 could serve as inspiration. The trie structure used for large buckets can also consume significant memory, which could be addressed by using a more compact alternative such as an adaptive radix tree (Leis et al., 2013). CBL being a fully dynamic structure also opens up possibilities beyond simple membership queries, which we explore in the next chapter.

For instance, we could afford mapping the values to an output space that’s 1% bigger than N(k).↩︎