12 Set operations on k-mer collections

Note

This chapter is partially adapted from (Martayan et al., 2024), accepted to ISMB 2024 and published in Bioinformatics.

The original paper reported benchmarks on set operations but did not describe the underlying algorithms. We present them here with refinements made since publication, together with new discussions on exponential search and connections to database query engines.

The previous chapter introduced CBL as a compressed, dynamic structure for exact k‑mer set representation and established its competitive performance for membership queries, insertions, and deletions. We now turn to a class of operations that most k‑mer structures do not support at all: set operations. Union, intersection, and difference are fundamental to pangenomic workflows: merging sequencing runs, identifying k‑mers shared across all samples in a cohort, removing contaminant sequences. Yet they are typically performed either approximately, using lossy Bloom-filter-based methods (Lemane et al., 2022; Shibuya et al., 2022), or by reloading sets into a general-purpose hash table. At the time of writing, CBL was, to our knowledge, the only exact k‑mer structure to offer in-place set operations. We explain what enabled this, what algorithmic machinery underlies it, and when the approach wins (or loses) against hash-based alternatives.

12.1 The sorted-iterator interface

The key enabler of efficient set operations in CBL is that its internal layout naturally yields a sorted stream of k‑mers. Traversing the prefix bitvector in rank order and, for each occupied prefix, scanning its suffix bucket in turn produces all stored k‑mers in lexicographic order. Each bucket stores suffixes as either a flat vector or, once it exceeds a size threshold, a trie. Tries are inherently sorted; only flat-vector buckets require an explicit sort, performed lazily when a set operation is requested. Since flat-vector buckets are small by definition, this per-bucket sort is cheap.

This sorted-iteration property transforms set operations into classical merge problems. Given two sorted streams, union and intersection can be computed in a single linear pass, consuming each stream once from left to right. By contrast, the hash-based alternative checks each element of one set against the other, requiring one random memory access per element. We examine the performance consequences of this difference in Section 12.4.

12.2 Multi-way set operations

CBL’s set operations are implemented on top of iter-set-ops (available at https://github.com/imartayan/iter-set-ops), a Rust crate that performs intersection, union, and difference over an arbitrary number of sorted deduplicated iterators. Generalizing to s > 2 sets costs \mathcal{O}(n \log s) rather than the \mathcal{O}(n s) of repeated hash lookups, with no change to the algorithm structure. All operations are lazy: they return iterators that advance their inputs on demand, so the result can be streamed without materializing it in memory.

12.2.1 Union

The union of s sorted iterators is computed with a binary heap-based s-way merge. Each iterator contributes its current front element to a min-heap; at each step the minimum is extracted and its source iterator is advanced; if the new front matches the extracted value, that iterator is also advanced to suppress duplicates before inserting back into the heap. Total cost is \mathcal{O}(n \log s) for s iterators of combined size n.

12.2.2 Intersection

Intersection uses a max-index tracking strategy rather than a heap. A “front” array holds the current element of each iterator. The algorithm records the index of the iterator holding the current maximum; all other iterators are advanced until they either match or overshoot that maximum. When all fronts are equal, the element is yielded and all iterators advance together.

This approach benefits from early termination: as soon as any iterator is exhausted, the computation stops. When the intersection is sparse, with few k‑mers shared across all s sets, this happens well before the inputs are consumed and the actual cost falls well below the \mathcal{O}(n \log s) worst case.

12.2.3 (Symmetric) difference

Set difference advances the left iterator and, for each left element, advances all right iterators until their fronts are \geq the left element; if any right front equals the left element it is suppressed. This runs in \mathcal{O}(n + m) for a left set of size n and right sets of combined size m.

12.3 Prefix-level pruning

CBL exploits its quotiented layout to perform set operations at the prefix level before descending to suffixes. When scanning two CBL structures in tandem, both prefix bitvectors are advanced simultaneously. If one structure has no bucket for a given prefix rank, the entire prefix range is skipped using a single rank/select query, with no suffix work performed. Only when both structures share an occupied prefix do their suffix buckets enter the merge or intersection routine.

Because necklace values are skewed towards small values (as discussed in Chapter 11), many prefix slots are empty, and this pruning eliminates a large fraction of the work in practice. When k‑mer sets are biologically related, their occupied prefix ranges overlap heavily, so the relative saving shifts to the suffix level. Identical necklaces are then handled by a zero-cost path that advances both iterators simultaneously. In-place mode further reduces peak memory by modifying one input structure directly, avoiding the allocation of a third.

12.4 Benchmarks

To evaluate the practical impact of sorted iteration, we compare CBL against HashSet, a hash table based on Google’s Swiss Tables, on intersection and union of k‑mer sets built from an expanding collection of bacterial genomes, ranging from 10⁷ to 10⁹ k‑mers (k = 31, p = 28 bits). Time and memory results for intersection are shown in Figure 12.1, and for union in Figure 12.2.

Figure 12.1: Time and memory trade-off when performing set intersections, for k = 31 and p = 28 bits.

Figure 12.2: Time and memory trade-off when performing set union, for k = 31 and p = 28 bits.

CBL achieves 200 ns per k‑mer on intersection and 500 ns per k‑mer on union; HashSet requires 400 ns and 900 ns respectively. The memory advantage is larger: CBL’s in-place operations do not allocate a result structure, while HashSet must construct a third hash table. Comparable trends hold for set difference and symmetric difference. These results reflect a broader pattern: at the scale typical in genomics, with large sets streamed across many samples, sorted iteration consistently outperforms hash-based alternatives.

12.5 Exponential search for skewed set sizes

The linear scan of merge-based algorithms is efficient when the two sets are of comparable size, but wasteful when one set is much smaller. Consider intersecting a small query set of m k‑mers against a large index of n k‑mers with m \ll n: iterating through all n elements to find m matches is suboptimal.

Exponential search (Bentley & Yao, 1976) addresses this: rather than advancing one step at a time in the large iterator, double the stride until overshooting the target, then binary-search within the overshot interval. The cost per match found is \mathcal{O}(\log d) where d is the gap to the next match. Demaine et al. (2000) extended this to multi-way intersection. For two sets of sizes m \ll n, the cost specializes to \mathcal{O}(m \log(n/m)), a significant improvement over the \mathcal{O}(n + m) linear scan. An alternative achieving the same average complexity is the double binary search of Baeza-Yates & Salinger (2010): binary-search the median of the smaller set in the larger one, then recurse on both halves. In practice both approaches outperform linear merge when the size ratio is large.

The iter-set-ops library uses max-index tracking for intersection, advancing all iterators towards the current maximum, but without the exponential stride; it therefore remains \mathcal{O}(n + m) in the two-set case. Adding exponential search would reduce this to \mathcal{O}(m \log(n/m)) for skewed sizes. The gain is largest whenever set sizes are strongly skewed: a small query intersected against a large reference, a small contaminant set removed from a large assembly, or a multi-way intersection across very unequally-sized sets. In the last case, the largest iterator should always be the one advanced with exponential jumps. Applying this to CBL would however require random access into suffix buckets to binary-search within the overshot interval, which is incompatible with the trie representation currently used for large buckets. Supporting exponential search would therefore require redesigning large buckets with a structure that supports both sorted iteration and random access.

12.6 Perspectives from database query engines

The algorithmic landscape of k‑mer set operations has strong connections to classical database join strategies (Graefe, 1993), suggesting that techniques developed for query engines could benefit k‑mer set processing. CBL’s sorted-iterator approach resembles sort-merge join, while HashSet resembles hash join. When one set is very small, the pattern corresponds to index-nested loop join: iterate over the small set and perform membership queries on the large index, using exponential search to skip large gaps. CBL’s prefix-level pruning is analogous to partition pruning in query optimizers.

A related opportunity comes from the information-retrieval literature: Lemire et al. (2015) showed that SIMD-accelerated intersection nearly doubles throughput on inverted-index posting lists, a setting structurally similar to k‑mer set intersection on sorted lists. More broadly, adaptive strategy selection and partition-parallel processing have precedent in the database literature but remain unexplored for k‑mer sets. The structural similarities between k‑mer set operations and relational query processing suggest that this literature is a productive source of inspiration for future work.