Appendix C — Experiments on hyper-k-mers

C.1 Datasets

Table C.1 lists the datasets used in the experiments and their characteristics.

Table C.1: E. Coli datasets used for benchmarking KFC. Min., Avg. and Max. refer to the minimum, average and maximum read length respectively.
Name Type Coverage # reads Total length Min. Avg. Max.
SRR11434954 HiFi 5000× 1,789,131 23,122,913,014 46 12,924.1 26,294
SRR28370642 ONT Duplex 50× 114,703 236,908,842 349 2,065.4 126,029
SRR28370651 ONT Duplex 50× 109,061 226,502,819 330 2,076.8 125,012
SRR28370668 ONT Simplex 2000× 6,819,683 9,101,103,830 1 1,334.5 396,011

C.2 Multi-threading efficiency of KFC

In this section, we assess the multi-threading efficiency of KFC (Figure C.1). Our results demonstrate that KFC effectively utilizes multi-core architectures, achieving performance gains with up to several dozen cores before experiencing diminishing returns.

Figure C.1: Multi-core usage efficiency of KFC on the 100× HiFi E. coli dataset.

C.3 Metagenomic benchmarks

Figure C.2 and Figure C.3 confirm the trends of Section 14.4.1 on two larger HiFi metagenomic datasets.

(a) Memory usage
(b) Running time
Figure C.2: k‑mer counting benchmark on HiFi Zymo community dataset (SRR13128014) downsampled at 5 gigabases, with unique k‑mer filtering.
(a) Memory usage
(b) Running time
Figure C.3: k‑mer counting benchmark on HiFi human gut datasets (SRR15275210, SRR15275211, SRR15275212, SRR15275213) downsampled at 15 gigabases, with unique k‑mer filtering.

C.4 Effect of coverage

Figure C.4 shows benchmarks similar to Figure 14.4 but at various coverage levels, without significant change in the relative behaviors of the tools.

Figure C.4: k‑mer counting benchmark on ONT Simplex E. coli dataset (SRR28370668) for various coverage.

Figure C.5: k‑mer counting benchmark on the complete HiFi human gut datasets (SRR15275210, SRR15275211, SRR15275212, SRR15275213), filtering k‑mers appearing once or twice (abundance threshold t = 3).

C.5 Effect of not filtering unique k‑mers

In this section, we present the performance of KFC when unique k‑mers are not filtered out. Figure C.6 shows the results discussed in Section 14.4.1. They are similar to Figure 14.4, but without filtering unique k‑mers.

Figure C.6: Comparison of k‑mer benchmarks on different E. coli datasets without any filtering. Each subfigure shows the memory usage and timing plots for different sequencing technologies.

C.6 Pangenome benchmark

We evaluate the tools on one thousand S. enterica complete genomes from NCBI to assess the cost of counting large k‑mers across a pangenome (Figure C.7). KFC remains the most memory-efficient tool ahead of FastK and KMC, and for very large k‑mer sizes (k > 500) it also becomes the fastest, although its multithreading efficiency (analyzed in Section C.2) lags behind KMC on small input sizes, so it leads only when running on fewer threads.

(a) Memory usage
(b) Running time
Figure C.7: k‑mer counting benchmark on one thousand S. enterica complete genomes.