Appendix C — Experiments on hyper-k-mers – Algorithm design and implementation for the scale of sequencing data

C.1 Datasets

Table C.1 lists the datasets used in the experiments and their characteristics.

Table C.1: E. Coli datasets used for benchmarking KFC. Min., Avg. and Max. refer to the minimum, average and maximum read length respectively.

Name	Type	Coverage	# reads	Total length	Min.	Avg.	Max.
`SRR11434954`	HiFi	5000×	1,789,131	23,122,913,014	46	12,924.1	26,294
`SRR28370642`	ONT Duplex	50×	114,703	236,908,842	349	2,065.4	126,029
`SRR28370651`	ONT Duplex	50×	109,061	226,502,819	330	2,076.8	125,012
`SRR28370668`	ONT Simplex	2000×	6,819,683	9,101,103,830	1	1,334.5	396,011

C.2 Multi-threading efficiency of KFC

In this section, we assess the multi-threading efficiency of KFC (Figure C.1). Our results demonstrate that KFC effectively utilizes multi-core architectures, achieving performance gains with up to several dozen cores before experiencing diminishing returns.

Figure C.1: Multi-core usage efficiency of KFC on the 100× HiFi E. coli dataset.

C.3 Metagenomic benchmarks

Figure C.2 and Figure C.3 confirm the trends of Section 14.4.1 on two larger HiFi metagenomic datasets.

C.4 Effect of coverage

Figure C.4 shows benchmarks similar to Figure 14.4 but at various coverage levels, without significant change in the relative behaviors of the tools.

Figure C.4: k‑mer counting benchmark on ONT Simplex E. coli dataset (SRR28370668) for various coverage.

Figure C.5: k‑mer counting benchmark on the complete HiFi human gut datasets (SRR15275210, SRR15275211, SRR15275212, SRR15275213), filtering k‑mers appearing once or twice (abundance threshold t = 3).

C.5 Effect of not filtering unique k‑mers

In this section, we present the performance of KFC when unique k‑mers are not filtered out. Figure C.6 shows the results discussed in Section 14.4.1. They are similar to Figure 14.4, but without filtering unique k‑mers.

Figure C.6: Comparison of k‑mer benchmarks on different E. coli datasets without any filtering. Each subfigure shows the memory usage and timing plots for different sequencing technologies.

C.6 Pangenome benchmark

We evaluate the tools on one thousand S. enterica complete genomes from NCBI to assess the cost of counting large k‑mers across a pangenome (Figure C.7). KFC remains the most memory-efficient tool ahead of FastK and KMC, and for very large k‑mer sizes (k > 500) it also becomes the fastest, although its multithreading efficiency (analyzed in Section C.2) lags behind KMC on small input sizes, so it leads only when running on fewer threads.