References

Abrar, Md.H. & Medvedev, P. (2024) PLA-index: A k-mer index exploiting rank curve linearity. LIPIcs, Volume 312, WABI 2024, vol. 312. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 13:1–13:18.

Alanko, J., Alipanahi, B., Settle, J., Boucher, C., & Gagie, T. (2021) Buffering updates enables efficient dynamic de bruijn graphs. Computational and Structural Biotechnology Journal, 19, 4067–4078.

Alanko, J.N., Biagi, E., & Puglisi, S.J. (2023) Longest common prefix arrays for succinct k-spectra. String processing and information retrieval. Springer Nature Switzerland, pp. 1–13.

Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Finimizers: Variable-length bounded-frequency minimizers for k-mer sets. IEEE Transactions on Computational Biology and Bioinformatics, 22, 899–910.

Alanko, J.N., Depuydt, L., Marchet, C., & Puglisi, S.J. (2026) Fast set operations for compact k-mer sets. bioRxiv.

Alanko, J.N., Puglisi, S.J., & Vuohtoniemi, J. (2023) Small Searchable κ-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform. SIAM conference on applied and computational discrete algorithms (ACDA23). Society for Industrial; Applied Mathematics, pp. 225–236.

Arm Limited (2026) Arm Architecture Reference Manual for A-profile architecture. Arm Limited. URL https://developer.arm.com/documentation/ddi0487/latest/.

Atkinson, K.E. (2008) An introduction to numerical analysis, 2nd ed. New York, NY: John Wiley & Sons.

Ayad, L.A.K., Fici, G., Groot Koerkamp, R., Loukides, G., Patro, R., Pibiri, G.E., & Pissis, S.P. (2025) U-index: A universal indexing framework for matching long patterns. LIPIcs, Volume 338, SEA 2025, vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 4:1–4:18.

Azar, Y., Broder, A.Z., Karlin, A.R., & Upfal, E. (1994) Balanced allocations (extended abstract). Proceedings of the twenty-sixth annual ACM symposium on theory of computing - STOC ’94, STOC ’94. ACM Press, pp. 593–602.

Baeza-Yates, R. & Salinger, A. (2010) Fast intersection algorithms for sorted sequences. Algorithms and applications. Springer Berlin Heidelberg, pp. 45–61.

Baire, A., Marijon, P., Andreace, F., & Peterlongo, P. (2024) Back to sequences: Find the origin of k-mers. Journal of Open Source Software, 9, 7066.

Baker, D.N. & Langmead, B. (2019) Dashing: Fast and accurate genomic distances with HyperLogLog. Genome Biology, 20.

Baker, D.N. & Langmead, B. (2023) Genomic sketching with multiplicities and locality-sensitive hashing using dashing 2. Genome Research. Cold Spring Harbor Laboratory, p. gr.277655.123.

Balouek, D., Carpen Amarie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Pérez, C., Quesnel, F., Rohr, C., & Sarzyniec, L. (2013) Adding virtualization capabilities to the Grid’5000 testbed. Cloud computing and services science, vol. 367, Communications in computer and information science (Ivanov, I.I., Sinderen, M. van, Leymann, F., & Shan, T. eds). Springer International Publishing, pp. 3–20.

Bankevich, A., Bzikadze, A.V., Kolmogorov, M., Antipov, D., & Pevzner, P.A. (2022) Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature Biotechnology, 40, 1075–1081.

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., & Pevzner, P.A. (2012) SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19, 455–477.

Benoit, G., Peterlongo, P., Mariadassou, M., Drezen, E., Schbath, S., Lavenier, D., & Lemaitre, C. (2016) Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2, e94.

Benoit, G., Raguideau, S., James, R., Phillippy, A.M., Chikhi, R., & Quince, C. (2024) High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology, 42, 1378–1383.

Bentley, J.L. & Yao, A.C.-C. (1976) An almost optimal algorithm for unbounded searching. Information Processing Letters, 5, 82–87.

Bille, P., Christiansen, A.R., Ettienne, M.B., & Gørtz, I.L. (2017) Fast dynamic arrays. LIPIcs, Volume 87, ESA 2017, vol. 87. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 16:1–16:13.

Bowe, A., Onodera, T., Sadakane, K., & Shibuya, T. (2012) Succinct de bruijn graphs. Algorithms in bioinformatics. Springer Berlin Heidelberg, pp. 225–235.

Bradley, P., Bakker, H.C. den, Rocha, E.P.C., McVean, G., & Iqbal, Z. (2019) Ultrafast search of all deposited bacterial and viral genomic data. Nature Biotechnology, 37, 152–159.

Břinda, K., Baym, M., & Kucherov, G. (2021) Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome Biology, 22.

Broder, A.Z. (1997) On the resemblance and containment of documents. Proceedings. Compression and complexity of SEQUENCES 1997 (cat. no.97TB100171), SEQUEN-97. IEEE Comput. Soc, pp. 21–29.

Burrows, M. & Wheeler, D.J. (1994) A block-sorting lossless data compression algorithm (Technical Report No. 124). Digital Equipment Corporation.

Campanelli, A., Pibiri, G.E., Fan, J., & Patro, R. (2024) Where the patterns are: Repetition-aware compression for colored de bruijn graphs. Journal of Computational Biology, 31, 1022–1044.

Chen, K., Li, X., Shi, Q., Shao, M., & Medvedev, P. (2026) Hash functions in nucleotide sequence analysis. Genome Research.

Chen, K., Pattar, V., & Shao, M. (2025) Sequence similarity estimation by random subsequence sketching. LIPIcs, Volume 344, WABI 2025, vol. 344. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 7:1–7:17.

Chikhi, R., Lemane, T., Loll-Krippleber, R., Montoliu-Nerin, M., Raffestin, B., Camargo, A.P., Miller, C.J., Fiamenghi, M.B., Agustinho, D.P., Majidian, S., Autric, G., Hugues, M., Lee, J., Faure, R., Curry, K.D., Moura de Sousa, J.A., Rocha, E.P.C., Koslicki, D., Medvedev, P., Gupta, P., Shen, J., Morales-Tapia, A., Sihuta, K., Roy, P.J., Brown, G.W., Edgar, R.C., Korobeynikov, A., Steinegger, M., Lareau, C.A., Peterlongo, P., & Babaian, A. (2024) Logan: Planetary-scale genome assembly surveys life’s diversity. bioRxiv.

Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., & Medvedev, P. (2015) On the representation of de bruijn graphs. Journal of Computational Biology, 22, 336–352.

Chikhi, R., Limasset, A., & Medvedev, P. (2016) Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32, i201–i208.

Cohen, J.D. (1997) Recursive hashing functions for n-grams. ACM Transactions on Information Systems, 15, 291–320.

Constantinides, B., Lees, J., & Crook, D.W. (2025) Deacon: Fast sequence filtering and contaminant depletion. bioRxiv.

Cracco, A. & Tomescu, A.I. (2023) Extremely fast construction and querying of compacted and colored de bruijn graphs with GGCAT. Genome Research.

Crochemore, M., Czumaj, A., Ga̧sieniec, L., Lecroq, T., Plandowski, W., & Rytter, W. (1999) Fast practical multi-pattern matching. Information Processing Letters, 71, 107–113.

Crosbie, N.D. (2025) Grepq: A rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions. Journal of Open Source Software, 10, 8048.

Darvish, M., Seiler, E., Mehringer, S., Rahn, R., & Reinert, K. (2022) Needle: A fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics, 38, 4100–4108.

Degardins, B., Paperman, C., & Marchet, C. (2025) Vizitig: A pangenome and pantranscriptome explorer. bioRxiv.

Demaine, E.D., López-Ortiz, A., & Munro, J.I. (2000) Adaptive set intersections, unions, and differences. Proceedings of the eleventh annual ACM-SIAM symposium on discrete algorithms, SODA ’00. USA: Society for Industrial; Applied Mathematics, pp. 743–752.

Deorowicz, S., Debudaj-Grabysz, A., & Grabowski, S. (2013) Disk-based k-mer counting on a PC. BMC Bioinformatics, 14.

Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31, 1569–1576.

Dı́az-Domı́nguez, D., Leinonen, M., & Salmela, L. (2024) Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol. Biol., 19, 14.

Donges, S., Puglisi, S.J., & Raman, R. (2022) On dynamic bitvector implementations. 2022 data compression conference (DCC). IEEE, pp. 252–261.

Dufresne, Y., Guillemot, V., & Dreo, J. (2024) Optimization of reversible hash functions for k-mer data structures. SeqBIM 2024 workshop. Rennes, France.

Edgar, R. (2021) Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9, e10805.

Ekim, B., Berger, B., & Chikhi, R. (2021) Minimizer-space de bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems, 12, 958–968.e6.

Ekim, B., Sahlin, K., Medvedev, P., Berger, B., & Chikhi, R. (2023) Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Research.

Elias, P. (1974) Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21, 246–260.

Erbert, M., Rechner, S., & Müller-Hannemann, M. (2017) Gerbil: A fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology, 12.

Fan, J., Khan, J., Singh, N.P., Pibiri, G.E., & Patro, R. (2024) Fulgor: A fast and compact k-mer index for large-scale matching and color queries. Algorithms for Molecular Biology, 19, 3.

Fano, R.M. (1971) On the number of bits required to implement an associative memory ( No. Memorandum 61). Computer Structures Group, MIT, Cambridge, MA. URL http://csg.csail.mit.edu/pubs/memos/Memo-61/Memo-61.pdf.

Faro, S. & Lecroq, T. (2013) The exact online string matching problem: A review of the most recent results. ACM Computing Surveys, 45, 1–42.

Faure, R., Abrar, H., Wu, H., Chikhi, R., Koslicki, D., & Medvedev, P. (2025) Comparing and indexing metagenomes at a large scale using random projections. SeqBIM 2025 workshop. Nantes, France.

Feng, X., Cheng, H., Portik, D., & Li, H. (2022) Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods, 19, 671–674.

Ferragina, P. & Manzini, G. (2000) Opportunistic data structures with applications. Proceedings 41st annual symposium on foundations of computer science, SFCS-00. IEEE Comput. Soc, pp. 390–398.

Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007) HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science, DMTCS Proceedings vol. AH,...

Gagie, T., Manzini, G., & Sirén, J. (2017) Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698, 67–78.

Gallant, A. (2024) Ripgrep. URL https://github.com/BurntSushi/ripgrep.

Gienieczko, M., Murlak, F., & Paperman, C. (2023) Supporting descendants in SIMD-accelerated JSONPath. Proceedings of the 28th ACM international conference on architectural support for programming languages and operating systems, volume 4, ASPLOS ’23. ACM, pp. 338–361.

Gog, S. & Petri, M. (2013) Optimized succinct data structures for massive data. Software: Practice and Experience, 44, 1287–1314.

Golan, S., Tziony, I., Kraus, M., Orenstein, Y., & Shur, A. (2025) GreedyMini: Generating low-density DNA minimizers. Bioinformatics, 41, i275–i284.

Graefe, G. (1993) Query evaluation techniques for large databases. ACM Computing Surveys, 25, 73–169.

Greenberg, G., Ravi, A.N., & Shomorony, I. (2023) LexicHash: Sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics, 39.

Groot Koerkamp, R. (2024) A*PA2: Up to 19× faster exact global alignment. LIPIcs, Volume 312, WABI 2024, vol. 312. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 17:1–17:25.

Groot Koerkamp, R. (2025a) PtrHash: Minimal perfect hashing at RAM throughput. LIPIcs, Volume 338, SEA 2025, vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 21:1–21:21.

Groot Koerkamp, R. (2025b) Optimal throughput bioinformatics (PhD thesis). URL https://www.research-collection.ethz.ch/handle/20.500.11850/783091.

Groot Koerkamp, R. (2026) The anti-lexicographic SUS-anchor: A near-optimal k=1 sampling scheme. arXiv.

Groot Koerkamp, R. & Ivanov, P. (2024) Exact global alignment using a* with chaining seed heuristic and match pruning. Bioinformatics, 40.

Groot Koerkamp, R., Liu, D., & Pibiri, G.E. (2025) The open-closed mod-minimizer algorithm. Algorithms for Molecular Biology, 20.

Groot Koerkamp, R. & Martayan, I. (2025) SimdMinimizers: Computing Random Minimizers, fast. 23rd international symposium on experimental algorithms (SEA 2025), vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.

Groot Koerkamp, R. & Pibiri, G.E. (2024) The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. 24th international workshop on algorithms in bioinformatics (WABI 2024), vol. 312, Leibniz international proceedings in informatics (LIPIcs) (Pissis, S.P. & Sung, W.-K. eds). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 11:1–11:23.

Hennessy, J.L., Patterson, D.A., & Kozyrakis, C. (2026) Computer architecture: A quantitative approach, Seventh edition eds. Cambridge, MA: Morgan Kaufmann Publishers. URL https://shop.elsevier.com/books/computer-architecture/hennessy/978-0-443-15406-5.

Hera, M.R., Koslicki, D., & Martínez, C. (2025) MaxGeomHash: An algorithm for variable-size random sampling of distinct elements. bioRxiv.

Hernandez-Courbevoie, Y., Salson, M., Bessière, C., Xue, H., Gautheret, D., Marchet, C., & Limasset, A. (2025) REINDEER2: Practical abundance index at scale. String processing and information retrieval. Springer Nature Switzerland, pp. 156–171.

Hirzel, M., Schneider, S., & Tangwongsan, K. (2017) Sliding-window aggregation algorithms: tutorial. Proceedings of the 11th ACM international conference on distributed and event-based systems, DEBS 2017, barcelona, spain, june 19-23, 2017. ACM, pp. 11–14.

Holley, G. & Melsted, P. (2020) Bifrost: Highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome Biology, 21.

Homer, N., Stadick, S., Lambert, S., Stone, M., & Fennell, T. (2025) Fqgrep. URL https://doi.org/10.5281/zenodo.15034074.

Ingels, F., Limasset, A., Marchet, C., & Salson, M. (2026) Vigemers: On the number of k-mers sharing the same XOR-based minimizer. arXiv.

Ingels, F., Marchet, C., & Salson, M. (2024) On the number of k-mers admitting a given lexicographical minimizer. arXiv.

Ingels, F., Robidou, L., Martayan, I., Marchet, C., & Limasset, A. (2025) Minimizer density revisited: Models and multiminimizers. bioRxiv.

Intel Corporation (2026) Intel® 64 and IA-32 Architectures Software Developer’s Manual. Intel Corporation. URL https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html.

Iqbal, Z., Turner, I., & McVean, G. (2012) High-throughput microbial population genomics using the cortex variation assembler. Bioinformatics, 29, 275–276.

Irber, L., Brooks, P.T., Reiter, T., Pierce-Ward, N.T., Hera, M.R., Koslicki, D., & Brown, C.T. (2022) Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers. bioRxiv.

Irber, L., Pierce-Ward, N.T., & Brown, C.T. (2022) Sourmash branchwater enables lightweight petabyte-scale sequence search. bioRxiv.

Jain, C., Dilthey, A., Koren, S., Aluru, S., & Phillippy, A.M. (2017) A fast approximate algorithm for mapping long reads to large reference databases. Research in computational molecular biology. Springer International Publishing, pp. 66–81.

Karasikov, M., Mustafa, H., Danciu, D., Kulkov, O., Zimmermann, M., Barber, C., Rätsch, G., & Kahles, A. (2025) Efficient and accurate search in petabase-scale sequence repositories. Nature, 647, 1036–1044.

Karasikov, M., Mustafa, H., Rätsch, G., & Kahles, A. (2022) Lossless indexing with counting de bruijn graphs. Genome Research, 32, 1754–1764.

Karp, R.M. (2009) Reducibility among combinatorial problems. 50 years of integer programming 1958-2008. Springer Berlin Heidelberg, pp. 219–241.

Karp, R.M. & Rabin, M.O. (1987) Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31, 249–260.

Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De Vargas, C., Raes, J., Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.-M., Follows, M., Gorsky, G., Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not, F., Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., Velayoudon, D., Weissenbach, J., & Wincker, P. and (2011) A holistic approach to marine eco-systems biology. PLoS Biology, 9, e1001177.

Kazemi, P., Wong, J., Nikolić, V., Mohamadi, H., Warren, R.L., & Birol, I. (2022) ntHash2: Recursive spaced seed hashing for nucleotide sequences. Bioinformatics, 38, 4812–4813.

Khan, J., Kokot, M., Deorowicz, S., & Patro, R. (2022) Scalable, ultra-fast, and low-memory construction of compacted de bruijn graphs with cuttlefish 2. Genome Biology, 23.

Khan, J., Patro, R., & Pandey, P. (2026) Kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations. bioRxiv.

Kille, B., Groot Koerkamp, R., McAdams, D., Liu, A., & Treangen, T.J. (2024) A near-tight lower bound on the density of forward sampling schemes. Bioinformatics, 41.

Kokot, M., Długosz, M., & Deorowicz, S. (2017) KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33, 2759–2761.

Kolmogorov, M., Yuan, J., Lin, Y., & Pevzner, P.A. (2019) Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology, 37, 540–546.

Konstantinidis, K.T. & Tiedje, J.M. (2005) Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences, 102, 2567–2572.

Langdale, G. & Lemire, D. (2019) Parsing gigabytes of JSON per second. The VLDB Journal, 28, 941–960.

Lehmann, H.-P., Mueller, T., Pagh, R., Pibiri, G.E., Sanders, P., Vigna, S., & Walzer, S. (2026) Modern minimal perfect hashing: A survey. ACM Computing Surveys, 58, 1–36.

Leis, V., Kemper, A., & Neumann, T. (2013) The adaptive radix tree: ARTful indexing for main-memory databases. 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp. 38–49.

Lemane, T., Medvedev, P., Chikhi, R., & Peterlongo, P. (2022) Kmtricks: Efficient and flexible construction of bloom filters for large sequencing data collections. Bioinformatics Advances, 2.

Lemire, D. (2017) Removing duplicates from lists quickly. URL https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/.

Lemire, D., Boytsov, L., & Kurz, N. (2015) SIMD compression and the intersection of sorted integers. Software: Practice and Experience, 46, 723–749.

Lemire, D., Kaser, O., & Kurz, N. (2019) Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience, 49, 953–970.

Levallois, V., Shibuya, Y., Le Gal, B., Dufresne, Y., Patro, R., Peterlongo, P., & Pibiri, G.E. (2026) Kaminari: A frugal colored index for approximate k-mer queries. Bioinformatics Advances.

Li, H. (2009) Kseq. URL https://github.com/attractivechaos/klib.

Li, H. (2016) Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110.

Li, H. (2018) Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.

Li, H. (2020) Biofast. URL https://github.com/lh3/biofast.

Li, P. & König, C. (2010) B-bit minwise hashing. Proceedings of the 19th international conference on world wide web, WWW ’10. ACM, pp. 671–680.

Li, P., Owen, A., & Zhang, C. (2012) One permutation hashing. Advances in neural information processing systems, vol. 25 (Pereira, F., Burges, C.J., Bottou, L., & Weinberger, K. eds). Curran Associates, Inc.

Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., & Suri, S. (2013) Memory efficient minimum substring partitioning. Proceedings of the VLDB Endowment, 6, 169–180.

Li, Y. & Yan, X. (2015) MSPKmerCounter: A fast and memory efficient approach for k-mer counting. arXiv.

Limasset, A., Rizk, G., Chikhi, R., & Peterlongo, P. (2017) Fast and scalable minimal perfect hashing for massive key sets. LIPIcs, Volume 75, SEA 2017, vol. 75. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 25:1–25:16.

Lothaire, M. (1997) Combinatorics on words, vol. 17. Cambridge university press.

Mäklin, T., Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Sequence alignment with k-bounded matching statistics. bioRxiv.

Marçais, G., DeBlasio, D., & Kingsford, C. (2018) Asymptotically optimal minimizers schemes. Bioinformatics, 34, i13–i22.

Marçais, G., Elder, C.S., & Kingsford, C. (2024) K-nonical space: Sketching with reverse complements. Bioinformatics, 40, btae629.

Marçais, G. & Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770.

Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., & Kingsford, C. (2017) Improving the performance of minimizers and winnowing schemes. Bioinformatics, 33, i110–i117.

Marchet, C. (2024) Advances in practical k-mer sets: Essentials for the curious. arXiv.

Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., & Chikhi, R. (2020) REINDEER: Efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36, i177–i185.

Marchet, C., Kerbiriou, M., & Limasset, A. (2021) BLight: Efficient exact associative structure for k-mers. Bioinformatics, 37, 2858–2865.

Marchini, S. & Vigna, S. (2020) Compact fenwick trees for dynamic ranking and selection. Software: Practice and Experience, 50, 1184–1202.

Marco-Sola, S., Eizenga, J.M., Guarracino, A., Paten, B., Garrison, E., & Moreto, M. (2023) Optimal gap-affine alignment in o(s) space. Bioinformatics, 39.

Marco-Sola, S., Moure, J.C., Moreto, M., & Espinosa, A. (2020) Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics, 37, 456–463.

Martayan, I., Cazaux, B., Limasset, A., & Marchet, C. (2024) Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Bioinformatics.

Martayan, I., Lobet, L., Marchet, C., & Paperman, C. (2026) Helicase: Vectorized parsing and bitpacking of genomic sequences. bioRxiv.

Martayan, I., Robidou, L., Shibuya, Y., & Limasset, A. (2025) Hyper-k-mers: Efficient streaming k-mers representation. Research in computational molecular biology (RECOMB 2025). Springer Nature Switzerland.

Martayan, I., Vandamme, L., Constantinides, B., Cazaux, B., Paperman, C., & Limasset, A. (2026) Accelerating k-mer-based sequence filtering. Peer Community Journal, 6.

McNaughton, R. & Papert, S.A. (1971) Counter-free automata. The MIT Press. URL https://dl.acm.org/doi/abs/10.5555/1097043.

Mitzenmacher, M. (2001) The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems, 12, 1094–1104.

Mohamadi, H., Chu, J., Vandervalk, B.P., & Birol, I. (2016) ntHash: Recursive nucleotide hashing. Bioinformatics, 32, 3492–3494.

Myers, G. (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46, 395–415.

Myers, G. (2023) FASTK: A fast K-mer counter for high-fidelity shotgun datasets. URL https://github.com/thegenemyers/FASTK.

Mykkeltveit, J. (1972) A proof of golomb’s conjecture for the de bruijn graph. Journal of Combinatorial Theory, Series B, 13, 40–45.

Ndiaye, M., Prieto-Baños, S., Fitzgerald, L.M., Yazdizadeh Kharrazi, A., Oreshkov, S., Dessimoz, C., Sedlazeck, F.J., Glover, N., & Majidian, S. (2024) When less is more: Sketching with minimizers in genomics. Genome Biology, 25.

Needleman, S.B. & Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, 443–453.

Nunes, I., Heddes, M., Vergés, P., Abraham, D., Veidenbaum, A., Nicolau, A., & Givargis, T. (2023) DotHash: Estimating set similarity metrics for link prediction and document deduplication. Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, KDD ’23. ACM, pp. 1758–1769.

Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., Aganezov, S., Hoyt, S.J., Diekhans, M., Logsdon, G.A., Alonge, M., Antonarakis, S.E., Borchers, M., Bouffard, G.G., Brooks, S.Y., Caldas, G.V., Chen, N.-C., Cheng, H., Chin, C.-S., Chow, W., Lima, L.G. de, Dishuck, P.C., Durbin, R., Dvorkina, T., Fiddes, I.T., Formenti, G., Fulton, R.S., Fungtammasan, A., Garrison, E., Grady, P.G.S., Graves-Lindsay, T.A., Hall, I.M., Hansen, N.F., Hartley, G.A., Haukness, M., Howe, K., Hunkapiller, M.W., Jain, C., Jain, M., Jarvis, E.D., Kerpedjiev, P., Kirsche, M., Kolmogorov, M., Korlach, J., Kremitzki, M., Li, H., Maduro, V.V., Marschall, T., McCartney, A.M., McDaniel, J., Miller, D.E., Mullikin, J.C., Myers, E.W., Olson, N.D., Paten, B., Peluso, P., Pevzner, P.A., Porubsky, D., Potapova, T., Rogaev, E.I., Rosenfeld, J.A., Salzberg, S.L., Schneider, V.A., Sedlazeck, F.J., Shafin, K., Shew, C.J., Shumate, A., Sims, Y., Smit, A.F.A., Soto, D.C., Sović, I., Storer, J.M., Streets, A., Sullivan, B.A., Thibaud-Nissen, F., Torrance, J., Wagner, J., Walenz, B.P., Wenger, A., Wood, J.M.D., Xiao, C., Yan, S.M., Young, A.C., Zarate, S., Surti, U., McCoy, R.C., Dennis, M.Y., Alexandrov, I.A., Gerton, J.L., O’Neill, R.J., Timp, W., Zook, J.M., Schatz, M.C., Eichler, E.E., Miga, K.H., & Phillippy, A.M. (2022) The complete sequence of a human genome. Science, 376, 44–53.

Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., & Phillippy, A.M. (2016) Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology, 17.

One Codex (2019) Needletail. URL https://github.com/onecodex/needletail.

Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., & Kingsford, C. (2016) Compact universal k-mer hitting sets. Algorithms in bioinformatics. Springer International Publishing, pp. 257–268.

Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., & Kingsford, C. (2017) Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLOS Computational Biology, 13, e1005777.

Pan, C. & Reinert, K. (2024) A simple refined DNA minimizer operator enables 2-fold faster computation. Bioinformatics, 40, btae045.

Pandey, P., Almodaresi, F., Bender, M.A., Ferdman, M., Johnson, R., & Patro, R. (2018) Mantis: A fast, small, and exact large-scale sequence-search index. Cell Systems, 7, 201–207.e4.

Paperman, C., Salvati, S., & Soyez-Martin, C. (2023) An algebraic approach to vectorial programs. LIPIcs, Volume 254, STACS 2023, vol. 254. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, pp. 51:1–51:23.

Patro, R., Bharti, S., Singhania, P., Dhakal, R., Dahlstrom, T.J., & Groot Koerkamp, R. (2025) Mim: A lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing. bioRxiv.

Peleg, A., Wilkie, S., & Weiser, U. (1997) Intel MMX for multimedia PCs. Communications of the ACM, 40, 24–38.

Pellow, D., Pu, L., Ekim, B., Kotlar, L., Berger, B., Shamir, R., & Orenstein, Y. (2023) Efficient minimizer orders for large values of k using minimum decycling sets. Genome Research.

Pennisi, E. (2017) Biologists propose to sequence the DNA of all life on earth. Science.

Pibiri, G.E. (2022) Sparse and skew hashing of k-mers. Bioinformatics, 38, i185–i194.

Pibiri, G.E. & Kanda, S. (2021) Rank/select queries over mutable bitmaps. Information Systems, 99, 101756.

Pibiri, G.E. & Patro, R. (2026) Optimizing sparse and skew hashing: Faster k-mer dictionaries. bioRxiv.

Pibiri, G.E., Shibuya, Y., & Limasset, A. (2023) Locality-preserving minimal perfect hashing of k-mers. Bioinformatics, 39, i534–i543.

Pibiri, G.E. & Trani, R. (2021) PTHash: Revisiting FCH minimal perfect hashing. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’21. ACM, pp. 1339–1348.

Pibiri, G.E. & Venturini, R. (2017) Dynamic elias-fano representation. LIPIcs, Volume 78, CPM 2017, vol. 78. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 30:1–30:14.

Pierce, N.T., Irber, L., Reiter, T., Brooks, P., & Brown, C.T. (2019) Large-scale sequence comparisons with sourmash. F1000Research, 8, 1006.

Rahman, A. & Medevedev, P. (2021) Representation of k-mer sets using spectrum-preserving string sets. Journal of Computational Biology, 28, 381–394.

Rahman Hera, M., Pierce-Ward, N.T., & Koslicki, D. (2023) Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Research.

Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., & Yorke, J.A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369.

Rouzé, T., Chikhi, R., & Limasset, A. (2025) Inverted colored de bruijn graph for practical kmer sets storage. bioRxiv.

Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2023) Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching. 23rd international workshop on algorithms in bioinformatics (WABI 2023), vol. 273. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.

Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2025) Fractional hitting sets for efficient multiset sketching. Algorithms for Molecular Biology, 20, 1.

Rowe, W.P.M. (2019) When the levee breaks: A practical guide to sketching algorithms for processing the flood of genomic data. Genome Biology, 20.

Russell, R.M. (1978) The cray-1 computer system. Commun. ACM, 21, 63–72.

Sahlin, K. (2021) Effective sequence similarity detection with strobemers. Genome Research, 31, 2080–2094.

Sahlin, K. (2022) Strobealign: Flexible seed size enables ultra-fast and accurate read alignment. Genome Biology, 23.

Sahlin, K., Baudeau, T., Cazaux, B., & Marchet, C. (2023) A survey of mapping algorithms in the long-reads era. Genome Biology, 24.

Sanger, F., Nicklen, S., & Coulson, A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74, 5463–5467.

Sawada, J. & Williams, A. (2017) Practical algorithms to rank necklaces, lyndon words, and de bruijn sequences. Journal of Discrete Algorithms, 43, 95–110.

Schartl, M., Woltering, J.M., Irisarri, I., Du, K., Kneitz, S., Pippel, M., Brown, T., Franchini, P., Li, J., Li, M., Adolfi, M., Winkler, S., Freitas Sousa, J. de, Chen, Z., Jacinto, S., Kvon, E.Z., Correa de Oliveira, L.R., Monteiro, E., Baia Amaral, D., Burmester, T., Chalopin, D., Suh, A., Myers, E., Simakov, O., Schneider, I., & Meyer, A. (2024) The genomes of all lungfish inform on genome expansion and tetrapod evolution. Nature, 634, 96–103.

Schleimer, S., Wilkerson, D.S., & Aiken, A. (2003) Winnowing: Local algorithms for document fingerprinting. Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03. New York, NY, USA: Association for Computing Machinery, pp. 76–85.

Schmidt, S. & Alanko, J.N. (2023) Eulertigs: Minimum plain text representation of k-mer sets without repetitions in linear time. Algorithms for Molecular Biology, 18.

Schmidt, S., Khan, S., Alanko, J.N., Pibiri, G.E., & Tomescu, A.I. (2023) Matchtigs: Minimum plain text representation of k-mer sets. Genome Biology, 24.

Sereika, M., Kirkegaard, R.H., Karst, S.M., Michaelsen, T.Y., Sørensen, E.A., Wollenberg, R.D., & Albertsen, M. (2022) Oxford nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nature Methods, 19, 823–826.

Serre, O. (2004) Vectorial languages and linear temporal logic. Theor. Comput. Sci., 310, 79–116.

Shaw, J. & Yu, Y.W. (2021) Theory of local k-mer selection with applications to long-read alignment. Bioinformatics, 38, 4659–4669.

Shen, W., Le, S., Li, Y., & Hu, F. (2016) SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PLOS ONE, 11, e0163962.

Shen, W., Lees, J.A., & Iqbal, Z. (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap. Nature Biotechnology.

Shen, W., Sipos, B., & Zhao, L. (2024) SeqKit2: A swiss army knife for sequence and alignment processing. iMeta, 3.

Shibuya, Y., Belazzougui, D., & Kucherov, G. (2022) Efficient Reconciliation of Genomic Datasets of High Similarity. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 14:1–14:14.

Shiryev, S.A. & Agarwala, R. (2024) Indexing and searching petabase-scale nucleotide resources. Nature Methods, 21, 994–1002.

Shrivastava, A. & Li, P. (2014) Densifying one permutation hashing via rotation for fast near neighbor search. Proceedings of the 31st international conference on machine learning, vol. 32, Proceedings of machine learning research (Xing, E.P. & Jebara, T. eds). Bejing, China: PMLR, pp. 557–565.

Shur, A., Tziony, I., & Orenstein, Y. (2026) 10-minimizers: A promising class of constant-space minimizers. bioRxiv.

Sladký, O., Veselý, P., & Břinda, K. (2023) Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv.

Sladký, O., Veselý, P., & Břinda, K. (2024) Towards efficient k-mer set operations via function-assigned masked superstrings. bioRxiv.

Sladký, O., Veselý, P., & Břinda, K. (2025) FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT). Bioinformatics Advances, 6.

Smith, C., Martayan, I., Limasset, A., & Dufresne, Y. (2024) Brisk: Exact resource-efficient dictionary for k-mers. bioRxiv.

Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195–197.

Solomon, B. & Kingsford, C. (2016) Fast search of thousands of short-read sequencing experiments. Nature Biotechnology, 34, 300–302.

Soyez-Martin, C. (2023) From semigroup theory to vectorization: Recognizing regular languages. (PhD thesis). URL http://dx.doi.org/10.70675/09db0f47z19fbz4c84zbeedza5bfb8f585c8.

Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., & Robinson, G.E. (2015) Big data: Astronomical or genomical? PLOS Biology, 13, e1002195.

Teyssier, N. (2025) Paraseq. URL https://github.com/noamteyssier/paraseq.

Teyssier, N. & Dobin, A. (2025) BINSEQ: A family of high-performance binary formats for nucleotide sequences. bioRxiv.

Theodorakis, G., Koliousis, A., Pietzuch, P.R., & Pirk, H. (2018) Hammer slide: Work- and CPU-efficient streaming window aggregation. International workshop on accelerating analytics and data management systems using modern processor and storage architectures, ADMS@VLDB 2018, rio de janeiro, brazil, august 27, 2018 (Bordawekar, R. & Lahiri, T. eds). pp. 34–41.

Valve Corporation (2026) Steam Hardware & Software Survey: March 2026. URL https://store.steampowered.com/hwsurvey/.

Vandamme, L., Cazaux, B., & Limasset, A. (2025) K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets. Bioinformatics Advances, vbaf111.

Vigna, S. (2008) Broadword implementation of rank/select queries. Experimental algorithms. Springer Berlin Heidelberg, pp. 154–168.

Wald, A. (1944) On cumulative sums of random variables. The Annals of Mathematical Statistics, 15, 283–296.

Wang, X., Hong, Y., Chang, H., Park, K., Langdale, G., Hu, J., & Zhu, H. (2019) Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. 16th USENIX symposium on networked systems design and implementation (NSDI 19). pp. 631–648.

Wittler, R. (2023) General encoding of canonical k-mers. Peer Community Journal, 3.

Wood, D.E., Lu, J., & Langmead, B. (2019) Improved metagenomic analysis with Kraken 2. Genome biology, 20, 1–13.

Wood, D.E. & Salzberg, S.L. (2014) Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15.

Xu, W., Hsu, P.-K., Moshiri, N., Yu, S., & Rosing, T. (2024) HyperGen: Compact and efficient genome sketching using hyperdimensional vectors. Bioinformatics, 40.

Yu, Y.W. & Weber, G.M. (2020) HyperMinHash: MinHash in LogLog space. IEEE Transactions on Knowledge and Data Engineering, 1–1.

Zakeri, M., Brown, N.K., Ahmed, O.Y., Gagie, T., & Langmead, B. (2024) Movi: A fast and cache-efficient full-text pangenome index. iScience, 27, 111464.

Zakeri, M., Brown, N.K., Gagie, T., & Langmead, B. (2026) Movi 2: Fast and space-efficient queries on pangenomes. Bioinformatics.

Zentgraf, J., Schmitz, J.E., & Rahmann, S. (2025) Cleanifier: Contamination removal from microbial sequences using spaced seeds of a human pangenome index. Bioinformatics, 42.

Zhang, H., Song, H., Xu, X., Chang, Q., Wang, M., Wei, Y., Yin, Z., Schmidt, B., & Liu, W. (2023) RabbitFX: Efficient framework for FASTA/q file parsing on modern multi-core platforms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20, 2341–2348.

Zhao, X. (2019) BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35, 671–673.

Zheng, H., Kingsford, C., & Marçais, G. (2020) Improved design and analysis of practical minimizers. Bioinformatics, 36, i119–i127.

Zheng, H., Kingsford, C., & Marçais, G. (2021) Sequence-specific minimizers via polar sets. Bioinformatics, 37, i187–i195.

Zheng, H., Marçais, G., & Kingsford, C. (2023) Creating and using minimizer sketches in computational genomics. Journal of Computational Biology, 30, 1251–1276.

Zhou, D., Andersen, D.G., & Kaminsky, M. (2013) Space-efficient, high-performance rank and select structures on uncompressed bit sequences. Experimental algorithms. Springer Berlin Heidelberg, pp. 151–163.