References

Agret, C., Cazaux, B., & Limasset, A. (2021) Toward optimal fingerprint indexing for large scale genomics. bioRxiv.
Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Finimizers: Variable-length bounded-frequency minimizers for k-mer sets. IEEE Transactions on Computational Biology and Bioinformatics, 22, 899–910.
Alanko, J.N., Puglisi, S.J., & Vuohtoniemi, J. (2023) Small searchable κ-spectra via subset rank queries on the spectral burrows-wheeler transform. SIAM conference on applied and computational discrete algorithms (ACDA23). Society for Industrial; Applied Mathematics, pp. 225–236.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410.
Arm Limited (2026) Arm Architecture Reference Manual for A-profile architecture. Arm Limited. URL https://developer.arm.com/documentation/ddi0487/latest/.
Baire, A., Marijon, P., Andreace, F., & Peterlongo, P. (2024) Back to sequences: Find the origin of k-mers. Journal of Open Source Software, 9, 7066.
Balouek, D., Carpen Amarie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Pérez, C., Quesnel, F., Rohr, C., & Sarzyniec, L. (2013) Adding virtualization capabilities to the Grid’5000 testbed. Cloud computing and services science, vol. 367, Communications in computer and information science (Ivanov, I.I., Sinderen, M. van, Leymann, F., & Shan, T. eds). Springer International Publishing, pp. 3–20.
Chikhi, R., Lemane, T., Loll-Krippleber, R., Montoliu-Nerin, M., Raffestin, B., Camargo, A.P., Miller, C.J., Fiamenghi, M.B., Agustinho, D.P., Majidian, S., Autric, G., Hugues, M., Lee, J., Faure, R., Curry, K.D., Moura de Sousa, J.A., Rocha, E.P.C., Koslicki, D., Medvedev, P., Gupta, P., Shen, J., Morales-Tapia, A., Sihuta, K., Roy, P.J., Brown, G.W., Edgar, R.C., Korobeynikov, A., Steinegger, M., Lareau, C.A., Peterlongo, P., & Babaian, A. (2024) Logan: Planetary-scale genome assembly surveys life’s diversity. bioRxiv.
Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., & Rice, P.M. (2009) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Research, 38, 1767–1771.
Cohen, J.D. (1997) Recursive hashing functions for n-grams. ACM Transactions on Information Systems, 15, 291–320.
Constantinides, B., Lees, J., & Crook, D.W. (2025) Deacon: Fast sequence filtering and contaminant depletion. bioRxiv.
Crochemore, M., Czumaj, A., Ga̧sieniec, L., Lecroq, T., Plandowski, W., & Rytter, W. (1999) Fast practical multi-pattern matching. Information Processing Letters, 71, 107–113.
Crosbie, N.D. (2025) Grepq: A rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions. Journal of Open Source Software, 10, 8048.
Darvish, M., Seiler, E., Mehringer, S., Rahn, R., & Reinert, K. (2022) Needle: A fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics, 38, 4100–4108.
David, Y., Alisha, A., Awais, A., Rajkumar, D., Dipayan, G., Muhammad, H., Maira, I., Eugene, I., Vishnukumar, K., Amnon, K., Manish, K., Ankur, L., Isuru, L., Lili, M., Colman, O., Joana, P., Ruben, P., Stephane, P., Nadim, R., Jeena, R., Iva, T., Marianna, V., Senthilnathan, V., Zahra, W., Peter, W., Tony, B., Guy, C., & Ugis, S. (2025) The european nucleotide archive in 2025. Nucleic Acids Research, 54, D120–D127.
Edgar, R. (2021) Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ, 9, e10805.
Edgar, R.C., Taylor, B., Lin, V., Altman, T., Barbera, P., Meleshko, D., Lohr, D., Novakovsky, G., Buchfink, B., Al-Shayeb, B., Banfield, J.F., Peña, M. de la, Korobeynikov, A., Chikhi, R., & Babaian, A. (2022) Petabase-scale sequence alignment catalyses viral discovery. Nature, 602, 142–147.
Ekim, B., Berger, B., & Chikhi, R. (2021) Minimizer-space de bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems, 12, 958–968.e6.
Faro, S. & Lecroq, T. (2013) The exact online string matching problem: A review of the most recent results. ACM Computing Surveys, 45, 1–42.
Gallant, A. (2024) Ripgrep. URL https://github.com/BurntSushi/ripgrep.
Gienieczko, M., Murlak, F., & Paperman, C. (2023) Supporting descendants in SIMD-accelerated JSONPath. Proceedings of the 28th ACM international conference on architectural support for programming languages and operating systems, volume 4, ASPLOS ’23. ACM, pp. 338–361.
Golan, S., Tziony, I., Kraus, M., Orenstein, Y., & Shur, A. (2025) GreedyMini: Generating low-density DNA minimizers. Bioinformatics, 41, i275–i284.
Groot Koerkamp, R. (2025) Optimal throughput bioinformatics (PhD thesis). URL https://www.research-collection.ethz.ch/handle/20.500.11850/783091.
Groot Koerkamp, R., Liu, D., & Pibiri, G.E. (2025) The open-closed mod-minimizer algorithm. Algorithms for Molecular Biology, 20.
Groot Koerkamp, R. & Martayan, I. (2025) SimdMinimizers: Computing Random Minimizers, fast. 23rd international symposium on experimental algorithms (SEA 2025), vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
Groot Koerkamp, R. & Pibiri, G.E. (2024) The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. 24th international workshop on algorithms in bioinformatics (WABI 2024), vol. 312, Leibniz international proceedings in informatics (LIPIcs) (Pissis, S.P. & Sung, W.-K. eds). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 11:1–11:23.
Hennessy, J.L., Patterson, D.A., & Kozyrakis, C. (2026) Computer architecture: A quantitative approach, Seventh edition eds. Cambridge, MA: Morgan Kaufmann Publishers. URL https://shop.elsevier.com/books/computer-architecture/hennessy/978-0-443-15406-5.
Hirzel, M., Schneider, S., & Tangwongsan, K. (2017) Sliding-window aggregation algorithms: tutorial. Proceedings of the 11th ACM international conference on distributed and event-based systems, DEBS 2017, barcelona, spain, june 19-23, 2017. ACM, pp. 11–14.
Holley, G. & Melsted, P. (2020) Bifrost: Highly parallel construction and indexing of colored and compacted de bruijn graphs. Genome Biology, 21.
Homer, N., Stadick, S., Lambert, S., Stone, M., & Fennell, T. (2025) Fqgrep. URL https://doi.org/10.5281/zenodo.15034074.
Ingels, F., Martayan, I., Salson, M., & Marchet, C. (2024) Constrained enumeration of k-mers from a collection of references with metadata. bioRxiv.
Ingels, F., Robidou, L., Martayan, I., Marchet, C., & Limasset, A. (2025) Minimizer density revisited: Models and multiminimizers. bioRxiv.
Intel Corporation (2026) Intel® 64 and IA-32 Architectures Software Developer’s Manual. Intel Corporation. URL https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html.
Karasikov, M., Mustafa, H., Rätsch, G., & Kahles, A. (2022) Lossless indexing with counting de bruijn graphs. Genome Research, 32, 1754–1764.
Karp, R.M. & Rabin, M.O. (1987) Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31, 249–260.
Kazemi, P., Wong, J., Nikolić, V., Mohamadi, H., Warren, R.L., & Birol, I. (2022) ntHash2: Recursive spaced seed hashing for nucleotide sequences. Bioinformatics, 38, 4812–4813.
Khan, J., Patro, R., & Pandey, P. (2026) Kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations. bioRxiv.
Langdale, G. & Lemire, D. (2019) Parsing gigabytes of JSON per second. The VLDB Journal, 28, 941–960.
Lemane, T., Lezzoche, N., Lecubin, J., Pelletier, E., Lescot, M., Chikhi, R., & Peterlongo, P. (2024) Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. Nature Computational Science, 4, 104–109.
Lemire, D. (2017) Removing duplicates from lists quickly. URL https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/.
Li, H. (2009) Kseq. URL https://github.com/attractivechaos/klib.
Li, H. (2018) Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.
Li, H. (2020) Biofast. URL https://github.com/lh3/biofast.
Lipman, D.J. & Pearson, W.R. (1985) Rapid and sensitive protein similarity searches. Science, 227, 1435–1441.
Ma, B., Lu, C., Wang, Y., Yu, J., Zhao, K., Xue, R., Ren, H., Lv, X., Pan, R., Zhang, J., Zhu, Y., & Xu, J. (2023) A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nature Communications, 14.
Mäklin, T., Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Sequence alignment with k-bounded matching statistics. bioRxiv.
Marçais, G., Elder, C.S., & Kingsford, C. (2024) K-nonical space: Sketching with reverse complements. Bioinformatics, 40, btae629.
Marchet, C., Boucher, C., Puglisi, S.J., Medvedev, P., Salson, M., & Chikhi, R. (2020) Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31, 1–12.
Marchet, C. & Limasset, A. (2023) Scalable sequence database search using partitioned aggregated bloom comb trees. Bioinformatics, 39, i252–i259.
Martayan, I., Cazaux, B., Limasset, A., & Marchet, C. (2024) Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Bioinformatics.
Martayan, I., Lobet, L., Marchet, C., & Paperman, C. (2026) Helicase: Vectorized parsing and bitpacking of genomic sequences. bioRxiv.
Martayan, I., Robidou, L., Shibuya, Y., & Limasset, A. (2025) Hyper-k-mers: Efficient streaming k-mers representation. Research in computational molecular biology (RECOMB 2025). Springer Nature Switzerland.
Martayan, I., Vandamme, L., Constantinides, B., Cazaux, B., Paperman, C., & Limasset, A. (2025) Accelerating k-mer-based sequence filtering. bioRxiv.
McNaughton, R. & Papert, S.A. (1971) Counter-free automata. The MIT Press. URL https://dl.acm.org/doi/abs/10.5555/1097043.
Mohamadi, H., Chu, J., Vandervalk, B.P., & Birol, I. (2016) ntHash: Recursive nucleotide hashing. Bioinformatics, 32, 3492–3494.
Myers, G. (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46, 395–415.
Nayfach, S., Shi, Z.J., Seshadri, R., Pollard, K.S., & Kyrpides, N.C. (2019) New insights from uncultivated genomes of the global human gut microbiome. Nature, 568, 505–510.
Ndiaye, M., Prieto-Baños, S., Fitzgerald, L.M., Yazdizadeh Kharrazi, A., Oreshkov, S., Dessimoz, C., Sedlazeck, F.J., Glover, N., & Majidian, S. (2024) When less is more: Sketching with minimizers in genomics. Genome Biology, 25.
Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., Aganezov, S., Hoyt, S.J., Diekhans, M., Logsdon, G.A., Alonge, M., Antonarakis, S.E., Borchers, M., Bouffard, G.G., Brooks, S.Y., Caldas, G.V., Chen, N.-C., Cheng, H., Chin, C.-S., Chow, W., Lima, L.G. de, Dishuck, P.C., Durbin, R., Dvorkina, T., Fiddes, I.T., Formenti, G., Fulton, R.S., Fungtammasan, A., Garrison, E., Grady, P.G.S., Graves-Lindsay, T.A., Hall, I.M., Hansen, N.F., Hartley, G.A., Haukness, M., Howe, K., Hunkapiller, M.W., Jain, C., Jain, M., Jarvis, E.D., Kerpedjiev, P., Kirsche, M., Kolmogorov, M., Korlach, J., Kremitzki, M., Li, H., Maduro, V.V., Marschall, T., McCartney, A.M., McDaniel, J., Miller, D.E., Mullikin, J.C., Myers, E.W., Olson, N.D., Paten, B., Peluso, P., Pevzner, P.A., Porubsky, D., Potapova, T., Rogaev, E.I., Rosenfeld, J.A., Salzberg, S.L., Schneider, V.A., Sedlazeck, F.J., Shafin, K., Shew, C.J., Shumate, A., Sims, Y., Smit, A.F.A., Soto, D.C., Sović, I., Storer, J.M., Streets, A., Sullivan, B.A., Thibaud-Nissen, F., Torrance, J., Wagner, J., Walenz, B.P., Wenger, A., Wood, J.M.D., Xiao, C., Yan, S.M., Young, A.C., Zarate, S., Surti, U., McCoy, R.C., Dennis, M.Y., Alexandrov, I.A., Gerton, J.L., O’Neill, R.J., Timp, W., Zook, J.M., Schatz, M.C., Eichler, E.E., Miga, K.H., & Phillippy, A.M. (2022) The complete sequence of a human genome. Science, 376, 44–53.
One Codex (2019) Needletail. URL https://github.com/onecodex/needletail.
Pan, C. & Reinert, K. (2024) A simple refined DNA minimizer operator enables 2-fold faster computation. Bioinformatics, 40, btae045.
Paperman, C., Salvati, S., & Soyez-Martin, C. (2023) An algebraic approach to vectorial programs. LIPIcs, Volume 254, STACS 2023, vol. 254. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, pp. 51:1–51:23.
Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., Evans, P.N., Hugenholtz, P., & Tyson, G.W. (2017) Recovery of nearly 8, 000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology, 2, 1533–1542.
Patro, R., Bharti, S., Singhania, P., Dhakal, R., Dahlstrom, T.J., & Groot Koerkamp, R. (2025) Mim: A lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing. bioRxiv.
Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85, 2444–2448.
Peleg, A., Wilkie, S., & Weiser, U. (1997) Intel MMX for multimedia PCs. Communications of the ACM, 40, 24–38.
Pellow, D., Pu, L., Ekim, B., Kotlar, L., Berger, B., Shamir, R., & Orenstein, Y. (2023) Efficient minimizer orders for large values of k using minimum decycling sets. Genome Research.
Pennisi, E. (2017) Biologists propose to sequence the DNA of all life on earth. Science.
Pibiri, G.E. (2022) Sparse and skew hashing of k-mers. Bioinformatics, 38, i185–i194.
Rahman Hera, M. & Koslicki, D. (2025) Estimating similarity and distance using FracMinHash. Algorithms for Molecular Biology, 20.
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., & Yorke, J.A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2023) Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching. 23rd international workshop on algorithms in bioinformatics (WABI 2023), vol. 273. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2025) Fractional hitting sets for efficient multiset sketching. Algorithms for Molecular Biology, 20, 1.
Russell, R.M. (1978) The cray-1 computer system. Commun. ACM, 21, 63–72.
Sahlin, K. (2021) Effective sequence similarity detection with strobemers. Genome Research, 31, 2080–2094.
Sahlin, K. (2022) Strobealign: Flexible seed size enables ultra-fast and accurate read alignment. Genome Biology, 23.
Sayers, E.W., Cavanaugh, M., Clark, K., Pruitt, K.D., Sherry, S.T., Yankie, L., & Karsch-Mizrachi, I. (2023) GenBank 2024 update. Nucleic Acids Research, 52, D134–D137.
Schartl, M., Woltering, J.M., Irisarri, I., Du, K., Kneitz, S., Pippel, M., Brown, T., Franchini, P., Li, J., Li, M., Adolfi, M., Winkler, S., Freitas Sousa, J. de, Chen, Z., Jacinto, S., Kvon, E.Z., Correa de Oliveira, L.R., Monteiro, E., Baia Amaral, D., Burmester, T., Chalopin, D., Suh, A., Myers, E., Simakov, O., Schneider, I., & Meyer, A. (2024) The genomes of all lungfish inform on genome expansion and tetrapod evolution. Nature, 634, 96–103.
Schleimer, S., Wilkerson, D.S., & Aiken, A. (2003) Winnowing: Local algorithms for document fingerprinting. Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD ’03. New York, NY, USA: Association for Computing Machinery, pp. 76–85.
Serre, O. (2004) Vectorial languages and linear temporal logic. Theor. Comput. Sci., 310, 79–116.
Shaw, J. & Yu, Y.W. (2021) Theory of local k-mer selection with applications to long-read alignment. Bioinformatics, 38, 4659–4669.
Shen, W., Le, S., Li, Y., & Hu, F. (2016) SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PLOS ONE, 11, e0163962.
Shen, W., Sipos, B., & Zhao, L. (2024) SeqKit2: A swiss army knife for sequence and alignment processing. iMeta, 3.
Shur, A., Tziony, I., & Orenstein, Y. (2026) 10-minimizers: A promising class of constant-space minimizers. bioRxiv.
Smith, C., Martayan, I., Limasset, A., & Dufresne, Y. (2024) Brisk: Exact resource-efficient dictionary for k-mers. bioRxiv.
Soyez-Martin, C. (2023) From semigroup theory to vectorization: Recognizing regular languages. (PhD thesis). URL https://hal.archives-ouvertes.fr/tel-04417087.
Teyssier, N. (2025) Paraseq. URL https://github.com/noamteyssier/paraseq.
Teyssier, N. & Dobin, A. (2025) BINSEQ: A family of high-performance binary formats for nucleotide sequences. bioRxiv.
Theodorakis, G., Koliousis, A., Pietzuch, P.R., & Pirk, H. (2018) Hammer slide: Work- and CPU-efficient streaming window aggregation. International workshop on accelerating analytics and data management systems using modern processor and storage architectures, ADMS@VLDB 2018, rio de janeiro, brazil, august 27, 2018 (Bordawekar, R. & Lahiri, T. eds). pp. 34–41.
Valve Corporation (2026) Steam Hardware & Software Survey: March 2026. URL https://store.steampowered.com/hwsurvey/.
Vandamme, L., Cazaux, B., & Limasset, A. (2025) K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets. Bioinformatics Advances, vbaf111.
Wang, X., Hong, Y., Chang, H., Park, K., Langdale, G., Hu, J., & Zhu, H. (2019) Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs. 16th USENIX symposium on networked systems design and implementation (NSDI 19). pp. 631–648.
Wood, D.E., Lu, J., & Langmead, B. (2019) Improved metagenomic analysis with Kraken 2. Genome biology, 20, 1–13.
Zentgraf, J., Schmitz, J.E., & Rahmann, S. (2025) Cleanifier: Contamination removal from microbial sequences using spaced seeds of a human pangenome index. Bioinformatics, 42.
Zhang, H., Song, H., Xu, X., Chang, Q., Wang, M., Wei, Y., Yin, Z., Schmidt, B., & Liu, W. (2023) RabbitFX: Efficient framework for FASTA/q file parsing on modern multi-core platforms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20, 2341–2348.
Zielezinski, A., Vinga, S., Almeida, J., & Karlowski, W.M. (2017) Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biology, 18.