References
Agret, C., Cazaux, B., & Limasset, A. (2021) Toward optimal
fingerprint indexing for large scale genomics. bioRxiv.
Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Finimizers:
Variable-length bounded-frequency minimizers for k-mer sets.
IEEE Transactions on Computational Biology and Bioinformatics,
22, 899–910.
Alanko, J.N., Puglisi, S.J., & Vuohtoniemi, J. (2023) Small searchable
κ-spectra via subset rank queries on the spectral burrows-wheeler
transform. SIAM conference on applied and computational discrete
algorithms (ACDA23). Society for Industrial; Applied Mathematics,
pp. 225–236.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J. (1990) Basic local
alignment search tool. Journal of Molecular Biology,
215, 403–410.
Arm Limited (2026) Arm Architecture Reference Manual for A-profile
architecture. Arm Limited. URL https://developer.arm.com/documentation/ddi0487/latest/.
Baire, A., Marijon, P., Andreace, F., & Peterlongo, P. (2024) Back to sequences: Find the
origin of k-mers. Journal of Open Source Software,
9, 7066.
Balouek, D., Carpen Amarie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Pérez, C., Quesnel, F., Rohr, C., & Sarzyniec, L. (2013) Adding
virtualization capabilities to the Grid’5000 testbed.
Cloud computing and services science, vol. 367,
Communications in computer and information science (Ivanov,
I.I., Sinderen, M. van, Leymann, F., & Shan, T. eds). Springer
International Publishing, pp. 3–20.
Chikhi, R., Lemane, T., Loll-Krippleber, R., Montoliu-Nerin, M., Raffestin, B., Camargo, A.P., Miller, C.J., Fiamenghi, M.B., Agustinho, D.P., Majidian, S., Autric, G., Hugues, M., Lee,
J., Faure, R., Curry, K.D., Moura de
Sousa, J.A., Rocha, E.P.C., Koslicki, D., Medvedev, P., Gupta, P., Shen,
J., Morales-Tapia, A., Sihuta, K., Roy,
P.J., Brown, G.W., Edgar, R.C., Korobeynikov, A., Steinegger, M., Lareau, C.A., Peterlongo, P., & Babaian, A. (2024) Logan: Planetary-scale
genome assembly surveys life’s diversity. bioRxiv.
Cock, P.J.A., Fields, C.J., Goto, N., Heuer,
M.L., & Rice, P.M. (2009) The sanger FASTQ file format
for sequences with quality scores, and the solexa/illumina FASTQ
variants. Nucleic Acids Research, 38,
1767–1771.
Cohen, J.D. (1997) Recursive hashing functions
for n-grams. ACM Transactions on Information Systems,
15, 291–320.
Constantinides, B., Lees, J., & Crook, D.W. (2025) Deacon: Fast sequence
filtering and contaminant depletion. bioRxiv.
Crochemore, M., Czumaj, A., Ga̧sieniec, L., Lecroq, T., Plandowski, W., & Rytter, W. (1999) Fast practical
multi-pattern matching. Information Processing Letters,
71, 107–113.
Crosbie, N.D. (2025) Grepq: A rust application
that quickly filters FASTQ files by matching sequences to a set of
regular expressions. Journal of Open Source Software,
10, 8048.
Darvish, M., Seiler, E., Mehringer, S., Rahn, R., & Reinert, K. (2022) Needle: A fast and
space-efficient prefilter for estimating the quantification of very
large collections of expression experiments.
Bioinformatics, 38, 4100–4108.
David, Y., Alisha, A., Awais, A., Rajkumar, D., Dipayan, G., Muhammad, H., Maira, I., Eugene, I., Vishnukumar, K., Amnon, K., Manish, K., Ankur, L., Isuru, L., Lili,
M., Colman, O., Joana, P., Ruben, P., Stephane, P., Nadim, R., Jeena, R., Iva,
T., Marianna, V., Senthilnathan, V., Zahra, W., Peter, W., Tony,
B., Guy, C., & Ugis, S. (2025) The european nucleotide
archive in 2025. Nucleic Acids Research,
54, D120–D127.
Edgar, R. (2021) Syncmers are more sensitive
than minimizers for selecting conserved k-mers in biological
sequences. PeerJ, 9, e10805.
Edgar, R.C., Taylor, B., Lin,
V., Altman, T., Barbera, P., Meleshko, D., Lohr, D., Novakovsky, G., Buchfink, B., Al-Shayeb, B., Banfield, J.F., Peña, M. de la, Korobeynikov, A., Chikhi, R., & Babaian, A. (2022) Petabase-scale
sequence alignment catalyses viral discovery. Nature,
602, 142–147.
Ekim, B., Berger, B., & Chikhi, R. (2021) Minimizer-space de
bruijn graphs: Whole-genome assembly of long reads in minutes on a
personal computer. Cell Systems, 12,
958–968.e6.
Faro, S. & Lecroq, T. (2013) The exact online string
matching problem: A review of the most recent results. ACM
Computing Surveys, 45, 1–42.
Gallant, A. (2024) Ripgrep. URL https://github.com/BurntSushi/ripgrep.
Gienieczko, M., Murlak, F., & Paperman, C. (2023) Supporting descendants in
SIMD-accelerated JSONPath. Proceedings of the 28th ACM
international conference on architectural support for programming
languages and operating systems, volume 4, ASPLOS ’23.
ACM, pp. 338–361.
Golan, S., Tziony, I., Kraus, M., Orenstein, Y., & Shur, A. (2025) GreedyMini:
Generating low-density DNA minimizers. Bioinformatics,
41, i275–i284.
Groot Koerkamp, R. (2025) Optimal
throughput bioinformatics (PhD thesis). URL https://www.research-collection.ethz.ch/handle/20.500.11850/783091.
Groot Koerkamp, R., Liu, D., & Pibiri, G.E. (2025) The open-closed
mod-minimizer algorithm. Algorithms for Molecular Biology,
20.
Groot Koerkamp, R. & Martayan, I. (2025) SimdMinimizers: Computing Random Minimizers,
fast. 23rd international symposium on experimental
algorithms (SEA 2025), vol. 338. Schloss Dagstuhl – Leibniz-Zentrum
für Informatik.
Groot Koerkamp, R. & Pibiri, G.E. (2024) The mod-minimizer: A Simple and Efficient Sampling
Algorithm for Long k-mers. 24th international workshop on
algorithms in bioinformatics (WABI 2024), vol. 312, Leibniz
international proceedings in informatics (LIPIcs) (Pissis, S.P.
& Sung, W.-K. eds). Dagstuhl, Germany: Schloss Dagstuhl –
Leibniz-Zentrum für Informatik, pp. 11:1–11:23.
Hennessy, J.L., Patterson, D.A., & Kozyrakis, C. (2026) Computer architecture:
A quantitative approach, Seventh edition eds. Cambridge, MA: Morgan
Kaufmann Publishers. URL https://shop.elsevier.com/books/computer-architecture/hennessy/978-0-443-15406-5.
Hirzel, M., Schneider, S., & Tangwongsan, K. (2017) Sliding-window
aggregation algorithms: tutorial. Proceedings of the 11th
ACM international conference on distributed and event-based
systems, DEBS 2017, barcelona, spain, june 19-23,
2017. ACM, pp. 11–14.
Holley, G. & Melsted, P. (2020) Bifrost: Highly
parallel construction and indexing of colored and compacted de bruijn
graphs. Genome Biology, 21.
Homer, N., Stadick, S., Lambert, S., Stone, M., & Fennell, T. (2025) Fqgrep. URL https://doi.org/10.5281/zenodo.15034074.
Ingels, F., Martayan, I., Salson, M., & Marchet, C. (2024) Constrained enumeration
of k-mers from a collection of references with metadata.
bioRxiv.
Ingels, F., Robidou, L., Martayan, I., Marchet, C., & Limasset, A. (2025) Minimizer density
revisited: Models and multiminimizers. bioRxiv.
Intel Corporation (2026) Intel® 64 and IA-32 Architectures Software
Developer’s Manual. Intel Corporation. URL https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html.
Karasikov, M., Mustafa, H., Rätsch, G., & Kahles, A. (2022) Lossless indexing with
counting de bruijn graphs. Genome Research,
32, 1754–1764.
Karp, R.M. & Rabin, M.O. (1987) Efficient randomized
pattern-matching algorithms. IBM Journal of Research and
Development, 31, 249–260.
Kazemi, P., Wong, J., Nikolić, V., Mohamadi, H., Warren, R.L., & Birol, I. (2022) ntHash2: Recursive
spaced seed hashing for nucleotide sequences.
Bioinformatics, 38, 4812–4813.
Khan, J., Patro, R., & Pandey, P. (2026) Kache-hash: A dynamic,
concurrent, and cache-efficient hash table for streaming k-mer
operations. bioRxiv.
Langdale, G. & Lemire, D. (2019) Parsing gigabytes of
JSON per second. The VLDB Journal, 28,
941–960.
Lemane, T., Lezzoche, N., Lecubin, J., Pelletier, E., Lescot, M., Chikhi, R., & Peterlongo, P. (2024) Indexing and real-time
user-friendly queries in terabyte-sized complex genomic datasets with
kmindex and ORA. Nature Computational Science,
4, 104–109.
Lemire, D. (2017) Removing duplicates from lists quickly. URL https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/.
Li, H. (2009) Kseq. URL https://github.com/attractivechaos/klib.
Li, H. (2018) Minimap2: Pairwise
alignment for nucleotide sequences. Bioinformatics,
34, 3094–3100.
Li, H. (2020) Biofast. URL https://github.com/lh3/biofast.
Lipman, D.J. & Pearson, W.R. (1985) Rapid and sensitive
protein similarity searches. Science, 227,
1435–1441.
Ma, B., Lu, C., Wang,
Y., Yu, J., Zhao, K., Xue,
R., Ren, H., Lv, X., Pan, R.,
Zhang, J., Zhu, Y., & Xu, J. (2023) A genomic catalogue of
soil microbiomes boosts mining of biodiversity and genetic
resources. Nature Communications, 14.
Mäklin, T., Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Sequence alignment with
k-bounded matching statistics. bioRxiv.
Marçais, G., Elder, C.S., & Kingsford, C. (2024) K-nonical space:
Sketching with reverse complements. Bioinformatics,
40, btae629.
Marchet, C., Boucher, C., Puglisi, S.J., Medvedev, P., Salson, M., & Chikhi, R. (2020) Data structures based on
k-mers for querying large collections of sequencing data sets.
Genome Research, 31, 1–12.
Marchet, C. & Limasset, A. (2023) Scalable sequence
database search using partitioned aggregated bloom comb trees.
Bioinformatics, 39, i252–i259.
Martayan, I., Cazaux, B., Limasset, A., & Marchet, C. (2024) Conway-Bromage-Lyndon (CBL): an exact, dynamic
representation of k-mer sets. Bioinformatics.
Martayan, I., Lobet, L., Marchet, C., & Paperman, C. (2026) Helicase: Vectorized
parsing and bitpacking of genomic sequences. bioRxiv.
Martayan, I., Robidou, L., Shibuya, Y., & Limasset, A. (2025) Hyper-k-mers:
Efficient streaming k-mers representation. Research in
computational molecular biology (RECOMB 2025). Springer Nature
Switzerland.
Martayan, I., Vandamme, L., Constantinides, B., Cazaux, B., Paperman, C., & Limasset, A. (2025) Accelerating
k-mer-based sequence filtering. bioRxiv.
McNaughton, R. & Papert, S.A. (1971) Counter-free automata. The
MIT Press. URL https://dl.acm.org/doi/abs/10.5555/1097043.
Mohamadi, H., Chu, J., Vandervalk, B.P., & Birol, I. (2016) ntHash: Recursive
nucleotide hashing. Bioinformatics, 32,
3492–3494.
Myers, G. (1999) A fast bit-vector algorithm
for approximate string matching based on dynamic programming.
Journal of the ACM, 46, 395–415.
Nayfach, S., Shi, Z.J., Seshadri, R., Pollard, K.S., & Kyrpides, N.C. (2019) New insights from
uncultivated genomes of the global human gut microbiome.
Nature, 568, 505–510.
Ndiaye, M., Prieto-Baños, S., Fitzgerald, L.M., Yazdizadeh Kharrazi, A., Oreshkov, S., Dessimoz, C., Sedlazeck, F.J., Glover, N., & Majidian, S. (2024) When less is more:
Sketching with minimizers in genomics. Genome Biology,
25.
Nurk, S., Koren, S., Rhie,
A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., Aganezov, S., Hoyt, S.J., Diekhans, M., Logsdon, G.A., Alonge, M., Antonarakis, S.E., Borchers, M., Bouffard, G.G., Brooks, S.Y., Caldas, G.V., Chen, N.-C., Cheng, H., Chin,
C.-S., Chow, W., Lima, L.G. de, Dishuck, P.C., Durbin, R., Dvorkina, T., Fiddes, I.T., Formenti, G., Fulton, R.S., Fungtammasan, A., Garrison, E., Grady, P.G.S., Graves-Lindsay, T.A., Hall, I.M., Hansen, N.F., Hartley, G.A., Haukness, M., Howe, K., Hunkapiller, M.W., Jain, C., Jain,
M., Jarvis, E.D., Kerpedjiev, P., Kirsche, M., Kolmogorov, M., Korlach, J., Kremitzki, M., Li, H., Maduro,
V.V., Marschall, T., McCartney, A.M., McDaniel, J., Miller, D.E., Mullikin, J.C., Myers, E.W., Olson, N.D., Paten, B., Peluso, P., Pevzner, P.A., Porubsky, D., Potapova, T., Rogaev, E.I., Rosenfeld, J.A., Salzberg, S.L., Schneider, V.A., Sedlazeck, F.J., Shafin, K., Shew, C.J., Shumate, A., Sims, Y., Smit,
A.F.A., Soto, D.C., Sović, I., Storer, J.M., Streets, A., Sullivan, B.A., Thibaud-Nissen, F., Torrance, J., Wagner, J., Walenz, B.P., Wenger, A., Wood, J.M.D., Xiao, C., Yan,
S.M., Young, A.C., Zarate, S., Surti, U., McCoy, R.C., Dennis, M.Y., Alexandrov, I.A., Gerton, J.L., O’Neill, R.J., Timp, W., Zook,
J.M., Schatz, M.C., Eichler, E.E., Miga, K.H., & Phillippy, A.M. (2022) The complete sequence of
a human genome. Science, 376, 44–53.
One Codex (2019) Needletail. URL https://github.com/onecodex/needletail.
Pan, C. & Reinert, K. (2024) A simple refined
DNA minimizer operator enables 2-fold faster computation.
Bioinformatics, 40, btae045.
Paperman, C., Salvati, S., & Soyez-Martin, C. (2023) An algebraic
approach to vectorial programs. LIPIcs, Volume 254, STACS
2023, vol. 254. Schloss Dagstuhl - Leibniz-Zentrum für Informatik,
pp. 51:1–51:23.
Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B.J., Evans, P.N., Hugenholtz, P., & Tyson, G.W. (2017) Recovery of nearly 8,
000 metagenome-assembled genomes substantially expands the tree of
life. Nature Microbiology, 2, 1533–1542.
Patro, R., Bharti, S., Singhania, P., Dhakal, R., Dahlstrom, T.J., & Groot Koerkamp, R. (2025) Mim: A lightweight
auxiliary index to enable fast, parallel, gzipped FASTQ parsing.
bioRxiv.
Pearson, W.R. & Lipman, D.J. (1988) Improved tools for
biological sequence comparison. Proceedings of the National
Academy of Sciences, 85, 2444–2448.
Peleg, A., Wilkie, S., & Weiser, U. (1997) Intel MMX for multimedia
PCs. Communications of the ACM, 40, 24–38.
Pellow, D., Pu, L., Ekim,
B., Kotlar, L., Berger, B., Shamir, R., & Orenstein, Y. (2023) Efficient minimizer orders
for large values of k using minimum decycling sets. Genome
Research.
Pennisi, E. (2017) Biologists propose to
sequence the DNA of all life on earth. Science.
Pibiri, G.E. (2022) Sparse and skew
hashing of k-mers. Bioinformatics, 38,
i185–i194.
Rahman Hera, M. & Koslicki, D. (2025) Estimating similarity
and distance using FracMinHash. Algorithms for Molecular
Biology, 20.
Roberts, M., Hayes, W., Hunt,
B.R., Mount, S.M., & Yorke, J.A. (2004) Reducing storage
requirements for biological sequence comparison.
Bioinformatics, 20, 3363–3369.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2023) Fractional Hitting Sets for Efficient and Lightweight
Genomic Data Sketching. 23rd international workshop on
algorithms in bioinformatics (WABI 2023), vol. 273. Schloss
Dagstuhl – Leibniz-Zentrum für Informatik.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2025) Fractional hitting
sets for efficient multiset sketching. Algorithms for Molecular
Biology, 20, 1.
Sahlin, K. (2021) Effective sequence
similarity detection with strobemers. Genome Research,
31, 2080–2094.
Sahlin, K. (2022) Strobealign: Flexible
seed size enables ultra-fast and accurate read alignment. Genome
Biology, 23.
Sayers, E.W., Cavanaugh, M., Clark, K., Pruitt, K.D., Sherry, S.T., Yankie, L., & Karsch-Mizrachi, I. (2023) GenBank 2024 update.
Nucleic Acids Research, 52, D134–D137.
Schartl, M., Woltering, J.M., Irisarri, I., Du, K., Kneitz,
S., Pippel, M., Brown, T., Franchini, P., Li, J., Li, M.,
Adolfi, M., Winkler, S., Freitas
Sousa, J. de, Chen, Z., Jacinto, S., Kvon, E.Z., Correa de
Oliveira, L.R., Monteiro, E.,
Baia Amaral, D., Burmester, T., Chalopin, D., Suh, A., Myers,
E., Simakov, O., Schneider, I., & Meyer, A. (2024) The genomes of all
lungfish inform on genome expansion and tetrapod evolution.
Nature, 634, 96–103.
Schleimer, S., Wilkerson, D.S., & Aiken, A. (2003) Winnowing: Local algorithms
for document fingerprinting. Proceedings of the 2003
ACM SIGMOD international conference on
Management of data, SIGMOD ’03.
New York, NY, USA: Association for Computing Machinery, pp. 76–85.
Serre, O. (2004) Vectorial languages
and linear temporal logic. Theor. Comput. Sci.,
310, 79–116.
Shaw, J. & Yu, Y.W. (2021) Theory of local
k-mer selection with applications to long-read alignment.
Bioinformatics, 38, 4659–4669.
Shen, W., Le, S., Li, Y.,
& Hu, F. (2016) SeqKit: A
cross-platform and ultrafast toolkit for FASTA/q file manipulation.
PLOS ONE, 11, e0163962.
Shen, W., Sipos, B., & Zhao, L. (2024) SeqKit2: A swiss army knife for
sequence and alignment processing. iMeta,
3.
Shur, A., Tziony, I., & Orenstein, Y. (2026) 10-minimizers: A
promising class of constant-space minimizers. bioRxiv.
Smith, C., Martayan, I., Limasset, A., & Dufresne, Y. (2024) Brisk: Exact
resource-efficient dictionary for k-mers. bioRxiv.
Soyez-Martin, C. (2023) From semigroup
theory to vectorization: Recognizing regular languages. (PhD thesis).
URL https://hal.archives-ouvertes.fr/tel-04417087.
Teyssier, N. (2025) Paraseq. URL https://github.com/noamteyssier/paraseq.
Teyssier, N. & Dobin, A. (2025) BINSEQ: A family of
high-performance binary formats for nucleotide sequences.
bioRxiv.
Theodorakis, G., Koliousis, A., Pietzuch, P.R., & Pirk, H. (2018) Hammer
slide: Work- and CPU-efficient streaming window aggregation.
International workshop on accelerating analytics and data management
systems using modern processor and storage architectures, ADMS@VLDB
2018, rio de janeiro, brazil, august 27, 2018 (Bordawekar, R. &
Lahiri, T. eds). pp. 34–41.
Valve Corporation (2026) Steam
Hardware & Software Survey: March 2026. URL https://store.steampowered.com/hwsurvey/.
Vandamme, L., Cazaux, B., & Limasset, A. (2025) K2R:
Tinted de Bruijn Graphs implementation for efficient read extraction
from sequencing datasets. Bioinformatics Advances,
vbaf111.
Wang, X., Hong, Y., Chang,
H., Park, K., Langdale, G., Hu, J., & Zhu, H. (2019) Hyperscan: A Fast Multi-pattern Regex Matcher for Modern
CPUs. 16th USENIX symposium on networked systems design
and implementation (NSDI 19). pp. 631–648.
Wood, D.E., Lu, J., & Langmead, B. (2019) Improved metagenomic
analysis with Kraken 2. Genome biology,
20, 1–13.
Zentgraf, J., Schmitz, J.E., & Rahmann, S. (2025) Cleanifier:
Contamination removal from microbial sequences using spaced seeds of a
human pangenome index. Bioinformatics, 42.
Zhang, H., Song, H., Xu,
X., Chang, Q., Wang, M., Wei,
Y., Yin, Z., Schmidt, B., & Liu, W. (2023) RabbitFX: Efficient
framework for FASTA/q file parsing on modern multi-core platforms.
IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 20, 2341–2348.
Zielezinski, A., Vinga, S., Almeida, J., & Karlowski, W.M. (2017) Alignment-free sequence
comparison: Benefits, applications, and tools. Genome
Biology, 18.