References
Abrar, Md.H. & Medvedev, P. (2024) PLA-index: A k-mer
index exploiting rank curve linearity. LIPIcs, Volume 312, WABI
2024, vol. 312. Schloss Dagstuhl – Leibniz-Zentrum für Informatik,
pp. 13:1–13:18.
Alanko, J., Alipanahi, B., Settle, J., Boucher, C., & Gagie, T. (2021) Buffering updates
enables efficient dynamic de bruijn graphs. Computational and
Structural Biotechnology Journal, 19, 4067–4078.
Alanko, J.N., Biagi, E., & Puglisi, S.J. (2023) Longest common prefix
arrays for succinct k-spectra. String processing and information
retrieval. Springer Nature Switzerland, pp. 1–13.
Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Finimizers:
Variable-length bounded-frequency minimizers for k-mer sets.
IEEE Transactions on Computational Biology and Bioinformatics,
22, 899–910.
Alanko, J.N., Depuydt, L., Marchet, C., & Puglisi, S.J. (2026) Fast set operations
for compact k-mer sets. bioRxiv.
Alanko, J.N., Puglisi, S.J., & Vuohtoniemi, J. (2023) Small Searchable κ-Spectra via Subset Rank Queries on the
Spectral Burrows-Wheeler Transform. SIAM conference on
applied and computational discrete algorithms (ACDA23). Society for
Industrial; Applied Mathematics, pp. 225–236.
Arm Limited (2026) Arm Architecture Reference Manual for A-profile
architecture. Arm Limited. URL https://developer.arm.com/documentation/ddi0487/latest/.
Atkinson, K.E. (2008) An introduction
to numerical analysis, 2nd ed. New York, NY: John Wiley & Sons.
Ayad, L.A.K., Fici, G., Groot
Koerkamp, R., Loukides, G., Patro, R., Pibiri, G.E., & Pissis, S.P. (2025) U-index: A universal
indexing framework for matching long patterns. LIPIcs, Volume
338, SEA 2025, vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für
Informatik, pp. 4:1–4:18.
Azar, Y., Broder, A.Z., Karlin, A.R., & Upfal, E. (1994) Balanced allocations
(extended abstract). Proceedings of the twenty-sixth annual ACM
symposium on theory of computing - STOC ’94, STOC ’94. ACM
Press, pp. 593–602.
Baeza-Yates, R. & Salinger, A. (2010) Fast intersection
algorithms for sorted sequences. Algorithms and
applications. Springer Berlin Heidelberg, pp. 45–61.
Baire, A., Marijon, P., Andreace, F., & Peterlongo, P. (2024) Back to sequences: Find the
origin of k-mers. Journal of Open Source Software,
9, 7066.
Baker, D.N. & Langmead, B. (2019) Dashing: Fast and
accurate genomic distances with HyperLogLog. Genome
Biology, 20.
Baker, D.N. & Langmead, B. (2023) Genomic sketching with
multiplicities and locality-sensitive hashing using dashing 2.
Genome Research. Cold Spring Harbor Laboratory, p.
gr.277655.123.
Balouek, D., Carpen Amarie, A., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., Lèbre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Pérez, C., Quesnel, F., Rohr, C., & Sarzyniec, L. (2013) Adding
virtualization capabilities to the Grid’5000 testbed.
Cloud computing and services science, vol. 367,
Communications in computer and information science (Ivanov,
I.I., Sinderen, M. van, Leymann, F., & Shan, T. eds). Springer
International Publishing, pp. 3–20.
Bankevich, A., Bzikadze, A.V., Kolmogorov, M., Antipov, D., & Pevzner, P.A. (2022) Multiplex de bruijn
graphs enable genome assembly from long, high-fidelity reads.
Nature Biotechnology, 40, 1075–1081.
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., & Pevzner, P.A. (2012) SPAdes: A new genome
assembly algorithm and its applications to single-cell sequencing.
Journal of Computational Biology, 19, 455–477.
Benoit, G., Peterlongo, P., Mariadassou, M., Drezen, E., Schbath, S., Lavenier, D., & Lemaitre, C. (2016) Multiple comparative
metagenomics using multiset k-mer counting. PeerJ Computer
Science, 2, e94.
Benoit, G., Raguideau, S., James, R., Phillippy, A.M., Chikhi, R., & Quince, C. (2024) High-quality
metagenome assembly from long accurate reads with metaMDBG.
Nature Biotechnology, 42, 1378–1383.
Bentley, J.L. & Yao, A.C.-C. (1976) An almost optimal
algorithm for unbounded searching. Information Processing
Letters, 5, 82–87.
Bille, P., Christiansen, A.R., Ettienne, M.B., & Gørtz, I.L. (2017) Fast dynamic
arrays. LIPIcs, Volume 87, ESA 2017, vol. 87. Schloss
Dagstuhl – Leibniz-Zentrum für Informatik, pp. 16:1–16:13.
Bowe, A., Onodera, T., Sadakane, K., & Shibuya, T. (2012) Succinct de bruijn
graphs. Algorithms in bioinformatics. Springer Berlin
Heidelberg, pp. 225–235.
Bradley, P., Bakker, H.C. den, Rocha, E.P.C., McVean, G., & Iqbal, Z. (2019) Ultrafast search of all
deposited bacterial and viral genomic data. Nature
Biotechnology, 37, 152–159.
Břinda, K., Baym, M., & Kucherov, G. (2021) Simplitigs as an
efficient and scalable representation of de bruijn graphs.
Genome Biology, 22.
Broder, A.Z. (1997) On the resemblance and
containment of documents. Proceedings. Compression and
complexity of SEQUENCES 1997 (cat. no.97TB100171),
SEQUEN-97. IEEE Comput. Soc, pp. 21–29.
Burrows, M. & Wheeler, D.J. (1994) A block-sorting lossless
data compression algorithm (Technical Report No. 124). Digital Equipment
Corporation.
Campanelli, A., Pibiri, G.E., Fan, J., & Patro, R. (2024) Where the patterns are:
Repetition-aware compression for colored de bruijn graphs.
Journal of Computational Biology, 31,
1022–1044.
Chen, K., Li, X., Shi, Q.,
Shao, M., & Medvedev, P. (2026) Hash functions in
nucleotide sequence analysis. Genome Research.
Chen, K., Pattar, V., & Shao, M. (2025) Sequence similarity
estimation by random subsequence sketching. LIPIcs, Volume 344,
WABI 2025, vol. 344. Schloss Dagstuhl – Leibniz-Zentrum für
Informatik, pp. 7:1–7:17.
Chikhi, R., Lemane, T., Loll-Krippleber, R., Montoliu-Nerin, M., Raffestin, B., Camargo, A.P., Miller, C.J., Fiamenghi, M.B., Agustinho, D.P., Majidian, S., Autric, G., Hugues, M., Lee,
J., Faure, R., Curry, K.D., Moura de
Sousa, J.A., Rocha, E.P.C., Koslicki, D., Medvedev, P., Gupta, P., Shen,
J., Morales-Tapia, A., Sihuta, K., Roy,
P.J., Brown, G.W., Edgar, R.C., Korobeynikov, A., Steinegger, M., Lareau, C.A., Peterlongo, P., & Babaian, A. (2024) Logan: Planetary-scale
genome assembly surveys life’s diversity. bioRxiv.
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., & Medvedev, P. (2015) On the representation of de
bruijn graphs. Journal of Computational Biology,
22, 336–352.
Chikhi, R., Limasset, A., & Medvedev, P. (2016) Compacting de
bruijn graphs from sequencing data quickly and in low memory.
Bioinformatics, 32, i201–i208.
Cohen, J.D. (1997) Recursive hashing functions
for n-grams. ACM Transactions on Information Systems,
15, 291–320.
Constantinides, B., Lees, J., & Crook, D.W. (2025) Deacon: Fast sequence
filtering and contaminant depletion. bioRxiv.
Cracco, A. & Tomescu, A.I. (2023) Extremely fast construction
and querying of compacted and colored de bruijn graphs with GGCAT.
Genome Research.
Crochemore, M., Czumaj, A., Ga̧sieniec, L., Lecroq, T., Plandowski, W., & Rytter, W. (1999) Fast practical
multi-pattern matching. Information Processing Letters,
71, 107–113.
Crosbie, N.D. (2025) Grepq: A rust application
that quickly filters FASTQ files by matching sequences to a set of
regular expressions. Journal of Open Source Software,
10, 8048.
Darvish, M., Seiler, E., Mehringer, S., Rahn, R., & Reinert, K. (2022) Needle: A fast and
space-efficient prefilter for estimating the quantification of very
large collections of expression experiments.
Bioinformatics, 38, 4100–4108.
Degardins, B., Paperman, C., & Marchet, C. (2025) Vizitig: A pangenome
and pantranscriptome explorer. bioRxiv.
Demaine, E.D., López-Ortiz, A., & Munro, J.I. (2000) Adaptive set
intersections, unions, and differences. Proceedings of the
eleventh annual ACM-SIAM symposium on discrete algorithms, SODA
’00. USA: Society for Industrial; Applied Mathematics, pp. 743–752.
Deorowicz, S., Debudaj-Grabysz, A., & Grabowski, S. (2013) Disk-based k-mer
counting on a PC. BMC Bioinformatics, 14.
Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015) KMC 2: fast and resource-frugal k-mer
counting. Bioinformatics, 31,
1569–1576.
Dı́az-Domı́nguez, D., Leinonen, M., & Salmela, L. (2024) Space-efficient
computation of k-mer dictionaries for large values of k.
Algorithms Mol. Biol., 19, 14.
Donges, S., Puglisi, S.J., & Raman, R. (2022) On dynamic bitvector
implementations. 2022 data compression conference (DCC).
IEEE, pp. 252–261.
Dufresne, Y., Guillemot, V., & Dreo, J. (2024) Optimization
of reversible hash functions for k-mer data structures. SeqBIM
2024 workshop. Rennes, France.
Edgar, R. (2021) Syncmers are more sensitive
than minimizers for selecting conserved k-mers in biological
sequences. PeerJ, 9, e10805.
Ekim, B., Berger, B., & Chikhi, R. (2021) Minimizer-space de
bruijn graphs: Whole-genome assembly of long reads in minutes on a
personal computer. Cell Systems, 12,
958–968.e6.
Ekim, B., Sahlin, K., Medvedev, P., Berger, B., & Chikhi, R. (2023) Efficient mapping of
accurate long reads in minimizer space with mapquik. Genome
Research.
Elias, P. (1974) Efficient storage and
retrieval by content and address of static files. Journal of the
ACM, 21, 246–260.
Erbert, M., Rechner, S., & Müller-Hannemann, M. (2017) Gerbil: A fast and
memory-efficient k-mer counter with GPU-support. Algorithms for
Molecular Biology, 12.
Fan, J., Khan, J., Singh,
N.P., Pibiri, G.E., & Patro, R. (2024) Fulgor: A fast and
compact k-mer index for large-scale matching and color queries.
Algorithms for Molecular Biology, 19, 3.
Fano, R.M. (1971) On the number of bits
required to implement an associative memory ( No. Memorandum 61).
Computer Structures Group, MIT, Cambridge, MA. URL http://csg.csail.mit.edu/pubs/memos/Memo-61/Memo-61.pdf.
Faro, S. & Lecroq, T. (2013) The exact online string
matching problem: A review of the most recent results. ACM
Computing Surveys, 45, 1–42.
Faure, R., Abrar, H., Wu,
H., Chikhi, R., Koslicki, D., & Medvedev, P. (2025) Comparing and indexing
metagenomes at a large scale using random projections. SeqBIM
2025 workshop. Nantes, France.
Feng, X., Cheng, H., Portik, D., & Li, H. (2022) Metagenome assembly of
high-fidelity long reads with hifiasm-meta. Nature Methods,
19, 671–674.
Ferragina, P. & Manzini, G. (2000) Opportunistic data
structures with applications. Proceedings 41st annual symposium
on foundations of computer science, SFCS-00. IEEE Comput.
Soc, pp. 390–398.
Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007) HyperLogLog: The analysis of
a near-optimal cardinality estimation algorithm. Discrete
Mathematics & Theoretical Computer Science, DMTCS
Proceedings vol. AH,...
Gagie, T., Manzini, G., & Sirén, J. (2017) Wheeler graphs: A
framework for BWT-based data structures. Theoretical Computer
Science, 698, 67–78.
Gallant, A. (2024) Ripgrep. URL https://github.com/BurntSushi/ripgrep.
Gienieczko, M., Murlak, F., & Paperman, C. (2023) Supporting descendants in
SIMD-accelerated JSONPath. Proceedings of the 28th ACM
international conference on architectural support for programming
languages and operating systems, volume 4, ASPLOS ’23.
ACM, pp. 338–361.
Gog, S. & Petri, M. (2013) Optimized succinct data
structures for massive data. Software: Practice and
Experience, 44, 1287–1314.
Golan, S., Tziony, I., Kraus, M., Orenstein, Y., & Shur, A. (2025) GreedyMini:
Generating low-density DNA minimizers. Bioinformatics,
41, i275–i284.
Graefe, G. (1993) Query evaluation techniques
for large databases. ACM Computing Surveys,
25, 73–169.
Greenberg, G., Ravi, A.N., & Shomorony, I. (2023) LexicHash:
Sequence similarity estimation via lexicographic comparison of
hashes. Bioinformatics, 39.
Groot Koerkamp, R. (2024) A*PA2: Up to 19×
faster exact global alignment. LIPIcs, Volume 312, WABI
2024, vol. 312. Schloss Dagstuhl – Leibniz-Zentrum für Informatik,
pp. 17:1–17:25.
Groot Koerkamp, R. (2025a) PtrHash: Minimal
perfect hashing at RAM throughput. LIPIcs, Volume 338, SEA
2025, vol. 338. Schloss Dagstuhl – Leibniz-Zentrum für Informatik,
pp. 21:1–21:21.
Groot Koerkamp, R. (2025b) Optimal
throughput bioinformatics (PhD thesis). URL https://www.research-collection.ethz.ch/handle/20.500.11850/783091.
Groot Koerkamp, R. (2026) The anti-lexicographic
SUS-anchor: A near-optimal k=1 sampling scheme. arXiv.
Groot Koerkamp, R. & Ivanov, P. (2024) Exact global
alignment using a* with chaining seed heuristic and match pruning.
Bioinformatics, 40.
Groot Koerkamp, R., Liu, D., & Pibiri, G.E. (2025) The open-closed
mod-minimizer algorithm. Algorithms for Molecular Biology,
20.
Groot Koerkamp, R. & Martayan, I. (2025) SimdMinimizers: Computing Random Minimizers,
fast. 23rd international symposium on experimental
algorithms (SEA 2025), vol. 338. Schloss Dagstuhl – Leibniz-Zentrum
für Informatik.
Groot Koerkamp, R. & Pibiri, G.E. (2024) The mod-minimizer: A Simple and Efficient Sampling
Algorithm for Long k-mers. 24th international workshop on
algorithms in bioinformatics (WABI 2024), vol. 312, Leibniz
international proceedings in informatics (LIPIcs) (Pissis, S.P.
& Sung, W.-K. eds). Dagstuhl, Germany: Schloss Dagstuhl –
Leibniz-Zentrum für Informatik, pp. 11:1–11:23.
Hennessy, J.L., Patterson, D.A., & Kozyrakis, C. (2026) Computer architecture:
A quantitative approach, Seventh edition eds. Cambridge, MA: Morgan
Kaufmann Publishers. URL https://shop.elsevier.com/books/computer-architecture/hennessy/978-0-443-15406-5.
Hera, M.R., Koslicki, D., & Martínez, C. (2025) MaxGeomHash: An
algorithm for variable-size random sampling of distinct elements.
bioRxiv.
Hernandez-Courbevoie, Y., Salson, M., Bessière, C., Xue, H., Gautheret, D., Marchet, C., & Limasset, A. (2025) REINDEER2: Practical
abundance index at scale. String processing and information
retrieval. Springer Nature Switzerland, pp. 156–171.
Hirzel, M., Schneider, S., & Tangwongsan, K. (2017) Sliding-window
aggregation algorithms: tutorial. Proceedings of the 11th
ACM international conference on distributed and event-based
systems, DEBS 2017, barcelona, spain, june 19-23,
2017. ACM, pp. 11–14.
Holley, G. & Melsted, P. (2020) Bifrost: Highly
parallel construction and indexing of colored and compacted de bruijn
graphs. Genome Biology, 21.
Homer, N., Stadick, S., Lambert, S., Stone, M., & Fennell, T. (2025) Fqgrep. URL https://doi.org/10.5281/zenodo.15034074.
Ingels, F., Limasset, A., Marchet, C., & Salson, M. (2026) Vigemers: On the number
of k-mers sharing the same
XOR-based minimizer. arXiv.
Ingels, F., Marchet, C., & Salson, M. (2024) On the number of k-mers admitting a given
lexicographical minimizer. arXiv.
Ingels, F., Robidou, L., Martayan, I., Marchet, C., & Limasset, A. (2025) Minimizer density
revisited: Models and multiminimizers. bioRxiv.
Intel Corporation (2026) Intel® 64 and IA-32 Architectures Software
Developer’s Manual. Intel Corporation. URL https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html.
Iqbal, Z., Turner, I., & McVean, G. (2012) High-throughput
microbial population genomics using the cortex variation assembler.
Bioinformatics, 29, 275–276.
Irber, L., Brooks, P.T., Reiter, T., Pierce-Ward, N.T., Hera, M.R., Koslicki, D., & Brown, C.T. (2022) Lightweight
compositional analysis of metagenomes with FracMinHash and minimum
metagenome covers. bioRxiv.
Irber, L., Pierce-Ward, N.T., & Brown, C.T. (2022) Sourmash branchwater
enables lightweight petabyte-scale sequence search.
bioRxiv.
Jain, C., Dilthey, A., Koren, S., Aluru, S., & Phillippy, A.M. (2017) A fast approximate
algorithm for mapping long reads to large reference databases.
Research in computational molecular biology. Springer
International Publishing, pp. 66–81.
Karasikov, M., Mustafa, H., Danciu, D., Kulkov, O., Zimmermann, M., Barber, C., Rätsch, G., & Kahles, A. (2025) Efficient and accurate
search in petabase-scale sequence repositories. Nature,
647, 1036–1044.
Karasikov, M., Mustafa, H., Rätsch, G., & Kahles, A. (2022) Lossless indexing with
counting de bruijn graphs. Genome Research,
32, 1754–1764.
Karp, R.M. (2009) Reducibility among
combinatorial problems. 50 years of integer programming
1958-2008. Springer Berlin Heidelberg, pp. 219–241.
Karp, R.M. & Rabin, M.O. (1987) Efficient randomized
pattern-matching algorithms. IBM Journal of Research and
Development, 31, 249–260.
Karsenti, E., Acinas, S.G., Bork, P., Bowler, C., De
Vargas, C., Raes, J., Sullivan, M., Arendt, D., Benzoni, F., Claverie, J.-M., Follows, M., Gorsky, G., Hingamp, P., Iudicone, D., Jaillon, O., Kandels-Lewis, S., Krzic, U., Not,
F., Ogata, H., Pesant, S., Reynaud, E.G., Sardet, C., Sieracki, M.E., Speich, S., Velayoudon, D., Weissenbach, J., & Wincker, P. and (2011) A holistic approach
to marine eco-systems biology. PLoS Biology,
9, e1001177.
Kazemi, P., Wong, J., Nikolić, V., Mohamadi, H., Warren, R.L., & Birol, I. (2022) ntHash2: Recursive
spaced seed hashing for nucleotide sequences.
Bioinformatics, 38, 4812–4813.
Khan, J., Kokot, M., Deorowicz, S., & Patro, R. (2022) Scalable, ultra-fast,
and low-memory construction of compacted de bruijn graphs with
cuttlefish 2. Genome Biology, 23.
Khan, J., Patro, R., & Pandey, P. (2026) Kache-hash: A dynamic,
concurrent, and cache-efficient hash table for streaming k-mer
operations. bioRxiv.
Kille, B., Groot
Koerkamp, R., McAdams, D., Liu, A., & Treangen, T.J. (2024) A near-tight lower
bound on the density of forward sampling schemes.
Bioinformatics, 41.
Kokot, M., Długosz, M., & Deorowicz, S. (2017) KMC 3: counting and manipulating k-mer
statistics. Bioinformatics, 33,
2759–2761.
Kolmogorov, M., Yuan, J., Lin,
Y., & Pevzner, P.A. (2019) Assembly of long,
error-prone reads using repeat graphs. Nature
Biotechnology, 37, 540–546.
Konstantinidis, K.T. & Tiedje, J.M. (2005) Genomic insights that
advance the species definition for prokaryotes. Proceedings of
the National Academy of Sciences, 102, 2567–2572.
Langdale, G. & Lemire, D. (2019) Parsing gigabytes of
JSON per second. The VLDB Journal, 28,
941–960.
Lehmann, H.-P., Mueller, T., Pagh, R., Pibiri, G.E., Sanders, P., Vigna, S., & Walzer, S. (2026) Modern minimal perfect hashing: A
survey. ACM Computing Surveys, 58, 1–36.
Leis, V., Kemper, A., & Neumann, T. (2013) The adaptive radix
tree: ARTful indexing for main-memory databases. 2013 IEEE 29th
international conference on data engineering (ICDE). IEEE, pp.
38–49.
Lemane, T., Medvedev, P., Chikhi, R., & Peterlongo, P. (2022) Kmtricks: Efficient and
flexible construction of bloom filters for large sequencing data
collections. Bioinformatics Advances, 2.
Lemire, D. (2017) Removing duplicates from lists quickly. URL https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/.
Lemire, D., Boytsov, L., & Kurz, N. (2015) SIMD compression and the
intersection of sorted integers. Software: Practice and
Experience, 46, 723–749.
Lemire, D., Kaser, O., & Kurz, N. (2019) Faster remainder by direct
computation: Applications to compilers and software libraries.
Software: Practice and Experience, 49,
953–970.
Levallois, V., Shibuya, Y., Le
Gal, B., Dufresne, Y., Patro, R., Peterlongo, P., & Pibiri, G.E. (2026) Kaminari: A frugal colored
index for approximate k-mer queries. Bioinformatics
Advances.
Li, H. (2009) Kseq. URL https://github.com/attractivechaos/klib.
Li, H. (2016) Minimap and
miniasm: Fast mapping and de novo assembly for noisy long sequences.
Bioinformatics, 32, 2103–2110.
Li, H. (2018) Minimap2: Pairwise
alignment for nucleotide sequences. Bioinformatics,
34, 3094–3100.
Li, H. (2020) Biofast. URL https://github.com/lh3/biofast.
Li, P. & König, C. (2010) B-bit minwise
hashing. Proceedings of the 19th international conference on
world wide web, WWW ’10. ACM, pp. 671–680.
Li, P., Owen, A., & Zhang, C. (2012) One
permutation hashing. Advances in neural information processing
systems, vol. 25 (Pereira, F., Burges, C.J., Bottou, L., &
Weinberger, K. eds). Curran Associates, Inc.
Li, Y., Kamousi, P., Han, F., Yang,
S., Yan, X., & Suri, S. (2013) Memory efficient minimum
substring partitioning. Proceedings of the VLDB Endowment,
6, 169–180.
Li, Y. & Yan, X. (2015) MSPKmerCounter: A fast
and memory efficient approach for k-mer counting. arXiv.
Limasset, A., Rizk, G., Chikhi, R., & Peterlongo, P. (2017) Fast and scalable
minimal perfect hashing for massive key sets. LIPIcs, Volume 75,
SEA 2017, vol. 75. Schloss Dagstuhl – Leibniz-Zentrum für
Informatik, pp. 25:1–25:16.
Lothaire, M. (1997) Combinatorics on
words, vol. 17. Cambridge university press.
Mäklin, T., Alanko, J.N., Biagi, E., & Puglisi, S.J. (2025) Sequence alignment with
k-bounded matching statistics. bioRxiv.
Marçais, G., DeBlasio, D., & Kingsford, C. (2018) Asymptotically
optimal minimizers schemes. Bioinformatics,
34, i13–i22.
Marçais, G., Elder, C.S., & Kingsford, C. (2024) K-nonical space:
Sketching with reverse complements. Bioinformatics,
40, btae629.
Marçais, G. & Kingsford, C. (2011) A fast, lock-free approach for efficient parallel
counting of occurrences of k-mers. Bioinformatics,
27, 764–770.
Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., & Kingsford, C. (2017) Improving the
performance of minimizers and winnowing schemes.
Bioinformatics, 33, i110–i117.
Marchet, C. (2024) Advances in practical
k-mer sets: Essentials for the curious. arXiv.
Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., & Chikhi, R. (2020) REINDEER:
Efficient indexing of k-mer presence and abundance in sequencing
datasets. Bioinformatics, 36, i177–i185.
Marchet, C., Kerbiriou, M., & Limasset, A. (2021) BLight: Efficient
exact associative structure for k-mers. Bioinformatics,
37, 2858–2865.
Marchini, S. & Vigna, S. (2020) Compact fenwick trees for
dynamic ranking and selection. Software: Practice and
Experience, 50, 1184–1202.
Marco-Sola, S., Eizenga, J.M., Guarracino, A., Paten, B., Garrison, E., & Moreto, M. (2023) Optimal gap-affine
alignment in o(s) space. Bioinformatics,
39.
Marco-Sola, S., Moure, J.C., Moreto, M., & Espinosa, A. (2020) Fast gap-affine
pairwise alignment using the wavefront algorithm.
Bioinformatics, 37, 456–463.
Martayan, I., Cazaux, B., Limasset, A., & Marchet, C. (2024) Conway-Bromage-Lyndon (CBL): an exact, dynamic
representation of k-mer sets. Bioinformatics.
Martayan, I., Lobet, L., Marchet, C., & Paperman, C. (2026) Helicase: Vectorized
parsing and bitpacking of genomic sequences. bioRxiv.
Martayan, I., Robidou, L., Shibuya, Y., & Limasset, A. (2025) Hyper-k-mers:
Efficient streaming k-mers representation. Research in
computational molecular biology (RECOMB 2025). Springer Nature
Switzerland.
Martayan, I., Vandamme, L., Constantinides, B., Cazaux, B., Paperman, C., & Limasset, A. (2025) Accelerating
k-mer-based sequence filtering. bioRxiv.
McNaughton, R. & Papert, S.A. (1971) Counter-free automata. The
MIT Press. URL https://dl.acm.org/doi/abs/10.5555/1097043.
Mitzenmacher, M. (2001) The power of two choices in
randomized load balancing. IEEE Transactions on Parallel and
Distributed Systems, 12, 1094–1104.
Mohamadi, H., Chu, J., Vandervalk, B.P., & Birol, I. (2016) ntHash: Recursive
nucleotide hashing. Bioinformatics, 32,
3492–3494.
Myers, G. (1999) A fast bit-vector algorithm
for approximate string matching based on dynamic programming.
Journal of the ACM, 46, 395–415.
Myers, G. (2023) FASTK: A fast K-mer counter for high-fidelity shotgun
datasets. URL https://github.com/thegenemyers/FASTK.
Mykkeltveit, J. (1972) A proof of golomb’s
conjecture for the de bruijn graph. Journal of Combinatorial
Theory, Series B, 13, 40–45.
Ndiaye, M., Prieto-Baños, S., Fitzgerald, L.M., Yazdizadeh Kharrazi, A., Oreshkov, S., Dessimoz, C., Sedlazeck, F.J., Glover, N., & Majidian, S. (2024) When less is more:
Sketching with minimizers in genomics. Genome Biology,
25.
Needleman, S.B. & Wunsch, C.D. (1970) A general method
applicable to the search for similarities in the amino acid sequence of
two proteins. Journal of Molecular Biology,
48, 443–453.
Nunes, I., Heddes, M., Vergés, P., Abraham, D., Veidenbaum, A., Nicolau, A., & Givargis, T. (2023) DotHash: Estimating set
similarity metrics for link prediction and document deduplication.
Proceedings of the 29th ACM SIGKDD conference on knowledge discovery
and data mining, KDD ’23. ACM, pp. 1758–1769.
Nurk, S., Koren, S., Rhie,
A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., Gershman, A., Aganezov, S., Hoyt, S.J., Diekhans, M., Logsdon, G.A., Alonge, M., Antonarakis, S.E., Borchers, M., Bouffard, G.G., Brooks, S.Y., Caldas, G.V., Chen, N.-C., Cheng, H., Chin,
C.-S., Chow, W., Lima, L.G. de, Dishuck, P.C., Durbin, R., Dvorkina, T., Fiddes, I.T., Formenti, G., Fulton, R.S., Fungtammasan, A., Garrison, E., Grady, P.G.S., Graves-Lindsay, T.A., Hall, I.M., Hansen, N.F., Hartley, G.A., Haukness, M., Howe, K., Hunkapiller, M.W., Jain, C., Jain,
M., Jarvis, E.D., Kerpedjiev, P., Kirsche, M., Kolmogorov, M., Korlach, J., Kremitzki, M., Li, H., Maduro,
V.V., Marschall, T., McCartney, A.M., McDaniel, J., Miller, D.E., Mullikin, J.C., Myers, E.W., Olson, N.D., Paten, B., Peluso, P., Pevzner, P.A., Porubsky, D., Potapova, T., Rogaev, E.I., Rosenfeld, J.A., Salzberg, S.L., Schneider, V.A., Sedlazeck, F.J., Shafin, K., Shew, C.J., Shumate, A., Sims, Y., Smit,
A.F.A., Soto, D.C., Sović, I., Storer, J.M., Streets, A., Sullivan, B.A., Thibaud-Nissen, F., Torrance, J., Wagner, J., Walenz, B.P., Wenger, A., Wood, J.M.D., Xiao, C., Yan,
S.M., Young, A.C., Zarate, S., Surti, U., McCoy, R.C., Dennis, M.Y., Alexandrov, I.A., Gerton, J.L., O’Neill, R.J., Timp, W., Zook,
J.M., Schatz, M.C., Eichler, E.E., Miga, K.H., & Phillippy, A.M. (2022) The complete sequence of
a human genome. Science, 376, 44–53.
Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., & Phillippy, A.M. (2016) Mash: Fast genome and
metagenome distance estimation using MinHash. Genome
Biology, 17.
One Codex (2019) Needletail. URL https://github.com/onecodex/needletail.
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., & Kingsford, C. (2016) Compact universal
k-mer hitting sets. Algorithms in bioinformatics. Springer
International Publishing, pp. 257–268.
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., & Kingsford, C. (2017) Designing small
universal k-mer hitting sets for improved analysis of high-throughput
sequencing. PLOS Computational Biology,
13, e1005777.
Pan, C. & Reinert, K. (2024) A simple refined
DNA minimizer operator enables 2-fold faster computation.
Bioinformatics, 40, btae045.
Pandey, P., Almodaresi, F., Bender, M.A., Ferdman, M., Johnson, R., & Patro, R. (2018) Mantis: A fast, small,
and exact large-scale sequence-search index. Cell Systems,
7, 201–207.e4.
Paperman, C., Salvati, S., & Soyez-Martin, C. (2023) An algebraic
approach to vectorial programs. LIPIcs, Volume 254, STACS
2023, vol. 254. Schloss Dagstuhl - Leibniz-Zentrum für Informatik,
pp. 51:1–51:23.
Patro, R., Bharti, S., Singhania, P., Dhakal, R., Dahlstrom, T.J., & Groot Koerkamp, R. (2025) Mim: A lightweight
auxiliary index to enable fast, parallel, gzipped FASTQ parsing.
bioRxiv.
Peleg, A., Wilkie, S., & Weiser, U. (1997) Intel MMX for multimedia
PCs. Communications of the ACM, 40, 24–38.
Pellow, D., Pu, L., Ekim,
B., Kotlar, L., Berger, B., Shamir, R., & Orenstein, Y. (2023) Efficient minimizer orders
for large values of k using minimum decycling sets. Genome
Research.
Pennisi, E. (2017) Biologists propose to
sequence the DNA of all life on earth. Science.
Pibiri, G.E. (2022) Sparse and skew
hashing of k-mers. Bioinformatics, 38,
i185–i194.
Pibiri, G.E. & Kanda, S. (2021) Rank/select queries over
mutable bitmaps. Information Systems, 99,
101756.
Pibiri, G.E. & Patro, R. (2026) Optimizing sparse and
skew hashing: Faster k-mer dictionaries. bioRxiv.
Pibiri, G.E., Shibuya, Y., & Limasset, A. (2023) Locality-preserving
minimal perfect hashing of k-mers. Bioinformatics,
39, i534–i543.
Pibiri, G.E. & Trani, R. (2021) PTHash: Revisiting FCH
minimal perfect hashing. Proceedings of the 44th international
ACM SIGIR conference on research and development in information
retrieval, SIGIR ’21. ACM, pp. 1339–1348.
Pibiri, G.E. & Venturini, R. (2017) Dynamic elias-fano
representation. LIPIcs, Volume 78, CPM 2017, vol. 78.
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, pp. 30:1–30:14.
Pierce, N.T., Irber, L., Reiter, T., Brooks, P., & Brown, C.T. (2019) Large-scale
sequence comparisons with sourmash. F1000Research,
8, 1006.
Rahman, A. & Medevedev, P. (2021) Representation of k-mer
sets using spectrum-preserving string sets. Journal of
Computational Biology, 28, 381–394.
Rahman Hera, M., Pierce-Ward, N.T., & Koslicki, D. (2023) Deriving confidence
intervals for mutation rates across a wide range of evolutionary
distances using FracMinHash. Genome Research.
Roberts, M., Hayes, W., Hunt,
B.R., Mount, S.M., & Yorke, J.A. (2004) Reducing storage
requirements for biological sequence comparison.
Bioinformatics, 20, 3363–3369.
Rouzé, T., Chikhi, R., & Limasset, A. (2025) Inverted colored de
bruijn graph for practical kmer sets storage. bioRxiv.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2023) Fractional Hitting Sets for Efficient and Lightweight
Genomic Data Sketching. 23rd international workshop on
algorithms in bioinformatics (WABI 2023), vol. 273. Schloss
Dagstuhl – Leibniz-Zentrum für Informatik.
Rouzé, T., Martayan, I., Marchet, C., & Limasset, A. (2025) Fractional hitting
sets for efficient multiset sketching. Algorithms for Molecular
Biology, 20, 1.
Rowe, W.P.M. (2019) When the levee breaks:
A practical guide to sketching algorithms for processing the flood of
genomic data. Genome Biology, 20.
Sahlin, K. (2021) Effective sequence
similarity detection with strobemers. Genome Research,
31, 2080–2094.
Sahlin, K. (2022) Strobealign: Flexible
seed size enables ultra-fast and accurate read alignment. Genome
Biology, 23.
Sahlin, K., Baudeau, T., Cazaux, B., & Marchet, C. (2023) A survey of mapping
algorithms in the long-reads era. Genome Biology,
24.
Sanger, F., Nicklen, S., & Coulson, A.R. (1977) DNA sequencing with
chain-terminating inhibitors. Proceedings of the National
Academy of Sciences, 74, 5463–5467.
Sawada, J. & Williams, A. (2017) Practical algorithms to
rank necklaces, lyndon words, and de bruijn sequences. Journal
of Discrete Algorithms, 43, 95–110.
Schartl, M., Woltering, J.M., Irisarri, I., Du, K., Kneitz,
S., Pippel, M., Brown, T., Franchini, P., Li, J., Li, M.,
Adolfi, M., Winkler, S., Freitas
Sousa, J. de, Chen, Z., Jacinto, S., Kvon, E.Z., Correa de
Oliveira, L.R., Monteiro, E.,
Baia Amaral, D., Burmester, T., Chalopin, D., Suh, A., Myers,
E., Simakov, O., Schneider, I., & Meyer, A. (2024) The genomes of all
lungfish inform on genome expansion and tetrapod evolution.
Nature, 634, 96–103.
Schleimer, S., Wilkerson, D.S., & Aiken, A. (2003) Winnowing: Local algorithms
for document fingerprinting. Proceedings of the 2003
ACM SIGMOD international conference on
Management of data, SIGMOD ’03.
New York, NY, USA: Association for Computing Machinery, pp. 76–85.
Schmidt, S. & Alanko, J.N. (2023) Eulertigs: Minimum
plain text representation of k-mer sets without repetitions in linear
time. Algorithms for Molecular Biology,
18.
Schmidt, S., Khan, S., Alanko, J.N., Pibiri, G.E., & Tomescu, A.I. (2023) Matchtigs: Minimum
plain text representation of k-mer sets. Genome Biology,
24.
Sereika, M., Kirkegaard, R.H., Karst, S.M., Michaelsen, T.Y., Sørensen, E.A., Wollenberg, R.D., & Albertsen, M. (2022) Oxford nanopore R10.4
long-read sequencing enables the generation of near-finished bacterial
genomes from pure cultures and metagenomes without short-read or
reference polishing. Nature Methods, 19,
823–826.
Serre, O. (2004) Vectorial languages
and linear temporal logic. Theor. Comput. Sci.,
310, 79–116.
Shaw, J. & Yu, Y.W. (2021) Theory of local
k-mer selection with applications to long-read alignment.
Bioinformatics, 38, 4659–4669.
Shen, W., Le, S., Li, Y.,
& Hu, F. (2016) SeqKit: A
cross-platform and ultrafast toolkit for FASTA/q file manipulation.
PLOS ONE, 11, e0163962.
Shen, W., Lees, J.A., & Iqbal, Z. (2025) Efficient sequence
alignment against millions of prokaryotic genomes with LexicMap.
Nature Biotechnology.
Shen, W., Sipos, B., & Zhao, L. (2024) SeqKit2: A swiss army knife for
sequence and alignment processing. iMeta,
3.
Shibuya, Y., Belazzougui, D., & Kucherov, G. (2022) Efficient
Reconciliation of Genomic
Datasets of High Similarity.
22nd International Workshop on
Algorithms in Bioinformatics
(WABI 2022). Schloss Dagstuhl – Leibniz-Zentrum für
Informatik, pp. 14:1–14:14.
Shiryev, S.A. & Agarwala, R. (2024) Indexing and searching
petabase-scale nucleotide resources. Nature Methods,
21, 994–1002.
Shrivastava, A. & Li, P. (2014) Densifying
one permutation hashing via rotation for fast near neighbor search.
Proceedings of the 31st international conference on machine
learning, vol. 32, Proceedings of machine learning
research (Xing, E.P. & Jebara, T. eds). Bejing, China: PMLR,
pp. 557–565.
Shur, A., Tziony, I., & Orenstein, Y. (2026) 10-minimizers: A
promising class of constant-space minimizers. bioRxiv.
Sladký, O., Veselý, P., & Břinda, K. (2023) Masked superstrings as
a unified framework for textual k-mer set representations.
bioRxiv.
Sladký, O., Veselý, P., & Břinda, K. (2024) Towards efficient k-mer
set operations via function-assigned masked superstrings.
bioRxiv.
Sladký, O., Veselý, P., & Břinda, K. (2025) FroM
Superstring to Indexing: a space-efficient index for unconstrained k-mer
sets using the Masked Burrows-Wheeler Transform (MBWT).
Bioinformatics Advances, 6.
Smith, C., Martayan, I., Limasset, A., & Dufresne, Y. (2024) Brisk: Exact
resource-efficient dictionary for k-mers. bioRxiv.
Smith, T.F. & Waterman, M.S. (1981) Identification of
common molecular subsequences. Journal of Molecular
Biology, 147, 195–197.
Solomon, B. & Kingsford, C. (2016) Fast search of thousands of
short-read sequencing experiments. Nature Biotechnology,
34, 300–302.
Soyez-Martin, C. (2023) From semigroup
theory to vectorization: Recognizing regular languages. (PhD thesis).
URL http://dx.doi.org/10.70675/09db0f47z19fbz4c84zbeedza5bfb8f585c8.
Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron,
M.J., Iyer, R., Schatz, M.C., Sinha, S., & Robinson, G.E. (2015) Big data:
Astronomical or genomical? PLOS Biology,
13, e1002195.
Teyssier, N. (2025) Paraseq. URL https://github.com/noamteyssier/paraseq.
Teyssier, N. & Dobin, A. (2025) BINSEQ: A family of
high-performance binary formats for nucleotide sequences.
bioRxiv.
Theodorakis, G., Koliousis, A., Pietzuch, P.R., & Pirk, H. (2018) Hammer
slide: Work- and CPU-efficient streaming window aggregation.
International workshop on accelerating analytics and data management
systems using modern processor and storage architectures, ADMS@VLDB
2018, rio de janeiro, brazil, august 27, 2018 (Bordawekar, R. &
Lahiri, T. eds). pp. 34–41.
Valve Corporation (2026) Steam
Hardware & Software Survey: March 2026. URL https://store.steampowered.com/hwsurvey/.
Vandamme, L., Cazaux, B., & Limasset, A. (2025) K2R:
Tinted de Bruijn Graphs implementation for efficient read extraction
from sequencing datasets. Bioinformatics Advances,
vbaf111.
Vigna, S. (2008) Broadword
implementation of rank/select queries. Experimental
algorithms. Springer Berlin Heidelberg, pp. 154–168.
Wald, A. (1944) On cumulative sums of
random variables. The Annals of Mathematical Statistics,
15, 283–296.
Wang, X., Hong, Y., Chang,
H., Park, K., Langdale, G., Hu, J., & Zhu, H. (2019) Hyperscan: A Fast Multi-pattern Regex Matcher for Modern
CPUs. 16th USENIX symposium on networked systems design
and implementation (NSDI 19). pp. 631–648.
Wittler, R. (2023) General encoding of
canonical k-mers. Peer Community Journal,
3.
Wood, D.E., Lu, J., & Langmead, B. (2019) Improved metagenomic
analysis with Kraken 2. Genome biology,
20, 1–13.
Wood, D.E. & Salzberg, S.L. (2014) Kraken: Ultrafast
metagenomic sequence classification using exact alignments.
Genome Biology, 15.
Xu, W., Hsu, P.-K., Moshiri, N., Yu,
S., & Rosing, T. (2024) HyperGen: Compact
and efficient genome sketching using hyperdimensional vectors.
Bioinformatics, 40.
Yu, Y.W. & Weber, G.M. (2020) HyperMinHash: MinHash
in LogLog space. IEEE Transactions on Knowledge and Data
Engineering, 1–1.
Zakeri, M., Brown, N.K., Ahmed, O.Y., Gagie, T., & Langmead, B. (2024) Movi: A fast and
cache-efficient full-text pangenome index. iScience,
27, 111464.
Zakeri, M., Brown, N.K., Gagie, T., & Langmead, B. (2025) Movi 2: Fast and
space-efficient queries on pangenomes. bioRxiv.
Zentgraf, J., Schmitz, J.E., & Rahmann, S. (2025) Cleanifier:
Contamination removal from microbial sequences using spaced seeds of a
human pangenome index. Bioinformatics, 42.
Zhang, H., Song, H., Xu,
X., Chang, Q., Wang, M., Wei,
Y., Yin, Z., Schmidt, B., & Liu, W. (2023) RabbitFX: Efficient
framework for FASTA/q file parsing on modern multi-core platforms.
IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 20, 2341–2348.
Zhao, X. (2019) BinDash, software
for fast genome distance estimation on a typical personal laptop.
Bioinformatics, 35, 671–673.
Zheng, H., Kingsford, C., & Marçais, G. (2020) Improved design
and analysis of practical minimizers. Bioinformatics,
36, i119–i127.
Zheng, H., Kingsford, C., & Marçais, G. (2021) Sequence-specific
minimizers via polar sets. Bioinformatics,
37, i187–i195.
Zheng, H., Marçais, G., & Kingsford, C. (2023) Creating and using
minimizer sketches in computational genomics. Journal of
Computational Biology, 30, 1251–1276.
Zhou, D., Andersen, D.G., & Kaminsky, M. (2013) Space-efficient,
high-performance rank and select structures on uncompressed bit
sequences. Experimental algorithms. Springer Berlin
Heidelberg, pp. 151–163.