1 Delving into genomic sequences
DNA, or deoxyribonucleic acid, is the molecule that carries the genetic information of living organisms. As you probably know, it contains four nucleotides: adenine (A) pairing with thymine (T), and cytosine (C) pairing with guanine (G). A genome is the complete set of DNA of an organism, stored across one or multiple chromosomes. The human genome, for instance, spans roughly 3.2 billion base pairs distributed across 23 chromosome pairs. It’s often difficult to grasp how much information this represents, so let’s try to imagine what this would amount to when printed as books. A typical novel contains roughly 80,000 words, or 500,000 characters, and is approximately 3cm wide. Thus, a single human genome would require 6400 books, enough to fill a 200m long bookshelf!
Given this amount of information, we soon realize that any kind of manual analysis would take years of effort1, so we definitely want some form of automation. This is where sequencing technologies come into play, by turning DNA samples into a signal that we can feed to computers. However, current sequencers are not able to read entire chromosomes in one go, so the DNA is fragmented and sequenced into millions of reads with partial information. Moreover, sequencers regularly make errors by inserting, deleting or modifying nucleotides, resulting in inaccurate reads.
Modern sequencing technologies are divided in two main categories. Short-read sequencers, most notably Illumina, produce reads of 100–300 base pairs with a low cost per base and error rates below 0.5%. Long-read sequencers from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) generate reads of thousands to hundreds of thousands of base pairs, at the cost of higher error rates (0.5–5%) and lower throughput.
After sequencing, we’re left with a large collection of noisy reads which are completely unorganized, so we’d like to correct their errors and recover their original arrangement. Going back to the book analogy, we would have many excerpts spanning a couple sentences (for short reads) or a couple pages (for long reads) that contain lots of typos. Some typos are easy to catch because they create fake words, like “sky” becoming “skt”, but others are more insidious and could turn a friend into a fiend.
The process of reconstructing a sequence from a set of reads is called assembly. Two main strategies exist: when a closely related reference sequence is available, it can be used as a template to anchor and rearrange the reads, a process known as reference-guided assembly. When no such reference exists, the reads must be assembled from scratch, which is called de novo assembly. In practice, depending on the application, one may work either with a fully assembled genome or directly with raw reads. Both are typically stored in text-based formats: FASTA, which records a sequence identifier followed by the nucleotide sequence itself, and FASTQ, which additionally encodes a per-base quality score reflecting the sequencer’s confidence at each position.
A large share of these datasets, both raw reads and assembled genomes, are saved in public archives. The most prominent ones include the Sequence Read Archive (SRA) and GenBank operated by NCBI in the US, and their European counterpart European Nucleotide Archive (ENA) operated by EMBL-EBI in the UK. Their size is growing at a staggering pace: SRA alone has surpassed 100 petabytes of data. In our book metric, this would correspond to more than 6km of books, enough to cover the distance from Earth to the Moon… 16 times! This exponential growth lies at the heart of the challenge we try to address: sequencing data is accumulating faster than our storage and computational capacity can keep up with. New approaches are therefore needed, ones that can scale to this ocean of data while remaining tractable on modest resources.
Throughout this thesis, we will discuss different ways of comparing sequences to one another. Historically, two broad families of methods are distinguished: alignment-based and alignment-free. Alignment-based methods usually rely on computing an edit distance between two sequences, that is the minimum number of insertions, deletions, and substitutions required to transform one into the other2. Their main drawback is computational cost: classical algorithms run in quadratic time (Needleman & Wunsch, 1970; Smith & Waterman, 1981), and even the more recent ones such as WFA (Marco-Sola et al., 2020, 2023) or A*PA (Groot Koerkamp, 2024; Groot Koerkamp & Ivanov, 2024) run in slightly super-linear time3. Alignment-free methods, by contrast, bypass the position-by-position correspondence to run in at most linear time and space. Instead, each sequence is summarized by a compact signature, most commonly derived from its k‑mer content, i.e. all its substrings of fixed length k, and sequences are compared by the similarity of their signatures. We discuss the benefits of k‑mer-based approaches in the next chapter, and subsequent chapters develop the data structures and algorithms needed to compare sequences at scale.
Sanger’s early sequencing work required reading gel fragments by hand, even for genomes as small as bacteriophages (Sanger et al., 1977).↩︎
In practice, repeated operations are often assigned lower costs, leading to affine-cost edit distances.↩︎
For a detailed history of pairwise alignment, I recommend reading Ragnar Groot Koerkamp’s thesis (Groot Koerkamp, 2025).↩︎