High-performance sequence processing

All we have to decide is what to do with the cycles that are given us.

In the introduction we established that some genomic pipelines routinely process hundreds of terabases of data. At that scale, even a constant-factor improvement in throughput translates directly into wall-clock time. This part asks how close to the hardware limit a genomic pipeline can actually run and what it takes to get there. Chapter 5 introduces the vectorization model and the SIMD instruction sets available on common hardware. Chapters 6, 7 and 8 develop vectorized algorithms for the three core primitives of any genomic pipeline: parsing FASTA/FASTQ input, computing rolling hashes over k‑mers, and extracting minimizers from a sequence. Chapter 9 applies these primitives to sequence filtering.