Appendix A — Vectorized sequence parsing

A.1 Features of the benchmarked CPUs

Table A.1: Instruction-set extensions supported by the benchmarked CPUs.
CPU Year SSE AVX2 BMI2 NEON
Intel
Xeon X5670 2010 +
Xeon E5-2620 2012 +
Xeon E7-4850 V3 2015 + + +
Xeon E5-2620 V4 2016 + + +
Xeon Gold 6130 2017 + + +
Xeon Gold 5218 2019 + + +
Xeon Gold 5318Y 2021 + + +
Xeon Silver 4314 2021 + + +
Xeon Gold 6442Y 2023 + + +
Core Ultra 7 165H *1 2023 + + +
AMD
Epyc 7301 2017 + + ~2
Epyc 7452 2019 + + ~
Epyc 7642 2019 + + ~
Epyc 7513 2021 + + +
Epyc 9254 2022 + + +
Ryzen 5 8500G * 2024 + + +
ARM
Apple M1 * 2020 +
Apple M3 Pro * 2023 +
Neoverse-V2 2023 +

A.2 Additional experiments on data read from disk

Figure A.1: Throughput of each parser for long reads on multiple CPUs, sorted by manufacturer and year.

A.3 Additional experiments on data loaded in RAM

A.3.1 Throughput

(a) On short reads (FASTQ format).
(b) On long reads (FASTQ format).
Figure A.2: Throughput of each parser for data loaded in RAM on multiple CPUs, sorted by manufacturer and year. Needletail and Paraseq both have to use a reader over a slice, which degrades their performance.
Figure A.3: Throughput of Helicase string collection compared to counting DNA bases.

A.3.2 Instructions and cycles

(a) Instructions per byte on short reads.
(b) Cycles per byte on short reads.
Figure A.4: Instructions and cycles per byte on multiple CPUs, sorted by manufacturer and year.

A.3.3 Branches and branch misses

(a) On a human genome (FASTA format).
(b) On short reads (FASTQ format).
Figure A.5: Branches per byte on multiple CPUs, sorted by manufacturer and year.
(a) On a human genome (FASTA format).
(b) On short reads (FASTQ format).
Figure A.6: Branch misses per MB on multiple CPUs, sorted by manufacturer and year.

  1. (*) personal computers, not benchmarked in a reproducible environment↩︎

  2. (~) microcoded PDEP/PEXT instructions (very slow)↩︎