5 A primer on vectorization
CPU performance scaled remarkably through the late 20th century, driven by increasing transistor counts (roughly following the famous Moore’s law) and rising clock frequencies. However, as the clock frequencies started facing physical limits, CPU manufacturers invested in new solutions to keep improving the computation throughput. The main trend over the last decades has been to focus on handling more operations in parallel, either by adding more CPU cores (task-level parallelism), or by increasing the amount of operations done by each core (instruction-level and data-level parallelism). In particular, many applications such as scientific computing, signal processing or regex engines benefit especially from data-level parallelism, where the same operation is applied independently on multiple data (Hennessy et al., 2026). In this chapter, we will present how vectorized instructions can speed up these use cases.
5.1 A brief history of vectorized extensions
The concept of applying a single instruction to multiple data elements simultaneously (formally known as SIMD) predates the personal computer era. The Cray-1 (Russell, 1978), designed by Seymour Cray and introduced in 1976, was one of the first successful vector processors. It exposed 8 vector registers of 64 double-precision floats each, with dedicated pipelined units that could operate on entire registers in a single instruction. This made it particularly well-suited to the dense numerical workloads of scientific computing, but its vector model stayed confined to high-performance computing for the following decades. The landscape shifted in 1997 when Intel introduced the MMX extension to consumer CPUs, motivated by the growing demands of multimedia workloads (Peleg et al., 1997). MMX exposed 64-bit registers holding multiple packed integers, establishing the template of the extensions that followed. The main SIMD instruction set extensions available on modern x86 and ARM CPUs are summarized in Table 5.1 and Table 5.2 respectively.
| Extension | Year | Width | Key operations | Adoption |
|---|---|---|---|---|
| SSE | 1999 | 128 bits | Packed single-precision float | ~100% |
| SSE2 | 2001 | 128 bits | Packed double-precision float + integers | 98.09% |
| SSE3 | 2004 | 128 bits | Horizontal add, duplicate loads | 98.09% |
| SSSE3 | 2006 | 128 bits | Byte shuffle, abs, sign | 98.01% |
| SSE4.1 | 2007 | 128 bits | Blend, dot product | 97.97% |
| SSE4.2 | 2008 | 128 bits | String compare, popcount | 97.93% |
| AVX | 2011 | 256 bits | 256-bit float operations | 96.93% |
| AVX2 | 2013 | 256 bits | 256-bit integers, permute, gather | 94.84% |
| AVX-5122 | 20163 | 512 bits | 512-bit float + int, mask registers, scatter | 21.89%4 |
| Extension | Year | Width | Key operations | Adoption |
|---|---|---|---|---|
| NEON | 2007 | 128 bits | Packed float + integers | All |
| SVE | 2016 | 128–2048 bits5 | Vector-length agnostic operations, per-element masking, gather/scatter | HPC/server |
| SVE2 | 2019 | 128–2048 bits | Complex arithmetic, bit deposit/extract | HPC/server |
5.2 SIMD vectorization in practice
All the SIMD extensions described above (apart from SVE/SVE2) follow the same design: each wide register is divided into equally-sized lanes of 8–64 bits (e.g. 32 lanes of 8 bits packed in a 256-bit register), and each instruction operates independently on every lane in parallel. Supported operations include arithmetic (excluding integer division and modulo), bitwise operations and comparisons. Listing 5.1 and Listing 5.2 show an example computing the sum of 32-bit signed integers in AVX2 and NEON respectively.
This paradigm enforces two main constraints on algorithm design. First, there is no branching on individual lane values: the usual workaround is to compute a boolean mask for each lane and use it to blend the results of two paths6. Second, instructions that combine values across lanes (known as cross-lane operations) are rare: the main ones are horizontal reductions and permutations.
i32 in AVX2 using 256-bit registers.
let mut sum = _mm256_setzero_si256(); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
let vec = _mm256_loadu_si256(chunk.as_ptr()); // load in a 256-bit register
sum = _mm256_add_epi32(sum, vec); // add each lane to the existing sum
}
let mut res = [0i32; 8];
_mm256_storeu_si256(res.as_mut_ptr(), sum);
return res.iter().sum::<i32>(); // manual horizontal sum of the lanesi32 in NEON using 128-bit registers.
let mut sum = vdupq_n_s32(0); // initialize with 0s
for chunk in slice.chunks(4) { // for each chunk of 4 32-bit ints
let vec = vld1q_s32(chunk.as_ptr()); // load in a 128-bit register
sum = vaddq_s32(sum, vec); // add each lane to the existing sum
}
return vaddvq_s32(sum); // horizontal sum of the lanesOne of the main drawbacks of hand-written SIMD code as shown in Listing 5.1 and Listing 5.2 is that it requires a specific implementation for each SIMD extension that we want to support, making it difficult to maintain and reducing portability. However, trusting the compiler to auto-vectorize our code is not a reliable option either since it can easily break with a minor change, a new compiler version or a different architecture. A popular alternative is to use portable SIMD libraries, such as Rust’s portable_simd or wide7, which provide a unified API that is compiled to the appropriate SIMD extension. An example of a portable SIMD implementation (supporting both x86 and ARM CPUs) is given in Listing 5.3.
i32 using portable_simd.
#![feature(portable_simd)]
use std::simd::Simd;
let mut sum = Simd<i32, 8>::splat(0); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
let vec = Simd<i32, 8>::from_slice(chunk); // load in SIMD register
sum += vec; // add each lane to the existing sum
}
return sum.reduce_sum(); // horizontal sum of the lanes5.3 The example of memchr
Another instructive example is memchr, which finds the first occurrence of a given byte in a buffer. A naive scalar implementation, detailed in Listing 5.4, reads one byte per iteration, giving a throughput of one byte per cycle at best.
memchr.
fn memchr_scalar(buf: &[u8], needle: u8) -> Option<usize> {
for (i, &byte) in buf.iter().enumerate() {
if byte == needle { return Some(i); }
}
return None;
}On the other hand, an AVX2 implementation, detailed in Listing 5.5, processes 32 bytes per iteration, following a three-step pattern that recurs in many applications. First, the needle byte is broadcast across all 32 lanes of a 256-bit register. Second, 32 bytes are loaded from the buffer and compared against the broadcast register in a single instruction, producing a boolean mask for equality. Third, the mask is collapsed to a 32-bit integer (i.e. one bit per lane) using movemask and the position of the first match is found by counting trailing zeros in a single instruction.
memchr (type conversions and unsafe keywords omitted for clarity).
fn memchr_avx2(buf: &[u8], needle: u8) -> Option<usize> {
let needles = _mm256_set1_epi8(needle); // broadcast to all 32 lanes
let mut i = 0;
while i + 32 <= buf.len() {
let chunk = _mm256_loadu_si256(buf.as_ptr().add(i));
let cmp = _mm256_cmpeq_epi8(chunk, needles); // compare all 32 lanes
let mask = _mm256_movemask_epi8(cmp); // 1 bit per matching lane
if mask != 0 {
return Some(i + mask.trailing_zeros()); // index of first match
}
i += 32;
}
// handle the <32 remaining bytes manually
return buf[i..].iter().position(|&b| b == needle).map(|j| i + j);
}The performance difference is substantial, approaching a 32× speedup on long strings, which is precisely why vectorized primitives should be preferred. The broadcast-compare-scan pattern illustrated here is not specific to memchr. It appears whenever we need to locate a particular value in a stream of data, which we will extensively use for parsing in the next chapter.
I was not expecting to cite Steam in my thesis, but here I am.↩︎
AVX-512 is a family of sub-extensions. All modern implementations ship F (foundation), BW (byte/word operations), DQ (doubleword/quadword), and VL (128/256-bit variants of AVX-512 instructions) together. The Steam survey confirms identical adoption across all four.↩︎
First implemented in 2016 for servers and 2017 for desktop, then completely removed from Intel desktop processors in 2022 (Alder Lake and its successors). AMD implemented it in 2022 (Zen 4).↩︎
Apart from recent AMD CPUs, AVX-512 is only available for servers, thus explaining the low adoption for consumer CPUs.↩︎
While the SVE specification supports up to 2048 bits in theory, it has only been implemented for up to 512 bits as of today.↩︎
I recommend reading https://en.algorithmica.org/hpc/simd/masking/ for a deeper introduction.↩︎
At the moment,
wideis less complete thanportable_simdbut has the benefit of being stable.↩︎