5 A primer on vectorization

CPU performance scaled remarkably through the late 20^th century, driven by increasing transistor counts (roughly following the famous Moore’s law) and rising clock frequencies. However, as the clock frequencies started facing physical limits, CPU manufacturers invested in new solutions to keep improving the computation throughput. The main trend over the last decades has been to focus on handling more operations in parallel, either by adding more CPU cores (task-level parallelism), or by increasing the number of operations done by each core (instruction-level and data-level parallelism). In particular, many applications such as scientific computing, signal processing or regex engines benefit especially from data-level parallelism, where the same operation is applied independently on multiple data (Hennessy et al., 2026). In this chapter, we will present how vectorized instructions can speed up these use cases.

5.1 A brief history of vectorized extensions

The concept of applying a single instruction to multiple data elements simultaneously (formally known as SIMD) predates the personal computer era. The Cray-1 (Russell, 1978), designed by Seymour Cray and introduced in 1976, was one of the first successful vector processors. It exposed 8 vector registers of 64 double-precision floats each, with dedicated pipelined units that could operate on entire registers in a single instruction. This made it particularly well-suited to the dense numerical workloads of scientific computing, but its vector model stayed confined to high-performance computing for the following decades. The landscape shifted in 1997 when Intel introduced the MMX extension to consumer CPUs, motivated by the growing demands of multimedia workloads (Peleg et al., 1997). MMX exposed 64-bit registers holding multiple packed integers, establishing the template of the extensions that followed. The main SIMD instruction set extensions available on modern x86 and ARM CPUs are summarized in Table 5.1 and Table 5.2 respectively.

Table 5.1: Main x86 SIMD instruction set extensions (Intel Corporation, 2026), with desktop adoption from the Steam Hardware Survey (Valve Corporation, 2026)¹.

Extension	Year	Width	Key operations	Adoption
SSE	1999	128 bits	Packed single-precision float	~100%
SSE2	2001	128 bits	Packed double-precision float + integers	98.09%
SSE3	2004	128 bits	Horizontal add, duplicate loads	98.09%
SSSE3	2006	128 bits	Byte shuffle, abs, sign	98.01%
SSE4.1	2007	128 bits	Blend, dot product	97.97%
SSE4.2	2008	128 bits	String compare, popcount	97.93%
AVX	2011	256 bits	256-bit float operations	96.93%
AVX2	2013	256 bits	256-bit integers, permute, gather	94.84%
AVX-512²	2016³	512 bits	512-bit float + int, mask registers, scatter	21.89%⁴

Table 5.2: Main ARM SIMD instruction set extensions (Arm Limited, 2026).

Extension	Year	Width	Key operations	Adoption
NEON	2007	128 bits	Packed float + integers	All
SVE	2016	128–2048 bits⁵	Vector-length agnostic operations, per-element masking, gather/scatter	HPC/server
SVE2	2019	128–2048 bits	Complex arithmetic, bit deposit/extract	HPC/server

5.2 SIMD vectorization in practice

All the SIMD extensions described above (apart from SVE/SVE2) follow the same design: each wide register is divided into equally-sized lanes of 8–64 bits (e.g. 32 lanes of 8 bits packed in a 256-bit register), and each instruction operates independently on every lane in parallel. Supported operations include arithmetic (excluding integer division and modulo), bitwise operations and comparisons. Listing 5.1 and Listing 5.2 show an example computing the sum of 32-bit signed integers in AVX2 and NEON respectively.

This paradigm enforces two main constraints on algorithm design. First, there is no branching on individual lane values: the usual workaround is to compute a boolean mask for each lane and use it to blend the results of two paths⁶. Second, instructions that combine values across lanes (known as cross-lane operations) are rare: the main ones are horizontal reductions and permutations.

Listing 5.1: Sum of i32 in AVX2 using 256-bit registers.

let mut sum = _mm256_setzero_si256(); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
    let vec = _mm256_loadu_si256(chunk.as_ptr()); // load in a 256-bit register
    sum = _mm256_add_epi32(sum, vec); // add each lane to the existing sum
}
let mut res = [0i32; 8];
_mm256_storeu_si256(res.as_mut_ptr(), sum);
return res.iter().sum::<i32>(); // manual horizontal sum of the lanes

Listing 5.2: Sum of i32 in NEON using 128-bit registers.

let mut sum = vdupq_n_s32(0); // initialize with 0s
for chunk in slice.chunks(4) { // for each chunk of 4 32-bit ints
    let vec = vld1q_s32(chunk.as_ptr()); // load in a 128-bit register
    sum = vaddq_s32(sum, vec); // add each lane to the existing sum
}
return vaddvq_s32(sum); // horizontal sum of the lanes

One of the main drawbacks of hand-written SIMD code as shown in Listing 5.1 and Listing 5.2 is that it requires a specific implementation for each SIMD extension that we want to support, making it difficult to maintain and reducing portability. However, trusting the compiler to auto-vectorize our code is not a reliable option either since it can easily break with a minor change, a new compiler version or a different architecture. A popular alternative is to use portable SIMD libraries, such as Rust’s portable_simd or wide ⁷, which provide a unified API that is compiled to the appropriate SIMD extension. An example of a portable SIMD implementation (supporting both x86 and ARM CPUs) is given in Listing 5.3.

Listing 5.3: Sum of i32 using portable_simd.

#![feature(portable_simd)]
use std::simd::Simd;

let mut sum = Simd<i32, 8>::splat(0); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
    let vec = Simd<i32, 8>::from_slice(chunk); // load in SIMD register
    sum += vec; // add each lane to the existing sum
}
return sum.reduce_sum(); // horizontal sum of the lanes

5.3 The example of memchr

Another instructive example is memchr, which finds the first occurrence of a given byte in a buffer. A naive scalar implementation, detailed in Listing 5.4, reads one byte per iteration, giving a throughput of one byte per cycle at best.

Listing 5.4: Scalar implementation of memchr.

fn memchr_scalar(buf: &[u8], needle: u8) -> Option<usize> {
    for (i, &byte) in buf.iter().enumerate() {
        if byte == needle { return Some(i); }
    }
    return None;
}

On the other hand, an AVX2 implementation, detailed in Listing 5.5, processes 32 bytes per iteration, following a three-step pattern that recurs in many applications. First, the needle byte is broadcast across all 32 lanes of a 256-bit register. Second, 32 bytes are loaded from the buffer and compared against the broadcast register in a single instruction, producing a boolean mask for equality. Third, the vector mask is collapsed into a scalar bitmask using movemask, which extracts the most significant bit of each lane in a SIMD register into a 32-bit integer (i.e. one bit per lane). The position of the first match is then found by counting trailing zeros in the resulting bitmask.

Listing 5.5: AVX2 implementation of memchr (type conversions and unsafe keywords omitted for clarity).

fn memchr_avx2(buf: &[u8], needle: u8) -> Option<usize> {
    let needles = _mm256_set1_epi8(needle); // broadcast to all 32 lanes
    let mut i = 0;
    while i + 32 <= buf.len() {
        let chunk = _mm256_loadu_si256(buf.as_ptr().add(i));
        let cmp   = _mm256_cmpeq_epi8(chunk, needles); // compare all 32 lanes
        let mask  = _mm256_movemask_epi8(cmp);  // 1 bit per matching lane
        if mask != 0 {
            return Some(i + mask.trailing_zeros()); // index of first match
        }
        i += 32;
    }
    // handle the <32 remaining bytes manually
    return buf[i..].iter().position(|&b| b == needle).map(|j| i + j);
}

The performance difference is substantial, approaching a 32× speedup on long strings, which is precisely why vectorized primitives should be preferred. The broadcast-compare-scan pattern illustrated here is not specific to memchr. It appears whenever we need to locate a particular value in a stream of data, which we will extensively use for parsing in the next chapter.

I was not expecting to cite Steam in my thesis, but here I am.↩︎
AVX-512 is a family of sub-extensions. All modern implementations ship F (foundation), BW (byte/word operations), DQ (doubleword/quadword), and VL (128/256-bit variants of AVX-512 instructions) together. The Steam survey confirms identical adoption across all four.↩︎
First implemented in 2016 for servers and 2017 for desktop, then completely removed from Intel desktop processors in 2022 (Alder Lake and its successors). AMD implemented it in 2022 (Zen 4).↩︎
Apart from recent AMD CPUs, AVX-512 is only available for servers, thus explaining the low adoption for consumer CPUs.↩︎
While the SVE specification supports up to 2048 bits in theory, it has only been implemented for up to 512 bits as of today.↩︎
I recommend reading https://en.algorithmica.org/hpc/simd/masking/ for a deeper introduction.↩︎
At the moment, wide is less complete than portable_simd but has the benefit of being stable.↩︎