5  A primer on vectorization

CPU performance scaled remarkably through the late 20th century, driven by increasing transistor counts (roughly following the famous Moore’s law) and rising clock frequencies. However, as the clock frequencies started facing physical limits, CPU manufacturers invested in new solutions to keep improving the computation throughput. The main trend over the last decades has been to focus on handling more operations in parallel, either by adding more CPU cores (task-level parallelism), or by increasing the amount of operations done by each core (instruction-level and data-level parallelism). In particular, many applications such as scientific computing, signal processing or regex engines benefit especially from data-level parallelism, where the same operation is applied independently on multiple data (Hennessy et al., 2026). In this chapter, we will present how vectorized instructions can speed up these use cases.

5.1 A brief history of vectorized extensions

The concept of applying a single instruction to multiple data elements simultaneously (formally known as SIMD) predates the personal computer era. The Cray-1 (Russell, 1978), designed by Seymour Cray and introduced in 1976, was one of the first successful vector processors. It exposed 8 vector registers of 64 double-precision floats each, with dedicated pipelined units that could operate on entire registers in a single instruction. This made it particularly well-suited to the dense numerical workloads of scientific computing, but its vector model stayed confined to high-performance computing for the following decades. The landscape shifted in 1997 when Intel introduced the MMX extension to consumer CPUs, motivated by the growing demands of multimedia workloads (Peleg et al., 1997). MMX exposed 64-bit registers holding multiple packed integers, establishing the template of the extensions that followed. The main SIMD instruction set extensions available on modern x86 and ARM CPUs are summarized in Table 5.1 and Table 5.2 respectively.

Table 5.1: Main x86 SIMD instruction set extensions (Intel Corporation, 2026), with desktop adoption from the Steam Hardware Survey (Valve Corporation, 2026)1.
Extension Year Width Key operations Adoption
SSE 1999 128 bits Packed single-precision float ~100%
SSE2 2001 128 bits Packed double-precision float + integers 98.09%
SSE3 2004 128 bits Horizontal add, duplicate loads 98.09%
SSSE3 2006 128 bits Byte shuffle, abs, sign 98.01%
SSE4.1 2007 128 bits Blend, dot product 97.97%
SSE4.2 2008 128 bits String compare, popcount 97.93%
AVX 2011 256 bits 256-bit float operations 96.93%
AVX2 2013 256 bits 256-bit integers, permute, gather 94.84%
AVX-5122 20163 512 bits 512-bit float + int, mask registers, scatter 21.89%4
Table 5.2: Main ARM SIMD instruction set extensions (Arm Limited, 2026).
Extension Year Width Key operations Adoption
NEON 2007 128 bits Packed float + integers All
SVE 2016 128–2048 bits5 Vector-length agnostic operations, per-element masking, gather/scatter HPC/server
SVE2 2019 128–2048 bits Complex arithmetic, bit deposit/extract HPC/server

5.2 SIMD vectorization in practice

All the SIMD extensions described above (apart from SVE/SVE2) follow the same design: each wide register is divided into equally-sized lanes of 8–64 bits (e.g. 32 lanes of 8 bits packed in a 256-bit register), and each instruction operates independently on every lane in parallel. Supported operations include arithmetic (excluding integer division and modulo), bitwise operations and comparisons. Listing 5.1 and Listing 5.2 show an example computing the sum of 32-bit signed integers in AVX2 and NEON respectively.

This paradigm enforces two main constraints on algorithm design. First, there is no branching on individual lane values: the usual workaround is to compute a boolean mask for each lane and use it to blend the results of two paths6. Second, instructions that combine values across lanes (known as cross-lane operations) are rare: the main ones are horizontal reductions and permutations.

Listing 5.1: Sum of i32 in AVX2 using 256-bit registers.
let mut sum = _mm256_setzero_si256(); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
    let vec = _mm256_loadu_si256(chunk.as_ptr()); // load in a 256-bit register
    sum = _mm256_add_epi32(sum, vec); // add each lane to the existing sum
}
let mut res = [0i32; 8];
_mm256_storeu_si256(res.as_mut_ptr(), sum);
return res.iter().sum::<i32>(); // manual horizontal sum of the lanes
Listing 5.2: Sum of i32 in NEON using 128-bit registers.
let mut sum = vdupq_n_s32(0); // initialize with 0s
for chunk in slice.chunks(4) { // for each chunk of 4 32-bit ints
    let vec = vld1q_s32(chunk.as_ptr()); // load in a 128-bit register
    sum = vaddq_s32(sum, vec); // add each lane to the existing sum
}
return vaddvq_s32(sum); // horizontal sum of the lanes

One of the main drawbacks of hand-written SIMD code as shown in Listing 5.1 and Listing 5.2 is that it requires a specific implementation for each SIMD extension that we want to support, making it difficult to maintain and reducing portability. However, trusting the compiler to auto-vectorize our code is not a reliable option either since it can easily break with a minor change, a new compiler version or a different architecture. A popular alternative is to use portable SIMD libraries, such as Rust’s portable_simd or wide7, which provide a unified API that is compiled to the appropriate SIMD extension. An example of a portable SIMD implementation (supporting both x86 and ARM CPUs) is given in Listing 5.3.

Listing 5.3: Sum of i32 using portable_simd.
#![feature(portable_simd)]
use std::simd::Simd;

let mut sum = Simd<i32, 8>::splat(0); // initialize with 0s
for chunk in slice.chunks(8) { // for each chunk of 8 32-bit ints
    let vec = Simd<i32, 8>::from_slice(chunk); // load in SIMD register
    sum += vec; // add each lane to the existing sum
}
return sum.reduce_sum(); // horizontal sum of the lanes

5.3 The example of memchr

Another instructive example is memchr, which finds the first occurrence of a given byte in a buffer. A naive scalar implementation, detailed in Listing 5.4, reads one byte per iteration, giving a throughput of one byte per cycle at best.

Listing 5.4: Scalar implementation of memchr.
fn memchr_scalar(buf: &[u8], needle: u8) -> Option<usize> {
    for (i, &byte) in buf.iter().enumerate() {
        if byte == needle { return Some(i); }
    }
    return None;
}

On the other hand, an AVX2 implementation, detailed in Listing 5.5, processes 32 bytes per iteration, following a three-step pattern that recurs in many applications. First, the needle byte is broadcast across all 32 lanes of a 256-bit register. Second, 32 bytes are loaded from the buffer and compared against the broadcast register in a single instruction, producing a boolean mask for equality. Third, the mask is collapsed to a 32-bit integer (i.e. one bit per lane) using movemask and the position of the first match is found by counting trailing zeros in a single instruction.

Listing 5.5: AVX2 implementation of memchr (type conversions and unsafe keywords omitted for clarity).
fn memchr_avx2(buf: &[u8], needle: u8) -> Option<usize> {
    let needles = _mm256_set1_epi8(needle); // broadcast to all 32 lanes
    let mut i = 0;
    while i + 32 <= buf.len() {
        let chunk = _mm256_loadu_si256(buf.as_ptr().add(i));
        let cmp   = _mm256_cmpeq_epi8(chunk, needles); // compare all 32 lanes
        let mask  = _mm256_movemask_epi8(cmp);  // 1 bit per matching lane
        if mask != 0 {
            return Some(i + mask.trailing_zeros()); // index of first match
        }
        i += 32;
    }
    // handle the <32 remaining bytes manually
    return buf[i..].iter().position(|&b| b == needle).map(|j| i + j);
}

The performance difference is substantial, approaching a 32× speedup on long strings, which is precisely why vectorized primitives should be preferred. The broadcast-compare-scan pattern illustrated here is not specific to memchr. It appears whenever we need to locate a particular value in a stream of data, which we will extensively use for parsing in the next chapter.


  1. I was not expecting to cite Steam in my thesis, but here I am.↩︎

  2. AVX-512 is a family of sub-extensions. All modern implementations ship F (foundation), BW (byte/word operations), DQ (doubleword/quadword), and VL (128/256-bit variants of AVX-512 instructions) together. The Steam survey confirms identical adoption across all four.↩︎

  3. First implemented in 2016 for servers and 2017 for desktop, then completely removed from Intel desktop processors in 2022 (Alder Lake and its successors). AMD implemented it in 2022 (Zen 4).↩︎

  4. Apart from recent AMD CPUs, AVX-512 is only available for servers, thus explaining the low adoption for consumer CPUs.↩︎

  5. While the SVE specification supports up to 2048 bits in theory, it has only been implemented for up to 512 bits as of today.↩︎

  6. I recommend reading https://en.algorithmica.org/hpc/simd/masking/ for a deeper introduction.↩︎

  7. At the moment, wide is less complete than portable_simd but has the benefit of being stable.↩︎