A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline

Hi Folks! We just posted a new preprint on arXiv: A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline. This post is a quick, friendly tour of what we found.

Code and artifact

The tool we built to run all of this is llvm-passview — an open-source driver that automates the prefix-by-prefix compilation and measurement pipeline.

If you want to dig into the raw data, the full artifact lives at PAPER-ARTIFACT. For every benchmark, at every pipeline step, you get: the LLVM IR, the compiled binary, execution logs, and profiling CSVs for every metric (runtime, compile time, binary size, IPC, cache misses, energy, ...). That's a lot of data to play with. The exact 113-step pipeline decomposition is available here.

The question

You pass -O3 to clang and your code gets faster. Great. But which of the 113 optimization passes that -O3 runs actually did something? And did they all help, or did some of them secretly make things worse before another pass fixed it?

That's what we set out to answer. We ran every cumulative prefix of the -O3 pipeline — pass 0 only, then passes 0–1, then 0–2, ... up to all 113 — on 30 PolyBench/C kernels and measured execution time, compile time, binary size, hardware counters (IPC, cache misses, ...) and energy. That's 84,750 measurements in total, with careful noise mitigation.

All multi-benchmark figures below use the following legend — one unique color, marker, and line style per kernel:

Benchmark legend — Legend for all multi-benchmark figures (unique color, marker, and line style per kernel).

Finding 1 — marginal impact of individual passes

The first surprise: -O3 is not monotonically improving. About 7–10% of pass-to-pass transitions actually slow things down before a later pass recovers. Here's what the runtime trajectory looks like across all 30 benchmarks:

Runtime evolution with 95% confidence intervals across the -O3 pipeline — Runtime evolution (normalized to -O0) with 95% confidence intervals. Each line is a benchmark. Note the non-monotone wiggles — those are real regressions, not noise.

Most of the speedup also comes late in the pipeline. The median benchmark needs to complete 84.8% of the pipeline before locking in 80% of its final speedup — driven by loop-vectorize (pass 87) and loop-unroll (pass 95). Three benchmarks (correlation, covariance, seidel-2d) end up slower than -O0 altogether.

Cumulative speedup and normalized gain across the pipeline — Top: cumulative speedup S_i per benchmark. Bottom: normalized fraction of final speedup reached at each pipeline prefix. The S-curve shape and the late crossing of the 0.8 threshold are clearly visible.

Decomposing the trajectory into per-pass deltas reveals a clear 80/20 structure: a handful of passes dominate, and the long tail is essentially noise. early-cse, licm, instcombine, loop-vectorize, and loop-unroll carry most of the weight, while the bottom half of the pipeline is statistically indistinguishable from doing nothing.

Marginal-utility waterfall plot — Marginal utility of each pass (ΔT_i = time before minus time after). Most lines hug zero; the real action concentrates in two bursts around passes 8–18 and 41–48.

Aggregating those deltas across all benchmarks gives a clean cross-benchmark ranking of who actually did the work:

Top-20 pass impact ranking — Top-20 passes ranked by mean absolute marginal impact. The top decile carries the majority of the total speedup. `early-cse` takes the crown (27/30 benchmarks).

Finding 2 — speed, compile time, and binary size trade-offs

Runtime improvements don't come for free. Compile time grows near-monotonically, while binary size is non-monotone: early passes shrink it via dead-code elimination, late inlining and unrolling inflate it. The result is that the final -O3 endpoint is Pareto-dominated in 29/30 benchmarks — there's an earlier pipeline checkpoint that is simultaneously smaller and faster. Also, static IR instruction count turns out to be a poor proxy for runtime: 27/30 benchmarks show a negative correlation between IR size and execution time, since passes like unrolling and vectorization expand the IR while reducing runtime.

Compile time and binary size trajectories along the -O3 pipeline — Compile time (top) and binary size (bottom) trajectories along the pipeline. Compile time grows near-monotonically; binary size first shrinks (dead-code elimination) then inflates with late inlining and unrolling.

Putting speedup and binary size together on a Pareto plot makes the dominated endpoint obvious:

Pareto fronts for compile time vs speedup and binary size vs speedup — Pareto views of the trade-off space. Each dot is a pipeline prefix aggregated across benchmarks. In 29/30 cases, the full -O3 endpoint (rightmost dot) is dominated on the binary-size/speedup frontier.

Finding 3 — vectorization saves energy (not just time)

We also measured hardware counters and RAPL energy. A striking result: IPC (instructions per clock) actually falls by 17.4% end-to-end, even though execution is much faster. That sounds contradictory — but it makes sense once you realize vectorization (SIMD) replaces many cheap scalar instructions with fewer, wider, longer-latency ones. You're doing less work total, which wins, even if each individual operation takes longer per cycle. The heatmap below shows exactly which passes move which counters, across all 30 benchmarks:

Hardware counter signatures at the top-10 most impactful pass positions — Per-benchmark hardware-counter signatures at the top-10 most impactful pass positions. Blue = improvement, red = regression relative to -O0. Each cell is one (benchmark, pass) pair across 9 counters (instructions, cycles, IPC, D1/LLC/L1-I misses, branch misses, energy).

On the energy side: passes that improve runtime are de facto energy passes too. The suite-wide energy savings track runtime savings closely (30–60%), with the biggest single step at loop-vectorize (−35% in one pass). To our knowledge, this is the first per-pass energy profile of the LLVM pipeline.

Finding 4 — how much speedup is lost to phase interference?

Given all those non-monotone transitions, one natural question is: how much potential speedup is the pipeline leaving on the table? We define a simple idealized upper bound: take the sum of all positive pass contributions, ignoring the regressions. The loss L measures the gap between that ceiling and what you actually get.

Phase interference loss L per benchmark — Phase-interference loss L per benchmark. Mean 46.35%, median 39.3%. On `correlation` (L=100%), -O3 ends up slower than -O0.

The mean loss is 46.35% (median 39.3%). The ceiling is unreachable — passes don't compose additively — but the number gives a concrete sense of how much intra-pipeline interference costs. On correlation (L=100%), licm at pass 41 drops performance by 26%, and the very next pass (loop-rotate) recovers most of it — the pipeline self-corrects in situ, but not fully. As far as we know, this is the first search-free, per-pass quantification of phase-interference loss in a production compiler's default pipeline.

So what?

A few practical takeaways:

Pass pruning: if you're building a constrained compiler (embedded, compile-time budget), a study to find a systematic way to drop the bottom half of the passes is in order.
Cost models: IR instruction count is a poor proxy for runtime. 27/30 benchmarks show a negative correlation between IR size and execution time — more IR often means faster code (unrolling, vectorization).
Autotuning: the per-pass data gives a dense supervision signal for learned phase ordering, without needing to search over orderings.
Energy: if you're optimizing for power (mobile, HPC), just optimize for runtime — at least for compute-bound workloads, the two objectives align tightly.

The full paper is on arXiv: arxiv.org/abs/2606.31238. If you have thoughts or questions, feel free to reach out!

How to cite

@misc{Bruzzone26f-preprint,
    title={A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline},
    author={Federico Bruzzone and Walter Cazzola},
    year={2026},
    eprint={2606.31238},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2606.31238},
}