Measuring software performance reliably is remarkably difficult. It’s a specialized version of a more general problem: trying to find a signal in a world full of noise. A benchmark that reports a 5% improvement might just be measuring thermal throttling, noisy neighbors, or the phase of the moon.
In this talk, we walk through the full stack of reliable performance measurement — from controlling your benchmarking environment (bare metal instances, CPU affinity, disabling SMT and dynamic frequency scaling) to designing benchmarks that are both representative and repeatable. We cover the statistical methods needed to interpret results correctly (hypothesis testing, change point detection) and show how to integrate continuous benchmarking into development workflows so regressions are caught before they reach production.
Along the way, we share experiments demonstrating how environment control alone can reduce measurement variance by 100x, and practical tips for anyone who writes benchmarks — whether you’re optimizing a hot loop or validating a system-wide change.
Co-presented with Augusto de Oliveira.
Recording
Slides
Demo/Code
- fosdem-2026-software-performance — includes Jupyter notebooks in
experiments/with benchmark design and results interpretation visualizations
Events
- FOSDEM 2026 — Software Performance Devroom
- Recording (coming soon)
Related
- Measuring Software Performance: Why Your Benchmarks Are Probably Lying — full technical blog post expanding on this talk
