Andrii's Blog

A regression in code I didn't touch

A deep dive into L1 instruction cache set conflicts, associativity, and code alignment in Go.

In my previous post I briefly touched on the topic of how merely shifting the code by a couple of bytes may significantly affect hot path performance.

CPUs are weird. They don't just take instructions and run them in order. There are caches, branch predictors, prefetchers, yada yada, and all of it is sensitive to where exactly your code sits in memory. The same hot loop at one address can be a few percent slower at another, just because it crossed some invisible boundary somewhere.

Every cache you can find around cpu is a potential subject of unexpected performance regressions (or gains) inflicted by code alignment changes. The hero of this post is L1 icache - the fastest cpu cache that stores cpu instructions. On my machine (Intel i5-12500) it's 32KB, 8-way set associative: 64 sets × 8 ways × 64-byte cachelines. Those numbers matter for the story.

In this post I want to tell you an interesting anecdote about the case where I spent a couple hours investigating why a change in one piece of code caused a performance regression in a completely unrelated part of the codebase and the root cause was, surprisingly, L1i conflict misses from limited cache associativity.

The Phantom Regression

I was working on improving compression speed of quality level 2 in my Brotli Go port go-brrr.

go version go1.26.2-X:nodwarf5 linux/amd64
goos: linux
goarch: amd64
pkg: github.com/molecule-man/go-brrr
cpu: 12th Gen Intel(R) Core(TM) i5-12500

                │ /tmp/before.txt │           /tmp/after.txt            │
                │       B/s       │     B/s       vs base               │
830kb.so.css         297.8Mi ± 0%   304.0Mi ± 0%  +2.08% (p=0.000 n=21)
005kb.webp.js        126.8Mi ± 1%   122.7Mi ± 0%  -3.24% (p=0.000 n=21)
011kb.quer.json      348.5Mi ± 0%   344.8Mi ± 0%  -1.08% (p=0.000 n=21)

The speed of large files compression is improved (expected). However, performance on small files regressed by 3% - completely unexpected as my change touched hash2.go (doesn't matter what it is) but small files under 64kb are always compressed by hash2u16.go.

At this point I was pretty accustomed to such regressions and would otherwise happily skip the investigation but this time I decided to dig deeper as a 3% regression was larger than the usual 2% alignment-shift-induced regression typical for this machine.

416 Bytes of Trouble

As it was pretty clear that regression is caused by alignment change the first thing I did was to calculate how much things have shifted in the assembly. The only function I changed was createBackwardReferences in hash2.go therefore I dumped assembly for this function before and after:

go tool objdump -s '\(\*h2\)\.createBackwardReferences' /tmp/bench.before
go tool objdump -s '\(\*h2\)\.createBackwardReferences' /tmp/bench.after

Checking the instruction addresses in assembly showed that my change shrank the function by 402 bytes. The Go compiler aligns all functions to 32 bytes, meaning the first instruction of any function always starts at an address that is divisible by 32. So the actual downstream shift must be a multiple of 32 - and objdump showed it was 416B. The h2u16 code is AFTER the changed function in the assembly so this is exactly what shifted the machine code of the regressing path.

The fact that hotpath of h2u16 as a whole has shifted by something that is 32B divisible already hinted that the root cause was 64B aligned instruction cache (and not e.g. 32B aligned intel's DSB cache - but I didn't realize it back then and continued investigation).

Perf-action

Note to myself: thank god there is perf, use it more often.

The next thing I did, I took perf from the shelf and started interrogating it until the picture became crystal clear:

Expand to see the perf command
EVENTS='cycles,instructions,branches,branch-misses,
baclears.any,dsb2mite_switches.penalty_cycles,
frontend_retired.dsb_miss,frontend_retired.any_dsb_miss,
frontend_retired.l1i_miss,frontend_retired.itlb_miss,
frontend_retired.unknown_branch,
idq.dsb_uops,idq.mite_uops,idq.ms_uops,
icache_data.stalls,icache_tag.stalls,
br_misp_retired.cond,br_misp_retired.indirect,br_misp_retired.near_taken'

for bin in before after; do
  BENCH_CORPUS_FILE=../testcorpus/005kb.webp.js \
    perf stat -e $EVENTS -- \
    /tmp/bench.$bin -test.run '^$' \
      -test.bench 'CompressCorpusFile/q=2' \
      -test.benchtime 1000000x -test.cpu 1 -test.count 1
done
eventbeforeafter
time (1M iters)22.65 s23.10 s
...
🚩 frontend_retired.l1i_miss9.96 M28.14 M
🚩 icache_data.stalls (cycles)54.7 M135.7 M

L1 instruction cache misses nearly tripled (2.8×). The next step was to localize the source of icache misses. Of course perf can do it:

Expand to see the perf command
for bin in before after; do
  BENCH_CORPUS_FILE=... perf record -F 999 -e cpu_core/frontend_retired.l1i_miss/upp \
    -o /tmp/perf.$bin.data -- /tmp/bench.$bin ...
done

perf report -i /tmp/perf.after.data --stdio --no-children
symbolbeforeafter
(*h2u16).createBackwardReferencesNoWrap21.05%20.14%
huffmanBlock.writeData7.37%12.23%
(*encodeState).reset5.26%11.15%
(*encoderArena).writeMetaBlockFast4.21%9.71%
(*bitWriter).buildAndWriteHuffmanTreeFast2.11%7.19%
(*encoderArena).chooseHasher10.53%5.40%
(*encoderCore).resetHasher3.16%5.40%
(*bitWriter).encodeHuffmanTree10.53%4.68%

Oh boy. h2u16 indeed appears as the top L1i-miss source but its relative share was ~20% and stayed ~20%. Other symbols shuffle around (some doubled their share, some halved) but no single symbol absorbs the new misses on its own. The total number of L1i-misses grew from 10M to 28M but relative shares don't point at a single place as the source of new L1i-misses. Suddenly things are not crystal clear again.

Ok, what can we rule out? We can rule out the theory of a single hot loop being evicted from icache more often than it used to due to the alignment shift. That can't be the case as perf would show us that all the relative miss shares would decrease except a single one which would balloon. So it means it's not a single hot loop. It's an interplay of multiple hot loops with the L1i cache.

What else could manifest itself as a total L1i-miss increase with no relative change between symbols? If one hot loop would take more instructions than it used to before it would manifest exactly like this. The total number of misses would increase as new hot loop instructions would fight for place in icache with peers. But we can rule out this hypothesis as well, as none of the hot code has changed. There are no new instructions. Remember? The code change only shifted instruction positions but didn't add new instructions in the execution path.

We are left with only one possible reason: cache associativity.

8 ways, 64 sets, one confused programmer

If you've ever seen the classic demo where someone loops over an array with growing strides and benchmarks fall off a cliff at power-of-two boundaries (e.g. Igor Ostrovsky's Gallery of Processor Cache Effects) - you've already seen cache associativity in action. The L1d version of this story is famous (at least it's my feeling). There are YouTube videos, textbook examples, classroom labs. The L1i version is the same hardware mechanic applied to instructions instead of data, and somehow it's almost never talked about. The mechanism is the same, but applied to the code itself, mostly invisible until it bites you. (Oh yeah, about "biting". The next post most likely will be about DSB vs MITE so there is going to be "bitten by MITE" in the title. Sorry in advance).

32KB of L1i doesn't mean "32KB of instructions fit". The cache is partitioned into 64 sets, each with only 8 slots (ways) for 64-byte lines. Two hot instruction streams can compete for the same set even when total footprint is well under 32KB.

The previously described detective work confirms that source of regression is the fact that some hot instructions started to land in the index where the colder peer instructions land more often than they used to.

I wrote a small awk+python script to check exactly that: set placement changes.

Record cycles (not l1i_miss, we want every hot cacheline, not just the missing ones), bucket by 64-byte cacheline, then by set index:

Expand to see the perf and awk commands
for bin in before after; do
  BENCH_CORPUS_FILE=... perf record -F 4999 -e cycles:u -o /tmp/cyc.$bin.data \
    -- /tmp/bench.$bin -test.bench 'CompressCorpusFile/q=2' \
       -test.benchtime 1000000x -test.cpu 1 -test.count 1
done

for bin in before after; do
  perf script -i /tmp/cyc.$bin.data \
    | awk '/cpu_core\/cycles.*tmp\/bench/ && $6 ~ /^[0-9a-f]+$/ {
        cl = int(strtonum("0x"$6) / 64) * 64
        sym = $7; sub(/\+0x.*/, "", sym)
        printf "%x %s\n", cl, sym
      }' \
    | sort | uniq -c | sort -rn > /tmp/cl.$bin.txt
done

Then build a per-set heatmap that separates shifted (h2u16) from fixed-address peer (huffmanBlock, sortHuffmanTree, bitWriter, etc.):

Expand to see the python script
def parse(path):
    by_set = {}
    for line in open(path):
        parts = line.split(maxsplit=2)
        cnt = int(parts[0]); addr = parts[1]; sym = parts[2].strip()
        s = (int(addr,16) >> 6) & 63
        tag = "h2u16" if "h2u16" in sym else "peer"
        by_set.setdefault(s, []).append((cnt, tag, addr, sym))
    return by_set

Result (each cell shows total cycles:u samples; "NEW COLLISION" means the set had ≤1 hot occupant before, two after):

set | before h2u16+peer | after h2u16+peer  | note
----+-------------------+-------------------+--------------------------
 25 |       28 +    24  |   4470 +  2671    | NEW COLLISION
 26 |       17 +   447  |   5646 +  5144    | NEW COLLISION
 27 |       63 +   536  |   8145 +  1912    | NEW COLLISION
 28 |      171 +  1859  |   4711 +  2362    | NEW COLLISION
 29 |       21 +   367  |  19364 +  2596    | NEW HOTTEST COLLISION
 30 |       14 +  2417  |   2536 +  4779    | NEW COLLISION
 31 |      740 +  4247  |   4959 +  4607    | heavier collision in 'after'
 33 |     5445 +  3887  |   3464 +  2418    | calmed down (was hot before)
 36 |    11119 +  1777  |   3378 +  1106    | WAS hottest, now cool
 58 |        0 +   665  |  12743 +   119    | h2u16.reset moved here

The table shows only the sets where something interesting has happened. Because of the index redistribution good icache usage became less favorable. huffmanBlock.writeData calls the very hot (*h2u16).createBackwardReferencesNoWrap and their instructions land on the same cache set evicting each other.

8 ways, 9 fighters

Aaaand I lied to you. So far I've been talking as if two streams fighting for a set is enough to cause evictions - that's not actually how an 8-way set works. You need at least 9 hot cachelines mapping to the same set before the cache starts evicting one to make room for another. Which makes this even more interesting.

As soon as I realized that, I took perf from the shelf again and re-recorded L1i misses with PEBS instead of cycles.

Expand to see the perf command
for bin in before after; do
  BENCH_CORPUS_FILE=... perf record \
    -e cpu_core/frontend_retired.l1i_miss/upp -c 2000 \
    -o /tmp/miss.$bin.data \
    -- /tmp/bench.$bin -test.bench 'CompressCorpusFile/q=2' \
       -test.benchtime=1000000x -test.cpu=1 -test.count=1
done

Bucketing miss samples by L1i set (set = (addr >> 6) & 63):

set | before misses | after misses |    Δ   | hot cachelines (>=30 misses each)
----+---------------+--------------+--------+----------------------------------
 36 |     142       |     2224     | +2082  |   1 → 5
 35 |      28       |     2033     | +2005  |   0 → 7
 33 |     212       |     1636     | +1424  |   4 → 5
 32 |     132       |     1700     | +1568  |   2 → 3
 37 |     350       |      884     |  +534  |   2 → 4
 29 |     135       |      762     |  +627  |   1 → 2

74% of the +8200 extra misses concentrate in six sets, 32–37. Set 29, the one I'd called out as "the new hottest collision" using cycles attribution actually contributes only 8% of the new misses. The cycles-based story was overall right (the shift crowded sets that h2u16 hadn't crowded before), but the epicentre of the actual miss activity sits at sets 32–37, not at sets 25–29.

Set 35 went from zero hot miss cachelines to seven:

[h2u16]  692  0x55b8c0  createBackwardReferencesNoWrap
[peer ]  435  0x5b28c0  benchCompress.func1
[peer ]  310  0x5a48c0  encoderArena.trailingBits
[peer ]  180  0x5948c0  huffmanBlock.writeData
[peer ]  168  0x5ae8c0  (*Writer).Write
[peer ]  108  0x54d8c0  encoderArena.writeMetaBlockFast
[peer ]   96  0x5538c0  computeDistanceCode

Set 36 (the largest miss jump, 1 occupant -> 5):

[h2u16] 1291  0x55b900  createBackwardReferencesNoWrap
[peer ]  430  0x54c900  encoderArena.chooseHasher
[peer ]  293  0x594900  huffmanBlock.writeData
[peer ]  142  0x553900  computeDistanceCode
[peer ]   59  0x54d900  encoderArena.writeMetaBlockFast

Set 32: no h2u16 in the hot-miss list!:

[peer ]  715  0x535800  bitWriter.buildAndWriteHuffmanTreeFast
[peer ]  462  0x54a800  encodeState.reset
[peer ]  451  0x594800  huffmanBlock.writeData

Why are there no h2u16 occupants in the last set? Remember, this table shows not all participants but only the ones that landed with a cache miss. So h2u16 lines are most likely there but they were cold enough to be cut off from the stats.

A Pattern Emerges

Then I examined the functions that landed in the same sets as the hot parts of h2u16.createBackwardReferencesNoWrap and there was a very interesting pattern. The majority of those functions were of similar size and accepted a similar number of arguments and, most importantly, their hot loops were starting at roughly a 512-byte offset. My codebase is prone to these alignment issues: there is a multitude of warm functions that always land in the same set cluster and as soon as the hottest h2u16 function's hot path overlaps with this cluster the icache thrashing starts happening.

I immediately felt an urge to go and refactor some of those functions so that not all of them have the same size. There was only one suitable candidate writeMetaBlockFast which I wanted to refactor anyways. I extracted cold parts of this function into non-inlined helper function.

This has fixed the regression. However, of course, it couldn't solve the problem once and for all. One function doesn't really make a difference. And the regression was fixed because of the alignment shift and not because I reduced icache pressure. The very next change confirmed it. I tried to land an improvement in hash2.go again and it caused exactly the same 3% regression in hash2u16 🤡.

Solution

As usual with alignment shift problems, there is no real solution. You'll waste your time on padding your hot functions to shift their alignment but the next commit will unintentionally shift the alignment and you have your problem again. PGO may accidentally help the end user by producing a different layout, but as a library author you have no way to systematically prevent this.

However, for this particular problem where you have a bunch of warm functions and one hot function you don't change either and you just add a 32B-divisible shift between those, then you have a solution to at least make the benchmarks trustworthy. You can run your benchmarks with the functions aligned to 64B in both the "before" and "after" binary. The possibility to do it is available in Go toolchain since go version 1.25. To do that just prebuild your benchmarking binaries with -ldflags="-funcalign=64". Nowadays I run my benchmarks with different funcalign values - 16, 32, 64 - to see whether the performance change is consistent across alignments or just an artifact of where the code happened to land.

Conclusion

Alignment issues on the hot path are inevitable and there's rarely much you can do about any specific one. But understanding the root cause pays off in the long run. Each of these deep dives reveals something about the shape of your codebase that no flame graph would have shown on its own. In my case it was a hidden cluster of warm functions, all roughly the same size, with hot loops starting around the same 512B offset. I can't fix that, but I know it's there now and will take it into account when some refactoring opportunity arises in the future. Also, next time a "harmless" change to hash2.go causes an unrelated 3% regression, I won't waste two hours being surprised.

Also, I really liked playing around with funcalign while benchmarking. Here are benchmarks from a completely different feature I was working on an hour ago:

funcalign=32/tmp/before.txt (B/s)/tmp/after.txt (B/s)diff
070kb.pdd2.json787.9Mi ± 1%742.6Mi ± 1%-5.75%
107kb.pdd1.json796.8Mi ± 0%731.6Mi ± 0%-8.19%
funcalign=64/tmp/before.txt (B/s)/tmp/after.txt (B/s)diff
070kb.pdd2.json774.3Mi ± 1%893.2Mi ± 1%+15.36%
107kb.pdd1.json770.1Mi ± 1%838.5Mi ± 2%+8.88%
funcalign=16/tmp/before.txt (B/s)/tmp/after.txt (B/s)diff
070kb.pdd2.json759.8Mi ± 1%761.1Mi ± 2%~
107kb.pdd1.json749.2Mi ± 0%751.4Mi ± 1%+0.30%

Quite a dramatic difference caused only by alignment change. I don't know the reason yet (sensitive to 32B. Is it you again, L1i?). Maybe a new blog post is incoming.