Optimization catalog. When float division beats integer division
A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster
I did quite a lot of hot path optimizations while working on my last project and some of them were genuinely interesting ones. So why don't I write about some of them. I'm starting the series with an interesting case where speedup was achieved by unintuitive replacement of int operation with float operation.
Avoiding IDIVQ with float division
Let's take a look at this optimization (ignore the 3x claim, I was wrong in that doc comment):
diff --git a/huffman.go b/huffman.go
index 33f8263..180dffc 100644
func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) {
if i != length {
sum += int(counts[i])
if stride >= 4 {
- limit = (256*sum + stride/2) / stride
+ // float64 division is ~3x faster than IDIVQ for variable
+ // divisors, and exact for our input range (numerator < 2^25,
+ // denominator < 2^9, quotient < 2^17 — all exactly
+ // representable in float64's 52-bit mantissa).
+ limit = int(float64(256*sum+stride/2) / float64(stride))
}
if stride == 4 {
limit += 120
I'm replacing integer division (256*sum + stride/2) / stride with float
division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the
float operations? Well yes, but actually no. Not in this case.
Before we do anything let's run benchmarks to confirm that the optimization is real (benchmark code):
goos: linux
goarch: amd64
pkg: idivq
cpu: 12th Gen Intel(R) Core(TM) i5-12500
│ idivq │ float │
│ sec/op │ sec/op vs base │
DivInline 3.358n ± 0% 2.414n ± 0% -28.10% (p=0.002 n=6)
Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD CPU:
goos: linux
goarch: amd64
pkg: idivq
cpu: AMD Ryzen 5 7535HS with Radeon Graphics
│ idivq │ float │
│ sec/op │ sec/op vs base │
DivInline 2.178n ± 0% 1.959n ± 0% -10.08% (p=0.002 n=6)
~10% win.
So why is it faster? Let's compare the assembly:
| Integer (IDIVQ) | Float (DIVSD) |
|---|---|
Compute 256*sum + stride/2
SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (BX)(CX*1), DX SARQ $1, DX ADDQ DX, AX |
Compute 256*sum + stride/2
SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (CX)(BX*1), DX SARQ $1, DX ADDQ AX, DX |
TESTQ CX, CX JEQ panicdivide CMPQ CX, $-1 JNE do_div NEGQ AX XORL DX, DX JMP ret |
The division by zero check is missing in the float version. As well as handling of overflow special
case stride == -1.
|
Uses very expensive IDIVQ instruction to do the actual division
CQO IDIVQ CX |
Uses relatively cheap DIVSD instruction but adds a bunch of code to
do int<-->float conversions
XORPS X0, X0 CVTSQ2SD DX, X0 XORPS X1, X1 CVTSQ2SD CX, X1 DIVSD X1, X0 CVTTSD2SQ X0, AX |
The key point is that IDIVQ is replaced with DIVSD. According to uops.info
the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see
link). And DIVSD has
latency 13 and throughput of 4 cycles
(link). So I thought
that the number I should look for is throughput.
Why throughput and not latency? Because in this microbenchmark the divisions are independent. One result doesn't depend on the previous one. So the divider unit stays "saturated" as it receives the work at the fastest possible rate. And the rate is bottlenecked by the throughput and not latency.
So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote
3x in the doc comment above (should have written 2.5x to be more precise as
10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only
~1.4x (the 28% Intel win)?
To answer this question I ran perf stat and collected the
arith.divider_active metric. The arith.divider_active counts cycles during
which the cpu divide unit was busy.
| idivq | float | |
|---|---|---|
| arith.divider_active / op | 10.00 cyc | 6.35 cyc |
| total cycles / op | 10.03 cyc | 7.29 cyc |
The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has
predicted (using throughput rather than latency was a right choice after all).
While we are here note that total cycles in this case is only 0.03 cycles extra
which means that all that machinery compiler added to check the divide-by-zero
and overflow is negligible (that was another question I wanted to ask - if the
speedup is actually caused by that machinery elimination and perf showed it was
not the case. The speedup is truly because of using more performant DIVSD
instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division,
not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra
work making it 7.29 cycles per op in total. Is uops.info wrong then?
I think it's not. I'm fairly confident that DIVSD timing is operand-dependent
- the
6.35vs4gap is itself a sign of it, since uops.info measures throughput with one fixed operand set. I don't have a decisive explanation for the exact rule, though.
Caveat
Integer division and float-division-then-cast-to-int are not always equivalent.
It's due to how the floating point numbers are represented in the hardware and
the rounding limitation. Every integer up to 2^53 is exactly representable as an
fp number (valid integers can be found after 2^53 too, but above that gaps
appear - only even numbers, then multiples of 4, and so on). As we'll see below,
the operands of DIVSD must stay <= 2^53 - and the representable values above
2^53 don't save you.
Now the rounding limitation: the result of division (let's call it q for
quotient) is q = floor(a/b). But the true value of a/b can be somewhere in the
range [q, q+1). The problem arises when a/b is so close to q+1 that
hardware can't determine that it is actually smaller than q+1. Then instead of
correct q you get off-by-one q+1. The a/b becomes indistinguishable from
q+1 when ulp(q) >= 2/b where the ulp is the unit in the last
place and
that happens when bits(q) + bits(b) > 54, which forces a ≈ q*b > 2^53.
So basically when you divide a by b you have these two limitations:
a <= 2^53b <= 2^53
If these conditions hold then you are golden and you can replace IDIVQ with DIVSD.
Wrapping up
So this is one of the few cases where the "integer ops are obviously faster"
intuition is wrong: IDIVQ is one of the most expensive integer instructions,
and DIVSD plus the int<-->float conversions turns out to be faster. Of course
benchmark before applying.
Oh yeah, forgot to mention. This optimization is only applicable when you divide by a runtime variable. If it's a compile-time constant, then the compiler already replaces it with a better optimization.