Optimization catalog. When float division beats integer division
A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster
I did quite a lot of hot path optimizations while working on my last project and some of them were genuinely interesting ones. So why don't I write about some of them. I'm starting the series with an interesting case where speedup was achieved by unintuitive replacement of int operation with float operation.
Avoiding IDIVQ with float division
Let's take a look at this optimization (ignore the 3x claim, I was wrong in that doc comment):
diff --git a/huffman.go b/huffman.go
index 33f8263..180dffc 100644
func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) {
if i != length {
sum += int(counts[i])
if stride >= 4 {
- limit = (256*sum + stride/2) / stride
+ // float64 division is ~3x faster than IDIVQ for variable
+ // divisors, and exact for our input range (numerator < 2^25,
+ // denominator < 2^9, quotient < 2^17 — all exactly
+ // representable in float64's 52-bit mantissa).
+ limit = int(float64(256*sum+stride/2) / float64(stride))
}
if stride == 4 {
limit += 120
I'm replacing integer division (256*sum + stride/2) / stride with float
division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the
float operations? Well yes, but actually no. Not in this case.
Before we do anything let's run benchmarks to confirm that the optimization is real (benchmark code):
goos: linux
goarch: amd64
pkg: idivq
cpu: 12th Gen Intel(R) Core(TM) i5-12500
│ idivq │ float │
│ sec/op │ sec/op vs base │
DivInline 3.358n ± 0% 2.414n ± 0% -28.10% (p=0.002 n=6)
Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD CPU:
goos: linux
goarch: amd64
pkg: idivq
cpu: AMD Ryzen 5 7535HS with Radeon Graphics
│ idivq │ float │
│ sec/op │ sec/op vs base │
DivInline 2.178n ± 0% 1.959n ± 0% -10.08% (p=0.002 n=6)
~10% win.
So why is it faster? Let's compare the assembly:
| Integer (IDIVQ) | Float (DIVSD) |
|---|---|
Compute 256*sum + stride/2
SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (BX)(CX*1), DX SARQ $1, DX ADDQ DX, AX |
Compute 256*sum + stride/2
SHLQ $8, AX MOVQ BX, CX SHRQ $63, BX LEAQ (CX)(BX*1), DX SARQ $1, DX ADDQ AX, DX |
TESTQ CX, CX JEQ panicdivide CMPQ CX, $-1 JNE do_div NEGQ AX XORL DX, DX JMP ret |
The division by zero check is missing in the float version. As well as handling of overflow special
case stride == -1.
|
Uses very expensive IDIVQ instruction to do the actual division
CQO IDIVQ CX |
Uses relatively cheap DIVSD instruction but adds a bunch of code to
do int<-->float conversions
XORPS X0, X0 CVTSQ2SD DX, X0 XORPS X1, X1 CVTSQ2SD CX, X1 DIVSD X1, X0 CVTTSD2SQ X0, AX |
The key point is that IDIVQ is replaced with DIVSD. According to uops.info
the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see
link). And DIVSD has
latency 13 and throughput of 4 cycles
(link). So I thought
that the number I should look for is throughput.
Why throughput and not latency? Because in this microbenchmark the divisions are independent. One result doesn't depend on the previous one. So the divider unit stays "saturated" as it receives the work at the fastest possible rate. And the rate is bottlenecked by the throughput and not latency.
So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote
3x in the doc comment above (should have written 2.5x to be more precise as
10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only
~1.4x (the 28% Intel win)?
To answer this question I ran perf stat and collected the
arith.divider_active metric. The arith.divider_active counts cycles during
which the cpu divide unit was busy.
| idivq | float | |
|---|---|---|
| arith.divider_active / op | 10.00 cyc | 6.35 cyc |
| total cycles / op | 10.03 cyc | 7.29 cyc |
The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has
predicted (using throughput rather than latency was a right choice after all).
While we are here note that total cycles in this case is only 0.03 cycles extra
which means that all that machinery compiler added to check the divide-by-zero
and overflow is negligible (that was another question I wanted to ask - if the
speedup is actually caused by that machinery elimination and perf showed it was
not the case. The speedup is truly because of using more performant DIVSD
instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division,
not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra
work making it 7.29 cycles per op in total. Is uops.info wrong then?
I think it's not. I'm fairly confident that DIVSD timing is operand-dependent
- the
6.35vs4gap is itself a sign of it, since uops.info measures throughput with one fixed operand set. I don't have a decisive explanation for the exact rule, though.
Caveat
Integer division and float-division-then-cast-to-int are not always equivalent.
It's due to how the floating point numbers are represented in the hardware and
the rounding limitation. Every integer up to 2^53 is exactly representable as an
fp number (valid integers can be found after 2^53 too, but above that gaps
appear - only even numbers, then multiples of 4, and so on). As we'll see below,
the operands of DIVSD must stay <= 2^53 - and the representable values above
2^53 don't save you.
Now the rounding limitation: the result of division (let's call it q for
quotient) is q = floor(a/b). But the true value of a/b can be somewhere in the
range [q, q+1). The problem arises when a/b is so close to q+1 that
hardware can't determine that it is actually smaller than q+1. Then instead of
correct q you get off-by-one q+1. The a/b becomes indistinguishable from
q+1 when ulp(q) >= 2/b where the ulp is the unit in the last
place and
that happens when bits(q) + bits(b) > 54, which forces a ≈ q*b > 2^53.
So basically when you divide a by b you have these two limitations:
a <= 2^53b <= 2^53
If these conditions hold then you are golden and you can replace IDIVQ with DIVSD.
Wrapping up
So this is one of the few cases where the "integer ops are obviously faster"
intuition is wrong: IDIVQ is one of the most expensive integer instructions,
and DIVSD plus the int<-->float conversions turns out to be faster. Of course
benchmark before applying.
Oh yeah, forgot to mention. This optimization is only applicable when you divide by a runtime variable. If it's a compile-time constant, then the compiler already replaces it with a better optimization.
Update
This post got a good discussion on Reddit (thread). A few comments gave ideas for different optimizations.
u/Masztufa wrote:
I wonder if the superscalar nature of cpus also comes up or not in these tests
You're always doing integer math (pointer arithmetic), so it would seem like that choosing integer math would load the int math part of the cpu, while if you used floats for the actual data you could use more of the silicon to get the job done
This comment gave me an aha moment and I realized that I can interleave int and float divisions so both units work at once and benefit from the superscalar execution (commit). The best combination was interleaving 5 divisions: int, int, float, float, float.
u/Dwedit wrote:
One thing with integer math is that it becomes much faster to precalculate a reciprocal and use that instead. The compiler automatically does that for you for constant values, but not for variable values.
uint32_t reciprocal = (uint32_t)(0x100000000ULL / divisor + 1); //divisor must be > 1 answer = (number * (uint64_t)reciprocal) >> 32;
When the divisor is reused across many divisions, you can replace the division itself with a single multiply. I microbenchmarked it here and it gave the same performance gain as the float (divsd) optimization.
u/vip17 pointed to the C libdivide library. I found a go equivalent at
github.com/bmkessler/fastdiv.
Note that I benchmarked the Uint32 version. To stay correct across the whole
64-bit range, the Uint64 version has to use 128-bit math, so it does two
multiplications instead of one. The Uint32 version does a single
multiplication, which is closer to the other benchmarked optimizations (at the
cost of limiting the range to numbers below 2^32).
Finally I added 5x unroll version for every optimization so that it's an apples-to-apples comparison with the interleave trick I mentioned above. commit.
The numbers. Only the 5x unrolled versions are included:
| approach | ns / 5x ops | vs IDIVQ |
|---|---|---|
IDIVQ | 16.81 | — |
| int+float split | 8.316 | −51% |
fastdiv | 7.796 | −54% |
float (DIVSD) | 7.391 | −56% |
| reciprocal multiply | 6.474 | −61% |
So the reciprocal multiply suggested by u/Dwedit produced the best results for the 5x unrolled loop. The int+float split is a fun trick, but simply unrolling the other versions (e.g. 5x DIVSD) turned out to be even faster.
What about SIMD? Suggested by u/Masztufa and u/vip17. I agree that SIMD
would beat my divsd version. But the hot loop in the function I optimized
(optimizeHuffmanCountsForRLE) looks like a state machine and is hard to
vectorize. Loop unrolling won't work either, so I'll have to pick the best of
the non-unrolled versions.