Andrii's Blog

Optimization catalog. When float division beats integer division

A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster

I did quite a lot of hot path optimizations while working on my last project and some of them were genuinely interesting ones. So why don't I write about some of them. I'm starting the series with an interesting case where speedup was achieved by unintuitive replacement of int operation with float operation.

Avoiding IDIVQ with float division

Let's take a look at this optimization (ignore the 3x claim, I was wrong in that doc comment):

diff --git a/huffman.go b/huffman.go
index 33f8263..180dffc 100644
--- a/huffman.go
+++ b/huffman.go
@@ -378,7 +378,11 @@ func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) {
 		if i != length {
 			sum += int(counts[i])
 			if stride >= 4 {
-				limit = (256*sum + stride/2) / stride
+				// float64 division is ~3x faster than IDIVQ for variable
+				// divisors, and exact for our input range (numerator < 2^25,
+				// denominator < 2^9, quotient < 2^17 — all exactly
+				// representable in float64's 52-bit mantissa).
+				limit = int(float64(256*sum+stride/2) / float64(stride))
 			}
 			if stride == 4 {
 				limit += 120

I'm replacing integer division (256*sum + stride/2) / stride with float division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the float operations? Well yes, but actually no. Not in this case.

Before we do anything let's run benchmarks to confirm that the optimization is real (benchmark code):

goos: linux
goarch: amd64
pkg: idivq
cpu: 12th Gen Intel(R) Core(TM) i5-12500
          │    idivq    │               float                │
          │   sec/op    │   sec/op     vs base               │
DivInline   3.358n ± 0%   2.414n ± 0%  -28.10% (p=0.002 n=6)

Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD CPU:

goos: linux
goarch: amd64
pkg: idivq
cpu: AMD Ryzen 5 7535HS with Radeon Graphics
          │    idivq    │               float                │
          │   sec/op    │   sec/op     vs base               │
DivInline   2.178n ± 0%   1.959n ± 0%  -10.08% (p=0.002 n=6)

~10% win.

So why is it faster? Let's compare the assembly:

Integer (IDIVQ) Float (DIVSD)
Compute 256*sum + stride/2
SHLQ $8, AX
MOVQ BX, CX
SHRQ $63, BX
LEAQ (BX)(CX*1), DX
SARQ $1, DX
ADDQ DX, AX
Compute 256*sum + stride/2
SHLQ $8, AX
MOVQ BX, CX
SHRQ $63, BX
LEAQ (CX)(BX*1), DX
SARQ $1, DX
ADDQ AX, DX
TESTQ CX, CX
JEQ panicdivide
CMPQ CX, $-1
JNE do_div
NEGQ AX
XORL DX, DX
JMP ret
The division by zero check is missing in the float version. As well as handling of overflow special case stride == -1.
Uses very expensive IDIVQ instruction to do the actual division
CQO
IDIVQ CX
Uses relatively cheap DIVSD instruction but adds a bunch of code to do int<-->float conversions
XORPS X0, X0
CVTSQ2SD DX, X0
XORPS X1, X1
CVTSQ2SD CX, X1
DIVSD X1, X0
CVTTSD2SQ X0, AX

The key point is that IDIVQ is replaced with DIVSD. According to uops.info the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see link). And DIVSD has latency 13 and throughput of 4 cycles (link). So I thought that the number I should look for is throughput.

Why throughput and not latency? Because in this microbenchmark the divisions are independent. One result doesn't depend on the previous one. So the divider unit stays "saturated" as it receives the work at the fastest possible rate. And the rate is bottlenecked by the throughput and not latency.

So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote 3x in the doc comment above (should have written 2.5x to be more precise as 10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only ~1.4x (the 28% Intel win)?

To answer this question I ran perf stat and collected the arith.divider_active metric. The arith.divider_active counts cycles during which the cpu divide unit was busy.

idivqfloat
arith.divider_active / op10.00 cyc6.35 cyc
total cycles / op10.03 cyc7.29 cyc

The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has predicted (using throughput rather than latency was a right choice after all). While we are here note that total cycles in this case is only 0.03 cycles extra which means that all that machinery compiler added to check the divide-by-zero and overflow is negligible (that was another question I wanted to ask - if the speedup is actually caused by that machinery elimination and perf showed it was not the case. The speedup is truly because of using more performant DIVSD instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division, not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra work making it 7.29 cycles per op in total. Is uops.info wrong then?

I think it's not. I'm fairly confident that DIVSD timing is operand-dependent

  • the 6.35 vs 4 gap is itself a sign of it, since uops.info measures throughput with one fixed operand set. I don't have a decisive explanation for the exact rule, though.

Caveat

Integer division and float-division-then-cast-to-int are not always equivalent. It's due to how the floating point numbers are represented in the hardware and the rounding limitation. Every integer up to 2^53 is exactly representable as an fp number (valid integers can be found after 2^53 too, but above that gaps appear - only even numbers, then multiples of 4, and so on). As we'll see below, the operands of DIVSD must stay <= 2^53 - and the representable values above 2^53 don't save you.

Now the rounding limitation: the result of division (let's call it q for quotient) is q = floor(a/b). But the true value of a/b can be somewhere in the range [q, q+1). The problem arises when a/b is so close to q+1 that hardware can't determine that it is actually smaller than q+1. Then instead of correct q you get off-by-one q+1. The a/b becomes indistinguishable from q+1 when ulp(q) >= 2/b where the ulp is the unit in the last place and that happens when bits(q) + bits(b) > 54, which forces a ≈ q*b > 2^53.

So basically when you divide a by b you have these two limitations:

  • a <= 2^53
  • b <= 2^53

If these conditions hold then you are golden and you can replace IDIVQ with DIVSD.

Wrapping up

So this is one of the few cases where the "integer ops are obviously faster" intuition is wrong: IDIVQ is one of the most expensive integer instructions, and DIVSD plus the int<-->float conversions turns out to be faster. Of course benchmark before applying.

Oh yeah, forgot to mention. This optimization is only applicable when you divide by a runtime variable. If it's a compile-time constant, then the compiler already replaces it with a better optimization.

Update

This post got a good discussion on Reddit (thread). A few comments gave ideas for different optimizations.

u/Masztufa wrote:

I wonder if the superscalar nature of cpus also comes up or not in these tests

You're always doing integer math (pointer arithmetic), so it would seem like that choosing integer math would load the int math part of the cpu, while if you used floats for the actual data you could use more of the silicon to get the job done

This comment gave me an aha moment and I realized that I can interleave int and float divisions so both units work at once and benefit from the superscalar execution (commit). The best combination was interleaving 5 divisions: int, int, float, float, float.


u/Dwedit wrote:

One thing with integer math is that it becomes much faster to precalculate a reciprocal and use that instead. The compiler automatically does that for you for constant values, but not for variable values.

uint32_t reciprocal = (uint32_t)(0x100000000ULL / divisor + 1);  //divisor must be > 1

answer = (number * (uint64_t)reciprocal) >> 32;

When the divisor is reused across many divisions, you can replace the division itself with a single multiply. I microbenchmarked it here and it gave the same performance gain as the float (divsd) optimization.

u/vip17 pointed to the C libdivide library. I found a go equivalent at github.com/bmkessler/fastdiv. Note that I benchmarked the Uint32 version. To stay correct across the whole 64-bit range, the Uint64 version has to use 128-bit math, so it does two multiplications instead of one. The Uint32 version does a single multiplication, which is closer to the other benchmarked optimizations (at the cost of limiting the range to numbers below 2^32).

Finally I added 5x unroll version for every optimization so that it's an apples-to-apples comparison with the interleave trick I mentioned above. commit.

The numbers. Only the 5x unrolled versions are included:

approachns / 5x opsvs IDIVQ
IDIVQ16.81
int+float split8.316−51%
fastdiv7.796−54%
float (DIVSD)7.391−56%
reciprocal multiply6.474−61%

So the reciprocal multiply suggested by u/Dwedit produced the best results for the 5x unrolled loop. The int+float split is a fun trick, but simply unrolling the other versions (e.g. 5x DIVSD) turned out to be even faster.

What about SIMD? Suggested by u/Masztufa and u/vip17. I agree that SIMD would beat my divsd version. But the hot loop in the function I optimized (optimizeHuffmanCountsForRLE) looks like a state machine and is hard to vectorize. Loop unrolling won't work either, so I'll have to pick the best of the non-unrolled versions.