Andrii's Blog

Notes from Optimizing CPU-Bound Go Hot Paths

May 03, 2026

Go does a lot of things right. And I love go because of that. But while porting Brotli to pure Go for go-brrr, I kept hitting the same pattern: idiomatic abstractions made hot paths slower, and the fastest version was often hand-duplicated and specialized.

Lack of zero-cost abstractions

In the hot loops I was optimizing, generics, polymorphic dispatch (via interface), and closures often prevented the compiler from producing the same code as the concrete version. The reason is that go doesn't inline these calls in the shapes I was using (we will see problems with inlining very often in this post because inlining is quite important). Yes, the compiler can sometimes inline a direct closure call or devirtualize an interface call, but in the patterns I actually ran into it didn't, and I ate the call overhead. It's clear why interface calls are not inlined. They enable swapping of implementation at runtime rather than compile time. But generics allow to swapping implementation at compile time. If you are coming from languages like c++ or rust you'd expect the generic functions to be monomorphized (all variants pre-generated as concrete functions at compile time) but in go it doesn't happen,at least not in that form. Go uses approach they called GC Shape Stenciling where some parts are pre-generated at compile time but method calls on type parameters end up going through interface-style dispatch (technically the itab is reached via a generics dictionary rather than an ordinary interface argument, but the effect on the hot path is the same). The impossibility of inlining is addressed in the proposal:

The one exception is that method calls won't be fully resolvable at compile time... inlining won't happen in situations where it could happen with a fully stenciled implementation.

So what do we do? Actually, no problem. We just don't use abstractions like generics and duplicate concrete functions. We just take a concrete function duplicate it completely with parts that we wanted to parametrize changed. Needless to say, this will cause lots of duplication. In the Brotli port there were 16 almost identical functions and the only difference between them was that they were calling different versions of the hash function. The 16 variants couldn't be collapsed into one via usage of abstractions because the function is used in hot path.

So performance problem is solved by duplication but this introduces a potentially big problem of maintenance. This can be somewhat mitigated by code generation of course but it's very likely that you will have big number of occurrences where you have only 2-3 duplicating variants which will not justify introduction of codegen.

The next section is a deep dive into benchmarking of the concrete-vs-generic-vs-interface approaches with some exploration of underlying assembly which you can happily skip.

Deep dive

Let's illustrate all described above by example. Here is the function I used in a real codebase reduced from unimportant stuff.

func StoreConcrete(t *Table, data []byte) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := (v * HashMul32) >> (32 - BucketBits)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

Now imagine that we need to use several versions of hash functions. Which hash function to use is known at compile time. There are several options how we can parametrize.

One option is to use generics:

type Hasher interface {
	Hash(v uint32) uint32
}

type H5Hasher struct{}
func (H5Hasher) Hash(v uint32) uint32 {
	return (v * HashMul32) >> (32 - BucketBits)
}

func StoreGeneric[H Hasher](t *Table, data []byte) {
	// ...
		key := h.Hash(v)
	// ...
}

Another option is to use polymorphic dispatch:

func StoreInterface(t *Table, data []byte, h Hasher) {
	// ...
		key := h.Hash(v)
	// ...
}

Another option is to pass a closure to the function:

func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
	// ...
		key := hash(v)
	// ...
}

Expand to see full code

import "encoding/binary"

const HashMul32 = 0x1E35A7BD

const (
	BucketBits = 14
	BlockBits  = 4
	BucketSize = 1 << BucketBits
	BlockSize  = 1 << BlockBits
)

type Table struct {
	Num     [BucketSize]uint16
	Buckets [BucketSize * BlockSize]uint32
}

func StoreConcrete(t *Table, data []byte) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := (v * HashMul32) >> (32 - BucketBits)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

type Hasher interface {
	Hash(v uint32) uint32
}

type H5Hasher struct{}

func (H5Hasher) Hash(v uint32) uint32 {
	return (v * HashMul32) >> (32 - BucketBits)
}

func StoreGeneric[H Hasher](t *Table, data []byte) {
	var h H
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := h.Hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

func StoreInterface(t *Table, data []byte, h Hasher) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := h.Hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

As it's known at compile time what hash function is used the compiler can produce optimal code, right? Wrong!

Let's benchmark it first.

Environment:

go version go1.26.2-X:nodwarf5 linux/amd64
goos: linux
goarch: amd64
pkg: hashdemo
cpu: 12th Gen Intel(R) Core(TM) i5-12500

Run with:

go test -bench=. -benchmem -count 6 -cpu 1 | tee bench.txt
benchstat -filter '.unit:B/s' -col .name bench.txt

The throughput numbers in the table below are the benchstat-reported values across 6 runs.

Expand to see the full benchmarking code

import "testing"

const benchSize = 1 << 16

func BenchmarkConcrete(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreConcrete(t, data)
	}
}

func BenchmarkGeneric(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreGeneric[H5Hasher](t, data)
	}
}

func BenchmarkInterface(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	var h Hasher = H5Hasher{}
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreInterface(t, data, h)
	}
}

func BenchmarkClosure(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	hash := func(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) }
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreClosure(t, data, hash)
	}
}

func makeData(n int) []byte {
	b := make([]byte, n)
	x := uint32(0xDEADBEEF)
	for i := range b {
		x = x*1664525 + 1013904223
		b[i] = byte(x >> 24)
	}
	return b
}

Variant	Throughput	Δ vs Concrete
Concrete	378.0 MiB/s
Generic	320.6 MiB/s	-15.18%
Closure	322.0 MiB/s	-14.82%
Interface	274.3 MiB/s	-27.44%

Whoa! That's pretty dramatic difference.

Assembly related to the Concrete function

// func StoreConcrete(t *Table, data []byte) {
PUSHQ BP		// 0x52e4a0
MOVQ SP, BP
MOVQ BX, 0x18(SP)
// 	for i := uint32(0); i+4 <= end; i++ {
XORL DX, DX
JMP 0x52e4ed
// 		minor := uint32(t.Num[key]) & (BlockSize - 1)
TESTB AL, 0(AX)		// 0x52e4ad
// 	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVL 0(BX)(DX*1), R9
// 		key := (v * HashMul32) >> (32 - BucketBits)
IMULL $0x1e35a7bd, R9, R9
SHRL $0x12, R9
// 		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVZX 0(AX)(R9*2), R10
ANDL $0xf, R10
// 		t.Buckets[minor+key<<BlockBits] = i
MOVL R9, R11
SHLL $0x4, R9
ADDL R10, R9
MOVL R8, 0x8000(AX)(R9*4)
// 		t.Num[key]++
MOVZX 0(AX)(R11*2), R9
INCL R9
MOVW R9, 0(AX)(R11*2)
// 	for i := uint32(0); i+4 <= end; i++ {
LEAL 0x1(R8), DX
MOVQ SI, CX
LEAL 0x4(DX), SI	// 0x52e4ed
CMPL CX, SI
JB 0x52e514
// 		v := binary.LittleEndian.Uint32(data[i:])
CMPQ CX, DX
JB 0x52e51b
MOVQ CX, SI
SUBQ DX, CX
MOVL DX, R8
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R8, DX
// 	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CMPQ CX, $0x3
JA 0x52e4ad
JMP 0x52e516
// }
POPQ BP			// 0x52e514
RET
// 	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CALL runtime.panicBounds(SB)	// 0x52e516
// 		v := binary.LittleEndian.Uint32(data[i:])
NOPL 0(AX)(AX*1)		// 0x52e51b
CALL runtime.panicBounds(SB)
NOPL				// 0x52e525

Assembly related to the Generic function

//TEXT hashdemo.H5Hasher.Hash(SB) /data/devel/my/blog-demo/hashdemo/hash.go
//	return (v * HashMul32) >> (32 - BucketBits)
IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

//TEXT hashdemo.StoreGeneric[go.shape.struct {}](SB) /data/devel/my/blog-demo/hashdemo/hash.go
//func StoreGeneric[H Hasher](t *Table, data []byte) {
CMPQ SP, 0x10(R14)
JBE 0x52f0f0
PUSHQ BP
MOVQ SP, BP
SUBQ $0x10, SP
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ AX, 0x20(SP)
MOVQ BX, 0x28(SP)
MOVQ CX, 0x30(SP)
MOVQ DI, 0x38(SP)
MOVQ SI, 0x40(SP)
XORL DX, DX
JMP 0x52f072
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVZX 0(CX)(BX*2), R8	// 0x52f02f
ANDL $0xf, R8
//		t.Buckets[minor+key<<BlockBits] = i
SHLL $0x4, AX
ADDL AX, R8
MOVL 0xc(SP), DX
MOVL DX, 0x8000(CX)(R8*4)
//		t.Num[key]++
MOVZX 0(CX)(BX*2), R8
INCL R8
MOVW R8, 0(CX)(BX*2)
//	for i := uint32(0); i+4 <= end; i++ {
INCL DX
//		key := h.Hash(v)
MOVQ 0x20(SP), AX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVQ 0x30(SP), CX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ 0x28(SP), BX
//		v := binary.LittleEndian.Uint32(data[i:])
MOVQ 0x40(SP), SI
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ 0x38(SP), DI
LEAL 0x4(DX), R8	// 0x52f072
CMPL DI, R8
JB 0x52f0cf
NOPL 0(AX)(AX*1)
//		v := binary.LittleEndian.Uint32(data[i:])
CMPQ DI, DX
JB 0x52f0ea
SUBQ DX, DI
MOVL DX, R9
SUBQ SI, DX
SARQ $0x3f, DX
ANDQ R9, DX
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CMPQ DI, $0x3
JBE 0x52f0e5
//	for i := uint32(0); i+4 <= end; i++ {
MOVL R9, 0xc(SP)
//		key := h.Hash(v)
MOVQ 0(AX), BX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVL 0(CX)(DX*1), CX
//		key := h.Hash(v)
MOVQ AX, DX
MOVL CX, AX
CALL BX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ 0x28(SP), CX
TESTB AL, 0(CX)
MOVL AX, BX
NOPW 0(AX)(AX*1)
NOPL
CMPQ BX, $0x4000
JB 0x52f02f
JMP 0x52f0d5
//}
ADDQ $0x10, SP		// 0x52f0cf
POPQ BP
RET
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ $0x4000, AX		// 0x52f0d5
NOPL 0(AX)
CALL runtime.panicBounds(SB)
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CALL runtime.panicBounds(SB)	// 0x52f0e5
//		v := binary.LittleEndian.Uint32(data[i:])
CALL runtime.panicBounds(SB)	// 0x52f0ea
NOPL
//func StoreGeneric[H Hasher](t *Table, data []byte) {
MOVQ AX, 0x8(SP)					// 0x52f0f0
MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
JMP hashdemo.StoreGeneric[go.shape.struct {}](SB)

//TEXT hashdemo.(*H5Hasher).Hash(SB) <autogenerated>

PUSHQ BP
MOVQ SP, BP
TESTQ AX, AX
JE 0x52f154
//	return (v * HashMul32) >> (32 - BucketBits)
IMULL $0x1e35a7bd, BX, AX
SHRL $0x12, AX

POPQ BP
RET
CALL runtime.panicwrap(SB)	// 0x52f154
NOPL

Assembly related to the Closure function

//TEXT hashdemo.StoreClosure(SB) /data/devel/my/blog-demo/hashdemo/hash.go
//func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
CMPQ SP, 0x10(R14)
JBE 0x52e779
PUSHQ BP
MOVQ SP, BP
SUBQ $0x10, SP
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ AX, 0x20(SP)
MOVQ SI, 0x40(SP)
MOVQ CX, 0x30(SP)
MOVQ BX, 0x28(SP)
MOVQ DI, 0x38(SP)
XORL DX, DX
JMP 0x52e710
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVZX 0(CX)(BX*2), R8	// 0x52e6cf
ANDL $0xf, R8
//		t.Buckets[minor+key<<BlockBits] = i
SHLL $0x4, AX
ADDL AX, R8
MOVL 0xc(SP), DX
MOVL DX, 0x8000(CX)(R8*4)
//		t.Num[key]++
MOVZX 0(CX)(BX*2), R8
INCL R8
MOVW R8, 0(CX)(BX*2)
//	for i := uint32(0); i+4 <= end; i++ {
INCL DX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ CX, AX
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ 0x30(SP), CX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVQ 0x28(SP), BX
//		key := hash(v)
MOVQ 0x40(SP), SI
//		v := binary.LittleEndian.Uint32(data[i:])
MOVQ 0x38(SP), DI
//	for i := uint32(0); i+4 <= end; i++ {
LEAL 0x4(DX), R8	// 0x52e710
CMPL CX, R8
JB 0x52e75c
//		v := binary.LittleEndian.Uint32(data[i:])
CMPQ CX, DX
JB 0x52e773
SUBQ DX, CX
MOVL DX, R9
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R9, DX
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CMPQ CX, $0x3
JBE 0x52e76e
//	for i := uint32(0); i+4 <= end; i++ {
MOVL R9, 0xc(SP)
//		key := hash(v)
MOVQ 0(SI), CX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVL 0(BX)(DX*1), AX
//		key := hash(v)
MOVQ SI, DX
CALL CX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ 0x20(SP), CX
TESTB AL, 0(CX)
MOVL AX, BX
CMPQ BX, $0x4000
JB 0x52e6cf
JMP 0x52e762
//}
ADDQ $0x10, SP		// 0x52e75c
POPQ BP
RET
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ $0x4000, AX		// 0x52e762
CALL runtime.panicBounds(SB)
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CALL runtime.panicBounds(SB)	// 0x52e76e
//		v := binary.LittleEndian.Uint32(data[i:])
CALL runtime.panicBounds(SB)
NOPL
//func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
MOVQ AX, 0x8(SP)			// 0x52e779
MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
JMP hashdemo.StoreClosure(SB)

//TEXT hashdemo.BenchmarkClosure.func1(SB) /data/devel/my/blog-demo/hashdemo/hash_test.go
//	hash := func(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) }
IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

Assembly related to the Interface function

//TEXT hashdemo.H5Hasher.Hash(SB) /data/devel/my/blog-demo/hashdemo/hash.go
//	return (v * HashMul32) >> (32 - BucketBits)
IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

//TEXT hashdemo.StoreInterface(SB) /data/devel/my/blog-demo/hashdemo/hash.go
//func StoreInterface(t *Table, data []byte, h Hasher) {
CMPQ SP, 0x10(R14)
JBE 0x52e650
PUSHQ BP
MOVQ SP, BP
SUBQ $0x18, SP
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ AX, 0x28(SP)
MOVQ CX, 0x38(SP)
MOVQ BX, 0x30(SP)
MOVQ DI, 0x40(SP)
MOVQ R8, 0x50(SP)
MOVQ SI, 0x48(SP)
XORL DX, DX
JMP 0x52e5dd
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVZX 0(CX)(DX*2), R9	// 0x52e594
ANDL $0xf, R9
//		t.Buckets[minor+key<<BlockBits] = i
SHLL $0x4, AX
ADDL AX, R9
MOVL 0x14(SP), R10
MOVL R10, 0x8000(CX)(R9*4)
//		t.Num[key]++
MOVZX 0(CX)(DX*2), R9
INCL R9
MOVW R9, 0(CX)(DX*2)
//	for i := uint32(0); i+4 <= end; i++ {
LEAL 0x1(R10), DX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ CX, AX
//	for i := uint32(0); i+4 <= end; i++ {
MOVQ 0x38(SP), CX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVQ 0x30(SP), BX
//		key := h.Hash(v)
MOVQ 0x48(SP), SI
//		v := binary.LittleEndian.Uint32(data[i:])
MOVQ 0x40(SP), DI
//		key := h.Hash(v)
MOVQ 0x50(SP), R8
//	for i := uint32(0); i+4 <= end; i++ {
LEAL 0x4(DX), R9	// 0x52e5dd
CMPL CX, R9
JB 0x52e62f
//		v := binary.LittleEndian.Uint32(data[i:])
CMPQ CX, DX
JB 0x52e64a
SUBQ DX, CX
MOVL DX, R10
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R10, DX
NOPL 0(AX)(AX*1)
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CMPQ CX, $0x3
JBE 0x52e645
//	for i := uint32(0); i+4 <= end; i++ {
MOVL R10, 0x14(SP)
//		key := h.Hash(v)
MOVQ 0x18(SI), CX
//	return uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24
MOVL 0(BX)(DX*1), BX
//		key := h.Hash(v)
MOVQ R8, AX
CALL CX
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ 0x28(SP), CX
TESTB AL, 0(CX)
MOVL AX, DX
CMPQ DX, $0x4000
JB 0x52e594
JMP 0x52e635
//}
ADDQ $0x18, SP		// 0x52e62f
POPQ BP
RET
//		minor := uint32(t.Num[key]) & (BlockSize - 1)
MOVQ $0x4000, AX		// 0x52e635
NOPL 0(AX)
CALL runtime.panicBounds(SB)
//	_ = b[3] // bounds check hint to compiler; see golang.org/issue/14808
CALL runtime.panicBounds(SB)	// 0x52e645
//		v := binary.LittleEndian.Uint32(data[i:])
CALL runtime.panicBounds(SB)
NOPL
//func StoreInterface(t *Table, data []byte, h Hasher) {
MOVQ AX, 0x8(SP)			// 0x52e650
MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
MOVQ R8, 0x30(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
MOVQ 0x30(SP), R8
JMP hashdemo.StoreInterface(SB)

what we notice immediately is that all the variants contain almost double amount of the instructions present in the original concrete function. In this case the extra call, the args being reloaded from the stack every iteration, the nil check, and the extra bounds check are enough to show up clearly in throughput. But let's compare side by side what happens inside the hot loop - the most important and performance sensitive part of the code.

Concrete	Generic
LEAL 0x4(DX), SI CMPL CX, SI JB 0x52e514 ...	LEAL 0x4(DX), R8 CMPL DI, R8 JB 0x52f0cf ...	Loop condition
	MOVL R9, 0xc(SP) MOVQ 0(AX), BX MOVL 0(CX)(DX*1), CX MOVQ AX, DX MOVL CX, AX CALL BX	Making call to the non-inlined hash function
MOVL 0(BX)(DX*1), R9 IMULL $0x1e35a7bd, R9, R9 SHRL $0x12, R9	IMULL $0x1e35a7bd, AX, AX SHRL $0x12, AX RET	hash function is simply inlined in the concrete version and is non-inlined in the generic version
	MOVQ 0x28(SP), CX TESTB AL, 0(CX) MOVL AX, BX NOPW 0(AX)(AX*1) NOPL CMPQ BX, $0x4000 JB 0x52f02f JMP 0x52f0d5	Extra bounds check and nil check that the concrete version doesn't need.
MOVZX 0(AX)(R92), R10 ... MOVW R9, 0(AX)(R112)	MOVZX 0(CX)(BX2), R8 ... MOVW R8, 0(CX)(BX2)	Real work inside the loop
	MOVQ 0x20(SP), AX MOVQ 0x30(SP), CX MOVQ 0x28(SP), BX MOVQ 0x40(SP), SI MOVQ 0x38(SP), DI	Reloading the function arguments from the stack every iteration because the call trashes the registers.

Well, the hot loop assembly shows clearly that cpu handles more instructions in the generic version due to machinery that is required to execute a call to non-inlined hash function. No need to also include the interface and closure versions in the table above. Their hot loops are nearly identical to the generic version.

Most of the problems below come back to the same root cause we just saw: the compiler isn't inlining where you need it to, and there's no way to tell it to. So the rest of the post is mostly variations on this.

Lack of intrinsics

The previous problem could have easily side-stepped by code duplication. This one, however, truly hurts the performance. The underlying mechanism, though, is again the inability to inline.

Lots of cpus support instructions to load memory into L1, L2, L3 cache. It's super useful as not having needed data in the cpu cache causes cpu to stall while loading the data for about 100 cycles. If you know in advance that you will definitely need some piece of data handful of statements later you can prefetch that memory and do some useful work while that memory is being loaded in the background.

In other languages prefetch is exposed through intrinsics, pseudo-functions that the compiler recognizes and replaces with a single machine instruction emitted right at the call site. C and C++ have __builtin_prefetch in GCC/Clang and _mm_prefetch from the Intel intrinsics headers; Rust has core::intrinsics::prefetch_read_data and friends. They look like a function call in source but compile to one instruction with zero call overhead (yes, inlined).

Go doesn't expose a prefetch intrinsic to user code. The only way to get a PREFETCHT0 (or its friends) into your binary is to switch to assembly. But go assembly functions can't be inlined. Every call to your prefetch helper compiles to a real CALL with the full calling-convention machinery around it.

As the prefetch code can't be inlined it again is slowed down by all that call machinery and very often defeats the purpose of adding prefetch in the first place.

The funny thing is that the prefetch intrinsic is right there in the internals of the stdlib. Just expose it to us, please. There is the github issue, asking to expose it but it is still sitting there as an open proposal.

SIMD is the same story, same mechanism. But this time, great news!, things are moving. Go 1.26 ships an experimental SIMD package for AMD64 behind GOEXPERIMENT=simd. See this github issue. It's not yet the stable, portable thing you'd want for production code across architectures, but it's progress.

SIMD (Single Instruction, Multiple Data), by the way, is a mechanism widely supported on modern CPUs where a single instruction operates on several data elements at once, packed into a wide vector register. More info is widely available on the internet.

Lack of //go:inline

There is //go:noinline compiler hint. It forbids compiler to inline the function followed by the hint. But there is no //go:inline hint which would do the opposite, instruct compiler to inline the function that follows the hint. This asymmetry kills me. I don't know the reason for this asymmetry, most probably, again there is some sort of trade-off that go team decided to handle in the way that forbids having //go:inline.

How do we deal with this problem? Go compiler calculates a "cost" of every function (based on complexity of the function) and if the cost is below heuristically chosen limit of 80 then the function is inlined (unless there are some other conditions that forbid inlining - see the generics, closures and interfaces cases above). PGO can push the compiler to be more aggressive for hot calls, so 80 isn't the whole story, but in regular non-PGO builds it's still the budget you run into. So if the function that we need to be inlined in the hot path is above the inlining cost we try to reshape the function so that it's "squeezed" into inlining limit. Of course if you can't squeeze it then you just manually inline it which causes the problem of duplication again.

One more important technique: extracting the cold part of the hot function into non-inlinable function (you actually even want to make sure that the cold function is not inlined by accident by hinting with //go:noinline). This way you might reduce the "cost" of hot function. In fact this technique is important in the scope outside of the case when you try to make the hot function inlinable. I'll probably want to write a separate post about it but the technique is about making things intentionally un-inlined to reduce icache-misses.

Lack of //go:nobounds (and other opt-in hints)

Every slice or array access in go gets a bounds check. The compiler can skip it when it can prove the index is in range. This is called bounds check elimination (BCE). In tight loops the elided version is meaningfully faster. The check itself costs something, and the panic branch also stops the optimizer from doing more aggressive things with the surrounding code.

Sometimes the compiler can't see the proof but you, the programmer, can. The usual trick is to insert a "hint load" early, like the _ = b[3] line you can spot in the assembly listings above in this post. That single check tells the compiler that all of b[0] .. b[3] are in range, and the per-byte checks below it disappear.

Another interesting anecdote related to the compiler inserting additional instructions to guarantee safety: having x << n in the code will cause the compiler to insert 4 instructions (SHLQ + CMPQ + SBBQ + ANDQ) instead of a single SHLQ instruction if the compiler can't prove that n < 64. The workaround is to write x << (n & 63). The mask is a no-op for any value n could actually take, but it convinces the compiler the shift is in range. Of course, this is a valid workaround only if you truly know that n < 64 in all cases.

These tricks only work when you can phrase your invariant as something the compiler already understands - another bounds check, a mask. Which is not always the case.

When that doesn't work, you're stuck. There is no //go:nobounds directive that says "trust me, this access is in range, skip the check". C and C++ have __builtin_assume, Rust has get_unchecked / unreachable_unchecked. Go gives you nothing.

There is one more option: do unsafe pointer arithmetic on the underlying memory, which sidesteps the bounds checks entirely. It often works, but it's a topic for another post.

This is the same shape of problem as //go:inline: Go gives you the opt-out (//go:noinline) but not the opt-in. And it's not just inlining and BCE. There is also no //go:unroll to force loop unrolling, no way to mark a branch as unlikely, no way to assert a value's range. If the compiler's heuristics happen to land in the right place, great. If they don't, you reshape your source code until they do, or you give up and write assembly.

Conclusion

In my opinion go shines in IO-bound world. It also has, in my opinion, made very good trade-off decisions which made go really great language. E.g. batteries-included stdlib, good package manager, easy to use async. However some trade-off decisions have made life a bit harder for people who try to do optimization of the CPU-bound workloads.

The first problem I described might not even be considered an issue for some people. Codegen exists, after all. And duplication isn't always pure cost: in go-brrr, skipping codegen let each copy specialize to the exact workload it handled. The variants ended up diverging far enough that a single template was not an option, but the specialization paid off.

Due to the above described issues (of course there are more, I didn't even mention the obvious fat runtime and garbage collection), you can't squeeze as much performance into the code handling cpu-bound workloads as other more performant languages would allow, and your code won't be looking very idiomatic, as your code will very likely have:

giant functions that would normally be split up,
duplicated loops where a shared helper would force a slow path,
hand-specialized code for hot shapes,
APIs structured around escape analysis and inlining rather than aesthetics.

My conclusion is not "don't write CPU-bound code in Go." I did, and the result is fast. But the path to fast Go often looks less like elegant abstraction and more like specialization, duplication, BCE tricks, and occasionally assembly.