Profile Party: Your Code is Pretty. Your CPU Doesn't Care.

A love letter to profiling, in the style of a back-of-house confession

10 min read

Look, I need to talk about something and it's going to make some people uncomfortable. That's fine. Comfort was never the point.

During the pandemic, something happened to our industry. Everyone went home, discovered linters, and came back as a code sommelier. Suddenly every pull request became a tasting menu. "Hmm, I'm getting notes of inconsistent naming convention in this diff." "This could benefit from an extracted interface." "Have we considered a more composable pipeline here?"

Meanwhile the server is on fire. But the code looks amazing.

I've sat through code reviews — and I mean real ones, on real systems that process real money — where a senior engineer spent forty-five minutes discussing whether a variable should be called transactionProcessor or txnHandler, while the function it lived in was allocating a gigabyte of garbage per call. Nobody opened a profiler. Nobody asked what the CPU thought about any of this. They just sat there, sniffing the code like a glass of wine, and completely ignored the kitchen.

Here's the thing about kitchens. The best meal you ever ate was probably made by someone with burns on their forearms, in a room that would horrify you, using techniques that would never make it into a glossy cookbook. The plate was beautiful because the result was beautiful. Not because someone Marie Kondo'd the mise en place.

Code works the same way. And I'm tired of pretending it doesn't.

The Experiment

We're going to write the same program twice:

The Pretty Version — the one that gets you through code review, gets the approving nod from the staff engineer, gets screenshot'd into a "clean code" Twitter thread.

The Fast Version — the one that looks like someone wrote it at 3 AM with a grudge against heap allocations. The one that a pandemic code reviewer would mass-comment with seventeen "nit:" prefixes.

Then we're going to profile both. And the profile is going to say things that no code reviewer ever will.

The workload: one million financial transactions. Filter by amount, date, currency. Group by category. Stats per category — sum, average, median, standard deviation. A summary report. The most boring, essential, everywhere kind of data processing there is. The kind of thing that's running in production right now, at your company, racking up a cloud bill that nobody questions because the code passed review.

The Pretty Version or: The Tasting Menu

Here's the struct. Inhale deeply.

type Transaction struct {
    ID        string
    Timestamp time.Time
    Category  Category        // type Category string
    Merchant  string
    Amount    float64
    Currency  string
    Tags      []string
    Notes     string
    Metadata  map[string]string
}

Chef's kiss. Self-documenting. Every field labeled with care. Tags for flexibility. Metadata for extensibility. time.Time because the standard library said so. Category is a named string type — type-safe but readable. Your reviewer sees this and whispers, "Clean."

Your CPU sees this and starts hyperventilating. Every single one of these Transaction structs lives on the heap — individually allocated, scattered across memory like confetti. time.Time carries a hidden *Location pointer inside it that you never asked for. Every string field is a pointer-and-length pair. The Tags slice, the Metadata map — more pointers. Your million-element []*Transaction is not an array of data. It's an array of directions to data, each one a scavenger hunt to a different part of the heap.

But you'd never know that from reading the code. It looks perfect. That's the problem.

Here's the pipeline. Composable. Interface-driven. The kind of thing you put in a conference talk.

type Pipeline interface {
    Filter(fn func(*Transaction) bool) Pipeline
    Sum() float64
    Average() float64
    Median() float64
    StdDev() float64
    Count() int
    Transactions() []*Transaction
}

And here's how you use it:

pipeline := NewPipeline(txns).
    Filter(func(t *Transaction) bool {
        return t.Amount >= minAmount
    }).
    Filter(func(t *Transaction) bool {
        return t.Timestamp.After(after)
    }).
    Filter(func(t *Transaction) bool {
        return t.Currency == "USD"
    })

groups := GroupByCategory(pipeline.Transactions())

Three filter calls. Composable. Elegant. Each one iterates the entire dataset and allocates a brand-new []*Transaction slice for the survivors. Three passes. Three slices. Then GroupByCategory does a fourth pass, building a map[Category][]*Transaction — because maps of string-keyed pointer slices are what elegance looks like, apparently.

Then per category we call Sum(), Average() (which calls Sum() again internally — surprise!), Median() (which copies all the amounts into a fresh []float64 and sorts it), and StdDev() (which calls Average() which calls Sum() — yes, again). That's five passes over each category's data to compute four numbers. With redundant sub-passes. Because each method is beautifully, cleanly, independently correct.

A reviewer would approve this in minutes. "Well-structured. Good separation of concerns. Ship it."

Let's ask the profiler.

What the Profiler Actually Sees or: The Kitchen After Close

$ go tool pprof -top pretty_cpu.prof

Duration: 1.30s, Total samples = 3.15s (241.79%)

      flat  flat%   sum%        cum   cum%
     0.57s 18.10% 18.10%      1.99s 63.17%  runtime.gcDrain
     0.56s 17.78% 35.87%      1.24s 39.37%  runtime.scanobject
     0.17s  5.40% 41.27%      0.17s  5.40%  runtime.memclrNoHeapPointers
     0.17s  5.40% 46.67%      0.26s  8.25%  slices.partitionOrdered
     0.16s  5.08% 51.75%      0.25s  7.94%  main.GroupByCategory
     0.14s  4.44% 56.19%      0.30s  9.52%  runtime.findObject

Read that and sit with it for a second.

63% of the CPU time is in garbage collection.

Not in your filters. Not in your stats. Not in any of the code you wrote and reviewed and approved. In gcDrain and scanobject — the Go runtime walking the heap on its hands and knees, finding the live objects among the wreckage, cleaning up the dead ones you created and immediately abandoned.

The garbage collector is doing more work than your program. Your beautiful composable pipeline is a garbage factory and the GC is the night shift cleaning up after it.

Your business logic — the filters, the grouping, the actual math — it's at 5% each. The work you wrote the code to do is a rounding error. The CPU spent most of its time not on your algorithm, but on the consequences of your architecture.

The heap profile is the autopsy report:

      flat  flat%   sum%        cum   cum%
    0.76GB 48.93% 48.93%     0.76GB 48.93%  main.(*transactionPipeline).Filter
    0.20GB 12.67% 93.28%     0.20GB 12.67%  main.GroupByCategory
    0.04GB  2.47% 95.75%     0.04GB  2.47%  main.(*transactionPipeline).Median

Filter allocated 760 megabytes. For three composable calls that a code reviewer praised for being "clean." GroupByCategory tossed in another 200 MB with its map. Median added 40 MB copying floats around for sorting. One call. One gigabyte of garbage.

No pandemic code reviewer caught this. No linter flagged it. No "nit:" in the PR. Because none of those things look at what the code does. They look at what the code looks like.

The Fast Version or: The Line Cook's Mise en Place

Now the code that would get seventeen comments in review. Struct-of-arrays:

type TxnTable struct {
    Count      int
    Amounts    []int64   // cents, not float64
    Timestamps []int64   // unix seconds, not time.Time
    Categories []uint8   // enum, not string
    Merchants  []uint16  // table index, not string
    idxBuf     []int32   // scratch buffer, reused
}

No Transaction struct. No interfaces. No self-documenting field names that secretly cost you four nanoseconds per access. Each field is its own contiguous array — a flat slab of memory the CPU can chew through like a conveyor belt. When you scan Amounts, one cache line loads 8 values and the prefetcher has the next line ready before you ask. That's not a metaphor. That's what the hardware actually does when you stop scattering your data across the heap.

Amounts are cents as int64. Not float64, which carries IEEE 754 rounding errors like luggage. Timestamps are unix seconds as int64. Not time.Time, which hides a *Location pointer inside it — a pointer the GC has to trace for every single record, a million times, for a timezone you already know. Categories are uint8. Not Category which is secretly string which is secretly a pointer. Comparing uint8 is one instruction. Comparing string("food") to string("food") walks bytes.

"But the naming conventions—"

The CPU doesn't read your variable names. I promise.

Here's the hot loop:

for i := 0; i < n; i++ {
    amt := amounts[i]
    if amt < minCents {
        continue
    }
    if timestamps[i] <= afterUnix {
        continue
    }
    cat := categories[i]
    cats[cat].count++
    cats[cat].sum += amt
    fa := float64(amt)
    cats[cat].sumSq += fa * fa
    t.idxBuf = append(t.idxBuf, int32(i))
}

One pass. Filter and accumulate simultaneously. The pretty version does three filter passes, then a groupby pass, then five stat passes per category. This does one. count, sum, and sumSq give us mean and variance without ever re-scanning the data: variance = E[X²] - (E[X])². One formula, zero extra passes, the kind of math that a pandemic code reviewer would ask you to "extract into a well-named helper method" while having no idea it just saved you five traversals.

The median still needs sorting — there's no getting around that. But we sort with a hand-written quicksort over []int64 instead of going through sort.Interface, which would add an interface dispatch per comparison. On 80K elements, that's 80K function calls the CPU can't inline through. Our quicksort is just integers and comparisons. The branch predictor is bored. The pipeline is full. Everything is fine.

"But you should use the standard library—"

I should use what's fast. This is what's fast.

The Profiler's Verdict or: Silence in the Kitchen

$ go tool pprof -top fast_cpu.prof

Duration: 401ms, Total samples = 280ms (69.82%)

      flat  flat%   sum%        cum   cum%
     210ms 75.00% 75.00%      210ms 75.00%  main.qsortInt64
      70ms 25.00%   100%      280ms   100%  main.ProcessFast

Two lines. The entire profile is two lines.

75% sorting — we need that for medians, no way around it. 25% filtering and accumulating — the actual work. Zero runtime.gcDrain. Zero runtime.scanobject. Zero runtime.findObject. The garbage collector didn't wake up. It had nothing to do. Because we didn't create any garbage.

Memory:

   30.85MB  4.99%    main.ProcessFast

30 megabytes. Not 1 gigabyte. Thirty megabytes.

The pretty version's profile was a horror movie — six runtime functions fighting for screen time, your business logic barely visible in the background. The fast version's profile is a two-item receipt. Sorting. Processing. Done. Go home.

The Numbers or: The Bill

Five rounds, Threadripper PRO, -benchmem:

BenchmarkPretty-64 12 106_803_233 ns/op 106_423_316 B/op 282 allocs/op BenchmarkFast-64 42 27_925_317 ns/op 3_760_416 B/op 7 allocs/op

	Pretty	Fast	Factor
Time	107 ms	28 ms	3.8x
Memory	101.5 MB	3.6 MB	28x
Allocations	282	7	40x
GC CPU share	63%	0%	---

Same data. Same operations. Same results. The difference is that one version was written to look good in a code review and the other was written to run.

What the Pandemic Code Reviewers Missed

1. Abstractions aren't free

This is the big one, and it's the one nobody wants to hear. Every Pipeline interface call, every func(*Transaction) bool predicate, every .Filter() returning a new Pipeline — the compiler can't see through any of it. It can't merge your three filters into one pass. It can't eliminate the intermediate slices. It can't inline through the interface dispatch. Your separation of concerns became a separation of cache lines.

"But abstraction is a best practice—"

Abstraction is a tool. Best practices are what people say when they've stopped thinking. A line cook doesn't julienne carrots because it's a "best practice." He does it because the dish needs it. If the dish needs rough-chopped, he rough-chops. The pandemic taught a generation of engineers to julienne everything regardless of what's on the plate.

2. Pointer chasing is the thing nobody talks about

[]*Transaction looks like a data structure. It's actually a treasure map where every step sends you to a random part of the heap. Each pointer dereference is a potential L1 cache miss — roughly 4 nanoseconds of the CPU standing around doing nothing, waiting for memory. Multiply by a million elements, multiply by three passes, and you've got a CPU that spends more time waiting than working. But there's no linter rule for cache locality, so nobody mentions it.

3. The GC is not your janitor — it's your coworker

63% of the CPU time was garbage collection. Not because Go's GC is bad — it's genuinely excellent. Because we gave it a gigabyte of trash to deal with. Every Filter call that created a new slice? The old one became garbage. Every Median call that copied amounts? Garbage. We treated the heap like a buffet and then wondered why cleanup took so long.

Your code reviewer looked at Filter returning a new Pipeline and thought "immutable, nice." The GC looked at it and thought "oh great, another 40 MB of pointers I have to trace."

4. Strings are pointers wearing a trench coat

Category("food") looks like a value. It's not. string in Go is a pointer plus a length. Comparing strings means walking bytes. Storing strings means allocating. Every string field in your struct is another pointer the GC has to trace, another indirection the CPU has to follow, another entry in the "things that look free but aren't" ledger.

The fast version uses uint8. One byte. One instruction to compare. Zero allocations. Invisible to the GC. The pandemic code reviewer would say "use meaningful types." The CPU would say "use small ones."

5. Multiple passes are a luxury you're paying for

StdDev() calls Average() calls Sum(). Three methods, each beautiful in isolation, each walking the entire dataset. In code review, this is "good decomposition." In production, this is three loops where one would do.

The fast version accumulates sum and sumSq in the filter loop. variance = E[X²] - (E[X])². Same math. One pass. The pandemic code reviewer would say "extract this into a well-named method." The profiler would say "please don't."

Run It Yourself

# Build
go build -o profile-party .

# Quick comparison (wall time + memory)
./profile-party

# Generate pprof profiles
PROFILE=pretty ./profile-party
PROFILE=fast   ./profile-party

# View CPU profiles (text)
go tool pprof -top pretty_cpu.prof
go tool pprof -top fast_cpu.prof

# View CPU profiles (interactive flamegraph)
go tool pprof -http=:8080 pretty_cpu.prof
go tool pprof -http=:8081 fast_cpu.prof

# Benchmarks with memory stats
go test -bench=. -benchmem -count=5

# Or run the whole thing
chmod +x profile.sh && ./profile.sh

Open the flamegraph for the pretty version. Look at how much of it is green — runtime, GC, memory management. Now open the fast version. Look at how there's nothing there. That visual difference is worth more than a hundred code reviews.

The Point

I'm not saying write ugly code. I'm not saying ignore your linter. I'm not saying fire your code reviewers — although some of them could use a shift in the kitchen.

I'm saying that code review without profiling is food criticism without tasting. You're judging the presentation, the plating, the font on the menu — and never putting the fork in your mouth. The pandemic created a generation of engineers who can spot a naming convention violation at fifty paces but have never once opened pprof.

The linter sees syntax. The profiler sees physics.

Cache lines. Memory hierarchy. GC pauses. Branch prediction. The actual, physical, measurable behavior of your program on actual, physical hardware. Two completely different views of the same code, and we've been staring at only one of them.

Write clean code. Ship clean code. But when it matters — and the profiler will tell you when — understand that the machine doesn't see what you see. The machine sees memory layouts and access patterns and allocation rates. The profiler is the window into that world. And what you see through it might make you question everything the code review told you was fine.

That's the beauty of profiling. Not that it makes code faster. That it makes you see. And once you've seen a CPU profile that's 63% garbage collection, you can never look at a composable filter pipeline the same way again.

The pretty code isn't wrong. The fast code isn't pretty. But only one of them was written by someone who checked what the machine actually thought.

Go profile something. You might lose your appetite.