Prithvish Baidya (d4mr)

systems engineer

ESC
Type to search...
· 10 min read

Anthropic's Kernel Optimization Challenge, Part 1: Understanding VLIW Machines

A visual guide to VLIW SIMD architectures, instruction scheduling, and why packing operations into cycles matters.

Nerds Took Over My Timeline

If you’re anything like me, your X feed got completely hijacked this week by nerds optimizing kernels. Anthropic released what I can only describe as a memetic virus for performance engineers:

The premise: here’s the performance engineering take-home we used to hire systems engineers. Claude Opus 4.5 now beats most humans at it. Can you beat Claude?

The baseline runs in 147,734 cycles. Claude’s best score is 1,363 cycles. That’s a 108x improvement. Beat that, email them your code, maybe get an interview.

And just like that, every systems programmer on Twitter lost their weekend.

The Leaderboard Wars

Paradigm set up a public leaderboard and things got competitive fast:

I threw my hat in the ring too. Got 2,375 cycles (@earth_ish), which felt pretty good until I refreshed the page and found myself on page 2. The leaderboard moves fast when you’ve got hundreds of optimization-brained engineers competing for bragging rights.

Then tinygrad Entered the Chat

George Hotz saw the challenge and immediately saw an opportunity:

We're at 1,340 cycles on the leaderboard with a fairly generic tinygrad backend. Beat our score with a beautiful PR worthy of the tinygrad repo for an interview.

frankie
frankie
Paradigm
@FrankieIsLost

this is a really great test for AI tool proficiency, since it requires working with long-running tasks, multi-step reasoning, tool use and iterative debugging. here's a leaderboard for it if you want to enter the arena:

405
Reply

His take: you can use tinygrad to do most of the work. Just extend the linearizer to output VLIW and render to their toy processor. He shipped an implementation that gets 1,340 cycles with a “fairly generic backend”:

The pitch: beat tinygrad’s score with a beautiful PR, get a job at tiny corp. Two companies now hiring off the same challenge.

Two Paths, One Challenge

PathTargetHow
Anthropic<1,487 cyclesHand-optimize the kernel
tinygradBeat 1,340 cyclesImprove the compiler backend
Both<1,363 cyclesEither approach works

This series will take you from zero to understanding exactly what these numbers mean, how to achieve them, and what the hell a “linearizer” is. Whether you want to hand-roll assembly or contribute to a compiler, you need to understand the machine first.

Let’s start with the fundamentals.


WTF Is Going On?

Before we dive in, let’s make sure we’re speaking the same language. Two concepts you need: kernels and cycles.

What’s a Kernel?

Not the Linux kind. In the context of GPUs and parallel computing:

A kernel is a small program that runs many times in parallel.

When you write a + b in NumPy and both arrays have a million elements, you’re not running a million separate Python operations. Deep in the stack, a kernel executes—a tight loop of machine instructions that processes elements in bulk.

Python
# This innocent line...
result = a + b # where a and b are arrays of 1,000,000 floats
# ...triggers a kernel that conceptually does:
for i in range(1_000_000):
result[i] = a[i] + b[i]

The kernel is the inner loop. It’s what actually runs on the hardware. And making it fast is the whole game.

What’s a Cycle?

Your CPU has a clock. Every tick of that clock is a cycle. Each cycle, the processor can do… something. Maybe one operation. Maybe more. Maybe it has to wait.

What's a Cycle?

Adding 7+3 and 12+8. Watch the data flow.

/8cycles
Memory
mem[0]
7
mem[1]
3
mem[2]
mem[3]
12
mem[4]
8
mem[5]
Registers
r0
r1
r2
Press Step or Run to begin
read
write
register

When Anthropic says “147,734 cycles,” they mean: the naive implementation takes 147,734 clock ticks to finish. When Claude gets 1,363 cycles, it’s doing the same work in 1,363 ticks. Same result, 100x fewer cycles.

Cycles are the currency of performance. Every optimization technique—pipelining, parallelism, caching, branch prediction—exists to do more useful work per cycle.

If you want to go deeper on how CPUs actually execute instructions, cpu.land is an incredible visual explainer. But for this challenge, you just need to know: fewer cycles = faster.

The Simulated Machine

Here’s the twist: this challenge doesn’t run on your CPU or GPU. Anthropic built a simulated processor with specific, unusual constraints. It’s a toy machine, but the optimization principles are real.

Understanding those constraints is the whole game. Let’s look at the machine.


The Machine

The simulated processor is a VLIW SIMD architecture. Let’s break that down.

SIMD: One Instruction, Many Data

SIMD stands for Single Instruction, Multiple Data. Instead of processing one element at a time, you process a whole vector in a single operation.

SIMD: Single Instruction, Multiple Data

One element at a time...

A
3
1
4
1
5
9
2
6
+
+
+
+
+
+
+
+
B
2
7
1
8
2
8
1
8
R
?
?
?
?
?
?
?
?
Cycles used: 0
Elements done: 0/8

This machine has a vector width of 8 (VLEN=8). Run both modes and watch the difference. Scalar takes 8 cycles to add 8 pairs. SIMD does it in 1.

If you have 256 elements to process, that’s 256 scalar operations or 32 vector operations. Same result, 8x fewer cycles.

VLIW: One Cycle, Many Operations

VLIW stands for Very Long Instruction Word. Instead of executing one operation per clock cycle, you pack multiple independent operations into a single “instruction bundle” that all execute in parallel.

The machine has five “engines,” each capable of running multiple “slots” per cycle:

Slot Limits Per Cycle

Maximum operations the machine can execute in a single cycle

ALU
12
VALU
6
LOAD
2
STORE
2
FLOW
1
Total capacity per cycle:23 operations

With VALU processing 8 elements each, that's up to 12 + 6×8 = 60 scalar-equivalent operations per cycle.

In a single cycle, you can execute up to 12 scalar ALU operations, 6 vector ALU operations, 2 loads, 2 stores, and 1 flow control operation. All at once. Every cycle.

Here’s what this looks like in practice:

VLIW Packing Demo

Compute: 7 + 3 = 10, then 10 × 2 = 20

Registers
r0
r1
r2
r3
Execution Timeline
Press Run or Step to begin execution
read
write
LOAD
ALU

Toggle between “Naive” and “Packed” to see the difference. Same computation, fewer cycles. The naive version uses one operation per cycle. The packed version fills multiple slots per cycle.

This is the core insight: the machine can do a lot per cycle, but only if you tell it to.

A naive compiler emits one operation per instruction. An optimizing compiler packs as many independent operations as possible into each cycle. The difference is 100x+ performance.

Scratch Space (Registers)

The machine has 1,536 words of “scratch space.” Think of it as registers. You load values from memory into scratch, compute on them, and store results back.

Unlike a real CPU with 16-32 registers, you have 1,536. This sounds generous until you realize the challenge requires processing 256 values through 16 rounds of computation. Register allocation becomes a real constraint.


The Algorithm

Now that we understand the machine, what are we computing?

Tree Traversal with Hashing

The problem simulates 256 “walkers” traversing a binary tree. Each walker has:

  • An index (current position in the tree)
  • A value (accumulator that gets hashed)

All walkers start at the root (index 0) with different initial values (random seeds). Even though they start at the same position, their different values cause the hash to produce different results, making them diverge down different paths.

Each round, every walker:

  1. Looks up the tree node value at its current index
  2. XORs its value with the node value
  3. Hashes the result (6 stages of bit manipulation)
  4. Moves left or right based on whether the hash is even or odd
  5. Wraps to the root if it falls off the tree

Here’s a visualization:

Tree Traversal Visualization

Simplified demo (height=4, 1 walker). Real challenge: height=10, 256 walkers, 16 rounds.

0L0L1L2L3L4
Round
0/16
Walker State
idx:0
val:12345
Algorithm
  1. XOR value with node
  2. Hash the result
  3. Go left if even, right if odd
  4. Wrap to root if at leaf

Watch the walker traverse the tree. Each step involves a lookup, hash, and branch decision. The hash function is deterministic, so the path depends entirely on the initial value.

The Hash Function

The hash is the expensive part. Six stages of bit manipulation:

Python
def myhash(a):
a = (a + 0x7ED55D16) + (a << 12)
a = (a ^ 0xC761C23C) ^ (a >> 19)
a = (a + 0x165667B1) + (a << 5)
a = (a + 0xD3A2646C) ^ (a << 9)
a = (a + 0xFD7046C5) + (a << 3)
a = (a ^ 0xB55A4F09) ^ (a >> 16)
return a

Each stage is 3 operations: add/xor with constant, shift, combine. That’s 18 ALU operations per hash, per walker, per round.

With 256 walkers and 16 rounds, that’s:

256 × 16 × 18 = 73,728 hash operations alone

Plus lookups, index calculations, and stores. The naive baseline takes 147,734 cycles because it does one operation per cycle.

The Numbers

WhatCount
Walkers256
Rounds16
Tree nodes2,047 (height=10)
Hash ops per walker per round18
Total hash ops256 × 16 × 18 = 73,728
Baseline cycles147,734
Claude’s best1,363

The baseline is slow because it doesn’t use SIMD (processing 1 walker at a time instead of 8) and doesn’t pack operations (1 op per cycle instead of many).


Why Packing Is Hard

You can’t just throw all operations into one cycle. There are dependencies.

Data Dependencies

Why some operations cannot execute in the same cycle

No shared registers between operations

a:r0=r1+r2
b:r3=r4*r5
c:r6=r7^r8

Can be packed into one cycle

Different source and destination registers - no conflicts

r0Destination
r1Source
r0Conflict

Click through the examples. When operation B reads a register that operation A writes, they can’t be in the same cycle. B would read the old value of the register, not A’s result.

This is the fundamental constraint of instruction scheduling:

  1. RAW (Read-After-Write): B reads what A writes → B must come after A
  2. WAW (Write-After-Write): Both write same register → undefined which wins
  3. WAR (Write-After-Read): B writes what A reads → ambiguous timing

The art of optimization is finding operations that don’t conflict and packing them together.

Example: Hash Stage

Each hash stage looks like:

Python
tmp1 = val + CONST1 # writes tmp1, reads val
tmp2 = val << SHIFT # writes tmp2, reads val
val = tmp1 ^ tmp2 # writes val, reads tmp1 and tmp2

The first two operations can be packed—they both read val but write different registers. The third operation cannot be packed with them—it reads tmp1 and tmp2, which are being written.

So each hash stage is at minimum 2 cycles, not 1. And that’s just one walker. To parallelize across walkers, you need different registers for each, which eats into your 1,536 register budget.


The Optimization Space

Armed with this understanding, what can we optimize?

Level 1: Vectorization (SIMD)

Instead of processing 1 walker at a time, process 8:

Python
# Before: 256 iterations, 1 walker at a time
for i in range(256):
val[i] = hash(val[i] ^ tree[idx[i]])
# After: 32 iterations, 8 walkers at a time
for i in range(0, 256, 8):
node_vals = [tree[idx[i+j]] for j in range(8)] # 8 scattered loads
val[i:i+8] = vhash(val[i:i+8] ^ node_vals)

The hash computation can use SIMD (vhash operates on 8 values at once). But there’s a catch: this machine has no gather instruction.

vload only loads 8 contiguous memory locations. When your 8 walkers are at scattered tree positions like [5, 12, 3, 7, 0, 9, 2, 11], you can’t load all their node values in one instruction. You need 8 separate scalar loads, and the machine can only do 2 loads per cycle. That’s 4 cycles just to fetch the data before you can even start hashing.

The Gather Bottleneck

8 walkers at scattered positions. No vgather instruction. Only 2 loads/cycle.

Walker Tree Positions
w0
5
w1
12
w2
3
w3
7
w4
0
w5
9
w6
2
w7
11
Tree Array in Memory
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Blue = need to load, Yellow = loading now, Green = loaded
/4cycles
Press Run to start loading
0/8
values loaded
need to load
loading
loaded

Level 2: Instruction Packing (VLIW)

Within each iteration, pack independent operations:

Python
# Before: 18 cycles for hash (1 op each)
# After: ~6 cycles for hash (3 ops packed where possible)

This requires analyzing dependencies and finding operations that can run in parallel.

Level 3: Register Allocation

With 256 walkers × 2 values each × 16 rounds of intermediate values… you need careful register reuse. The naive approach allocates fresh registers and runs out of space.

Level 4: Memory Access Patterns

The tree lookup is the bottleneck. At level 0, all 256 walkers read the same node (the root). At level 10, they’re scattered across 1024 possible nodes. Different levels need different strategies.

Level 5: Algorithmic Tricks

For small tree levels, you can avoid scattered loads entirely. At level 1, walkers can only be at index 1 or 2. Instead of 256 scattered loads, preload both values once:

Python
# Preload the only 2 possible values (2 loads, done once)
val_at_1 = tree[1]
val_at_2 = tree[2]
# For each walker, select based on index (no memory access!)
node_val = select(idx == 1, val_at_1, val_at_2)

At level 2, there are only 4 possible nodes. At level 3, only 8. The deeper you go, the more scattered the walkers become, and eventually you have to eat the load cost. But for early levels, this trick eliminates the gather bottleneck entirely.


What’s Next

This post gave you the mental model. You understand:

  • What the machine can do (VLIW SIMD with specific slot limits)
  • What we’re computing (tree traversal with hashing)
  • Why it’s hard (data dependencies limit packing)
  • What to optimize (vectorization, packing, allocation, memory, algorithms)

In Part 2, we’ll look at the actual code: the naive baseline, how to run it, and how to trace execution to see where cycles are spent.

In Part 3, we’ll dive into tinygrad’s approach: how it represents computation as UOps, what the “linearizer” does, and how to extend the renderer for VLIW output.

The goal by the end of this series: you understand enough to either hand-optimize the kernel for Anthropic or contribute a meaningful PR to tinygrad.

Both companies are hiring. The challenge is the same. The approach is up to you.


Resources


Next up: Part 2 - Running the baseline, tracing execution, and understanding where cycles are spent.

Back to all posts

Comments