KV-LAB · Inference Systems Curriculum

Before you start

How this works

The workspace

One repo, kv-lab, with a python/ scratch area for the conceptual toys and a core/ directory in Rust (recommended) or Go for the systems components. Most build tasks treat KV blocks as opaque buffers, so Go works for everything except the optional tensor work — and candle gives you tensors in Rust if you want them.

The model

Use something tiny so iteration is fast: Qwen/Qwen2.5-0.5B-Instruct or Llama-3.2-1B-Instruct, run via Hugging Face transformers in the Python toys.

Rhythm

No schedule and no estimates. Work in whatever bursts suit you — a steady trickle or a full holiday sprint. The sequence is the only thing that matters; the pace is yours.

One rule

Write each essay before you read anyone else's explanation. Forcing yourself to explain it in your own words is where the understanding actually forms.

The through-line. Every build task composes into a single codebase — a minimal, edge-flavored hierarchical prefix-caching KV store, essentially a small SGLang-HiCache. The three tasks flagged CAPSTONE are its load-bearing bricks. Finish Layer 02 and you have a portfolio capstone, not just notes.

Layer 00

Foundations — Transformer & Inference Mechanics

Own attention and the prefill/decode split so deeply that everything later is just optimization of things you already understand.

0/0

Module 0.1

0/0

Attention from scratch

Readings

Attention Is All You Need — Vaswani et al., 2017READ

The Illustrated Transformer — Jay AlammarREAD

The Annotated Transformer — Harvard NLPREAD

Objectives

Understand QKV projections, scaled dot-product attention, multi-head attention, causal masking — and why attention compute scales O(n²) with sequence length.

Assignments

Assignment A — Essay, ~500 wordsWRITE

Explain self-attention in your own words to a reader who knows linear algebra but not ML. Must answer: what Q, K, V represent; why multiple heads; and concretely why cost is quadratic in sequence length and what that implies for long inputs.

Assignment B — Hand-rolled attentionPYTHON

Implement scaled dot-product attention, then multi-head attention, from scratch in NumPy or raw PyTorch ops — no nn.MultiheadAttention. Include a causal mask. Verify against PyTorch's built-in to ~1e-5.

Stretch — nanoGPTSTRETCH

Skim nanoGPT (and Karpathy's "Let's build GPT from scratch" video). Train the char-level model for a few minutes so the full stack is concrete.

Checkpoint

Your hand-rolled MHA matches PyTorch, and you can whiteboard the quadratic-cost argument cold.CHECKPOINT

Module 0.2

0/0

Autoregressive decoding & the KV cache

Readings

Transformer Inference Arithmetic — kipplyREAD

Large Transformer Model Inference Optimization — Lilian WengREAD

Prompt Cache §2.2 — the KV-cache FLOP derivation onlyREAD

Objectives

Internalize the single most important fact in the field: prefill is compute-bound, decode is memory-bandwidth-bound. Understand arithmetic intensity and the roofline intuition.

Assignments

Assignment A — Essay, ~500 wordsWRITE

Derive and explain why decode is bottlenecked by memory bandwidth rather than FLOPs, while prefill is the opposite. Tie it to arithmetic intensity. State the implication: where does optimization effort pay off in each phase?

Assignment B — Naive vs. cached generationPYTHON

Implement greedy generation twice: recomputing attention over the full sequence each step (naive) vs. appending to a KV cache. Plot latency-per-token vs. sequence length for both. The naive curve climbs; the cached one stays roughly flat per step.

Checkpoint

Your plot demonstrates the cache's asymptotic win, and you can explain why the cached version still slows slightly as context grows (the attention read over a growing cache).CHECKPOINT

Module 0.3

0/0

Quantify the cache — a gentle Rust entry

Objectives

Build intuition for how big the KV cache gets and when it dominates GPU memory over the weights themselves — the economic pressure behind all of Layers 01–02. Use config numbers straight from a model card.

Assignment

KV-cache sizing CLIRUST/GO

Write a CLI taking a model config (n_layers, n_kv_heads, head_dim, dtype) plus batch_size and seq_len, printing KV-cache bytes, weight bytes, and the crossover point where cache exceeds weights. Run it for an 8B-class config. This is your first Rust/Go in the repo and primes you for paged allocation.

Checkpoint

You can state, for a real config, roughly how many tokens of context it takes before KV-cache memory becomes the binding constraint.CHECKPOINT

Layer 01

Memory Management — Managing the Cache

Manage the cache like a systems engineer: allocation, fragmentation, and the architectural tricks that shrink it.

0/0

Module 1.1

0/0

PagedAttention — the foundation of your capstone

Readings

Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al., 2023READ

vLLM source — skim the block manager (vllm/core/block_manager*)READ

Objectives

Understand the OS-virtual-memory analogy: why contiguous per-sequence KV allocation causes fragmentation and over-reservation, and how fixed-size blocks + a per-sequence block table fix it. Understand copy-on-write block sharing — it foreshadows Layer 02.

Assignments

Assignment A — Essay, ~500–750 wordsWRITE

Explain PagedAttention by explicit analogy to OS paging. Define block, block table, and the fragmentation it eliminates. Then explain how copy-on-write lets two sequences share blocks — and predict how you'll exploit that for shared prompt prefixes.

Assignment B — Paged KV-cache allocatorRUST/GO

Implement a fixed pool of N blocks (each = capacity for K tokens), per-sequence block tables, and ops: allocate_sequence, append_token (grab a new block when the current fills), free_sequence, and fork(seq) with copy-on-write. Model KV data as opaque byte buffers — no tensors. Add a stats method for pool utilization and wasted bytes.

CAPSTONE · brick 1 of 3

Checkpoint

Simulate 50 sequences of varying lengths: show near-total waste under naive contiguous allocation vs. <5% under paging. fork correctly shares blocks until a write diverges.CHECKPOINT

Module 1.2

0/0

Shrinking the cache — MQA / GQA / MLA

Readings

Fast Transformer Decoding: One Write-Head Is All You Need (MQA) — Shazeer, 2019READ

GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al., 2023READ

DeepSeek-V2 — read the Multi-head Latent Attention (MLA) sectionREAD

Objectives

Understand the KV-cache-vs-quality tradeoff curve. GQA generalizes MQA with an intermediate number of KV heads, with MHA and MQA as its endpoints. MLA takes a different route: low-rank compression of the KV rather than head-sharing.

Assignments

Assignment A — Essay, ~500 wordsWRITE

Place MHA, MQA, GQA, and MLA on a 2-axis chart (KV-cache size vs. quality) and justify each placement. Explain mechanically how GQA differs from MLA — sharing heads vs. compressing into a latent.

Assignment B — Configurable KV headsPYTHON

Extend your Module 0.1 attention to take a configurable n_kv_heads, yielding MHA (= n_heads), MQA (= 1), and GQA (between) via repeat_interleave of the KV heads. Tabulate KV-cache bytes per token as you sweep n_kv_heads.

Stretch — MLA paragraphSTRETCH

Write one paragraph on why MLA's low-rank latent is clever for long-context serving specifically, and what it costs at compute time.

Checkpoint

Your single attention implementation produces all of MHA/MQA/GQA, and your table shows the cache shrinking linearly with n_kv_heads.CHECKPOINT

Module 1.3

0/0

Eviction & compression — orientation

Readings

Efficient Streaming Language Models with Attention Sinks (StreamingLLM)READ

H2O: Heavy-Hitter Oracle for Efficient Generative InferenceREAD

KV Cache Optimization Strategies — survey, skim for the taxonomyREAD

Objectives

Build a mental map. The survey's clean split: eviction, compression, hybrid-memory, novel attention, and combination strategies — with the lesson that no single technique dominates; the right choice depends on context length, hardware, and workload.

Assignments

Assignment A — Taxonomy memo, ~500–750 wordsWRITE

Eviction vs. quantization vs. low-rank vs. hybrid-memory. Give one representative method per category and the workload where you'd reach for each. This doubles as your 2-minute whiteboard answer to "how would you reduce KV-cache pressure?"

Assignment B — Sink + window cachePYTHON

Implement a StreamingLLM-style cache on a small model: keep the first k "attention sink" tokens plus a rolling window of the last w, evicting the middle. Show that keeping the sinks preserves coherent generation while a naive "keep only the last w" degrades. Report a quality proxy (perplexity or eyeball coherence).

Checkpoint

You can deliver the taxonomy verbally, and your sink-vs-no-sink demo shows why those first few tokens matter.CHECKPOINT

Layer 02

Context Caching — the Core & the Capstone

Build the radix-tree, multi-tier cache that is your portfolio centerpiece. Everything from Layers 00–01 plugs in here.

0/0

Module 2.1

0/0

RadixAttention / SGLang

Readings

SGLang: Efficient Execution of Structured LM Programs — focus on the RadixAttention sectionREAD

SGLang source — read python/sglang/srt/mem_cache/radix_cache.pyREAD

Objectives

Understand the trie/radix-tree that maps token sequences to KV tensors, stored in a paged layout, with a least-recently-used policy that evicts leaves first — so common ancestors stay cached and reusable until they themselves become leaves. Plus reference counting to protect in-use nodes.

Assignments

Assignment A — Essay, ~500–750 wordsWRITE

Explain (1) how the radix tree enables automatic prefix sharing across requests, (2) why evicting leaves first preserves shared prefixes, and (3) what reference counting protects against. Connect it back to the copy-on-write you built in Module 1.1.

Assignment B — Radix prefix cache (toy)PYTHON

No model attached: insert(token_ids), match_prefix(token_ids) returning the longest cached prefix and matched node, LRU-leaf eviction under a capacity bound, and refcount locking so in-use nodes can't be evicted. Unit-test prefix matching and the leaf-eviction ordering.

Assignment C — Radix cache (core)RUST/GO

Port the radix cache into core/ and wire it to the Module 1.1 allocator: each tree node points at the KV blocks for its token span, and two requests sharing a prefix share those blocks (copy-on-write) until they diverge. API: given a request's token IDs, return (reused_blocks, suffix_to_compute).

CAPSTONE · brick 2 of 3

Checkpoint

Two requests sharing a long system-prompt prefix: the second allocates/computes only its novel suffix and reuses the prefix's blocks. Print the hit length and blocks saved.CHECKPOINT

Module 2.2

0/0

Hierarchical caching — host & NVMe tiers

Readings

Mooncake: A KVCache-centric Disaggregated ArchitectureREAD

Mooncake × SGLang HiCache design — the HiRadixTreeREAD

Strata: Hierarchical Context Caching for Long-Context ServingREAD

SGLANG-LSM — skim, for the storage-backend angleREAD

Objectives

This is your project, named in the literature: extend the radix tree into a three-tier hierarchy — GPU (L1), host DRAM (L2), persistent storage (L3) — where each node records which tier(s) hold its KV span, with precise local addresses but deliberately looser metadata for L3 to keep overhead down.

Assignments

Assignment A — Design doc, ~750 wordsWRITE

Design the L1/L2/L3 extension for local/edge hardware (your differentiator: replace "distributed cluster storage" with "local NVMe"). Specify per-node metadata per tier; the promotion/demotion policy; what you deliberately won't track for L3 and why; and how async prefetch hides load latency. Note where your single-machine design diverges from Mooncake/HiCache. This is literally your capstone's design doc.

Assignment B — Tiered store with NVMe spillRUST/GO

Add L2 (in-memory host pool) and L3 (on-disk — files or an embedded store like sled/RocksDB) to your prefix cache. On L1 eviction, demote blocks to L2 then L3; on a prefix match whose blocks live in L3, load them back up; and add an async prefetch that begins pulling needed blocks before the scheduler asks. Instrument everything with timing.

CAPSTONE · brick 3 of 3

Checkpoint

The miss-then-restore cycle: process a long prefix, evict it all the way to NVMe, issue a new request sharing that prefix, and show it loads from disk instead of recomputing. Report wall-clock prompt-processing time saved vs. cold recompute — that number is your headline result.CHECKPOINT

Module 2.3

0/0

Approximate reuse & the RAG frontier

Readings

Prompt Cache: Modular Attention ReuseREAD

CacheGen: KV Cache Compression and Streaming (code)READ

CacheBlend: Fast LLM Serving for RAG with Cached Knowledge FusionREAD

Objectives

Understand where exact prefix caching stops working and why people accept approximation: CacheBlend selectively recomputes only the important tokens when strict order can't be guaranteed; CacheGen compresses the KV into a compact bitstream to cut size and fetch delay with little quality loss.

Assignments

Assignment A — Essay, ~500–750 wordsWRITE

Contrast exact prefix (radix), modular reuse (Prompt Cache), and selective-recompute (CacheBlend), plus CacheGen's compress-and-stream. For each, name the accuracy risk it accepts and explain why breaking strict prefix order hurts attention (hint: attention sinks and inter-chunk dependencies).

Assignment B — Compress the L3 tierOPTIONAL

Add CacheGen-style quantize-and-serialize for KV blocks in your L3 tier: compress before writing to NVMe, decompress on load. Measure size reduction vs. the recompute you avoid.

Final deliverable

Capstone writeupCAPSTONE

A blog-style post: the architecture (L1/L2/L3 radix cache for local parallel agents), the design decisions and where they diverge from Mooncake/HiCache, and your benchmark numbers (cache-hit speedups, NVMe restore vs. recompute, fragmentation under paging). This writeup plus the repo is the artifact you put in front of interviewers — and the launchpad for an OSS contribution.