KV-LAB
0%
Self-directed curriculum · build as you learn

Inference Systems: KV Caching & Context Management

Three layers, from attention mechanics to a hierarchical prefix-caching KV store you build yourself. Read the paper, then build the thing the paper describes.

LAYER 00 · Foundations LAYER 01 · Memory Management LAYER 02 · Context Caching
0 / 0 resident
0%
cold (todo) resident (done)
Before you start

How this works

The workspace

One repo, kv-lab, with a python/ scratch area for the conceptual toys and a core/ directory in Rust (recommended) or Go for the systems components. Most build tasks treat KV blocks as opaque buffers, so Go works for everything except the optional tensor work — and candle gives you tensors in Rust if you want them.

The model

Use something tiny so iteration is fast: Qwen/Qwen2.5-0.5B-Instruct or Llama-3.2-1B-Instruct, run via Hugging Face transformers in the Python toys.

Rhythm

No schedule and no estimates. Work in whatever bursts suit you — a steady trickle or a full holiday sprint. The sequence is the only thing that matters; the pace is yours.

One rule

Write each essay before you read anyone else's explanation. Forcing yourself to explain it in your own words is where the understanding actually forms.

The through-line. Every build task composes into a single codebase — a minimal, edge-flavored hierarchical prefix-caching KV store, essentially a small SGLang-HiCache. The three tasks flagged CAPSTONE are its load-bearing bricks. Finish Layer 02 and you have a portfolio capstone, not just notes.

00
Layer 00

Foundations — Transformer & Inference Mechanics

Own attention and the prefill/decode split so deeply that everything later is just optimization of things you already understand.

0/0
Module 0.1
0/0

Attention from scratch

Readings
Attention Is All You Need — Vaswani et al., 2017READ
The Illustrated Transformer — Jay AlammarREAD
The Annotated Transformer — Harvard NLPREAD
Objectives

Understand QKV projections, scaled dot-product attention, multi-head attention, causal masking — and why attention compute scales O(n²) with sequence length.

Assignments
Assignment A — Essay, ~500 wordsWRITE
Explain self-attention in your own words to a reader who knows linear algebra but not ML. Must answer: what Q, K, V represent; why multiple heads; and concretely why cost is quadratic in sequence length and what that implies for long inputs.
Assignment B — Hand-rolled attentionPYTHON
Implement scaled dot-product attention, then multi-head attention, from scratch in NumPy or raw PyTorch ops — no nn.MultiheadAttention. Include a causal mask. Verify against PyTorch's built-in to ~1e-5.
Stretch — nanoGPTSTRETCH
Skim nanoGPT (and Karpathy's "Let's build GPT from scratch" video). Train the char-level model for a few minutes so the full stack is concrete.
Checkpoint
Your hand-rolled MHA matches PyTorch, and you can whiteboard the quadratic-cost argument cold.CHECKPOINT
Module 0.2
0/0

Autoregressive decoding & the KV cache

Readings
Prompt Cache §2.2 — the KV-cache FLOP derivation onlyREAD
Objectives

Internalize the single most important fact in the field: prefill is compute-bound, decode is memory-bandwidth-bound. Understand arithmetic intensity and the roofline intuition.

Assignments
Assignment A — Essay, ~500 wordsWRITE
Derive and explain why decode is bottlenecked by memory bandwidth rather than FLOPs, while prefill is the opposite. Tie it to arithmetic intensity. State the implication: where does optimization effort pay off in each phase?
Assignment B — Naive vs. cached generationPYTHON
Implement greedy generation twice: recomputing attention over the full sequence each step (naive) vs. appending to a KV cache. Plot latency-per-token vs. sequence length for both. The naive curve climbs; the cached one stays roughly flat per step.
Checkpoint
Your plot demonstrates the cache's asymptotic win, and you can explain why the cached version still slows slightly as context grows (the attention read over a growing cache).CHECKPOINT
Module 0.3
0/0

Quantify the cache — a gentle Rust entry

Objectives

Build intuition for how big the KV cache gets and when it dominates GPU memory over the weights themselves — the economic pressure behind all of Layers 01–02. Use config numbers straight from a model card.

Assignment
KV-cache sizing CLIRUST/GO
Write a CLI taking a model config (n_layers, n_kv_heads, head_dim, dtype) plus batch_size and seq_len, printing KV-cache bytes, weight bytes, and the crossover point where cache exceeds weights. Run it for an 8B-class config. This is your first Rust/Go in the repo and primes you for paged allocation.
Checkpoint
You can state, for a real config, roughly how many tokens of context it takes before KV-cache memory becomes the binding constraint.CHECKPOINT
01
Layer 01

Memory Management — Managing the Cache

Manage the cache like a systems engineer: allocation, fragmentation, and the architectural tricks that shrink it.

0/0
Module 1.1
0/0

PagedAttention — the foundation of your capstone

Readings
vLLM source — skim the block manager (vllm/core/block_manager*)READ
Objectives

Understand the OS-virtual-memory analogy: why contiguous per-sequence KV allocation causes fragmentation and over-reservation, and how fixed-size blocks + a per-sequence block table fix it. Understand copy-on-write block sharing — it foreshadows Layer 02.

Assignments
Assignment A — Essay, ~500–750 wordsWRITE
Explain PagedAttention by explicit analogy to OS paging. Define block, block table, and the fragmentation it eliminates. Then explain how copy-on-write lets two sequences share blocks — and predict how you'll exploit that for shared prompt prefixes.
Assignment B — Paged KV-cache allocatorRUST/GO
Implement a fixed pool of N blocks (each = capacity for K tokens), per-sequence block tables, and ops: allocate_sequence, append_token (grab a new block when the current fills), free_sequence, and fork(seq) with copy-on-write. Model KV data as opaque byte buffers — no tensors. Add a stats method for pool utilization and wasted bytes.
CAPSTONE · brick 1 of 3
Checkpoint
Simulate 50 sequences of varying lengths: show near-total waste under naive contiguous allocation vs. <5% under paging. fork correctly shares blocks until a write diverges.CHECKPOINT
Module 1.2
0/0

Shrinking the cache — MQA / GQA / MLA

Readings
DeepSeek-V2 — read the Multi-head Latent Attention (MLA) sectionREAD
Objectives

Understand the KV-cache-vs-quality tradeoff curve. GQA generalizes MQA with an intermediate number of KV heads, with MHA and MQA as its endpoints. MLA takes a different route: low-rank compression of the KV rather than head-sharing.

Assignments
Assignment A — Essay, ~500 wordsWRITE
Place MHA, MQA, GQA, and MLA on a 2-axis chart (KV-cache size vs. quality) and justify each placement. Explain mechanically how GQA differs from MLA — sharing heads vs. compressing into a latent.
Assignment B — Configurable KV headsPYTHON
Extend your Module 0.1 attention to take a configurable n_kv_heads, yielding MHA (= n_heads), MQA (= 1), and GQA (between) via repeat_interleave of the KV heads. Tabulate KV-cache bytes per token as you sweep n_kv_heads.
Stretch — MLA paragraphSTRETCH
Write one paragraph on why MLA's low-rank latent is clever for long-context serving specifically, and what it costs at compute time.
Checkpoint
Your single attention implementation produces all of MHA/MQA/GQA, and your table shows the cache shrinking linearly with n_kv_heads.CHECKPOINT
Module 1.3
0/0

Eviction & compression — orientation

Readings
KV Cache Optimization Strategies — survey, skim for the taxonomyREAD
Objectives

Build a mental map. The survey's clean split: eviction, compression, hybrid-memory, novel attention, and combination strategies — with the lesson that no single technique dominates; the right choice depends on context length, hardware, and workload.

Assignments
Assignment A — Taxonomy memo, ~500–750 wordsWRITE
Eviction vs. quantization vs. low-rank vs. hybrid-memory. Give one representative method per category and the workload where you'd reach for each. This doubles as your 2-minute whiteboard answer to "how would you reduce KV-cache pressure?"
Assignment B — Sink + window cachePYTHON
Implement a StreamingLLM-style cache on a small model: keep the first k "attention sink" tokens plus a rolling window of the last w, evicting the middle. Show that keeping the sinks preserves coherent generation while a naive "keep only the last w" degrades. Report a quality proxy (perplexity or eyeball coherence).
Checkpoint
You can deliver the taxonomy verbally, and your sink-vs-no-sink demo shows why those first few tokens matter.CHECKPOINT
02
Layer 02

Context Caching — the Core & the Capstone

Build the radix-tree, multi-tier cache that is your portfolio centerpiece. Everything from Layers 00–01 plugs in here.

0/0
Module 2.1
0/0

RadixAttention / SGLang

Readings
SGLang: Efficient Execution of Structured LM Programs — focus on the RadixAttention sectionREAD
SGLang source — read python/sglang/srt/mem_cache/radix_cache.pyREAD
Objectives

Understand the trie/radix-tree that maps token sequences to KV tensors, stored in a paged layout, with a least-recently-used policy that evicts leaves first — so common ancestors stay cached and reusable until they themselves become leaves. Plus reference counting to protect in-use nodes.

Assignments
Assignment A — Essay, ~500–750 wordsWRITE
Explain (1) how the radix tree enables automatic prefix sharing across requests, (2) why evicting leaves first preserves shared prefixes, and (3) what reference counting protects against. Connect it back to the copy-on-write you built in Module 1.1.
Assignment B — Radix prefix cache (toy)PYTHON
No model attached: insert(token_ids), match_prefix(token_ids) returning the longest cached prefix and matched node, LRU-leaf eviction under a capacity bound, and refcount locking so in-use nodes can't be evicted. Unit-test prefix matching and the leaf-eviction ordering.
Assignment C — Radix cache (core)RUST/GO
Port the radix cache into core/ and wire it to the Module 1.1 allocator: each tree node points at the KV blocks for its token span, and two requests sharing a prefix share those blocks (copy-on-write) until they diverge. API: given a request's token IDs, return (reused_blocks, suffix_to_compute).
CAPSTONE · brick 2 of 3
Checkpoint
Two requests sharing a long system-prompt prefix: the second allocates/computes only its novel suffix and reuses the prefix's blocks. Print the hit length and blocks saved.CHECKPOINT
Module 2.2
0/0

Hierarchical caching — host & NVMe tiers

Readings
Mooncake × SGLang HiCache design — the HiRadixTreeREAD
SGLANG-LSM — skim, for the storage-backend angleREAD
Objectives

This is your project, named in the literature: extend the radix tree into a three-tier hierarchy — GPU (L1), host DRAM (L2), persistent storage (L3) — where each node records which tier(s) hold its KV span, with precise local addresses but deliberately looser metadata for L3 to keep overhead down.

Assignments
Assignment A — Design doc, ~750 wordsWRITE
Design the L1/L2/L3 extension for local/edge hardware (your differentiator: replace "distributed cluster storage" with "local NVMe"). Specify per-node metadata per tier; the promotion/demotion policy; what you deliberately won't track for L3 and why; and how async prefetch hides load latency. Note where your single-machine design diverges from Mooncake/HiCache. This is literally your capstone's design doc.
Assignment B — Tiered store with NVMe spillRUST/GO
Add L2 (in-memory host pool) and L3 (on-disk — files or an embedded store like sled/RocksDB) to your prefix cache. On L1 eviction, demote blocks to L2 then L3; on a prefix match whose blocks live in L3, load them back up; and add an async prefetch that begins pulling needed blocks before the scheduler asks. Instrument everything with timing.
CAPSTONE · brick 3 of 3
Checkpoint
The miss-then-restore cycle: process a long prefix, evict it all the way to NVMe, issue a new request sharing that prefix, and show it loads from disk instead of recomputing. Report wall-clock prompt-processing time saved vs. cold recompute — that number is your headline result.CHECKPOINT
Module 2.3
0/0

Approximate reuse & the RAG frontier

Readings
Objectives

Understand where exact prefix caching stops working and why people accept approximation: CacheBlend selectively recomputes only the important tokens when strict order can't be guaranteed; CacheGen compresses the KV into a compact bitstream to cut size and fetch delay with little quality loss.

Assignments
Assignment A — Essay, ~500–750 wordsWRITE
Contrast exact prefix (radix), modular reuse (Prompt Cache), and selective-recompute (CacheBlend), plus CacheGen's compress-and-stream. For each, name the accuracy risk it accepts and explain why breaking strict prefix order hurts attention (hint: attention sinks and inter-chunk dependencies).
Assignment B — Compress the L3 tierOPTIONAL
Add CacheGen-style quantize-and-serialize for KV blocks in your L3 tier: compress before writing to NVMe, decompress on load. Measure size reduction vs. the recompute you avoid.
Final deliverable
Capstone writeupCAPSTONE
A blog-style post: the architecture (L1/L2/L3 radix cache for local parallel agents), the design decisions and where they diverge from Mooncake/HiCache, and your benchmark numbers (cache-hit speedups, NVMe restore vs. recompute, fragmentation under paging). This writeup plus the repo is the artifact you put in front of interviewers — and the launchpad for an OSS contribution.