How this works
The workspace
One repo, kv-lab, with a python/ scratch area for the conceptual toys and a core/ directory in Rust (recommended) or Go for the systems components. Most build tasks treat KV blocks as opaque buffers, so Go works for everything except the optional tensor work — and candle gives you tensors in Rust if you want them.
The model
Use something tiny so iteration is fast: Qwen/Qwen2.5-0.5B-Instruct or Llama-3.2-1B-Instruct, run via Hugging Face transformers in the Python toys.
Rhythm
No schedule and no estimates. Work in whatever bursts suit you — a steady trickle or a full holiday sprint. The sequence is the only thing that matters; the pace is yours.
One rule
Write each essay before you read anyone else's explanation. Forcing yourself to explain it in your own words is where the understanding actually forms.
The through-line. Every build task composes into a single codebase — a minimal, edge-flavored hierarchical prefix-caching KV store, essentially a small SGLang-HiCache. The three tasks flagged CAPSTONE are its load-bearing bricks. Finish Layer 02 and you have a portfolio capstone, not just notes.
Attention from scratch
Understand QKV projections, scaled dot-product attention, multi-head attention, causal masking — and why attention compute scales O(n²) with sequence length.
nn.MultiheadAttention. Include a causal mask. Verify against PyTorch's built-in to ~1e-5.Autoregressive decoding & the KV cache
Internalize the single most important fact in the field: prefill is compute-bound, decode is memory-bandwidth-bound. Understand arithmetic intensity and the roofline intuition.
Quantify the cache — a gentle Rust entry
Build intuition for how big the KV cache gets and when it dominates GPU memory over the weights themselves — the economic pressure behind all of Layers 01–02. Use config numbers straight from a model card.
n_layers, n_kv_heads, head_dim, dtype) plus batch_size and seq_len, printing KV-cache bytes, weight bytes, and the crossover point where cache exceeds weights. Run it for an 8B-class config. This is your first Rust/Go in the repo and primes you for paged allocation.PagedAttention — the foundation of your capstone
Understand the OS-virtual-memory analogy: why contiguous per-sequence KV allocation causes fragmentation and over-reservation, and how fixed-size blocks + a per-sequence block table fix it. Understand copy-on-write block sharing — it foreshadows Layer 02.
allocate_sequence, append_token (grab a new block when the current fills), free_sequence, and fork(seq) with copy-on-write. Model KV data as opaque byte buffers — no tensors. Add a stats method for pool utilization and wasted bytes.fork correctly shares blocks until a write diverges.CHECKPOINTShrinking the cache — MQA / GQA / MLA
Understand the KV-cache-vs-quality tradeoff curve. GQA generalizes MQA with an intermediate number of KV heads, with MHA and MQA as its endpoints. MLA takes a different route: low-rank compression of the KV rather than head-sharing.
n_kv_heads, yielding MHA (= n_heads), MQA (= 1), and GQA (between) via repeat_interleave of the KV heads. Tabulate KV-cache bytes per token as you sweep n_kv_heads.n_kv_heads.CHECKPOINTEviction & compression — orientation
Build a mental map. The survey's clean split: eviction, compression, hybrid-memory, novel attention, and combination strategies — with the lesson that no single technique dominates; the right choice depends on context length, hardware, and workload.
RadixAttention / SGLang
Understand the trie/radix-tree that maps token sequences to KV tensors, stored in a paged layout, with a least-recently-used policy that evicts leaves first — so common ancestors stay cached and reusable until they themselves become leaves. Plus reference counting to protect in-use nodes.
insert(token_ids), match_prefix(token_ids) returning the longest cached prefix and matched node, LRU-leaf eviction under a capacity bound, and refcount locking so in-use nodes can't be evicted. Unit-test prefix matching and the leaf-eviction ordering.core/ and wire it to the Module 1.1 allocator: each tree node points at the KV blocks for its token span, and two requests sharing a prefix share those blocks (copy-on-write) until they diverge. API: given a request's token IDs, return (reused_blocks, suffix_to_compute).Hierarchical caching — host & NVMe tiers
This is your project, named in the literature: extend the radix tree into a three-tier hierarchy — GPU (L1), host DRAM (L2), persistent storage (L3) — where each node records which tier(s) hold its KV span, with precise local addresses but deliberately looser metadata for L3 to keep overhead down.
sled/RocksDB) to your prefix cache. On L1 eviction, demote blocks to L2 then L3; on a prefix match whose blocks live in L3, load them back up; and add an async prefetch that begins pulling needed blocks before the scheduler asks. Instrument everything with timing.Approximate reuse & the RAG frontier
Understand where exact prefix caching stops working and why people accept approximation: CacheBlend selectively recomputes only the important tokens when strict order can't be guaranteed; CacheGen compresses the KV into a compact bitstream to cut size and fetch delay with little quality loss.