A usable million tokens in Qwen 3.6 on my Strix host is just not actually realistic. Prefill takes too long, and the hacks necessary to make it fast make it useless. What I DID get, however, is an actually-usable 256K token context which is itself a nice win: ingested in about ten and a half minutes, answering at 36 tokens/s, retrieval intact, KV cache under a gigabyte. This is the difference between the pleasant local chatbot I’d already tuned this box into and something that can sit under a real agent.

The trick is giving the hybrid model’s attention layers a bounded sliding window, so prefill goes flat (~855 tokens/s from 32K to 1M) and the KV cache stops growing. Then, because a windowed layer can only retrieve what falls inside its window, you keep 4 of the 10 attention layers un-windowed. I was wrong four times on the way here, and the week I lost to a one-character cross-backend bug is its own post.

A tuned box that is not yet a useful agent

This didn’t start as a sparse-attention project. It started as dissatisfaction with a box I’d already tuned. If you’ve been following along, you’ve watched me get Qwen running well on my Strix Halo mini-PC and then squeeze it from every direction I could find: quant ladders, speculative decoding, a thermal daemon so it could sustain load. Decode sits in the high 30s to mid 40s of tokens per second and short prompts turn around instantly. A nice local chatbot.

I don’t want a chatbot, though. I want a local agent, and agents eat context. So of course context is where the box fell over. A dense 256K-token prefill (the model’s full trained window) took about 20 minutes, per query, before the first token of the answer. A million tokens took 4 hours and 37 minutes. That’s not an estimate, I ran it once early on and watched it crawl. “Supports 256K context” was technically true and practically a lie. So the question stopped being “can I make it faster” and became: can I make a much larger context window usable on this box? I aimed at a frankly silly number, a million tokens. The intro already spoiled how that ends.

The box is HOMER, the same Strix Halo mini-PC this whole series is built on (introduced here, benchmarked against real hardware here). What matters for this post: a unified-memory APU has frontier-class memory and laptop-class compute. 128 GB of UMA will hold a multi-million-token KV cache without flinching. It just can’t feed the math fast enough (decode on these models is memory-bandwidth-bound, not compute-bound - that’s the roofline at work).

The model is Qwen3.6-35B-A3B at Q4_K_M, and it stays the model. I’m not swapping down to something sub-quadratic I’d like less. It’s a hybrid: of its 40 layers, 10 are full-attention GQA and the other 30 are Gated-DeltaNet (linear attention). That detail matters later, because only those 10 full-attention layers grow a KV cache, and only those 10 are somewhere attention sparsity can buy anything. It runs coherent on Vulkan/RADV llama.cpp (b9204), which I confirmed before trusting any throughput numbers.

The earlier tuning work also told me what kind of solution could work at all. I’d measured the box’s real memory bandwidth with a little HIP microbench: the read path hits about 242.7 GB/s out of a theoretical 256. The bus is maxed. No clever kernel is going to stream faster, and the memory overclock door is welded shut on this SKU. So the bandwidth-regime thesis from the rest of this series collapses to one rule here:

Every remaining win has to move fewer bytes.

Anything whose only benefit is more FLOPS is DOA since I’m not compute-starved at decode and I can’t go faster on the bus. What survives are tricks that skip reading some of the data. At long context, the biggest pile of skippable bytes is the KV cache during prefill, and the tool for skipping it is sparse attention.

The free win hiding in llama.cpp

llama.cpp’s Vulkan FlashAttention has a mask-aware tile-skip optimization (PR #19281, later hardened by a bounds fix in #20296): before it loads a tile of K and V, it checks whether that tile’s slice of the attention mask is entirely -inf. If it is, the query can’t attend there anyway, so the shader continues without loading the tile. No K read, no V read, no compute.

So I didn’t need a sparse-attention kernel, just a sparse-attention mask. Mark the blocks I don’t care about as -inf and the existing FlashAttention skips the memory traffic for me, on hardware it already runs on. All that’s left to build is the decision of what to drop.

I wasn’t the first one here either. DeepSeek’s own sparse attention (their DSA, from the V3.2 work) landed upstream as PR #23346, doing this same thing: fill a mask with -inf, unmask the selected positions, add it to the causal mask, hand it to the stock FlashAttention. The maintainers’ guidance on sparse attention was that it should reuse FA and the mask-skip rather than add a new op, so the route was sanctioned. (One AMD landmine worth flagging once: the argsort kernel hits a shared-memory assert on AMD GPUs, filed against top-k sampling in #24177, and it’s the same kernel an in-graph top-k block-selection would call. I pick blocks with a plain threshold instead. I wanted a threshold anyway.)

Attempt 1: the dumb fast version

Before trying to get too fancy, I built the dumbest possible mask to prove the skip even fires: keep a few “sink” tokens at the very start, keep a recent window of the last N tokens, drop everything in between. This is the classic StreamingLLM shape. Purely position-based, no knowledge of the content, a handful of lines in set_input_kq_mask_impl.

On the hero model at a 131K-token prefill, baseline throughput is 473 t/s. With a 256-token sink and a 4096-token window it’s 945 t/s, so 2.0x. At 262K it’s 223 -> 890, which is 4.0x. The speedup grows with depth, since the O(n²) attention term it’s killing is a bigger slice of prefill the deeper you go. In wall-clock terms a cold 256K prefill drops from about 19.6 minutes toward 5.

Two side measurements worth keeping. The window is a gentle knob: across windows from 1K to 32K at 131K context, throughput only slides from ~1008 to ~702 t/s, so a bigger, safer window is cheap. And there’s an Amdahl ceiling: at 131K the speedup caps around 2.1x even with a tiny window, because the MoE FFNs and those 30 DeltaNet layers are about half of prefill and sparse attention doesn’t touch them. At 256K attention is a larger fraction of the work, so the ceiling rises toward 4x.

The catch is that this is lossy. A fixed window drops the middle of your context. I ran a needle-in-a-haystack test (a planted string, TURQUOISE-7731, at a known depth) and got a depressingly clean result: you retrieve the needle if and only if it falls inside the window. Put it 20K tokens deep with a 16K window and it’s gone. The text stays fluent either way (sparsity doesn’t make the model babble), and a window big enough to never miss is a window big enough to give back the speedup. I also tried a dilated variant, keeping every Nth block on top of the window. It catches a deep needle if and only if the needle lands on the stride grid. Better than a pure window, still a dice roll. Position-based masks are a dead end for retrieval.

Why the dumb version doesn’t work

The fix has to be content-aware: keep the blocks that matter, wherever they sit, not the blocks that happen to be recent. The cheap version of that would be a static calibrated pattern, i.e. figure out once, offline, which columns are “heavy hitters” for this model and always keep those. Before building anything dynamic I measured whether static could work, with a little capture tool built on llama.cpp’s eval-callback, watching the 10 full-attention layers.

The obvious cheap proxy, key magnitude, is dead on this model. The Qwen3 line QK-normalizes its keys (an RMSNorm on Q and K before the dot product, a trick that goes back to ViT-22B), and per-position key norm came back flat: max over median 1.0x. Actual attention mass (run the softmax with FA off and watch where the weight goes) is another matter, max-over-median 47x. The weight piles onto the sink, onto a recent local window, and onto about 6.6% of columns scattered through the middle that are content-salient. The planted needle showed up as one of those heavy columns, so “keep the high-attention columns” preserves retrieval by construction.

Then I moved the needle (token ~350 to ~541) and re-measured, and the heavy mid-context cluster moved with it. The salient columns follow the content. A static calibration learned on one prompt can’t generalize to the next, because the important columns are wherever this particular document put them. So selection has to be dynamic: per-prompt, in the graph, from the actual keys.

Attempt 2: it retrieves, and it’s slower

Attempt 2 does its scoring dynamically, inside the compute graph right before the FlashAttention call. Pool the queries and keys down to one representative vector per block, score block against block, softmax over the key-blocks, threshold to a keep/drop decision, and OR that with “always keep the sink and the recent window.” Turn the drops into an additive -inf block mask and hand it to the stock FlashAttention. Same tile-skip as attempt 1, but the kept set follows the content.

Retrieval-wise it delivered. The first cut scored blocks off a single strided token and retrieved erratically (2 of 5 needle depths), but mean-pooled block representatives unioned with the sink+window fixed that: 5 out of 5 needle depths, at two different thresholds. The needle’s block gets kept wherever it lands.

......... aaaaaaaaand it was slower than doing nothing. 0.79x. At 131K, baseline 473 t/s became 372. What the heck?

The mechanism is strictly subtractive: I’m dropping blocks, FlashAttention should read less. Chasing down how dropping work produced a slowdown ate the better part of a week and became a post of its own. The short version: a one-character cross-backend bug in ggml. The step function I built my keep/drop mask on returns a different value at exactly zero on the Vulkan backend than on the CPU (step(0) is 1 there, 0 on the CPU), so my “drop this block” decision quietly evaluated to “keep” for every block, and FlashAttention never skipped a thing. The full debugging war story, and the upstream fix, are here.

What it does, and what it turned out to be

With the mask finally dropping the blocks I told it to (huzzah!), the code delivered the number I’d been chasing: at a 131K prefill, baseline 473 t/s became 880, a clean 1.86x, with the needle still retrieved at 5 of 5 depths. I started writing the blog post and getting ready to accept the recognition of my genius.

Then I measured it properly, and it just became two more things I’d been wrong about. Blarg.

First, the 1.86x was graded against the wrong quality bar. The 880 t/s came at a 4096-token window, and the only perplexity I’d checked was at a 32K context, where 4096 is still ~12% of everything. At the 131K context where the speed number lives, a 4096 window is ~3% of the prompt. So I did the boring thing I should have done first: same-context perplexity at 131K, swept across windows, on two corpora (Moby Dick for prose, the llama.cpp source tree for code).

window speedup @131K prose PPL code PPL
4096 1.86x +74% +26%
8192 1.75x +5.8% +12%
16384 1.59x +2.2% +7.5%
24576 1.46x +1.2% +5.9%
32768 1.36x +0.8% +4.8%
49152 1.21x +0.2% +3.5%
65536 1.10x -0.1% +2.5%

The 1.86x headline is unusable at real depth. A +74% prose perplexity is not a model you’d want to talk to. Prose has a cliff-then-recovery shape, since it leans on local context and a moderate absolute window restores it. The 16K window at +2.2% is the knee, and it’s what I’d ship for prose. Code never recovers: still +2.5% at a half-context window that’s barely faster than baseline, because dropping any blocks hurts long-range structure (a matched delimiter, a scope opened a thousand lines up). For code, don’t sparsify.

Notice that needle-in-a-haystack passed 5/5 on every row of that table, including the +74% disaster at the top. This is one of those things I would REALLY watch out for, and is why I do so much validation/testing. One planted token survives heavy sparsity while the model’s general grip on the context craters. This is the complaint RULER and NoLiMa make about single-needle retrieval, and if I’d trusted NIAH I’d have shipped the +74% config as “perfect.” Perplexity was the honest measure here. Though perplexity has a blind spot of its own, which bites me later on.

The second thing I was wrong about retired the word “content-aware.” Quality and speed barely moved no matter how I set the content threshold, so I ablated it directly: tau from 0.01 to 0.99 at a fixed window gave identical speed and perplexity at every value. tau was a no-op. I had built a content selector that selected no content, at any setting, the entire time. .... Yeah. Egg on my face.

The mechanism is simple once you look. The block score is a softmax over all ~1000 key-blocks, and that softmax is sharply peaked on the recent blocks. A far-back salient block, normalized against the dominant recent ones, comes out with a probability near zero, below any threshold worth using. The keep-cliff sits between tau=0 and tau=0.0005, nowhere near the 1/1024 you’d see if the distribution were remotely flat. I wondered whether mean-pooling 128 tokens into one representative was washing out a single sharp needle key, so I swapped in max-pooling. It behaved identically, because the problem is the softmax normalization, not the pooling. The only blocks my selector wanted were the recent ones already inside the window. Net new blocks kept: zero.

So attempt 2 had silently collapsed back into attempt 1, a plain sink-plus-sliding-window. Every row in that table is a sliding-window result. The content-aware machinery I spent a week debugging was, once it finally ran, computing an elaborate score and then keeping the same blocks a three-line window mask would have kept.

Why did it retrieve, then? If the selector keeps nothing far-back, the needle’s block gets dropped on the windowed attention layers, and yet NIAH was 5/5. At the time I had an explanation I liked: only 10 of the 40 layers are full-attention and those are the only ones I masked, so the 30 unmasked Gated-DeltaNet layers, which see every token, must be carrying the needle. A tidy story that fit every data point I had. It’s also totally wrong, which makes it mistake number four, and I wouldn’t find out until I measured recall properly, two sections down.

The moonshot: a million tokens, prefilled but not usable

Even with the content-aware story dead there’s a real result here, the one I’d been chasing since I set the silly target, because the cold prefill was always the wall on this box. Dense prefill goes super-quadratic past ~131K: the KV cache for the full-attention layers outgrows the 32 MB Infinity Cache, and attention starts re-reading it from LPDDR5 at the 256 GB/s ceiling. Windowing the attention layers should bound each query’s KV slice and flatten the curve. My first cut at that was the mask, and it half-worked. Dense versus my 16K-window mask:

context dense prefill 16K-window mask speedup
32K 39s 38s 1.0x
65K 97s 81s 1.2x
131K 277s 174s 1.6x
262K ~20min ~6min 3.1x

The speedup grows with depth. At a million tokens the dense side is not a guess: that’s the 4h37m run from the intro, and fitting a curve to this ladder lands right on it (4.6-5.2h depending on the fit). The mask extrapolates to ~40 minutes there, a raw ~6.6x, but I never ran the mask at 1M, and it has two structural leftovers that made me not want to. It still allocates the full KV cache, so the memory win is zero. And llama.cpp’s FlashAttention still loops over every query-tile by key-tile pair, paying a per-tile check even on the all--inf tiles it skips, so the work is O(N) but the bookkeeping stays O(N²).

The fix is to stop faking the window and ask for a real one. llama.cpp already ships sliding-window attention as a first-class path (Mistral, Gemma, and the hybrid lfm2 models all use it), so I marked the 10 attention layers as sliding-window through that path. They get a rolling KV cache bounded to the window: it never grows, and the kernel never touches tiles outside it. Prefill becomes genuinely linear:

context dense native sliding-window
32K 837 t/s 893 t/s
131K 473 t/s 860 t/s
262K 222 t/s 855 t/s
512K never run dense 853 t/s (~10 min)
1M 4h37m, measured* 851 t/s (20.5 min, measured)

~855 tokens a second from 32K to 1M, a 4.8% drift across a 32x range, while dense falls off a cliff. Both million-token prefills are real wall-clock numbers, and the ratio is ~13x. (*The dense 1M is the earlier run from the intro: Q8_0 weights and q8_0 KV via llama-bench, which is why it’s starred rather than sitting in the same Q4_K_M ladder as the rest of the column. It also ran without a RoPE override, so it’s a throughput measurement, not a coherence-valid million.) One thing that run settled, to be fair to dense: memory was never the killer on this box. The dense 1M fit in the 128 GB and even decoded at 10.6 tokens/s once the 4.6 hours were paid. The windowed cache staying bounded at a few hundred MB, where dense’s grows linearly with context, is still the difference between a box with headroom and a box running one job, and it matters a lot more on hardware with less RAM than this.

Here’s the half of the moonshot I didn’t get. The quality is the same windowed quality from the last section, because it’s the same window done properly (perplexity at 131K is 1.3718 versus the mask’s 1.3512 - the 1.5% is the attention sink the mask kept and the stock path leaves out). At a million tokens that window table stops being a caveat. A 16K window is 1.6% of a 1M context, the same fraction as a 2K window at 131K, and that configuration measured +57% code and +199% prose perplexity. Unusable. Retrieval is somehow even worse: the next section measures it properly, and a windowed layer’s recall reach is its window, so nothing more than 16K tokens back is findable. On top of which, Qwen3.6’s trained window is 256K. You can stretch a model past its training with RoPE tricks (on the big MI300X box earlier in this series I validated a freq-base override out to needle retrieval at ~634K), but nothing in these runs does, so past 256K there was never a usability claim on the table anyway. The million-token run is a prefill-and-memory demo. It proves the machinery stays flat and bounded as deep as you push it, and the window is a dial rather than a wall (at 256K, a 32K window holds 699 t/s at +5.1% code perplexity, a 64K window 522 t/s at +2.5%). It is not a usable context.

Prefill wall down, memory wall down, usability not achieved. The useful move from there was to point the same machinery at the deepest context where usability is winnable, and find out where that is.

Recall, measured properly: the four-layer backstop

First I had to fix my own measurement. The 5/5 retrieval results earlier in this post came from a harness I’d stopped questioning: sampled decoding, no stop token, long rambling generations scored by substring match, and a rambling generation can blunder into a substring match it didn’t earn. The corrected harness is greedy decoding, stop at end-of-generation, exact match, plus one behavioral check before trusting any grid: an out-of-window needle has to actually miss on the binary under test. That last check is not paranoia. I burned a full run measuring an unpatched binary that silently ignored the window setting and ran dense. If your feature is env-gated, “it ran” is not evidence it ran.

Measured that way, the sliding-window path gives a blunt answer: single-needle retrieval reach is the window. In a ~98K context with needles planted ~88K, ~49K, and ~10K tokens back, dense finds all three, a 64K window finds the two inside it and misses the 88K one, a 16K window finds only the last. The needles the window misses stay missed, which kills my DeltaNet explanation. The 30 linear-attention layers see every token, and they’re why the model stays fluent about the far context, why perplexity degrades gently under windowing instead of collapsing the way a windowed pure-attention model does. They do not do precise retrieval. On this model, precise retrieval lives in the full-attention layers, all ten of which I had just windowed.

The fix: don’t window all ten. I swept “keep the last K attention layers full” at a 16K window, same three needles. K=2 buys nothing, still blind past the window. K=4 brings all three needles back. K=6 matches K=4 and is just slower. There’s a critical mass somewhere between 2 and 4 layers, and 4 of the 10 is the cheapest configuration that clears it. The retrieval-heads literature says why this shape appears: Wu et al. showed NIAH-style retrieval is done by a small set of attention heads scattered across layers, and RazorAttention / DuoAttention keep those heads full while windowing the rest. Keeping whole layers is the blunt version of the same idea. You keep enough layers full to contain the heads that matter.

That is the recipe, and on my fork it graduated from env hack to a real CLI flag. Here’s the entire mechanism, a loader-time helper shared by the hybrid architectures (each one’s graph then routes through llama.cpp’s existing iSWA hybrid cache when swa_type is set):

// models.h (my fork) - shared by qwen35moe, qwen3next, granite-hybrid.
// Window every full-attention layer; un-window a chosen few as the recall backstop.
static inline void apply_dynsparse_swa(llama_hparams & hparams, const llama_model_params & mp) {
    // ... resolve W / backstop list from --dynsparse-swa* flags (env vars as fallback) ...
    hparams.swa_type = LLAMA_SWA_TYPE_STANDARD;
    hparams.n_swa    = swa_w;                                   // the window W
    for (uint32_t i = 0; i < n_main; ++i) {
        hparams.is_swa_impl[i] = hparams.is_recr(i) ? 0u : 1u;  // attention layers -> windowed
    }
    // ... then un-window the backstop set (--dynsparse-swa-full 27,31,35,39)
    //     and print the resolved windowed/kept-full split at load ...
}

Which makes the whole recipe a command line: window to 16K, keep four layers full, q4_1 KV cache.

build-vk/bin/llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf \
    --dynsparse-swa 16384 --dynsparse-swa-full 27,31,35,39 \
    -ctk q4_1 -ctv q4_1 -fa on -c 262144

None of this is upstream currently, so don’t expect the flag in a stock build. The whole fork is one patch on top of upstream llama.cpp commit 6f4f53f2b: grab the patch, then

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git checkout 6f4f53f2b
git apply 395aimax-fork-full.patch
cmake -B build-vk -DGGML_VULKAN=ON && cmake --build build-vk --config Release

That patch also carries the measurement tooling behind this post’s numbers (the attention-capture tool and the KV-eviction harness). A smaller extract with just the SWA feature is here if you’d rather read than build. Run end-to-end at real depths, with the needle planted outside the window so it can only be reached through the backstop:

config prefill decode attn-KV far needle
dense @131K 5.0 min (391 t/s) 38.6 t/s 819 MB hit
recipe @131K 3.8 min (521 t/s) 44.2 t/s 434 MB hit
recipe @192K 6.7 min (447 t/s) 40.5 t/s 596 MB hit
recipe @256K 10.5 min (389 t/s) 36.2 t/s 767 MB hit

At 131K the recipe beats dense on all three axes at once (1.89x smaller KV, 1.33x faster prefill, 1.14x faster decode - fewer bytes helps decode too). And the 256K row is the model’s entire trained context. Where the million was the stunt, this row is the everyday claim: the whole native window, ingested in ten and a half minutes at the throughput dense manages at half that depth, answering at 36 tokens/s with retrieval intact, on the “cheap” box. FINALLY! We did it, Reddit!

A single planted needle is the easiest retrieval task there is (Paulsen’s MECW work measured effective context across task types and found single-needle by far the most flattering probe), so I graded harder ones too. Multi-fact retrieval (five codes scattered through the context, “list all five”) is where the recipe hits its first real boundary: 5/5 at 128K but 1/5 at 256K, where dense still scores 5/5, so it’s my recipe’s limit rather than the model’s. Doubling the window to 32K restores 5/5 at 256K for +8% KV and +8% prefill. Oddly, four of the five needles sit far outside even the doubled window (the deepest is 216K tokens back), so the kept-full layers were never the bottleneck. The narrow window was degrading the model’s ability to bind multiple facts together at depth, and more local context fixes it.

I also checked that the model uses a long context rather than just fishing strings out of it: plant counterfactuals that contradict strong priors (water boils at 157C, the capital of France is Oslo) and ask it to answer from the document. The recipe matches dense, 5/5 at both 128K and 256K, zero reversions to what the model knows is really true. The one task family that still needs dense is aggregation, where the answer is spread diffusely over the whole context. Windowing breaks that gather and no backstop saves it, so know which kind of question you’re serving.

One model is cool but I wanted to know how portable this was, so I ported the recipe to two more hybrids. On Qwen3-Next-80B (another DeltaNet hybrid) the whole pattern transfers at 128K: 2.07x smaller KV, 1.37x faster prefill, 1.16x faster decode, far needle retrieved, multi-fact 5/5. On Granite-4.0-H-Small, a Mamba2-heavy hybrid with only 4 attention layers out of 40, it also works, with the critical-mass lesson in miniature: one kept-full layer holds recall at 64K but breaks at 128K, and two of its four hold 128K at a 1.46x KV shrink. The constant that moves between models is which layers you keep full, so measure that per model rather than trusting mine. The scope boundary: this is a hybrid-model technique. A pure-attention model has no recurrent layers to keep the far context fluent, windowing it at prefill collapses quality (SWAA documents this), and the recipe degrades into plain StreamingLLM with recall capped at the window. The hybrid’s linear layers make the windowing survivable, and the kept-full attention layers make it retrieve.

Where this sits, and what’s actually mine

Credit where it’s due, because almost none of the algorithm here is new, and the part that was supposed to be mine didn’t pan out. Training-free, dynamic, content-aware sparse prefill is a well-trodden line, and attempt 2 was my try at joining it:

  • MInference (Microsoft, NeurIPS 2024) defined the problem: classify each head’s sparse pattern, build the indices per prompt, run an optimized sparse prefill. It’s the anchor everyone benchmarks against.
  • FlexPrefill (ICLR 2025) does the part I leaned on hardest: per-prompt, training-free, selecting blocks by a cumulative-attention threshold rather than a fixed top-k. That’s my threshold, arrived at separately for the same reasons.
  • SeerAttention is the closest thing to my scorer: pool Q and K down to block representatives, multiply for block scores, run block-sparse attention. They learn a gate for it, where I threw the learning away and thresholded the raw pooled score.
  • XAttention and SpargeAttention are training-free block-sparse cousins with different cheap importance proxies (antidiagonal sums, block self-similarity).
  • DeepSeek’s NSA is the trained-from-scratch version of the idea, and their DSA is the inference-time one whose llama.cpp mask idiom I copied outright.
  • My fixed sink+window is straight out of StreamingLLM, and the attention sink it leans on is a studied phenomenon, not folklore.
  • The keep-some-attention-full backstop has its own literature: retrieval heads (a small, sparse set of heads does the retrieving), plus RazorAttention / DuoAttention, which keep those heads full and window the rest. My kept-full layers are the coarse version of their kept-full heads. SWAA studies windowing full-attention models at prefill and finds it collapses on pure-attention models, the boundary my hybrid results sit just inside.

So what’s mine? Less than I thought when I started writing. I set out to join that club, training-free content-aware block selection, and on this model and these corpora I couldn’t, because there was no cheap salient subset for a selector to find. The quality need is broad: code wants ~25% of the context back before it’s within 5% of dense, prose wants a moderate absolute window. With no handful of heavy blocks to keep, my selector reduced to a sliding window, and the window is StreamingLLM. What’s left:

  1. It runs on a “cheap” consumer iGPU, through Vulkan, with no new kernel. The papers above ship custom CUDA. Here the win was noticing the box already had the machinery, first a mask riding the stock FlashAttention tile-skip, then marking the attention layers sliding-window through llama.cpp’s existing hybrid-SWA path. Nobody had ported the RazorAttention/DuoAttention idea into this stack either, and llama.cpp’s knob is per-layer, so the layer-level version is what’s implementable today. I implemented it.
  2. The measured recipe on hybrids: window the attention layers, keep a critical mass full (4 of 10 here - 2 buys nothing, 6 buys nothing more), with operating points from grading it against real tasks. 16K windows for single-fact, 32K for multi-fact, dense for aggregation. Validated on three hybrid families, scoped to hybrids on purpose.
  3. The bug, which is its own post and the reason any of this took as long as it did.

One anti-result, because these rarely get written up: pooled-QK threshold selection with a global softmax does not surface far-back salient blocks on this model. The normalization drowns them, and max-pooling doesn’t save it. If you’re building content-aware sparse attention, ablate your selector against a plain window early. Mine looked like it was working for a week and wasn’t.

Takeaways

The most expensive lesson was the one-character bug, and it has its own post. Short version: when a compute graph validates on the CPU but misbehaves on a GPU backend, suspect a per-op semantic mismatch before anything clever about scheduling or memory. The bug only hid the work, though. The work itself taught me two things.

Ablate your own knob. If a feature has a setting that turns it off, run that setting, because if “off” looks identical to “on” then your feature was never on. One tau sweep would have caught the inert selector in an afternoon instead of an article. I believed in it for a week because I never pointed it at its own off switch.

And don’t trust a needle. It passed 5/5 at every window I tried, including the one at +74% perplexity, and then it turned out even those passes were partly my harness being generous. Perplexity lies in the other direction. It barely notices retrieval, which is how the recall boundary hid inside decent-looking curves. Measure both, strictly, on the content you actually serve, and pit whatever you built against the dumb baseline it’s supposed to beat. Mine turned out to be the dumb baseline, which I’d have known on day two if I’d run the comparison.

What’s left is smaller than I set out to build, and I like it more for being true. The cold prefill wall comes down the way the bandwidth thesis said it had to: stop reading bytes you don’t need, and the bytes turned out to be most of the KV cache on most of the attention layers. The linear-attention layers keep the far context fluent for free, four kept-full attention layers keep it retrievable, the million-token moonshot lands as a 20-minute prefill instead of a 4.6-hour one but nothing more, and the model’s whole native 256K context becomes something a local agent can live in, which is what I wanted from this box all along. I just had to be wrong four times to get there, and teach the GPU that zero is not a positive number along the way.