One character cost me a week of long context
I found a one-character bug in llama.cpp: on the Vulkan backend, step(0) returns 1 instead of 0, so the “drop this block” mask I built out of step() quietly kept every block and my sparse attention read the whole KV cache anyway. I spent the better part of a week blaming the scheduler, the allocator, an O(n^2)…
Read the full post →