{
  "experiment": "qwen3.6-mtp-swa-fix",
  "date": "2026-06-29",
  "box": "GMKtec EVO-X2 (Ryzen AI Max+ 395, gfx1151, 128GB LPDDR5X-8000)",
  "backend": "Vulkan/RADV, llama.cpp fork rebased onto upstream 6f4f53f2b (current master, 2026-06-29)",
  "model": "Qwen3.6-35B-A3B (qwen35moe hybrid: 10 GQA full-attn + 30 Gated-DeltaNet of 40 layers), MTP GGUF Qwen3.6-35B-A3B-UD-Q4_K_M",
  "what": "Patched llama.cpp to make MTP (multi-token-prediction / draft speculative decode) and our SWA recipe coexist, then characterized the stack vs each lever alone at short and long context.",
  "the_bug": {
    "symptom": "Enabling --spec-type draft-mtp together with the SWA recipe (--dynsparse-swa) aborted at load (graph_reserve): GGML_ASSERT(hparams.swa_type == LLAMA_SWA_TYPE_NONE) at llama-graph.cpp:2704, from llama_model_qwen35moe::build_arch_graph.",
    "root_cause": "graph_mtp built its attention input via build_attn_inp_kv() (the NON-iswa builder, which asserts swa_type==NONE). When the SWA recipe sets swa_type=STANDARD, the model runs on the iSWA hybrid cache, so the main graph uses build_inp_mem_hybrid_iswa()/get_attn() but the MTP sub-graph still used the non-iswa input -> assert.",
    "fix": "In graph_mtp, when swa_type != NONE, build the attention input via build_attn_inp_kv_iswa() and route build_attn() through it (identical arg list). The MTP layer index (n_layer()) is not marked SWA, so it routes to the full sub-cache (attends full KV). Upstream's iswa crash fixes (#24294 guard iswa kq_mask, #23131 null-buffer) hold; this is the remaining hybrid-MTP-specific gap.",
    "patch_file": "src/models/qwen35moe.cpp (graph_mtp ctor): mtp_use_iswa = hparams.swa_type != LLAMA_SWA_TYPE_NONE; conditional build_attn_inp_kv_iswa + ternary build_attn",
    "upstream_relevant_issue": "#23322 (Qwen3.6 + MTP + SWA combo); fix is upstreamable"
  },
  "short_context_20k": {
    "note": "ctx=20480, repetitive maintenance-log continuation, greedy, n_predict=200 ignore_eos, q4_1 KV, -ub 512",
    "swa_only":   {"decode_tps": 51.56, "acceptance": null},
    "mtp_only":   {"decode_tps": 73.77, "acceptance": 0.889, "accepted": 144, "drafted": 162},
    "mtp_plus_swa": {"decode_tps": 70.29, "acceptance": 0.878, "accepted": 144, "drafted": 164, "crashed": false, "coherent": true}
  },
  "depth_2x2_128k": {
    "note": "n_prompt=123228 (~123k), same repetitive continuation/greedy/200tok/q4_1 KV/-ub 512; SWA recipe = --dynsparse-swa 16384 --dynsparse-swa-full 27,31,35,39; peak 73C (watchdog 94C, safe)",
    "dense_only": {"prefill_tps": 388.0, "decode_tps": 37.09, "acceptance": null},
    "swa_only":   {"prefill_tps": 550.3, "decode_tps": 42.15, "acceptance": null},
    "dense_mtp":  {"prefill_tps": 361.4, "decode_tps": 41.04, "acceptance": 0.805, "accepted": 140, "drafted": 174},
    "mtp_plus_swa": {"prefill_tps": 469.1, "decode_tps": 54.14, "acceptance": 0.961, "accepted": 147, "drafted": 153, "crashed": false}
  },
  "validation_hard_gen_79k": {
    "note": "Validation that the SWA-lifts-acceptance effect is not just an artifact of the predictable repetitive-log gen. DIVERSE ~79k context (combinatorial prose, not the repeating log) + two harder generation tasks; greedy/256tok/q4_1; gen-diversity measured (distinct_word_ratio) to catch greedy degeneration. T2 is the clean, diversity-matched comparison (both configs ~0.72 distinct ratio).",
    "T2_analytical_BEST_apples_to_apples": {
      "dense_only":   {"decode_tps": 41.13, "acceptance": null, "distinct_word_ratio": null},
      "dense_mtp":    {"decode_tps": 36.77, "acceptance": 0.510, "distinct_word_ratio": 0.73, "mtp_delta_vs_dense_only": "-10.6% NET-NEGATIVE"},
      "swa_only":     {"decode_tps": 44.66, "acceptance": null, "distinct_word_ratio": null},
      "swa_mtp":      {"decode_tps": 43.51, "acceptance": 0.549, "distinct_word_ratio": 0.72, "mtp_delta_vs_swa_only": "-2.6% still slightly negative"}
    },
    "T1_narrative_continuation_CONFOUNDED": {
      "_comment": "Down-weight: the SWA run continued the narrative predictably (distinct ratio 0.29), the dense run veered into meta-reasoning (0.68) -- different gens, so acceptance not strictly comparable. Still: when the gen IS predictable (SWA case), MTP+SWA wins big.",
      "dense_mtp": {"decode_tps": 46.03, "acceptance": 0.735, "distinct_word_ratio": 0.68},
      "swa_only":  {"decode_tps": 44.43, "acceptance": null, "distinct_word_ratio": null},
      "swa_mtp":   {"decode_tps": 58.51, "acceptance": 0.864, "distinct_word_ratio": 0.29, "mtp_delta_vs_swa_only": "+31.7%"}
    }
  },
  "findings": [
    "MTP+SWA now RUNS at both short (20k) and long (79k-123k) context after the graph_mtp iswa-routing patch; previously a hard crash at load on b9204 and on bare current upstream.",
    "SWA lifts MTP draft acceptance EVERYWHERE, but the magnitude tracks how predictable the generation already is: predictable/structured gen 0.805->0.961 (big), genuinely diverse analytical gen 0.510->0.549 (small). The mechanism (bounding effective context lowers target next-token entropy -> draft matches better) is real; the size is not uniform.",
    "BEST CASE (predictable gen, 123k log): SWA+MTP +46% over dense baseline, super-additive vs SWA-only(+14%)/dense-MTP(+11%); acc 0.96. WORST CASE (open-ended analytical gen, 79k, diversity-matched): MTP is NET-NEGATIVE even with SWA -- dense+MTP -11% vs dense-only, SWA+MTP -2.6% vs SWA-only. Acceptance ~0.51-0.55 < the ~0.8 break-even.",
    "So SWA does NOT make MTP universally worth it. It LOWERS the predictability threshold at which MTP breaks even (turns the dense -11%/-20% open-ended penalty into ~neutral), and makes the win big where MTP already wins. The unconditional, generation-agnostic decode win at depth is SWA's OWN bounded-KV (+9-18%), not MTP.",
    "Prefill at 123k: SWA-only fastest (550, bounded KV); MTP adds ~15-21% prefill tax. At short context (20k) the lift is absent (dense acc already ~0.88): MTP+SWA 70.29 ~= MTP-only 73.77.",
    "Usable rule: always take the SWA bounded-KV decode win; ADD MTP only for predictable/structured generation (code, formatted continuation, extraction); leave MTP OFF for open-ended/creative/analytical generation even with SWA."
  ],
  "caveat": "Validated: the SWA-lifts-acceptance direction holds on a diverse context + hard gen (T2, diversity-matched: 0.510->0.549), NOT just the repetitive log -- so it is a real effect, not a predictable-gen artifact. But the hard-gen lift is small and does NOT clear the MTP break-even, so the honest verdict is acceptance-gated, not 'always stack'. The big 0.80->0.96 / +46% numbers are the BEST (predictable-gen) case."
}
