Qwen3.6 and the KV Cache Constraint: Local-First Long-Context
Hybrid attention + TurboQuant, April 2026

Table of Contents

1. Overview

The binding constraint on local-first long-context inference is not weights. It is the KV cache. Weights quantize cleanly to 4 bits and stop scaling with sequence length. KV grows linearly in context and quadratically in attention heads, and at 256K tokens it dominates the memory budget on any consumer machine. Qwen3.6 (April 2026) and the TurboQuant work in llama.cpp (November 2025 – April 2026) attack this constraint from orthogonal directions: architectural and numerical.

2. The contract: what local-first promises and what KV breaks

Local-first inference assumes (i) weights fit in unified memory or VRAM at some quant, (ii) context fits alongside weights, (iii) decode runs at read-out speed bounded by memory bandwidth. Invariant (i) is satisfied at Q4 for any model up to ~35B on 24–32GB hardware. Invariant (iii) is a property of the chip. Invariant (ii) is where local-first has been failing.

A dense 27B at fp16 KV consumes roughly 64 KB per token; at 200K tokens that is ~13GB of KV cache alone, on top of weights. The constraint binds before the user finishes their first agentic task.

diagram-kv-scaling.png

3. Refinement 1: hybrid attention reduces KV-bearing layers

Qwen3.6's architecture (both 27B dense and 35B-A3B MoE) interleaves Gated DeltaNet (linear attention, O(n) state) with Gated Attention (quadratic, KV-bearing) at a 3:1 ratio. The 64-layer network is 16 repeated blocks of (3 * DeltaNet -> 1 * GatedAttn). Only 16 of 64 layers carry a KV cache.

diagram-hybrid-block.png

DeltaNet uses a delta-rule recurrence with rank-1 outer-product state updates, which gives associative-memory writes without a KV cache. Community measurement on Qwen3.5-35B-A3B (same hybrid stack) reported ~800 MB additional VRAM going from ctx=4096 to ctx=65536. The 256K native window on a 24GB card stops being a brag and becomes routine.

Provenance: model card and community reverse-engineering of the config. The Qwen team has not yet published a full Qwen3.6 technical report; the 35B-A3B and 27B model cards are the canonical source.

4. Refinement 2: TurboQuant compresses the KV cache that remains

Orthogonal to architecture, llama.cpp PR #20969 (TurboQuant) lands 3-bit KV cache compression via Randomized Hadamard Transform plus Lloyd-Max scalar quantization. Block layout: 32 values -> 14 bytes (2B fp16 scale + 12B packed 3-bit indices) = 3.5 bpw. KV per token on hybrid Qwen3.5/3.6 drops from ~64 KB to ~14 KB, a 4.6x compression.

Empirical refutation of the obvious "this destroys quality" prior: output identical to fp16 baseline on the 35B model at temperature 0. NIAH 6/6 exact retrieval. Quality degrades on 0.6B models, as predicted. The composition of hybrid-3:1 architecture and 3-bit KV quantization is multiplicative, not additive: 64 -> 16 KV layers, then ~4.6x compression on each, gives ~18x total reduction over a dense fp16 baseline.

5. Backend contracts

Three runtimes, three different invariants:

Backend Format Strength Constraint
llama.cpp GGUF Long-context after Turbo Slower tok/s on small models vs MLX
vLLM NVFP4 Throughput, cont. batching maxModelLen is hard, OOMs on graph
MLX MLX 20-87% faster <14B on M4 Advantage collapses at 27B+
  • llama.cpp owns the long-context envelope after TurboQuant. Reported configuration that served a 43K-token request on 2x 16GB consumer GPUs (Qwen3.6-27B NVFP4) where vLLM rejected the request before inference.
  • vLLM wins on throughput at small context. maxModelLen is a hard contract, not a hint, and it OOMs on graph capture before it ever sees a request that exceeds it.
  • MLX on Apple Silicon, unified memory. 20–87% faster than llama.cpp for models <14B. Advantage collapses at 27B+ where memory bandwidth saturates and both backends hit the same ceiling. Full-prefill-before-first-token is the relevant aporia for interactive use.

Concrete numbers: M4 Pro 64GB on Qwen3-Coder-30B reports LM Studio (MLX) at 102 tok/s vs Ollama (llama.cpp) at 70 tok/s, with llama.cpp owning lower TTFT (0.18s vs 0.29s) and faster prefill at >32K context.

6. The 27B-vs-35B-A3B choice

Same family, two refinements of the same training run, opposite tradeoffs:

Dimension 27B dense 35B-A3B MoE
Size at Q4 ~17GB ~22GB
Active params 27B every token 3B via router
Decode throughput 1x ~3x
SWE-bench Verified 77.2 73.4
Quantization Predictable Routing-sensitive
Target hardware 24GB VRAM 32–48GB unified (M4)

The 27B beats the 397B-A17B Qwen3.5 flagship (76.2) on SWE-bench Verified. This is a narrow benchmark claim, not a general quality claim — see falsification conditions below.

For an M4 Mac Mini in the 32–48GB range the practical pick is 35B-A3B at Q4 or Q5 with thinking preservation enabled. For a 24GB-VRAM coding box, the 27B dense.

7. Falsification conditions

What would refute the thesis that hybrid attention plus TurboQuant solve the local-first long-context constraint:

  1. Sampling quality. KV-cache compression that is "identical at temp=0" but degrades measurably under sampling (top-p, top-k > 1). Untested in the public benchmarks so far.
  2. Agentic retrieval. Tool-use and agentic-coding tasks that require precise retrieval from positions >100K tokens deep. NIAH passes; agentic flows under TurboQuant are not yet characterized.
  3. MoE sensitivity. TurboQuant on 35B-A3B requires symmetric turbo4 per community testing, where dense 27B accepts asymmetric turbo3. The compression scheme is model-attention sensitive — "TurboQuant works" is not a general claim but a per-architecture conjecture.
  4. Benchmark narrowness. The 27B-beats-397B claim is SWE-bench Verified only. Multilingual, long-horizon-planning, and rare-language coding refutations should be expected.

8. Provenance

9. Related

  • Claude Code with Ollama — integration patterns for running Qwen models as Claude Code backends via OpenAI-compatible API.
  • Local-First Software — the broader local-first movement; this note applies the same invariants to LLM inference.
  • Claude Code Features 2026 Q2 — 1M context window on Opus 4.6 is the cloud counterpart to the local long-context problem here.
  • CLI Coding Agents Comparison — OpenCode (75+ providers) and Codex CLI (--oss flag) are the agents most likely to consume local Qwen inference.

Author: Jason Walsh

j@wal.sh

Last Updated: 2026-04-27 22:23:54

build: 2026-04-28 22:49 | sha: c759f7e