Running a 122B MoE model on a laptop APU
We spent a month pushing llama.cpp to its limits on AMD's Strix Halo platform. This site documents every build, every regression, and the upstream contributions that came from it.
What we learned
- Vulkan wins on memory mapping.
vkAllocateMemorywithHOST_VISIBLE_BITcorrectly maps the full 120 GiB GTT. ROCm'shipMallocis stuck in BIOS VRAM unless you explicitly enable unified memory. - ROCm wins on batched compute. PR #21344 unlocked +35% prefill speed on gfx1151.
- Spec decode stacks orthogonally. ROCm's single-token decode is slower, but its batched verify is so fast that speculative decoding flips the result. ROCm + spec became our fastest overall config.
- Decode doesn't degrade with context. At 512 tokens or 131,072 tokens, decode speed is basically identical. Only prefill gets expensive.
- MUL_MAT_ID is the hidden villain. For large MoE models, the routed expert path eats 42-66% of prefill time on Vulkan. We filed issue #21948 with full profiling data.
Four configurations compared
| Config | Backend | Patches | pp512 | tg128 | Best for |
|---|---|---|---|---|---|
| Vulkan stock | Vulkan | none | 303 | 24.7 | Minimum latency |
| Vulkan + spec | Vulkan | PR #20075 + 4 fixes | 302 | 35.3 | Fast decode, easy deploy |
| ROCm + MMQ | ROCm 7.2.1 | PR #21344 | 406 | 18.2 | Long prefill, RAG |
| ROCm full stack | ROCm 7.2.1 | #21344 + #20075 | 404 | 40.1 | Overall chat throughput |
Upstream contributions
- PR #21344 — Posted gfx1151 validation numbers confirming +19-35% prefill improvement.
- Issue #21948 — Filed Vulkan MUL_MAT_ID bottleneck with
GGML_VK_PERF_LOGGER=1profiling. - PR #20075 — Documented four bug fixes needed for spec decode on hybrid SSM/MoE models outside of Metal.
Every build we tested
Not just the winners. The crashes, regressions, and abandoned attempts are here too.
Complete build history
| Build | Backend | pp512 | pp8K+tg | pp32K+tg | pp128K+tg | tg128 | Notes |
|---|---|---|---|---|---|---|---|
| Vulkan b8779 stock | Vulkan | 268.89 | 249.65 | 234.94 | 145.03 | 23.52 | Baseline |
| Vulkan spec build | Vulkan | 302.10 | — | — | — | 24.56 | PR #20075 + fixes |
| ROCm stock (toolbox) | ROCm | 262.18 | 228.66 | — | 144.44 | 20.83 | Graph splits from VRAM limit |
| ROCm + PR #21344 | ROCm | 354.57 | 302.44 | 291.03 | 171.66 | 21.00 | Best pure prefill |
| ROCm stacked | ROCm | 229.70 | — | — | — | 17.96 | Regressed vs patched-only |
| ROCm full stack | ROCm | 405.99 | — | — | — | 18.29 | PR #21344 + #20075 + fixes |
| REAP-20 Q4_K_M baseline | Vulkan | 443 | — | — | — | 29.1 | Overnight baseline (smaller model) |
| REAP-40 Q4_K_M | Vulkan | 352 | 349 | 291 | 145 | 29.5 | Recommended smaller model |
The "ROCm stacked" build combined PR #21344 with -ffast-math and hipBLASLt. We expected additive gains. Instead, prefill dropped from 354 to 230 t/s. Not every optimization stacks cleanly.
What broke along the way
| Attempt | What went wrong | Lesson |
|---|---|---|
| Q8_0 full benchmarks | 105 GB model exceeded 120 GiB GTT and was OOM-killed repeatedly | Stick to Q6_K (76 GB) or Q4_K_M (57 GB) |
| ROCm Q6_K without UMA env var | Pathologically slow (>1 hr) due to 94 graph splits from BIOS VRAM limit | GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is mandatory on Strix Halo |
| ROCm with lhl's FATTN patches | Patch conflicts with current master; wouldn't apply cleanly | Needs rebase on matching llama.cpp version |
Stacked build with -ffast-math + hipBLASLt | pp512 dropped from 354 to 230 t/s vs patched-only | Test optimizations in isolation |
| Q6_K on stock Vulkan before kernel fix | Infinite loading hang (>20 min) for any model >64 GB | amd_iommu=off is required |
Real workload timing
Server-mode tests with actual prompts. 256-token generation, no_think mode, temp=0.3.
| Workload | Vulkan stock | Vulkan + spec | ROCm + MMQ | ROCm full stack | Best |
|---|---|---|---|---|---|
| Chat (30 in, 1000 out) | 41.5s | 31.9s +30% | 56.9s -27% | 28.3s +47% | Full stack |
| Code gen (2K in, 2K out) | 118.3s | 99.1s +19% | 138.0s -14% | 80.9s +46% | Full stack |
| Summarize (8K in, 256 out) | 155.9s | 153.5s +2% | 114.5s +36% | 107.2s +46% | Full stack |
Speculative decoding
Target: Qwen3.5-122B-A10B-REAP-20-Q6_K (76 GB). Draft: Qwen3.5-0.8B-Q4_K_M (508 MB). This should not have worked out of the box. It didn't.
Results after the fixes
| Task | Mode | Baseline | Spec decode | Speedup | Acceptance |
|---|---|---|---|---|---|
| Photosynthesis explanation | think | 24.29 | 30.74 | +26.5% | 76.0% |
| Neural nets explanation | no_think | 24.36 | 34.05 | +39.8% | 90.3% |
| BST code generation | think | 24.43 | 29.23 | +21.8% | 80.6% |
| Short answer (2+2) | think | 24.43 | 29.70 | +23.8% | 88.9% |
no_think mode is where spec decode shines: 40% speedup with 90% acceptance. Thinking mode drops acceptance to ~76% because reasoning tokens are harder to predict.
The four bugs we had to fix
Beyond PR #20075, these fixes were required for Qwen3.5 on non-Metal backends:
- Hybrid seq_rm skipped attention cleanup. When recurrent rollback failed, the function returned early and never called
mem_attn->seq_rm(). Stale positions accumulated, breaking M-RoPE invariants. - Soft rollback corrupted positions. The partial removal path moved cell positions backward (
cells[i].pos = p0 - 1) instead of erasing them.seq_pos_maxthen returned stale higher positions. - Compat check was chicken-and-egg.
common_speculative_is_compat()testedseq_rmon a fresh context before any checkpoints existed. It always failed for hybrid models. - Recurrent reserve size was too small. With
-np 1,recurrent_rs_size = max(1, n_seq_max) = 1. Checkpointing needs up to 8 cells per sequence. We bumped it tomax(16, n_seq_max * 16).
Context scaling
Prefill slows down as context grows. Decode stays weirdly constant. Here's the full sweep from 64 to 32K tokens.
Prefill speed by context
Decode speed by context
Full context scaling table
| Context | Vk pp | Vk tg | Vk+spec tg | ROCm pp | ROCm tg | ROCm+spec tg | Best decode |
|---|---|---|---|---|---|---|---|
| 64 | 98 | 24.3 | 26.6 | 130 | 17.8 | 28.5 | ROCm+spec |
| 512 | 303 | 24.4 | 28.1 | 376 | 17.7 | 23.3 | Vk+spec |
| 2K | 353 | 24.2 | 25.2 | 423 | 17.7 | 26.7 | ROCm+spec |
| 4K | 362 | 24.2 | 25.5 | 413 | 17.7 | 25.3 | ROCm+spec |
| 8K | 371 | 23.9 | 25.1 | 407 | 17.4 | 27.4 | ROCm+spec |
| 16K | 353 | 23.4 | 25.0 | 371 | 17.0 | 23.6 | Vk+spec |
| 32K | 314 | 22.6 | 18.6 | 315 | 16.2 | 22.9 | ROCm+spec |
ROCm's MMQ prefill advantage fades after 16K. At that point, flash attention dominates over batched matmul. Decode stays flat across all lengths.
Raw llama-bench: combined throughput
| Context | Vulkan stock | ROCm + MMQ | Winner |
|---|---|---|---|
| 512 | 93.0 | 77.8 | Vulkan (+20%) |
| 2K | 181.8 | 179.5 | Vulkan (+1%) |
| 8K | 258.0 | 287.3 | ROCm (+11%) |
| 16K | 266.0 | 304.4 | ROCm (+14%) |
| 32K | 255.4 | 285.8 | ROCm (+12%) |
| 65K | 217.2 | 236.8 | ROCm (+9%) |
| 131K | 155.7 | 169.2 | ROCm (+9%) |
Research trail
Profiling data, parallel subagent investigations, and the upstream issues and PRs that came out of them.
KV Cache Compression Frontier
Mission 01 tested 11 cache_type_k × cache_type_v combinations across 4 context lengths (512–32K) on Qwen3.5-122B-A10B-REAP-20-Q6_K.
Pareto-optimal sweet spots
| Combo | Context | pp t/s | tg t/s | KV mem | Reduction |
|---|---|---|---|---|---|
| f16 / f16 | 32K | 264.4 | 24.53 | 8.19 GB | baseline |
| q8_0 / q8_0 | 32K | 244.7 | 24.32 | 4.09 GB | −50% |
| q4_0 / q4_0 | 32K | 246.2 | 24.35 | 2.05 GB | −75% |
*All asymmetric combos (q8_0/f16, f16/q8_0, q4_0/f16, f16/q4_0, q4_0/q8_0, q8_0/q4_0) either timed out at 32K or showed severe prefill degradation. Avoid mixed K/V quantization on this Vulkan backend.
Upstream contributions
| Date | Contribution | Details |
|---|---|---|
| Apr 15 | Issue #21948 | Filed Vulkan MUL_MAT_ID bottleneck with full GGML_VK_PERF_LOGGER=1 profiling data. |
| Apr 15 | PR #21344 | Validated MMQ VGPR tuning on gfx1151: +19-35% prefill improvement. |
| Apr 15 | PR #20075 | Documented four bug fixes needed for hybrid SSM/MoE spec decode on non-Metal backends. |
| Apr 14 | Kernel fix | Applied amd_iommu=off and ttm.pages_limit=335544321 to fix llama.cpp #14854. |
What four parallel subagents found
- Memory: Vulkan's
vkAllocateMemorywithHOST_VISIBLE_BITmaps the full GTT correctly. ROCm'shipMallocis boxed into BIOS VRAM unlessGGML_CUDA_ENABLE_UNIFIED_MEMORY=1is set. - Compiler flags:
-ffast-mathcan give ~4% on some AMD GPUs, but it regressed in our stacked build. ROCm 7.2.1 made the old--amdgpu-unroll-threshold-localworkaround unnecessary. - Flash attention: lhl's rocWMMA FATTN tuning can improve decode at depth by 33-104%, but the patch doesn't apply cleanly to current master.
- MUL_MAT_ID: The routed expert path takes 700+ ms per batch. The most promising fix pattern is Metal's map → batched matmul → unmap approach from PR #13388.
Resources
The PRs, issues, articles, and repos that made this project possible — or that we wish we'd found sooner.
Our contributions
ValidationPR #21344 — MMQ VGPR tuning
Custom ROCm container, patch applied, gfx1151 numbers posted: +19-35% prefill on large MoE.
Bug reportIssue #21948 — Vulkan MUL_MAT_ID
Full perf-logger profiling showing MUL_MAT_ID consumes 42-66% of prefill time on gfx1151.
FixesPR #20075 — Spec decode fixes
Four additional fixes for hybrid SSM/MoE speculative decoding on non-Metal backends.
RepoGitHub: 0xSero/framework-max
Benchmarking scripts, overnight pipeline, and visualization code.
Prior art and reading
MergedPR #13388 — Metal MoE optimization
Map → batched matmul → unmap for MoE models. 1.8-4.1x speedup on Metal. The template we'd like to see on Vulkan.
MergedPR #15524 — Vulkan subgroup opt
Subgroup optimizations for Vulkan MUL_MAT_ID. Already active in our build, but still leaves a large gap.
ReferenceIssue #14854 — >64GB loading hang
The original report for the Vulkan/Strix Halo issue where models over 64GB load infinitely slowly without amd_iommu=off.
ResearchChips and Cheese — RDNA 3 Infinity Cache
Deep-dive on why Infinity Cache mostly doesn't help LLM decode (streaming workloads exceed the 32 MB cache).
Models
| Model | Size | Use case | Link |
|---|---|---|---|
| Qwen3.5-122B-A10B-REAP-20-Q6_K | 76 GB | Main target (best quality) | HF |
| Qwen3.5-122B-A10B-REAP-20-Q4_K_M | 57 GB | Smaller variant | HF |
| Qwen3.5-122B-A10B-REAP-40-Q4_K_M | 44 GB | Recommended tradeoff | HF |
| Qwen3.5-0.8B-Q4_K_M | 508 MB | Draft model for spec decode | Community GGUF |
Reproduce
Hardware setup, kernel parameters, build commands, and the exact flags we used.
Hardware and OS setup
- AMD Ryzen AI MAX+ 395 (16C/32T Zen 5)
- Radeon 8060S (gfx1151, RDNA 3.5, 40 CU)
- 128 GB LPDDR5X-8000
- Fedora 43, kernel 6.17.1
Required kernel parameters
sudo grubby --update-kernel=ALL --args='amd_iommu=off ttm.pages_limit=335544321 amdgpu.gttsize=122880'
Reboot, then verify GTT:
cat /sys/class/drm/card*/device/mem_info_gtt_total
You should see ~128849018880 (120 GiB). Without amd_iommu=off, any model over 64GB will hang during Vulkan loading.
Vulkan stock baseline
export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH $HOME/.local/opt/llama.cpp/current/llama-bench \ -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \ -ngl 99 -fa 1 -c 131072 \ -p 512,2048,8192,16384,32768,65536,131072 \ -n 128
ROCm with PR #21344
We used a custom podman container because Fedora's HIP headers conflict with ROCm 7.2.1 runtime.
# Inside the custom container export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 llama-bench \ -m /models/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \ -ngl 99 -fa 1 -c 131072 \ -p 512,2048,8192,16384,32768,65536,131072 \ -n 128
Do not skip the UMA env var. Without it, ROCm sees only BIOS VRAM and performance collapses.
Speculative decoding
export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH $HOME/.local/opt/llama.cpp/current/llama-server \ -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \ -md $HOME/.local/share/models/gguf/Qwen3.5-0.8B-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 \ -ngl 99 -ngld 99 -fa 1 -c 4096 \ --draft-max 8 --parallel 1
For Qwen3.5 and other hybrid SSM/MoE models, you'll need PR #20075 plus the four additional fixes we documented.