Running a 122B MoE model on a laptop APU

We spent a month pushing llama.cpp to its limits on AMD's Strix Halo platform. This site documents every build, every regression, and the upstream contributions that came from it.

406Peak prefill t/s
40.1Peak decode t/s
+64%Best chat speedup
131KMax context
The optimization stack
Hardware Kernel Backend Patches Data

What we learned

  • Vulkan wins on memory mapping. vkAllocateMemory with HOST_VISIBLE_BIT correctly maps the full 120 GiB GTT. ROCm's hipMalloc is stuck in BIOS VRAM unless you explicitly enable unified memory.
  • ROCm wins on batched compute. PR #21344 unlocked +35% prefill speed on gfx1151.
  • Spec decode stacks orthogonally. ROCm's single-token decode is slower, but its batched verify is so fast that speculative decoding flips the result. ROCm + spec became our fastest overall config.
  • Decode doesn't degrade with context. At 512 tokens or 131,072 tokens, decode speed is basically identical. Only prefill gets expensive.
  • MUL_MAT_ID is the hidden villain. For large MoE models, the routed expert path eats 42-66% of prefill time on Vulkan. We filed issue #21948 with full profiling data.

Four configurations compared

ConfigBackendPatchespp512tg128Best for
Vulkan stockVulkannone30324.7Minimum latency
Vulkan + specVulkanPR #20075 + 4 fixes30235.3Fast decode, easy deploy
ROCm + MMQROCm 7.2.1PR #2134440618.2Long prefill, RAG
ROCm full stackROCm 7.2.1#21344 + #2007540440.1Overall chat throughput

Upstream contributions

  • PR #21344 — Posted gfx1151 validation numbers confirming +19-35% prefill improvement.
  • Issue #21948 — Filed Vulkan MUL_MAT_ID bottleneck with GGML_VK_PERF_LOGGER=1 profiling.
  • PR #20075 — Documented four bug fixes needed for spec decode on hybrid SSM/MoE models outside of Metal.

Every build we tested

Not just the winners. The crashes, regressions, and abandoned attempts are here too.

Complete build history

BuildBackendpp512pp8K+tgpp32K+tgpp128K+tgtg128Notes
Vulkan b8779 stockVulkan268.89249.65234.94145.0323.52Baseline
Vulkan spec buildVulkan302.1024.56PR #20075 + fixes
ROCm stock (toolbox)ROCm262.18228.66144.4420.83Graph splits from VRAM limit
ROCm + PR #21344ROCm354.57302.44291.03171.6621.00Best pure prefill
ROCm stackedROCm229.7017.96Regressed vs patched-only
ROCm full stackROCm405.9918.29PR #21344 + #20075 + fixes
REAP-20 Q4_K_M baselineVulkan44329.1Overnight baseline (smaller model)
REAP-40 Q4_K_MVulkan35234929114529.5Recommended smaller model

The "ROCm stacked" build combined PR #21344 with -ffast-math and hipBLASLt. We expected additive gains. Instead, prefill dropped from 354 to 230 t/s. Not every optimization stacks cleanly.


What broke along the way

AttemptWhat went wrongLesson
Q8_0 full benchmarks105 GB model exceeded 120 GiB GTT and was OOM-killed repeatedlyStick to Q6_K (76 GB) or Q4_K_M (57 GB)
ROCm Q6_K without UMA env varPathologically slow (>1 hr) due to 94 graph splits from BIOS VRAM limitGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is mandatory on Strix Halo
ROCm with lhl's FATTN patchesPatch conflicts with current master; wouldn't apply cleanlyNeeds rebase on matching llama.cpp version
Stacked build with -ffast-math + hipBLASLtpp512 dropped from 354 to 230 t/s vs patched-onlyTest optimizations in isolation
Q6_K on stock Vulkan before kernel fixInfinite loading hang (>20 min) for any model >64 GBamd_iommu=off is required

Real workload timing

Server-mode tests with actual prompts. 256-token generation, no_think mode, temp=0.3.

WorkloadVulkan stockVulkan + specROCm + MMQROCm full stackBest
Chat (30 in, 1000 out)41.5s31.9s +30%56.9s -27%28.3s +47%Full stack
Code gen (2K in, 2K out)118.3s99.1s +19%138.0s -14%80.9s +46%Full stack
Summarize (8K in, 256 out)155.9s153.5s +2%114.5s +36%107.2s +46%Full stack

Speculative decoding

Target: Qwen3.5-122B-A10B-REAP-20-Q6_K (76 GB). Draft: Qwen3.5-0.8B-Q4_K_M (508 MB). This should not have worked out of the box. It didn't.

How spec decode changes the math
Draft model 0.8B, 8 tokens Target verifies in batch Batched matmul ≈ same cost as 1 token Accepted tokens ~76-90% acceptance effective_tps = accepted_per_step / (draft_time + verify_time)

Results after the fixes

TaskModeBaselineSpec decodeSpeedupAcceptance
Photosynthesis explanationthink24.2930.74+26.5%76.0%
Neural nets explanationno_think24.3634.05+39.8%90.3%
BST code generationthink24.4329.23+21.8%80.6%
Short answer (2+2)think24.4329.70+23.8%88.9%

no_think mode is where spec decode shines: 40% speedup with 90% acceptance. Thinking mode drops acceptance to ~76% because reasoning tokens are harder to predict.


The four bugs we had to fix

Beyond PR #20075, these fixes were required for Qwen3.5 on non-Metal backends:

  • Hybrid seq_rm skipped attention cleanup. When recurrent rollback failed, the function returned early and never called mem_attn->seq_rm(). Stale positions accumulated, breaking M-RoPE invariants.
  • Soft rollback corrupted positions. The partial removal path moved cell positions backward (cells[i].pos = p0 - 1) instead of erasing them. seq_pos_max then returned stale higher positions.
  • Compat check was chicken-and-egg. common_speculative_is_compat() tested seq_rm on a fresh context before any checkpoints existed. It always failed for hybrid models.
  • Recurrent reserve size was too small. With -np 1, recurrent_rs_size = max(1, n_seq_max) = 1. Checkpointing needs up to 8 cells per sequence. We bumped it to max(16, n_seq_max * 16).

Context scaling

Prefill slows down as context grows. Decode stays weirdly constant. Here's the full sweep from 64 to 32K tokens.

Prefill speed by context

Vulkan stock ROCm + MMQ

Decode speed by context

Vulkan stock Vulkan + spec ROCm + MMQ + spec

Full context scaling table

ContextVk ppVk tgVk+spec tgROCm ppROCm tgROCm+spec tgBest decode
649824.326.613017.828.5ROCm+spec
51230324.428.137617.723.3Vk+spec
2K35324.225.242317.726.7ROCm+spec
4K36224.225.541317.725.3ROCm+spec
8K37123.925.140717.427.4ROCm+spec
16K35323.425.037117.023.6Vk+spec
32K31422.618.631516.222.9ROCm+spec

ROCm's MMQ prefill advantage fades after 16K. At that point, flash attention dominates over batched matmul. Decode stays flat across all lengths.


Raw llama-bench: combined throughput

ContextVulkan stockROCm + MMQWinner
51293.077.8Vulkan (+20%)
2K181.8179.5Vulkan (+1%)
8K258.0287.3ROCm (+11%)
16K266.0304.4ROCm (+14%)
32K255.4285.8ROCm (+12%)
65K217.2236.8ROCm (+9%)
131K155.7169.2ROCm (+9%)

Research trail

Profiling data, parallel subagent investigations, and the upstream issues and PRs that came out of them.

MUL_MAT_ID profiling breakdown
Context MUL_MAT_ID FLASH_ATTN_EXT Total ms 512 66.2% 1700 8K 57.6% 1839 32K 41.9% 2556 128K 19.7% 5351

KV Cache Compression Frontier

Mission 01 tested 11 cache_type_k × cache_type_v combinations across 4 context lengths (512–32K) on Qwen3.5-122B-A10B-REAP-20-Q6_K.

Pareto-optimal sweet spots

ComboContextpp t/stg t/sKV memReduction
f16 / f1632K264.424.538.19 GBbaseline
q8_0 / q8_032K244.724.324.09 GB−50%
q4_0 / q4_032K246.224.352.05 GB−75%
Prefill speed vs memory reduction at 32K context
Config pp t/s Mem saved f16/f16 0% q8_0/q8_0 50% q4_0/q4_0 75% asymmetric* TIMEOUT or >3× slowdown

*All asymmetric combos (q8_0/f16, f16/q8_0, q4_0/f16, f16/q4_0, q4_0/q8_0, q8_0/q4_0) either timed out at 32K or showed severe prefill degradation. Avoid mixed K/V quantization on this Vulkan backend.

Upstream contributions

DateContributionDetails
Apr 15Issue #21948Filed Vulkan MUL_MAT_ID bottleneck with full GGML_VK_PERF_LOGGER=1 profiling data.
Apr 15PR #21344Validated MMQ VGPR tuning on gfx1151: +19-35% prefill improvement.
Apr 15PR #20075Documented four bug fixes needed for hybrid SSM/MoE spec decode on non-Metal backends.
Apr 14Kernel fixApplied amd_iommu=off and ttm.pages_limit=335544321 to fix llama.cpp #14854.

What four parallel subagents found

  • Memory: Vulkan's vkAllocateMemory with HOST_VISIBLE_BIT maps the full GTT correctly. ROCm's hipMalloc is boxed into BIOS VRAM unless GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is set.
  • Compiler flags: -ffast-math can give ~4% on some AMD GPUs, but it regressed in our stacked build. ROCm 7.2.1 made the old --amdgpu-unroll-threshold-local workaround unnecessary.
  • Flash attention: lhl's rocWMMA FATTN tuning can improve decode at depth by 33-104%, but the patch doesn't apply cleanly to current master.
  • MUL_MAT_ID: The routed expert path takes 700+ ms per batch. The most promising fix pattern is Metal's map → batched matmul → unmap approach from PR #13388.

Resources

The PRs, issues, articles, and repos that made this project possible — or that we wish we'd found sooner.

Our contributions

ValidationPR #21344 — MMQ VGPR tuning

Custom ROCm container, patch applied, gfx1151 numbers posted: +19-35% prefill on large MoE.

Bug reportIssue #21948 — Vulkan MUL_MAT_ID

Full perf-logger profiling showing MUL_MAT_ID consumes 42-66% of prefill time on gfx1151.

FixesPR #20075 — Spec decode fixes

Four additional fixes for hybrid SSM/MoE speculative decoding on non-Metal backends.

RepoGitHub: 0xSero/framework-max

Benchmarking scripts, overnight pipeline, and visualization code.


Prior art and reading

MergedPR #13388 — Metal MoE optimization

Map → batched matmul → unmap for MoE models. 1.8-4.1x speedup on Metal. The template we'd like to see on Vulkan.

MergedPR #15524 — Vulkan subgroup opt

Subgroup optimizations for Vulkan MUL_MAT_ID. Already active in our build, but still leaves a large gap.

ReferenceIssue #14854 — >64GB loading hang

The original report for the Vulkan/Strix Halo issue where models over 64GB load infinitely slowly without amd_iommu=off.

ResearchChips and Cheese — RDNA 3 Infinity Cache

Deep-dive on why Infinity Cache mostly doesn't help LLM decode (streaming workloads exceed the 32 MB cache).


Models

ModelSizeUse caseLink
Qwen3.5-122B-A10B-REAP-20-Q6_K76 GBMain target (best quality)HF
Qwen3.5-122B-A10B-REAP-20-Q4_K_M57 GBSmaller variantHF
Qwen3.5-122B-A10B-REAP-40-Q4_K_M44 GBRecommended tradeoffHF
Qwen3.5-0.8B-Q4_K_M508 MBDraft model for spec decodeCommunity GGUF

Reproduce

Hardware setup, kernel parameters, build commands, and the exact flags we used.

Hardware and OS setup

  • AMD Ryzen AI MAX+ 395 (16C/32T Zen 5)
  • Radeon 8060S (gfx1151, RDNA 3.5, 40 CU)
  • 128 GB LPDDR5X-8000
  • Fedora 43, kernel 6.17.1

Required kernel parameters

sudo grubby --update-kernel=ALL --args='amd_iommu=off ttm.pages_limit=335544321 amdgpu.gttsize=122880'

Reboot, then verify GTT:

cat /sys/class/drm/card*/device/mem_info_gtt_total

You should see ~128849018880 (120 GiB). Without amd_iommu=off, any model over 64GB will hang during Vulkan loading.


Vulkan stock baseline

export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH

$HOME/.local/opt/llama.cpp/current/llama-bench \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 \
  -n 128

ROCm with PR #21344

We used a custom podman container because Fedora's HIP headers conflict with ROCm 7.2.1 runtime.

# Inside the custom container
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-bench \
  -m /models/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 \
  -n 128

Do not skip the UMA env var. Without it, ROCm sees only BIOS VRAM and performance collapses.


Speculative decoding

export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH

$HOME/.local/opt/llama.cpp/current/llama-server \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -md $HOME/.local/share/models/gguf/Qwen3.5-0.8B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 -ngld 99 -fa 1 -c 4096 \
  --draft-max 8 --parallel 1

For Qwen3.5 and other hybrid SSM/MoE models, you'll need PR #20075 plus the four additional fixes we documented.