Running a 122B MoE model on a laptop APU

We spent a month pushing llama.cpp to its limits on AMD's Strix Halo platform. This site documents every build, every regression, and the upstream contributions that came from it.

406Peak prefill t/s

40.1Peak decode t/s

+64%Best chat speedup

131KMax context

The optimization stack

What we learned

Vulkan wins on memory mapping. vkAllocateMemory with HOST_VISIBLE_BIT correctly maps the full 120 GiB GTT. ROCm's hipMalloc is stuck in BIOS VRAM unless you explicitly enable unified memory.
ROCm wins on batched compute. PR #21344 unlocked +35% prefill speed on gfx1151.
Spec decode stacks orthogonally. ROCm's single-token decode is slower, but its batched verify is so fast that speculative decoding flips the result. ROCm + spec became our fastest overall config.
Decode doesn't degrade with context. At 512 tokens or 131,072 tokens, decode speed is basically identical. Only prefill gets expensive.
MUL_MAT_ID is the hidden villain. For large MoE models, the routed expert path eats 42-66% of prefill time on Vulkan. We filed issue #21948 with full profiling data.

Four configurations compared

Config	Backend	Patches	pp512	tg128	Best for
Vulkan stock	Vulkan	none	303	24.7	Minimum latency
Vulkan + spec	Vulkan	PR #20075 + 4 fixes	302	35.3	Fast decode, easy deploy
ROCm + MMQ	ROCm 7.2.1	PR #21344	406	18.2	Long prefill, RAG
ROCm full stack	ROCm 7.2.1	#21344 + #20075	404	40.1	Overall chat throughput

Upstream contributions

PR #21344 — Posted gfx1151 validation numbers confirming +19-35% prefill improvement.
Issue #21948 — Filed Vulkan MUL_MAT_ID bottleneck with GGML_VK_PERF_LOGGER=1 profiling.
PR #20075 — Documented four bug fixes needed for spec decode on hybrid SSM/MoE models outside of Metal.

Every build we tested

Not just the winners. The crashes, regressions, and abandoned attempts are here too.

Complete build history

Build	Backend	pp512	pp8K+tg	pp32K+tg	pp128K+tg	tg128	Notes
Vulkan b8779 stock	Vulkan	268.89	249.65	234.94	145.03	23.52	Baseline
Vulkan spec build	Vulkan	302.10	—	—	—	24.56	PR #20075 + fixes
ROCm stock (toolbox)	ROCm	262.18	228.66	—	144.44	20.83	Graph splits from VRAM limit
ROCm + PR #21344	ROCm	354.57	302.44	291.03	171.66	21.00	Best pure prefill
ROCm stacked	ROCm	229.70	—	—	—	17.96	Regressed vs patched-only
ROCm full stack	ROCm	405.99	—	—	—	18.29	PR #21344 + #20075 + fixes
REAP-20 Q4_K_M baseline	Vulkan	443	—	—	—	29.1	Overnight baseline (smaller model)
REAP-40 Q4_K_M	Vulkan	352	349	291	145	29.5	Recommended smaller model

The "ROCm stacked" build combined PR #21344 with -ffast-math and hipBLASLt. We expected additive gains. Instead, prefill dropped from 354 to 230 t/s. Not every optimization stacks cleanly.

What broke along the way

Attempt	What went wrong	Lesson
Q8_0 full benchmarks	105 GB model exceeded 120 GiB GTT and was OOM-killed repeatedly	Stick to Q6_K (76 GB) or Q4_K_M (57 GB)
ROCm Q6_K without UMA env var	Pathologically slow (>1 hr) due to 94 graph splits from BIOS VRAM limit	`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` is mandatory on Strix Halo
ROCm with lhl's FATTN patches	Patch conflicts with current master; wouldn't apply cleanly	Needs rebase on matching llama.cpp version
Stacked build with `-ffast-math` + hipBLASLt	pp512 dropped from 354 to 230 t/s vs patched-only	Test optimizations in isolation
Q6_K on stock Vulkan before kernel fix	Infinite loading hang (>20 min) for any model >64 GB	`amd_iommu=off` is required

Real workload timing

Server-mode tests with actual prompts. 256-token generation, no_think mode, temp=0.3.

Workload	Vulkan stock	Vulkan + spec	ROCm + MMQ	ROCm full stack	Best
Chat (30 in, 1000 out)	41.5s	31.9s +30%	56.9s -27%	28.3s +47%	Full stack
Code gen (2K in, 2K out)	118.3s	99.1s +19%	138.0s -14%	80.9s +46%	Full stack
Summarize (8K in, 256 out)	155.9s	153.5s +2%	114.5s +36%	107.2s +46%	Full stack

Speculative decoding

Target: Qwen3.5-122B-A10B-REAP-20-Q6_K (76 GB). Draft: Qwen3.5-0.8B-Q4_K_M (508 MB). This should not have worked out of the box. It didn't.

How spec decode changes the math

Results after the fixes

Task	Mode	Baseline	Spec decode	Speedup	Acceptance
Photosynthesis explanation	think	24.29	30.74	+26.5%	76.0%
Neural nets explanation	no_think	24.36	34.05	+39.8%	90.3%
BST code generation	think	24.43	29.23	+21.8%	80.6%
Short answer (2+2)	think	24.43	29.70	+23.8%	88.9%

no_think mode is where spec decode shines: 40% speedup with 90% acceptance. Thinking mode drops acceptance to ~76% because reasoning tokens are harder to predict.

The four bugs we had to fix

Beyond PR #20075, these fixes were required for Qwen3.5 on non-Metal backends:

Hybrid seq_rm skipped attention cleanup. When recurrent rollback failed, the function returned early and never called mem_attn->seq_rm(). Stale positions accumulated, breaking M-RoPE invariants.
Soft rollback corrupted positions. The partial removal path moved cell positions backward (cells[i].pos = p0 - 1) instead of erasing them. seq_pos_max then returned stale higher positions.
Compat check was chicken-and-egg. common_speculative_is_compat() tested seq_rm on a fresh context before any checkpoints existed. It always failed for hybrid models.
Recurrent reserve size was too small. With -np 1, recurrent_rs_size = max(1, n_seq_max) = 1. Checkpointing needs up to 8 cells per sequence. We bumped it to max(16, n_seq_max * 16).

Context scaling

Prefill slows down as context grows. Decode stays weirdly constant. Here's the full sweep from 64 to 32K tokens.

Prefill speed by context

Vulkan stock ROCm + MMQ

Decode speed by context

Vulkan stock Vulkan + spec ROCm + MMQ + spec

Full context scaling table

Context	Vk pp	Vk tg	Vk+spec tg	ROCm pp	ROCm tg	ROCm+spec tg	Best decode
64	98	24.3	26.6	130	17.8	28.5	ROCm+spec
512	303	24.4	28.1	376	17.7	23.3	Vk+spec
2K	353	24.2	25.2	423	17.7	26.7	ROCm+spec
4K	362	24.2	25.5	413	17.7	25.3	ROCm+spec
8K	371	23.9	25.1	407	17.4	27.4	ROCm+spec
16K	353	23.4	25.0	371	17.0	23.6	Vk+spec
32K	314	22.6	18.6	315	16.2	22.9	ROCm+spec

ROCm's MMQ prefill advantage fades after 16K. At that point, flash attention dominates over batched matmul. Decode stays flat across all lengths.

Raw llama-bench: combined throughput

Context	Vulkan stock	ROCm + MMQ	Winner
512	93.0	77.8	Vulkan (+20%)
2K	181.8	179.5	Vulkan (+1%)
8K	258.0	287.3	ROCm (+11%)
16K	266.0	304.4	ROCm (+14%)
32K	255.4	285.8	ROCm (+12%)
65K	217.2	236.8	ROCm (+9%)
131K	155.7	169.2	ROCm (+9%)

Research trail

Profiling data, parallel subagent investigations, and the upstream issues and PRs that came out of them.

MUL_MAT_ID profiling breakdown

KV Cache Compression Frontier

Mission 01 tested 11 cache_type_k × cache_type_v combinations across 4 context lengths (512–32K) on Qwen3.5-122B-A10B-REAP-20-Q6_K.

Pareto-optimal sweet spots

Combo	Context	pp t/s	tg t/s	KV mem	Reduction
f16 / f16	32K	264.4	24.53	8.19 GB	baseline
q8_0 / q8_0	32K	244.7	24.32	4.09 GB	−50%
q4_0 / q4_0	32K	246.2	24.35	2.05 GB	−75%

Prefill speed vs memory reduction at 32K context

*All asymmetric combos (q8_0/f16, f16/q8_0, q4_0/f16, f16/q4_0, q4_0/q8_0, q8_0/q4_0) either timed out at 32K or showed severe prefill degradation. Avoid mixed K/V quantization on this Vulkan backend.

Upstream contributions

Date	Contribution	Details
Apr 15	Issue #21948	Filed Vulkan MUL_MAT_ID bottleneck with full `GGML_VK_PERF_LOGGER=1` profiling data.
Apr 15	PR #21344	Validated MMQ VGPR tuning on gfx1151: +19-35% prefill improvement.
Apr 15	PR #20075	Documented four bug fixes needed for hybrid SSM/MoE spec decode on non-Metal backends.
Apr 14	Kernel fix	Applied `amd_iommu=off` and `ttm.pages_limit=335544321` to fix llama.cpp #14854.

What four parallel subagents found

Memory: Vulkan's vkAllocateMemory with HOST_VISIBLE_BIT maps the full GTT correctly. ROCm's hipMalloc is boxed into BIOS VRAM unless GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is set.
Compiler flags: -ffast-math can give ~4% on some AMD GPUs, but it regressed in our stacked build. ROCm 7.2.1 made the old --amdgpu-unroll-threshold-local workaround unnecessary.
Flash attention: lhl's rocWMMA FATTN tuning can improve decode at depth by 33-104%, but the patch doesn't apply cleanly to current master.
MUL_MAT_ID: The routed expert path takes 700+ ms per batch. The most promising fix pattern is Metal's map → batched matmul → unmap approach from PR #13388.

Resources

The PRs, issues, articles, and repos that made this project possible — or that we wish we'd found sooner.

Our contributions

ValidationPR #21344 — MMQ VGPR tuning

Custom ROCm container, patch applied, gfx1151 numbers posted: +19-35% prefill on large MoE.

Bug reportIssue #21948 — Vulkan MUL_MAT_ID

Full perf-logger profiling showing MUL_MAT_ID consumes 42-66% of prefill time on gfx1151.

FixesPR #20075 — Spec decode fixes

Four additional fixes for hybrid SSM/MoE speculative decoding on non-Metal backends.

RepoGitHub: 0xSero/framework-max

Benchmarking scripts, overnight pipeline, and visualization code.

Prior art and reading

MergedPR #13388 — Metal MoE optimization

Map → batched matmul → unmap for MoE models. 1.8-4.1x speedup on Metal. The template we'd like to see on Vulkan.

MergedPR #15524 — Vulkan subgroup opt

Subgroup optimizations for Vulkan MUL_MAT_ID. Already active in our build, but still leaves a large gap.

ReferenceIssue #14854 — >64GB loading hang

The original report for the Vulkan/Strix Halo issue where models over 64GB load infinitely slowly without amd_iommu=off.

ResearchChips and Cheese — RDNA 3 Infinity Cache

Deep-dive on why Infinity Cache mostly doesn't help LLM decode (streaming workloads exceed the 32 MB cache).

Models

Model	Size	Use case	Link
Qwen3.5-122B-A10B-REAP-20-Q6_K	76 GB	Main target (best quality)	HF
Qwen3.5-122B-A10B-REAP-20-Q4_K_M	57 GB	Smaller variant	HF
Qwen3.5-122B-A10B-REAP-40-Q4_K_M	44 GB	Recommended tradeoff	HF
Qwen3.5-0.8B-Q4_K_M	508 MB	Draft model for spec decode	Community GGUF

Reproduce

Hardware setup, kernel parameters, build commands, and the exact flags we used.

Hardware and OS setup

AMD Ryzen AI MAX+ 395 (16C/32T Zen 5)
Radeon 8060S (gfx1151, RDNA 3.5, 40 CU)
128 GB LPDDR5X-8000
Fedora 43, kernel 6.17.1

Required kernel parameters

sudo grubby --update-kernel=ALL --args='amd_iommu=off ttm.pages_limit=335544321 amdgpu.gttsize=122880'

Reboot, then verify GTT:

cat /sys/class/drm/card*/device/mem_info_gtt_total

You should see ~128849018880 (120 GiB). Without amd_iommu=off, any model over 64GB will hang during Vulkan loading.

Vulkan stock baseline

export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH

$HOME/.local/opt/llama.cpp/current/llama-bench \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 \
  -n 128

ROCm with PR #21344

We used a custom podman container because Fedora's HIP headers conflict with ROCm 7.2.1 runtime.

# Inside the custom container
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-bench \
  -m /models/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -ngl 99 -fa 1 -c 131072 \
  -p 512,2048,8192,16384,32768,65536,131072 \
  -n 128

Do not skip the UMA env var. Without it, ROCm sees only BIOS VRAM and performance collapses.

Speculative decoding

export LD_LIBRARY_PATH=$HOME/.local/opt/llama.cpp/current:$LD_LIBRARY_PATH

$HOME/.local/opt/llama.cpp/current/llama-server \
  -m $HOME/.local/share/models/gguf/Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf \
  -md $HOME/.local/share/models/gguf/Qwen3.5-0.8B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 -ngld 99 -fa 1 -c 4096 \
  --draft-max 8 --parallel 1

For Qwen3.5 and other hybrid SSM/MoE models, you'll need PR #20075 plus the four additional fixes we documented.