TL;DR: I got Qwen3-Coder-Next (80B MoE) running at 46 tok/s on an under-$3K mini PC. It took a full OS reinstall, a firmware downgrade, kernel parameter archaeology, a thermal crisis, and throwing out about half the tuning advice I found online. Here's everything I learned the hard way.

Why This Hardware

My existing GPU setups didn't have enough VRAM to run some of the larger models I was interested in testing. Discrete GPUs with 48+ GB of VRAM are absurdly expensive, and splitting a model across multiple consumer cards comes with its own headaches and PCIe bottleneck tax. So I started looking into UMA (Unified Memory Architecture) systems — where the CPU and GPU share the same memory pool — as a significantly more affordable way to get a ton of usable memory for inference.

That led me to the Ryzen AI MAX+ 395. It's a weird chip — a laptop/mini-PC APU with 32 Zen 5 cores, a 40-CU RDNA 3.5 iGPU, and support for up to 128 GB of LPDDR5 unified memory. Since the CPU and GPU share the same pool, the GPU can address all 128 GB without PCIe bottlenecks. For LLM inference, where model weights need to stream through the compute units every single token, that's a huge deal.

The theoretical memory bandwidth is 256 GB/s (LPDDR5X-8000 on a 256-bit bus). In practice I measured around 212-215 GB/s — about 82% efficiency. That's slower than an M4 Max (~546 GB/s) but faster than trying to cram a 70B model across two consumer GPUs and eating the PCIe tax.

The GMKtec NucBox EVO-X2 packages this chip into a mini PC chassis for under $3K with 128 GB RAM — though with the way LPDDR5 prices have been going lately, check current pricing before you get too excited. There are a few other options with this chip: Framework makes a Desktop, ASUS has the ROG Flow Z13 tablet, and Minisforum has the EliteMini AI Max. The GMKtec was the best price-to-performance option I found at the time, but it's worth shopping around.

The OS: Rocky Linux 9.7

I'm running Rocky Linux 9.7 — enterprise stability, good package ecosystem, SELinux actually works properly. Any RHEL 9 derivative should work similarly.

The Three Things That Must Be Right

After the base OS was clean, I hit a wall. A really frustrating wall. Getting this hardware working properly requires three specific things to be correct — the right kernel, the right firmware, and thermal power limits that won't let the system cook itself to death. I'm going to cover all three here because skipping any one of them will ruin your day.

1. Kernel 6.18.4 or newer

The KFD (Kernel Fusion Driver) in older kernels has a page table bug specific to gfx1151. Any GPU tensor allocation triggers "Memory access fault: Page not present" errors. This was fixed upstream in kernel 6.18.4. Rocky 9's stock kernel is 6.12, which is too old.

I tried AMD's amdgpu-dkms package first (which backports the amdgpu driver to older kernels), but the DKMS version is pre-6.18 and doesn't include the KFD fix. No combination of kernel parameters — HSA_ENABLE_SDMA=0, amd_iommu=off, amdgpu.noretry=0, amdgpu.cwsr_enable=0 — works around it. Trust me, I tried them all. You need the actual kernel fix.

The solution: ELRepo's kernel-ml package, which provides mainline kernels packaged for RHEL/Rocky. I installed 6.19.6 and it just worked.

sudo dnf install -y elrepo-release
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo dnf --enablerepo=elrepo-kernel install -y kernel-ml

2. MES firmware version 0x80

Even with kernel 6.19.6, I was still getting page faults. Cool. The second half of the puzzle is the MES (Micro Engine Scheduler) firmware. Rocky's linux-firmware-20260130 package ships MES version 0x83, which is known to cause ROCm page faults on Strix Halo. The upstream linux-firmware repository explicitly reverted it with the commit message: "MES FW 0x83 is reported to cause ROCm page faults."

Rocky hadn't picked up the revert yet, and AMD's own amdgpu-dkms-firmware package also ships 0x83. So the fix is manual:

# Download good firmware (version 0x80) from upstream revert commit
curl -sL -o /tmp/gc_11_5_1_mes1.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes1.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"
curl -sL -o /tmp/gc_11_5_1_mes_2.bin \
  "https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/plain/amdgpu/gc_11_5_1_mes_2.bin?id=c092c7487eb7c3d58697f490ff605bc38f4cc947"

# Install to updates dir (takes priority over base firmware)
sudo cp /tmp/gc_11_5_1_mes1.bin /lib/firmware/updates/amdgpu/
sudo cp /tmp/gc_11_5_1_mes_2.bin /lib/firmware/updates/amdgpu/

# Rebuild initramfs and reboot
sudo dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
sudo reboot

Verify after reboot:

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
# Good: firmware version: 0x00000080
# Bad:  firmware version: 0x00000083

Once both pieces were in place, PyTorch passed all validation checks: tensor operations, all data types (fp32, fp16, bf16, int8), 4 GiB memory allocation, and ~1.05 TFLOPS on a 4096x4096 FP32 matmul. Finally.

Lesson learned the hard way: Pin your firmware. I added exclude=linux-firmware* amdgpu-dkms-firmware* to /etc/dnf/dnf.conf to prevent package updates from sneaking MES 0x83 back in. Ask me how I know.

3. Thermal Power Limits

This one might be the most important of the three, so don't skip it.

While setting up a PyTorch benchmarking suite, the system started dying on me. At first I figured "oh weird, the host crashed" — but when I went to check on it, it wasn't just locked up. It was fully powered off. That's... not normal. Then it happened again. And again. Full hard power-off events with no warning, no logs, nothing.

I set up thermal monitoring logging every 5 seconds and caught the cause:

19:00:07  Tctl=71°C   pwr=92W    ← normal inference
19:00:12  Tctl=91°C   pwr=165W   ← torch.compile spike
19:00:22  Tctl=93°C   pwr=164W   ← approaching TjMax (100°C)
19:00:27  Tctl=61°C   pwr=30W    ← thermal shutdown

torch.compile triggers Triton/Inductor kernel compilation that simultaneously hammers all 32 CPU cores and the GPU. On a UMA APU where everything shares one thermal envelope in a mini PC chassis, that produces a 165W power spike — way past the 120W PPT Fast limit and far more than the little cooler can handle. The firmware thermal protection kicks in and just kills power. No graceful shutdown, just off.

Normal LLM inference is totally fine — 73-75W, 76-80°C, perfectly stable all day long. But the moment you hit a mixed CPU+GPU burst workload, you're rolling the dice. And it's not just torch.compile — anything that pegs the CPU and GPU simultaneously in this chassis can trigger it. I lost count of how many times the system just cut out on me before I got this sorted.

The fix is ryzenadj, a tool that lets you adjust AMD mobile power limits from Linux:

sudo ryzenadj --fast-limit=100000 --tctl-temp=88

This caps burst power to 100W and sets the thermal target to 88°C, giving 12°C of headroom before TjMax. Do this immediately after your first boot, before you run anything heavy. I created a systemd service to persist these limits across reboots so they're always active. The GMKtec ships with BIOS 1.12 / EC 1.10 (the latest available), so there's no firmware fix coming — you've gotta manage this in software.

Other thermal improvements people recommend but I haven't tried yet: replacing the stock thermal paste with PTM7950 phase-change material, and the ec_su_axb35 kernel module for Linux fan control. Maybe I'll get to those at some point.

Understanding Unified Memory (It's Unintuitive)

The BIOS has a "UMA Frame Buffer Size" setting that defaults to 64 GB. Your instinct says "big number = more GPU memory = good." Yeah, your instinct is wrong here.

On a traditional discrete GPU, VRAM is physically separate from system RAM. On Strix Halo, there's only one pool of LPDDR5. The BIOS carveout reserves a chunk of that pool as dedicated VRAM — the OS can't see it, can't use it for anything else, and the GPU doesn't even need it because it can access system RAM at the same speed through GTT (Graphics Translation Table).

The optimal configuration is:

  • BIOS VRAM: 2 GB (the minimum on the GMKtec's current BIOS 1.12 — you'll see guides online saying to set this to 512 MB, but that was only possible on earlier BIOS versions. 2 GB is as low as it goes now.)
  • GTT: 124 GB (dynamically mapped, shared between CPU and GPU)

This gives you ~124 GB usable for both CPU and GPU workloads, instead of 64 GB locked to GPU + 64 GB for CPU.

The kernel parameters to make this work:

amdgpu.gttsize=126976          # 124 GiB GTT
ttm.pages_limit=29360128       # Allow TTM to manage 112 GiB of pages
ttm.page_pool_size=29360128    # Matching pool size
amdgpu.no_system_mem_limit=1   # Disable SVM resident memory cap
amd_iommu=off                  # Fully disable IOMMU (~4% bandwidth gain)

The ttm.pages_limit parameter is particularly sneaky. Without it, you can set GTT to 124 GB and the kernel will report 124 GB, but HIP/ROCm applications will only see ~62 GiB. The TTM subsystem has its own page limit that must match. And it has to be set at boot — runtime changes don't take effect. That one took a while to figure out.

On Rocky 9, updating kernel parameters has its own gotcha: editing /etc/default/grub and running grub2-mkconfig doesn't work. Rocky 9 uses BLS (Boot Loader Specification) entries, which have their own options line. Use grubby instead:

grubby --update-kernel=DEFAULT --args="amd_iommu=off amdgpu.gttsize=126976 ttm.pages_limit=29360128 ttm.page_pool_size=29360128 amdgpu.no_system_mem_limit=1"

Building and Running llama.cpp

Ok, with the hardware finally cooperating, I built llama.cpp. I started with ROCm/HIP since that's what everyone recommends for AMD GPUs:

cmake -B build \
  -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 \
  -DGGML_NATIVE=OFF -DCMAKE_C_FLAGS='-march=znver4' -DCMAKE_CXX_FLAGS='-march=znver4' \
  -DGGML_AVX512=ON -DGGML_AVX512_VBMI=ON -DGGML_AVX512_VNNI=ON -DGGML_AVX512_BF16=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_LTO=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

A few build notes:

  • -DGGML_NATIVE=OFF with explicit -march=znver4 is required because GCC 11 on Rocky 9 emits VNNI instructions that the system's binutils can't assemble. Specifying znver4 explicitly avoids the problematic auto-detection.
  • The AVX512 flags enable SIMD for CPU-side tensor ops. Zen 5 has full AVX-512 support.
  • GGML_HIP_ROCWMMA_FATTN enables wave matrix multiply for flash attention.

Critical for APUs: You must set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 before running. Without it, llama.cpp tries to allocate in the 2 GB dedicated VRAM carveout and fails for any model larger than 2 GB. With it, allocations go through the full GTT pool. Don't skip this or you'll be very confused.

The Model

I'm running Qwen3-Coder-Next Q4_K_M — an 80B parameter Mixture-of-Experts model with 3B active parameters, purpose-built for coding agents. At Q4_K_M quantization it's about 46 GiB across 4 GGUF shards, fitting comfortably in 128 GB with room for a 65K token context window.

The Mixture-of-Experts architecture is what makes this hardware viable. An 80B MoE model only needs to stream the active expert weights each token — roughly 3B parameters — not the full 80B. Dense 70B models? They crawl at 5-7 tok/s on this hardware. This 80B MoE? 46 tok/s. Same memory, same bandwidth — the model architecture makes all the difference.

This model scored #1 on SWE-rebench Pass@5 at 64.6%, beating Claude Opus 4.6 (58.3%). Running it locally at interactive speeds on a sub-$3K box (give or take, depending on what RAM prices are doing this week) is... pretty nuts.

Runtime Configuration

I run llama-server as a systemd service with these flags:

-fa on              # Flash attention (smaller KV cache, faster attention)
--parallel 1        # Single slot — all memory for one user
-t 32 -tb 32       # All 32 CPU cores
-ub 2048            # Large ubatch for GPU utilization during prompt processing
-ctk q8_0 -ctv q8_0  # Quantized KV cache (~2x smaller than f16, minimal quality loss)
--mlock             # Pin model in RAM
-c 65536            # 65K context window

Two things I learned about GPU power modes: profile_peak sounds good but actually causes thermal throttling on an integrated GPU sharing the SoC thermal envelope. Generation dropped from 37.9 to 26.9 tok/s. Ouch. Use high instead — it clocks up aggressively but lets the thermal controller do its job.

Tuning: What the Internet Got Wrong

With the system stable, I went through every tuning recommendation I could find — a comprehensive "definitive guide" document and the strixhalo.wiki llama.cpp performance page. I benchmarked each claim individually. A lot of them were wrong, at least for this hardware.

Things that didn't matter

--no-mmap vs --mlock: Identical performance. pp=219.5/tg=37.7 vs pp=218.7/tg=38.0. On a UMA APU where GPU memory is system memory, both approaches effectively do the same thing. Pick whichever you prefer.

-b 256 batch size: Slightly worse than the default -ub 2048. The claimed jump from 70 to 591 tok/s was for Qwen3-30B-A3B, a much smaller model with different memory access patterns. Don't copy batch size settings across models.

ROCBLAS_USE_HIPBLASLT=1: No measurable effect on gfx1151 with this model. The "mandatory" claim may apply to other GPU architectures.

Things that helped a little

amd_iommu=off: Real. Generation speed went from 38.0 to 39.4 tok/s — a 3.7% improvement. Not the claimed 6%, but free performance. I also bumped GTT from 112 GiB to 124 GiB in the same change.

The big discovery: Vulkan beats ROCm

Then I built llama.cpp with Vulkan instead of HIP, just to see what would happen:

cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release

The results were... not subtle:

Context Vulkan pp (tok/s) Vulkan tg (tok/s) HIP pp (tok/s) HIP tg (tok/s)
Default (512) 548 45.9 336 40.8
32K 394 36.8 91 29.7
65K 305 32.2 54 23.5
100K 213 28.2 36 18.7

Vulkan with RADV (Mesa's open-source Vulkan driver) was 63% faster at prompt processing and 12% faster at generation at default context. The gap widens with context length — at 100K tokens, Vulkan is nearly 6x faster at prompt processing and 51% faster at generation.

This directly contradicts the common advice that "ROCm is better for long-context work." That may be true on datacenter GPUs (MI300X) or older desktop GPUs (gfx1100), but on gfx1151, the HIP compute kernels are known to run 2-6x slower than expected. Vulkan's cooperative matrix support through RADV doesn't have the same problem.

The guides also recommended AMDVLK (AMD's proprietary Vulkan driver) over RADV for 10-15% better performance. I investigated and found that AMD discontinued AMDVLK in September 2025, going all-in on RADV. The strixhalo.wiki's own benchmarks actually show RADV beating AMDVLK even before they killed it. Just use RADV.

One nice bonus: the Vulkan build doesn't need the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 environment variable. That's a HIP/ROCm-specific workaround.

The Boring But Important Stuff

A handful of other things that aren't exciting but tripped me up:

DNF firmware pinning: Added exclude=linux-firmware* amdgpu-dkms-firmware* to /etc/dnf/dnf.conf. Without this, a routine dnf update can reintroduce MES 0x83 and break GPU compute.

EPEL rocminfo conflict: EPEL ships rocminfo 5.4.4 which conflicts with the ROCm 7.2 version from AMD's repo. Fixed with dnf config-manager --save --setopt=epel.excludepkgs=rocminfo.

SELinux and systemd: The llama-server binary must live in /usr/local/bin (not ~/) for SELinux to allow systemd to execute it. Run restorecon -v after copying.

WiFi: The MediaTek MT7925 (Wi-Fi 7) works with WPA2 networks but fails on WPA2/WPA3 mixed-mode SSIDs. Suspected mt7925e driver bug. If your router broadcasts both, you may need a WPA2-only SSID.

GPU performance mode: Set via udev rule to persist across reboots:

echo 'ACTION=="add", SUBSYSTEM=="drm", KERNEL=="card0", ATTR{device/power_dpm_force_performance_level}="high"' \
  | sudo tee /etc/udev/rules.d/99-gpu-perf.rules

What I'd Do Differently

If I was setting this up again from scratch:

  1. Start with Vulkan, not ROCm/HIP. I spent way too much time optimizing the HIP build before discovering Vulkan was faster at everything. Just build llama.cpp with -DGGML_VULKAN=ON from the start.

  2. Install ELRepo kernel immediately. Don't waste time trying to make the stock 6.12 kernel work with DKMS. It can't. I tried.

  3. Check MES firmware before debugging anything else. If rocminfo hangs or GPU compute produces page faults, check MES version first. It's the most common cause and the least obvious one.

  4. Set BIOS VRAM to minimum and maximize GTT from day one. The default 64 GB carveout wastes half your memory for no reason.

  5. Install ryzenadj before you do literally anything else. Seriously. The thermal shutdowns caught me completely off guard and happened repeatedly. The stock power limits on this chassis are not safe for sustained workloads. Cap power first, then start playing with models.

The End Result

My final configuration:

Component Setting
OS Rocky Linux 9.7, kernel 6.19.6 (ELRepo)
GPU driver Mesa RADV 25.0.7 (Vulkan)
MES firmware 0x80 (manually installed)
BIOS VRAM 2 GB (minimum)
GTT 124 GiB
IOMMU Fully disabled
Power limits 100W burst / 88°C target (ryzenadj)
llama.cpp Vulkan build, flash attention, q8_0 KV cache
Model Qwen3-Coder-Next Q4_K_M (80B MoE, 46 GiB)
Context 65K tokens

Performance:

Metric Speed
Token generation (short context) 45.9 tok/s
Token generation (32K context) 36.8 tok/s
Token generation (65K context) 32.2 tok/s
Token generation (100K context) 28.2 tok/s
Prompt processing (short context) 548 tok/s

For a mini PC that cost me under $3K — though good luck getting that price if LPDDR5 keeps doing what it's been doing — running a frontier-class 80B coding model entirely locally, with 65K context and no API costs? I'm pretty happy with that.


Tested on: GMKtec NucBox EVO-X2, AMD Ryzen AI MAX+ 395, 128 GB LPDDR5, Rocky Linux 9.7, kernel 6.19.6, llama.cpp build f90bd1dd8, Mesa RADV 25.0.7. March 2026.