No more melting my Strix

TL;DR: I’ve had my 395 AI system hard crash 3+ times while working on posts for this blog due to thermals. The fix was always the same - cap power and pin the fans before a heavy run - and so was the failure: the fix was manual, it wasn’t always obvious which runs would push the box hard enough to need it, so I’d skip it or forget and melt the box. So I finally wrote a ~100-line daemon that watches the die temperature and ramps the fans itself. Bonus, it basically never requires actually maxing the fans, so it’s way quieter now too!

The daemon is the boring part, a poll loop and four numbers. The rest of this is how I ended up writing it, and why the off-the-shelf tools were no help.

Contents

A tradition of cooking this box
Driving the fans
Damen’s daemon
The code
The real bug was man all along
The usual caveats
General neat learnings

A tradition of cooking this box

I have now lost a workload to thermals on this miniPC three separate times (four crashes, counting the sitting where I did it twice), across two (soon to be three) posts.

The first time, I learned the hard way that a UMA APU in a miniPC chassis will cook itself to a hard shutdown the instant CPU and GPU spike together - a 165W burst the little cooler can’t clear, and the firmware just cuts power. The fix that time was a ryzenadj power cap.
The second time, I literally opened the relevant section with the words “don’t re-make the same thermal mistakes as last time,” and then re-made them - forgot to pin the fans before a multi-hour sweep and crashed the box twice in one sitting. After that I wrote a wrapper that refused to start a benchmark unless the fans were confirmed spinning at full tilt. Surely that was the end of it.
It was not the end of it, because the wrapper only helped if I remembered to route a run through it. So I kicked off a 262K-token benchmark the quick way, walked away, and it went dark at 102°C partway through the run. This post isn’t live yet, hence the lack of a link, but it’s coming.

The root problem here was clear: any system built on me remembering was doomed to failure.

Driving the fans

None of this part is my discovery - I got into Strix Halo thermals the way most people do - by finding the strixhalo.wiki fan-and-power-control guide after the box started misbehaving (it first cooked itself in my very first post on this machine), which is where both the ec_su_axb35 tooling and the “the standard Linux fan stack won’t bind here” situation come from. The part that’s actually mine is the specific curve and the daemon that runs it, further down.

The short version: lm-sensors + fancontrol won’t bind here, because there’s no standard hwmon PWM (pulse-width modulation, the variable signal that normally sets fan speed) channel for it to grab. The fans sit behind a vendor EC (embedded controller) exposed by a custom out-of-tree driver, ec_su_axb35, under its own sysfs class:

/sys/class/ec_su_axb35/
├── apu/power_mode          # balanced | performance
├── fan1/{mode,level,rpm,rampup_curve,rampdown_curve}
├── fan2/...
├── fan3/...
└── temp1/{temp,min,max}

You set a fan by writing fixed to mode and a level of 1-5, or auto to hand control back to the EC. No PWM percentage, no hwmon binding. (For the record, at level=5 the three fans read 4227 / 4321 / 1876 RPM - two big fans near 4300, one smaller one near 1900. Worth knowing, because “max” isn’t one number.)

The daemon hinges on one detail here: the EC’s own temp1 is a conservative sensor that tops out around 75°C, but the temperature that actually kills the box - the one that read 102 at the crash - is Tctl, from k10temp. Tctl is the control temperature AMD’s chips report for thermal management: effectively the CPU die/junction temperature, and the number the hardware itself watches when it decides to cut power. So the daemon watches Tctl, not the EC’s sensor (and finds k10temp by name rather than by hwmon index, because those indices renumber across reboots - exactly the kind of thing that turns “works on my machine” into “works until the next reboot”).

Damen’s daemon

It’s a graded fan curve on Tctl. Below 75°C the fans go back to auto and the box stays quiet; from 75°C up it steps through fan levels, hitting full blast by 85°C - comfortably under the 100°C TjMax, the junction-temperature limit where the chip starts cutting power to save itself:

CURVE=(75:2 79:3 82:4 85:5)   # Tctl °C : fan level; below 75 -> auto
HYST=3                        # downward deadband so it doesn't flap at a boundary
INTERVAL=3                    # seconds between polls

That’s the entire configuration. The rest is plumbing: poll Tctl, compute the target level, and if it changed, write the level to all three fans (or hand them back to auto when it cools off). A 3°C downward deadband stops it oscillating when the temperature parks on a band edge; warming ramps immediately, cooling waits a few degrees before stepping down. Simulating a sweep makes the hysteresis obvious:

WARMING:  auto ... 75->lvl2  79->lvl3  82->lvl4  85->lvl5
COOLING:  lvl5 held to 82 ... 81->lvl3  75->lvl2  71->auto

It runs as a systemd service with Restart=always, and it deliberately does not restore auto when it exits - if the daemon dies, the fans hold where they were rather than dropping mid-load, and systemd just brings it back. Fail toward cold, not quiet.

Here it is doing exactly that. I installed the service, then drove a sustained 35B benchmark and watched its journal:

strix-fan-daemon started; watching hwmon3/temp1_input; curve=75:2 79:3 82:4 85:5 interval=3s hyst=3C
Tctl 50C -> fans AUTO
Tctl 75C -> fans level 2
Tctl 79C -> fans level 3
Tctl 63C -> fans AUTO

Idle, the fans sit on auto. The bench drives the die temp up, and at 75C the daemon takes over: level 2, then level 3 at 79C. Level 3 turned out to be enough to hold the box in the high 70s for the rest of the run, so it never had to reach for 4 or 5. When the load stopped and the die fell back through the floor, it handed the fans back to auto. And the whole time, power_mode stayed exactly where I left it (performance) - the daemon only ever touches the fans, never the power profile.

The code

Here’s everything: the three files plus an installer. They assume the ec_su_axb35 driver is already installed and loaded - if it isn’t, set that up from the strixhalo.wiki guide first. If you have the same box you can lift these as-is; on a different EC the shape carries over even though the sysfs paths won’t.

The daemon itself, strix-fan-daemon.sh:

#!/bin/bash
# strix-fan-daemon - graded Tctl-driven fan curve for the Strix Halo.
#
# Automates, as a service, what fan-inference.sh does by hand: it watches the CPU
# die temperature and ramps the fans so the box stops cooking itself when someone
# forgets to crank them before a benchmark. It touches ONLY the fans
# (fanN/{mode,level}); it never changes apu/power_mode, ryzenadj, or any power limit.
#
# The spec:
#   - graded curve (not binary)
#   - driven by CPU die temp, k10temp / Tctl
#   - full fans (level 5) at >=85C, release to auto below 75C
#   - fans only, no ryzenadj
#
# Runs as root via systemd (strix-fan-daemon.service).

set -u

EC="/sys/class/ec_su_axb35"

# ---- tunables (the whole config is these four things) ----------------------
INTERVAL=3            # seconds between Tctl polls
HYST=3               # downward deadband, in C, to stop the fans flapping at a boundary
HEARTBEAT=20         # log a heartbeat every N polls even when nothing changes

# Graded curve. Each entry "MIN_TEMP:LEVEL": at Tctl >= MIN_TEMP (and below the
# next entry) the fans are pinned to LEVEL (1-5) in fixed mode. Below the lowest
# MIN_TEMP they drop back to auto (quiet idle). power_mode is never touched. The 75
# floor and the 85->5 top are the anchors; 79/82 are the ramp in between.
CURVE=(75:2 79:3 82:4 85:5)
# ----------------------------------------------------------------------------

log() { echo "$(date '+%F %T') $*"; }

if [ "${EUID:-$(id -u)}" -ne 0 ]; then
  echo "strix-fan-daemon must run as root (it writes $EC)." >&2
  exit 1
fi

# Resolve the k10temp Tctl input by hwmon *name* (indices move across reboots).
find_tctl() {
  local h lab
  for h in /sys/class/hwmon/hwmon*; do
    [ -r "$h/name" ] || continue
    [ "$(cat "$h/name")" = "k10temp" ] || continue
    for lab in "$h"/temp*_label; do
      [ -r "$lab" ] || continue
      if [ "$(cat "$lab")" = "Tctl" ]; then echo "${lab%_label}_input"; return 0; fi
    done
    [ -r "$h/temp1_input" ] && { echo "$h/temp1_input"; return 0; }
  done
  return 1
}

# Desired fan level for a given temp (0 = auto/idle).
desired_level() {
  local t=$1 lvl=0 entry min level
  for entry in "${CURVE[@]}"; do
    min=${entry%:*}; level=${entry#*:}
    [ "$t" -ge "$min" ] && lvl=$level
  done
  echo "$lvl"
}

apply_level() {            # $1 = level (0 = auto). Fans only - never touches apu/power_mode.
  local f
  if [ "$1" -eq 0 ]; then
    for f in fan1 fan2 fan3; do echo auto > "$EC/$f/mode"; done
  else
    for f in fan1 fan2 fan3; do
      echo fixed > "$EC/$f/mode"; echo "$1" > "$EC/$f/level"
    done
  fi
}

TCTL=$(find_tctl) || { log "FATAL: could not find k10temp/Tctl in /sys/class/hwmon"; exit 1; }
log "strix-fan-daemon started; watching $TCTL; curve=${CURVE[*]} interval=${INTERVAL}s hyst=${HYST}C"

cur=-1            # last level actually applied (-1 = unknown/forces first apply)
ticks=0
while :; do
  if ! milli=$(cat "$TCTL" 2>/dev/null) || [ -z "$milli" ]; then
    log "WARN: Tctl read failed; leaving fans unchanged"; sleep "$INTERVAL"; continue
  fi
  t=$(( milli / 1000 ))

  want=$(desired_level "$t")
  if [ "$want" -lt "$cur" ]; then
    # Cooling: only step down if we'd still want a lower level HYST degrees hotter
    # (otherwise hold, to avoid flapping at a band edge).
    want_warm=$(desired_level "$(( t + HYST ))")
    [ "$want_warm" -ge "$cur" ] && want=$cur
  fi

  if [ "$want" -ne "$cur" ]; then
    apply_level "$want"
    if [ "$want" -eq 0 ]; then log "Tctl ${t}C -> fans AUTO"; else log "Tctl ${t}C -> fans level $want"; fi
    cur=$want
    ticks=0
  else
    ticks=$(( ticks + 1 ))
    if [ "$ticks" -ge "$HEARTBEAT" ]; then
      log "Tctl ${t}C (holding $( [ "$cur" -eq 0 ] && echo AUTO || echo "level $cur" ))"; ticks=0
    fi
  fi
  sleep "$INTERVAL"
done

The systemd unit, strix-fan-daemon.service. The comment is the one deliberate choice in here: if the daemon dies, do not drop the fans back to auto.

[Unit]
Description=Strix Halo graded Tctl fan-curve daemon
Documentation=/blog-posts
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/local/sbin/strix-fan-daemon.sh
Restart=always
RestartSec=2
# Fail safe: if the daemon dies we do NOT restore auto (that would drop the fans
# mid-load); systemd just restarts it, and fans hold wherever they were.
Nice=-5

[Install]
WantedBy=multi-user.target

And the idempotent installer, install.sh:

#!/bin/bash
# Idempotent installer for the Strix fan-curve daemon. Run as root on the system:
#   sudo ./install.sh
# Safe to re-run; it just refreshes the files and reloads the service.
set -euo pipefail

SRC="$(cd "$(dirname "$0")" && pwd)"

install -m 0755 "$SRC/strix-fan-daemon.sh"      /usr/local/sbin/strix-fan-daemon.sh
install -m 0644 "$SRC/strix-fan-daemon.service" /etc/systemd/system/strix-fan-daemon.service

systemctl daemon-reload
systemctl enable strix-fan-daemon.service
systemctl restart strix-fan-daemon.service   # restart so a re-install actually picks up script changes
sleep 1
systemctl status --no-pager strix-fan-daemon.service || true
echo
echo "Installed. Follow it live with:  journalctl -fu strix-fan-daemon"

Drop those three in a directory, then:

sudo ./install.sh
journalctl -fu strix-fan-daemon      # watch it work

The real bug was man all along

Here’s the part I’d want past-Damen to read. Every one of these three deaths already had a fix written down - ryzenadj after the first, fan-pinning and then a whole bench wrapper after the second. What kept failing was process: a fix you have to remember to apply is a fix that eventually doesn’t get applied.

So now, the protection runs as a service the box starts itself - watching temperature continuously rather than waiting on me!

The usual caveats

Fans only - this does not cover the burst-spike death. The very first thermal crash wasn’t a slow cook; it was a 165W CPU+GPU transient that tripped a hard shutdown faster than any fan can respond. Fans can’t save you from that - only a power cap can. This daemon is deliberately fans-only, so for burst protection you still want the ryzenadj cap from the first post. What it fixes is the sustained-load death (the one that ate the 262K bench), which is the one that actually keeps happening to me.
It’s reactive. The curve engages as Tctl climbs past 75, so there’s a brief warm-up transient at the start of a heavy load before the fans spin up - unlike pinning them by hand before you start. Fine for sustained benchmarks; if you wanted to close that gap you’d add a load-aware floor on top of the curve.
One box, one EC. All of this is specific to the ec_su_axb35 driver on this particular miniPC. The shape of the solution generalizes; the sysfs paths and fan levels may not.

General neat learnings

If your fans have no hwmon PWM, the standard tooling won’t help - find the vendor EC’s sysfs and drive it yourself. It’s less work than it sounds: this whole daemon is one poll loop and four config values.
Watch the temperature that kills the box, not the convenient one. Tctl, not the EC’s polite low-reading sensor.

The least reliable component in this miniPC was always the human holding the SSH session. It’s a hundred lines of bash’s job now, and bash doesn’t forget to turn the fans on.