All writing

~/writing/gpu-passthrough-host-reboot

Systems debugging
7 min read

The game that rebooted the hypervisor

A friend's gaming VM hard-reset its Proxmox host the instant a game launched, with nothing in the logs. The cause was below every layer we kept editing: a power transient at game launch that tripped a PCIe fault and reset the box.

A guest isn't supposed to be able to reboot its hypervisor. That's the whole premise: the VM lives in a box, and the worst it can do is crash itself. So when a friend told me his Arch gaming VM rebooted the whole Proxmox host every time he launched a game, my first read was that he'd misdiagnosed it. The host was probably crashing for its own reasons and the game was a coincidence.

It wasn't. Launch the game, and a fraction of a second later the entire physical machine power-cycled. Not the VM. The box. SSH sessions to the host dropped, fans spun down, it came back up cold a minute later like someone had yanked the cord. The VM was reaching through the hypervisor and pulling the plug on the hardware underneath it. That shouldn't be possible.

The setup that made it weird

The card is a Tesla P100. That detail matters more than it first looks. It's a datacenter compute card, Pascal generation, no display outputs whatsoever. You can't plug a monitor into it. So this wasn't a normal gaming-GPU passthrough where the VM drives a screen directly. The P100 was passed through via VFIO as a pure compute and render device, the VM ran a QXL virtual display for the desktop, and games were streamed out over the network. A GPU with no picture, rendering frames that only ever existed as a video stream.

That's a weird enough rig that there's no forum post that matches it. The shape of the problem was mine to find.

Capture evidence before you can lose it

The first real obstacle was that the host died silently. No kernel panic, no oops, no stack trace. The journal stopped mid-line and the next entry was the next boot. Whatever happened was fast and low enough that the kernel never got to write anything before power was gone.

You can't debug a machine that erases its own last words. Before touching anything else, I set up netconsole to ship kernel messages off-box over UDP to another machine in real time. Anything the kernel printed in its final moments would already be sitting in a log somewhere else before the host died.

netconsole-setup.sh
# Ship kernel messages off-box over UDP before the host can die with them.
# 6666 is the listener port on the collector machine.
modprobe netconsole \
  netconsole="6666@HOST_IP/eth0,6666@COLLECTOR_IP/COLLECTOR_MAC"
 
# On the collector, just listen:
#   nc -u -l 6666 | tee host-kernel.log

Unglamorous, but everything else depended on it. With netconsole running, the next launch gave me something to read: a PCIe fault on the bus the P100 lived on, logged right before the lights went out. Not a software error. A hardware-level fault on the link, escalating to something the platform decided to handle by resetting.

Look below the layer you can edit

When a guest can reboot its host, stop looking at the guest. The boundary between VM and hypervisor is enforced in hardware, so an event that crosses it is happening below the software you keep editing. The bug is in the PCIe layer, the firmware, or the device, not in the game's launch options.

The ladder of false fixes

A PCIe fault on a passed-through card has a well-worn list of suspects. I worked down it. Disabled AER with pci=noaer. Turned off ASPM with pcie_aspm=off. Disabled Downstream Port Containment. Disabled SERR signalling on the bridge. Set a power cap on the card. Ran memtest overnight in case the host RAM was marginal.

kernel cmdline, the graveyard of guesses
pci=noaer pcie_aspm=off ... # each one "helped" for a while

Some of them made the crash less frequent. None of them made it stop. And every time the interval got longer, there was a pull to declare it fixed and walk away. That pull is the dangerous part. A bug that now takes ten launches to reproduce instead of one isn't fixed. It's hidden.

At some point I had to say it plainly: we might have suppressed the symptom without touching the cause. Disabling AER doesn't make faults stop. It makes the kernel stop reporting them. If the underlying fault was still firing and I'd just gagged the messenger, I'd made the system quieter and equally broken, with the added problem that I wouldn't see it coming. Saying that out loud is what kept the investigation alive instead of shipping a placebo and waiting for the next mystery.

What was actually happening

The clue was in the timing. The fault didn't fire at idle and didn't fire under steady load. It fired at the transition: the exact instant a game started and slammed the GPU from nearly idle to full tilt.

The P100 boosts its clocks aggressively on load. On a sudden load step, the card ramps clocks and power draw almost instantly, and that transient spike was enough to trip a PCIe fault on this particular platform. The card was asking for more current more quickly than the slot or the link could cleanly deliver, and the fault that produced was severe enough that the platform's answer was a reset.

Every false fix was downstream of this. AER, ASPM, DPC, SERR: all of them control how the system reports and reacts to faults. None of them touched the thing causing the fault, which was the clock-and-power transient at load onset. That's why each one helped a little and none helped enough. I'd been tuning the alarm system while the actual fault went untouched.

The fix was to stop the transient. Pin the GPU clocks flat so there's no sudden ramp to overshoot on, clamped at boot with nvidia-smi and persisted as a systemd service so it survives reboots.

/etc/systemd/system/gpu-clock-lock.service
[Unit]
Description=Pin Tesla P100 clocks to stop load-transition power faults
After=multi-user.target
 
[Service]
Type=oneshot
# Persistence mode on, then lock the GPU graphics clock flat.
ExecStart=/usr/bin/nvidia-smi -pm 1
ExecStart=/usr/bin/nvidia-smi -lgc 1189,1189
RemainAfterExit=yes
 
[Install]
WantedBy=multi-user.target

Lock the clocks flat and the card no longer ramps on load onset, the transient disappears, and the fault has nothing to trigger it. The crash stopped completely. Not less often. It stopped.

The secondary yaks

The host crash was the headline, but getting the rig fully working meant clearing a few other things the crashes had either caused or hidden.

The streaming host was Sunshine, and NVENC encoding refused to work, throwing cudaErrorNoKernelImageForDevice from the color-convert kernel. That error is specific and accurate: the binary has no compiled code image for the GPU's compute capability. CUDA 13 dropped Pascal support, which is sm_60, so a stock build had no sm_60 image to run. Fix was to build Sunshine from source against CUDA 12.9, the last toolkit that still emits Pascal code.

sunshine-build.sh
# CUDA 13 dropped Pascal (sm_60). Build against 12.9, which still emits it.
cmake -DCMAKE_CUDA_ARCHITECTURES=60 -DCUDA_TOOLKIT_ROOT_DIR=/opt/cuda-12.9 ..
make -j"$(nproc)"

Then I pinned it with IgnorePkg in pacman.conf. The failure mode of a routine update silently pulling CUDA 13 back in is NVENC breaking again weeks later with the same cryptic error and no obvious connection to anything changed.

Two more things came off after that. The earlier host resets had corrupted a Proton prefix, and the resulting breakage looked like a game launch-options problem. That's exactly what you get when an unrelated crash damages state and the damage surfaces somewhere completely different. Wiping and rebuilding the prefix cleared it. And the Fallout 4 next-gen launcher refused to start because it called a Windows API, RtlGetDeviceFamilyInfoEnum, that only Proton Experimental had implemented at the time. Fix was to point that title at the right Proton runtime.

What it adds up to

When a guest can reboot its host, the bug is below the software you keep poking. No amount of editing launch options or toggling PCIe error-handling flags will reach it. What worked was sequence: capture evidence before the machine can destroy it, refuse to call a suppressed symptom a fix, follow the timing of the fault down to the layer that actually owns it.

Finding the real root cause didn't just stop one crash. It explained the corrupted Proton prefix too, and turned a machine my friend had half written off back into one he trusts.

The clock lock is one nvidia-smi one-liner. Getting to the point where that one-liner was the obvious right move took everything else.