~/writing/rp2040-furnace-deadlock
The furnace controller that deadlocked when the WiFi dropped
An RP2040 furnace controller would freeze at random, always when the WiFi dropped. The culprit was a connectivity probe that opened a TCP connection to decide whether it could open one. A timeout would have hidden it; removing the probe fixed it.
A furnace controller that locks up isn't a bug you ship. It's a device whose entire job is to keep running: holding a temperature, watching a thermocouple, ready to throw an emergency cutoff if things go hot. When it stops, it stops doing all of that at once, silently, and you find out the way you least want to find out.
The symptom showed up every so often: a deadlock that lined up with the WiFi dropping. I didn't love it. Intermittent is bad. Correlated with the network is worse, because it means the failure lives in the part of the fw that has no business touching the control path in the first place.
What the box is
The controller is a Pico W: RP2040 with the CYW43 WiFi part on board. It reads temperature from a MAX6675 thermocouple over SPI, drives two relays (one for the heating element, one for an emergency overheat cutoff), and runs timeout logic to bank embers instead of letting the fire die. There's a small web UI so you can check the temperature and nudge setpoints from a phone. WiFi exists for the UI. It's a convenience. Emphatically not part of the safety loop, or it wasn't supposed to be.
Two cores. core0 ran the control logic and the relays. core1 ran the network stack and the web server. Splitting them meant the network could do whatever it wanted without ever stalling the part that decides whether the element is on or off.
That was the theory. The deadlock said otherwise.
Finding the stall
The "I think it's the WiFi" hunch was the right thread to pull. I added a heartbeat: core0 toggled a flag every control cycle, and a watchdog-adjacent check on the other side noted when the flag stopped moving. Then I pulled the network cable on the access point and waited.
It froze. Not just the web server, everything. The control loop stopped cycling. On a furnace.
I walked the network code looking for what runs when the link goes away, and found this:
// Called periodically to decide whether we still have a usable link.
static bool wifi_check_connectivity(void) {
struct tcp_pcb *pcb = tcp_new();
ip_addr_t probe;
ipaddr_aton(PROBE_HOST, &probe);
// tcp_connect with no timeout. If the network is gone, the SYN
// goes nowhere, nothing ever calls the connected callback, and
// this path waits. Forever.
tcp_connect(pcb, &probe, PROBE_PORT, connected_cb);
return wait_for_connected(pcb);
}To answer the question "is my network up," the fw was opening a TCP connection to a probe host. When the network was reachable that returned quickly and nobody noticed. When it was gone, the SYN went into the void, connected_cb never fired, and wait_for_connected blocked with nothing to wake it.
The reason this killed core0 and not just the network core is that lwIP isn't free-threaded. Shared state between the stack and the control side meant this routine sat on a path the control loop could end up waiting behind. A blocking call with no upper bound, on a device that must never stop, reachable from the safety loop. Three sins in one function.
The thing about blocking calls in safety code
A blocking call with no timeout is a deadlock you haven't triggered yet. On a desktop it's a spinner. On a furnace controller it's the element stuck in whatever state it was in when the call hung, with nothing left running to change its mind. The hazard isn't the WiFi dropping. The hazard is firmware that treats a missing network as a reason to wait.
The fix that put the deadlock back
The obvious move is to bound the call. Add a timeout, if the connect doesn't complete in, say, two seconds, give up and move on. The deadlock becomes a two-second hiccup. On this device, survivable.
So I went to add it. The lwIP version I was building against didn't expose the connect timeout the way I expected. The option I reached for wasn't there in the form I wanted, and the code I wrote to use it didn't compile. I tried the next obvious spelling. Also no. The longer I poked at the version-specific surface of the TCP API, the clearer it got that I was arguing with the library instead of fixing anything.
Then it got worse than not compiling. The intermediate version I got to build, while I was wiring up my own timeout bookkeeping around the connect, still had a path where callback ordering left the PCB in a state my wait loop didn't exit cleanly. Adding machinery to a blocking call got me a controller that could still wedge, now with more code. Back to square one, except square one had fewer lines.
That was the useful failure. I was so committed to making the blocking call behave that I never asked whether it should exist.
Do not probe. Check.
A furnace controller doesn't need to dial out to know if its link is up. That was the assumption hiding under everything. Opening a TCP connection to a probe host conflates two different questions: "is my radio associated and do I have an IP" and "can I reach the public internet." The controller only ever needed the first one, and the first one doesn't require sending a single packet. The chip already knows.
The CYW43 driver reports link status directly, and lwIP's netif carries the interface state. Read both. No socket, no SYN, no callback, no waiting on a remote host that may or may not be there.
// No TCP. No probe host. Ask the parts that already know.
static bool wifi_link_up(void) {
int link = cyw43_wifi_link_status(&cyw43_state, CYW43_ITF_STA);
if (link != CYW43_LINK_UP)
return false;
// Associated is not the same as configured. Confirm the netif
// is up and actually holds an address before calling it usable.
struct netif *nif = &cyw43_state.netif[CYW43_ITF_STA];
return netif_is_up(nif) && !ip_addr_isany(netif_ip4_addr(nif));
}A couple of register and struct reads. It can't block, because there's nothing to wait for: the answer already exists in memory, maintained by the driver as association state changes. When the network drops, wifi_link_up returns false immediately, the web server backs off, and the control loop never knows anything happened.
The right fix didn't bound the failure. It removed the entire class of failure. No timeout to tune because there's no call that can hang. Check state, don't probe.
The other hazard the freeze was hiding
Two cores sharing state is its own way to get hurt, and the connectivity bug had distracted me from it. If core1 is writing the latest temperature or a setpoint while core0 reads it, a torn read is a real possibility, and on this device a torn setpoint is a relay decision made on garbage.
Where the shared state was a simple flag, I used atomic_bool rather than reaching for a mutex. An atomic flag has no lock to hold and no lock to forget to release, and forgetting to release a lock is the same deadlock I'd just spent an afternoon removing, wearing a different hat. Where a mutex was genuinely needed, for the multi-field state that has to be consistent as a unit, I kept the critical section as short as possible: copy in, copy out, get off the lock. The longer you hold a mutex on a control loop, the more it starts to look like a blocking call waiting to happen.
// Cross-core flags that don't need a lock at all.
static atomic_bool g_overheat_latched = false;
static atomic_bool g_element_enabled = false;
// Read from either core, no lock, no torn read, nothing to hold.
bool element_is_enabled(void) {
return atomic_load_explicit(&g_element_enabled, memory_order_acquire);
}And then the part I won't ship a furnace controller without. A hardware watchdog. The RP2040 has one, and the deal is simple: the fw has to pet it on a regular cadence, and if it ever stops, the chip resets itself. I wired the pet into the control loop's heartbeat, the same flag I'd used to catch the original freeze, so the watchdog is fed only when core0 is actually cycling. If core0 wedges for any reason I haven't foreseen, the watchdog fires, the chip reboots, the relays come up in their safe default, and the controller starts over.
void core0_control_loop(void) {
watchdog_enable(WATCHDOG_TIMEOUT_MS, true);
for (;;) {
run_control_cycle(); // read thermocouple, decide relays
watchdog_update(); // only reached if the cycle completed
sleep_ms(CONTROL_PERIOD_MS);
}
}The watchdog isn't an excuse for sloppy code. It's an admission that I can't prove the absence of every wedge. The fw you trust is the fw you haven't caught hanging yet.
What carried over
This controller has existed in three forms: the C version on the Pico SDK I've been quoting, a MicroPython rewrite, and a TinyGo one. I keep porting it partly to see what each toolchain makes easy and partly because I enjoy it. The deadlock lesson didn't care which language it was written in. A connectivity probe that opens a connection to decide whether it can open a connection is a logic error, and a logic error survives translation. The C version is where I got bitten. I went and checked the other two anyway, and the MicroPython port had the same shape of probe sitting there waiting.
Two rules came out of it. First, on a safety-critical device the fix is not to make a blocking call time out. It is to not make the blocking call. If you can read the state directly, read it; do not go probe the world to rediscover something the hardware already told you. Second, have a watchdog, and feed it from the work, not from a timer that keeps ticking whether or not the work is getting done. A watchdog petted by a free-running timer will happily keep a wedged furnace alive.