~/writing/esp32-stratum-1-ntp
The $20 stratum 1 that chrony kept benching
I built a GPS-disciplined NTP server on a $20 ESP32 with single-nanosecond internal timing. Then chrony refused to use it. The fix was not in the clock. It was in everything I had assumed about serving the time.
Here's a question that sounds like it has an obvious answer: how good a clock can you build for twenty dollars?
The obvious answer is "not very." A real stratum 1 reference clock is a rubidium standard, or an OCXO holdover box, or a sawtooth-corrected PTP grandmaster sitting in a rack with a roof antenna. Those cost real money. An ESP32, a u-blox GPS module, and a WIZnet W5500 Ethernet chip cost about twenty dollars together. The conventional wisdom is that a microcontroller is fine for a hobby clock that's "good enough," and that you shouldn't expect it to stand next to the serious hardware.
I didn't like that answer, so I went to measure it. What I found was stranger than "the cheap clock is worse." The cheap clock was, internally, one of the best clocks on my network. And it was useless anyway, because I'd built an excellent clock and a mediocre NTP server, and those aren't the same thing.
What "stratum 1" actually has to mean
A stratum 1 server is one that gets its time directly from a reference source rather than from another NTP server. In practice that means a GPS receiver. The GPS module spits out two things: an NMEA sentence over UART that says, in human terms, "it is currently 14:02:53 UTC," and a pulse-per-second (PPS) line that goes high at the exact instant each second begins. The NMEA tells you which second it is. The PPS tells you precisely when that second starts. You need both.
The naive way to read PPS is to wire it to a GPIO pin and take an interrupt on the rising edge. The problem is that a GPIO interrupt on a general-purpose CPU isn't punctual. The interrupt has to be dispatched, the handler has to be scheduled, other interrupts can be in flight ahead of it. On an ESP32 that costs you somewhere between one and ten µs of jitter, and it's jitter, not a fixed offset, so you can't calibrate it out.
The ESP32 has a way around this that almost nobody uses for clocks. The MCPWM peripheral has a hardware capture unit. You point it at a GPIO, and when the edge arrives the hardware latches the value of a free-running timer into a register, in silicon, with zero software in the path. The CPU finds out later and reads the latched value. The edge is timestamped before the interrupt even exists.
// MCPWM capture: the hardware latches the timer on the PPS edge.
// The ISR runs later and just reads what silicon already recorded.
static bool IRAM_ATTR pps_capture_cb(mcpwm_cap_channel_handle_t ch,
const mcpwm_capture_event_data_t *ed,
void *user) {
BaseType_t hp_task_woken = pdFALSE;
// ed->cap_value was latched in hardware at the instant of the edge.
// No dispatch jitter, no scheduling jitter. This is the whole trick.
g_last_pps_ticks = ed->cap_value;
g_pps_event = true;
return hp_task_woken == pdTRUE;
}The APB clock runs at 80 MHz, so each tick is 12.5 ns. That's the resolution of the capture. The interesting part is what the servo did with it once I had clean edges to discipline against.
11 ns of RMS jitter on a 12.5 ns tick means the servo is averaging down below the quantization of its own clock, which is what a well-behaved PLL should do. Zero rejected pulses over 94,650 of them means the PPS line was clean and the outlier rejection never had to fire. By every internal measurement I had, the clock was solid.
Then I asked chrony what it thought.
The benching
I had three GPS-disciplined boxes on the bench, each with its own receiver, deliberately not synced to each other. The point wasn't to slave them together. The point was to let an independent chrony instance watch all three and tell me, honestly, which one it trusted. One was an Intel i210 NIC doing hardware PTP timestamping off a u-blox NEO-M9N. One was a BeagleBone with a GPS PPS feed. The third was my $20 ESP32.
chrony marks each source with a symbol. ^* is the selected source, the one it's actually steering the clock to. ^+ is a good candidate it would fall back to. ^- means benched: a source it has measured, doesn't trust enough to use, and has set aside.
My beautiful clock came up ^-. Benched. The i210 grandmaster, the expensive one, was ^*.
This is where the conventional wisdom says I should know better. A twenty dollar microcontroller isn't a grandmaster. Except the numbers didn't support that story. The ESP32 was, by wire topology, the closest clock to the monitoring host. It was on the same switch. It should have had the shortest, most stable rtt of the three. Instead chrony was measuring its rtt at 1.88 ms against 0.19 ms for the others. Ten times slower than peers that were further away.
The tell
A source that is topologically closer should have a shorter round trip, not a longer one. When the closest clock measures the worst delay, that is not noise you average away. That is a defect, and it is in the part of the system you control.
The GPS was never the problem. The clock was never the problem. The problem was the NTP server, and specifically the timestamps it was putting into packets.
How NTP actually computes who is right
Every NTP exchange has four timestamps. The client stamps when it sends the request (t1). The server stamps when it receives that request (t2) and when it sends its reply (t3). The client stamps when the reply arrives (t4). From those four numbers, the client computes two things:
offset = ((t2 - t1) + (t3 - t4)) / 2
delay = (t4 - t1) - (t3 - t2)The whole edifice rests on t2 and t3 being honest about when the packet actually crossed the wire. If your server stamps t2 a few hundred µs after the packet truly arrived, and stamps t3 well before it truly leaves, then the math sees phantom delay that was never on the network. chrony can't tell the difference between real network latency and a server lying about its own timestamps. It just sees a source with two milliseconds of slop and benches it.
So I stopped trusting my own timestamps and went to measure where they came from.
Receive: stamp it in hardware, again
The same lesson as PPS applied to packet arrival. The original code stamped t2 in the UDP receive path, in software, after the network stack had already handed the packet up. The W5500 has an INTn line that asserts the moment a packet lands in its buffer. I wired that to a GPIO, took the interrupt, and stamped t2 there, as early in the arrival as I could physically get.
// INTn from the W5500 asserts when a packet hits the socket buffer.
// Stamp the receive instant here, not up in the UDP read path.
static void IRAM_ATTR w5500_int_isr(void *arg) {
uint64_t now = now_ns_disciplined();
g_rx_hw_ts = now; // becomes t2
g_rx_irq_count++; // 1:1 with served requests, for auditing
xSemaphoreGiveFromISR(g_rx_sem, NULL);
}I added a counter, ntp_rx_irq_total, that incremented on every one of those interrupts, so I could prove the interrupt fired exactly once per served request rather than assuming it. Receive jitter dropped from a sigma of 26 µs to 2.6 µs. That made the ESP32's receive timestamp the tightest of all three clocks, including the grandmaster.
The two things blocking between t2 and t3
Receive was now honest. The rtt was still bad, which meant time was being lost between receiving the request and sending the reply. So I instrumented the gap, and found two separate stalls.
The first was an ARP prime. Before sending a reply the code was calling a helper that, on a cold ARP cache, busy-waited for one to two hundred milliseconds resolving the client's MAC. The second was subtler: the main loop called gps->loop() to service the UART at the top of every iteration, and that call could block for up to 100 ms draining the GPS, immediately before the NTP poll. A request could sit and wait on the GPS before the server ever looked at it.
Neither of those is a clock problem. Both of them are a "you wrote a blocking server and then measured it serving the time it was blocking on" problem.
Transmit: you cannot stamp t3 after you send
The last stall was the most interesting because it's genuinely hard to fix correctly. You want t3 to be the instant the reply hits the wire. But you write t3 into the packet, which means you have to know the transmit time before you transmit. You're stamping a timestamp for an event that hasn't happened yet.
I measured how long the actual send took. w5k_sendto was blocking for ~636 µs pushing the packet to the W5500 over SPI. I'd predicted ~560 µs from the SPI clock and packet size; measuring 636 confirmed I understood the path. So t3 was being written 636 µs before the packet left, every single time, and that error went straight into every client's offset calculation.
The fix is to predict the send time and pre-correct t3 by it, using an EWMA of the measured send duration so it adapts to conditions instead of being a magic constant.
// t3 is written INTO the packet, so it must be stamped before egress.
// Pre-correct by the predicted on-wire moment using an EWMA of measured
// SPI send time, then split the write so the loop never blocks on it.
uint64_t predicted_tx = now_ns_disciplined() + g_send_ewma_ns;
pkt->transmit_ts = ns_to_ntp64(predicted_tx);
uint64_t t_before = now_ns_disciplined();
w5500_sendto_nonblocking(sock, pkt, sizeof(*pkt), client);
uint64_t measured = now_ns_disciplined() - t_before;
// Adapt the prediction toward what actually happened. 1/8 gain.
g_send_ewma_ns += ((int64_t)measured - (int64_t)g_send_ewma_ns) >> 3;Switching to a non-blocking split write shaved roughly 1.4 milliseconds off each request on its own. The EWMA correction took the residual transmit error out of the offset.
The result
After receive was stamped in hardware, the two blocking stalls were removed, and transmit was pre-corrected and made non-blocking, I went back and asked chrony again. It didn't take long.
The twenty dollar ESP32 went from benched to selected. It became the source chrony trusted most, ahead of the i210 PTP grandmaster, on a network where the grandmaster cost many times more.
I want to be precise about the claim, because precision is the entire point. This isn't the best clock in the world. The honest tier above it is real: GPSDO and OCXO holdover boxes that keep time through GPS outages, and sawtooth-corrected grandmasters with hardware that an ESP32 doesn't have. What this is, is the best clock in its bracket. Among microcontroller-based stratum 1 servers, the combination of MCPWM capture for PPS, hardware receive timestamping off the W5500 INTn, and a self-calibrating transmit correction is, as far as I can measure, best in class. And the thing that makes it worth its salt, the hardware timestamping, costs nothing extra. It was sitting in the silicon the whole time.
The part I will not compromise on
One more thing, because it's the difference between a clock and a liar. A stratum 1 server has to be honest about when it doesn't know the time. If the GPS loses lock, the worst possible behavior is to keep confidently serving stratum 1 with stale time, because every client downstream will believe it. So the server checks lock on every single request. PPS pulses arriving, and an NMEA fix less than 1.5 seconds old, or it doesn't claim to be synced. The moment it isn't certain, it advertises stratum 16 with the leap indicator set to alarm, which is NTP for "do not use me." A clock that lies once is worse than no clock at all.
That's the whole discipline in one rule. The expensive part was never the hardware. It was refusing to believe my own clock until the independent monitor, watching from across the network, agreed with it.
The code is on GitHub.