All writing

~/writing/ts2phc-go-frozen-clock

Precision timing
10 min read

The frozen clock that nobody noticed

A GPS-disciplined NTP server stopped being disciplined, and every monitor stayed green. The giveaway was a freq_ppb value that repeated byte-for-byte across three seconds, which a live servo never does.

A Grafana dashboard had been stale since 2pm. Not down, not erroring: stale. Lines flat and plausible, panels green, NTP pool monitors quiet. By every check that was supposed to catch a problem, there was no problem. The clock was fine.

The clock was not fine. It had stopped knowing the time hours earlier and had been confidently lying ever since, and the only reason I caught it is that I happened to be staring at a number that should never repeat.

This is about ts2phc-go, a Go reimplementation I wrote of linuxptp's ts2phc. Its job is narrow: discipline a NIC's PHC to the PPS signal off a GPS receiver. On a server in the public NTP pool, that PHC is the foundation. chrony reads it as a local reference and serves stratum 1 to whoever asks. If the PHC drifts, everyone downstream drifts. If the PHC freezes, everyone downstream freezes, and nobody finds out, because a frozen clock that still ticks is indistinguishable from a working one to anything that only asks it the time.

The number that should not repeat

The servo in ts2phc-go logs its state every second. One of the fields is freq_ppb, the frequency correction it's currently applying to the PHC oscillator in parts per billion. A disciplining servo is always nudging this. The oscillator drifts with temperature, the loop chases it, the correction wobbles in the last few digits constantly. It's the heartbeat. A live servo doesn't report the exact same float twice in a row, because the physical world doesn't hand it the exact same offset twice in a row.

I was scrolling the logs and saw this:

servo log, 3 seconds apart
14:02:51  offset       4 ns  freq_ppb 17139.674132458233  ...
14:02:52  offset       4 ns  freq_ppb 17139.674132458233  ...
14:02:54  offset       4 ns  freq_ppb 17139.674132458233  ...

17139.674132458233, byte for byte identical, across a three-second gap. That's not a clock holding steady. A clock holding steady still jitters in the low digits. That's a clock that's stopped computing. The servo wasn't steering anything. It was printing the last value it ever calculated, over and over, like a stopped watch someone propped up to look like it was running.

Stable is not frozen

A repeated measurement isn't a stable measurement. A live servo's freq_ppb drifts in its least significant digits on every sample, because the physical offset it chases never repeats exactly. A byte-identical float across two samples doesn't mean the clock is calm. It means the clock is dead and nobody pulled the plug.

Why every monitor stayed green

The failure was upstream of the servo. gpsd's NTP feed had died. The main loop in ts2phc-go reads an offset from the GPS source each iteration, and when the source has nothing to give, the loop hits a continue. Every iteration. So the servo never sampled a new offset, never updated, and the PHC just stopped being disciplined. It coasted on its own oscillator from 2pm onward, drifting freely, while the servo sat there reciting its last freq_ppb.

Here's the part that should make you uncomfortable. chrony, downstream, kept reading that PHC as its reference and kept advertising stratum 1 to the pool, perfectly happily. From chrony's point of view nothing changed. The PHC still returned a time. The time was wrong and getting wronger, but chrony didn't have any way to know that, because it reads the PHC as a local reference with no independent cross-check. A frozen-but-plausible PHC is indistinguishable from a live one to anything that only asks it what time it is. You have to ask it something it can't fake, which is whether it's still being told the time by something outside itself.

The external pool monitors couldn't see it either. They measure offset against the consensus, and a clock that froze while it was already correct stays correct enough, for a while, to pass. By the time the drift would have tripped a threshold the damage to downstream trust was already done.

The shape of this failure

The dangerous failures in timekeeping aren't the loud ones. A clock that crashes gets noticed. A clock that keeps serving, keeps answering, and stops updating is the one that takes out everything that trusts it, because every check designed to catch a broken clock is actually a check for a clock that stopped responding, not a clock that stopped thinking.

Recovery was straightforward once I knew where to look. Restart gpsd, wait for a 3D fix (it came back on 23 to 25 satellites), the source started handing the loop fresh offsets again, the servo woke up, and downstream chrony relocked to roughly +9 ns at about +/-19 ns. The whole fix was a process restart. Finding it was noticing a float.

How I got here: reading the structs against the headers

I should back up, because the reason I trusted that freq_ppb tell at all is that I'd spent a lot of time inside this code, checking what it reported against what the hardware was actually doing. That habit is what made the frozen value jump out.

When I first wrote ts2phc-go I did the thing you're supposed to do with kernel ABI: I read my Go structs line by line against the UAPI headers in the kernel tree, rather than trusting that they looked right. They didn't look right.

The ClockCaps struct was 76 bytes: eight int32 fields plus an [11]int32 reserved area. But PTP_CLOCK_GETCAPS encodes an 80-byte copy in its ioctl number. The kernel doesn't care what size your struct is. It copies the number of bytes the ioctl says, which means on every single startup the kernel was writing four bytes past the end of my struct. It hadn't crashed yet. That's the worst kind of bug, the one that's corrupting memory without a word.

ptp_caps.go
// PTP_CLOCK_GETCAPS encodes an 80-byte copy. This struct was 76.
// The kernel writes the size the ioctl says, not the size you allocated.
type ClockCaps struct {
	MaxAdj    int32
	NChannels int32
	NAlarm    int32
	NExtTs    int32
	NPerOut   int32
	PPS       int32 // was swapped with NPins below
	NPins     int32
	CrossTs   int32
	AdjPhase  int32
	MaxPhaseAdj int32 // was missing entirely
	Reserved  [11]int32
}

While I was in there I found two more. The pps and n_pins fields were swapped relative to the UAPI definition, so I was reading each as the other. And max_phase_adj was missing from my struct entirely, which is part of why the size was off.

Then a separate one, in a different place. PTP_EXTTS_REQUEST2, the v2 external-timestamp request, was encoded with ioctl number 10. Number 10 is PTP_CLOCK_GETCAPS2. The v2 request is number 11. So every time the code asked for a v2 external-timestamp request it was actually issuing a getcaps, the kernel returned ENOTTY, and the code silently fell back to v1. Silently. It worked. It just quietly lost PTP_STRICT_FLAGS every time and nobody noticed, because falling back to v1 is a perfectly functional thing to do.

The bug I left in on purpose

The satisfying move is to fix every bug you find. The correct move sometimes isn't, and this is one of those cases.

That v2 ioctl-number bug, the one that made the strict request silently fall back to v1, I left unfixed on the production NIC. On purpose.

The reasoning: making the v2 request actually fire had no upside for this deployment. The v2 path enables a strict single-edge mode. This design doesn't want strict single-edge mode. It wants both edges plus a filter (more on that next), which is a deliberately different strategy. So fixing the bug would have, at best, changed nothing useful, and at worst changed the behavior of the exact NIC sitting in production serving the pool, for no benefit. The fall-back to v1 wasn't a failure, it was the behavior I wanted, arrived at by accident. Knowing which bugs not to fix is a skill, and it's a quieter one than knowing how to fix them.

The both-edge filter, which is the actual point

The reason this design wants both edges is the NIC. The Intel i210 delivers both the rising and the falling edge of the PPS to the timestamp FIFO. A naive reader that only expects one edge per second gets two, interleaves them, and computes garbage. An earlier audit of the raw pulse stream had flagged "pulses" of 458 ms, which isn't a PPS pulse, that's the gap between a rising edge of one second and a falling edge of the previous one being misread as a single interval.

So the headline feature of ts2phc-go is a dynamic edge filter that handles both edges, locks onto the real second boundary, and rejects the noise. When it locks it logs exactly what it sees:

edge filter lock
edge filter locked, pulse width 9.999994ms

A 9.999994 ms pulse width is the real PPS pulse off this receiver. Once the filter's locked there are none of the 458 ms garbage intervals the naive audit had flagged, because the filter knows which edge is which and only disciplines against the boundary that matters.

The 37-second clue

One more measurement from early on, because it's a clean example of a wrong number telling you something true. The very first time the servo ran end to end, it reported an offset of 37000000084 ns. That's 37 seconds and change. A servo that's 37 seconds off isn't a servo with a tuning problem, it's a servo measuring two different time scales.

The PHC is disciplined to TAI by ptp4l. NMEA, from the GPS, reports UTC. TAI has been exactly 37 seconds ahead of UTC since 2017. The C ts2phc handles this by reading leap-seconds.list. My fix added the 37-second offset directly in the NMEA source, and the offset collapsed from 37 seconds to low tens of nanoseconds. The remaining 84 ns in that first reading was the real offset, sitting in plain sight underneath a 37-second labeling error.

There were other points of hygiene I matched against the C version once I knew to look for them: draining the EXTTS FIFO on startup so the servo doesn't begin by chewing on stale kernel-buffered timestamps, configuring the SDP pin correctly so the i210 routes the PPS where the code expects it, and getting the edge handling right rather than assuming one edge per second.

Shipping it without shipping more than I meant to

The fixes (struct sizing and field order, the NMEA leap offset, the FIFO drain and SDP and edge handling) needed to go onto the production box. The deploy was deliberately boring. Built natively on the machine with go1.25 and CGO off, backed up the running binary first, swapped it, restarted. It re-locked in about one second and recovered to baseline, around -5 ns at about +/-17 ns.

One thing surfaced before I shipped, not after, which is the point of looking. git HEAD carried commits beyond the three fixes I intended to deploy: there was a first-step-threshold change in there too. A naive rebuild of HEAD would have shipped that quietly along with the three I'd actually reviewed and wanted. I flagged the scope and pulled it back to the intended set rather than letting the extra commit ride along unexamined. Shipping more than you decided to ship is its own kind of frozen clock: it looks like the thing you tested, right up until it isn't.

Frozen freq_ppb
17139.674132458233
byte-identical across a 3s gap
Recovery offset
+9 ns
downstream chrony relock, +/-19 ns
Re-lock time
~1 s
after the production swap

When a measurement becomes a memory

A frozen servo looks exactly like a working one unless you have an independent check.

The PHC didn't tell anyone. It couldn't. It had no way to distinguish "I'm disciplined and correct" from "I'm coasting on a value I computed at 2pm," because nothing was asking it the only question that matters, which isn't what time is it but how recently were you told. chrony asked the first question and got a confident answer. Nobody asked the second one until I noticed a float that refused to change.

So now the loop's failure to sample is loud instead of silent, and the thing I trust most isn't the clock and isn't the monitor but the byte-identical number, the sign that a measurement has stopped being a measurement and become a memory.

The code is on GitHub.