All writing

~/writing/bbb-pru-pps-timestamping

Precision timing
6 min read

The spare cores in a BeagleBone keep better time than its kernel

The standard Linux PPS driver timestamps the GPS pulse in an interrupt handler and pays ~20 µs of jitter for it. The BeagleBone has two 200 MHz real-time cores that Linux never touches. I moved the timestamp into one of those, and the clock offset dropped into the low nanoseconds.

A GPS-disciplined clock lives or dies on one moment: the instant you record that the pulse-per-second edge arrived. Everything downstream, the servo, the freq correction, the offset chrony reports, is built on that timestamp. If it jitters, the whole clock jitters.

On Linux the usual path is the pps-gpio driver. Wire PPS to a GPIO pin, and the kernel timestamps the rising edge in an IRQ handler. It works, and for plenty of uses it's fine. But the timestamp lands after the interrupt's been dispatched and the handler scheduled, and that path is at the scheduler's mercy. Under normal load you see ~20 µs of dispersion, with outliers past 10 µs on top. That's the noise floor of timestamping an edge through the Linux interrupt subsystem. You can't average your way out of jitter that large.

The BeagleBone Black has a way out, sitting unused on the same chip.

Two CPUs with no operating system

The AM335x on a BeagleBone has a PRU-ICSS: two Programmable Real-Time Units, 200 MHz cores that run independently of the ARM and the Linux kernel. No OS, no IRQs in the normal sense, one instruction per 5 ns cycle, deterministically. You load firmware from userspace via the remoteproc framework and they just run.

That's exactly what edge timestamping needs. A PRU polling a pin in a tight loop sees the edge with nothing scheduled between the transition and the timestamp. The AM335x also gives it a clock to stamp with: the IEP (Industrial Ethernet Peripheral) has a free-running counter at 200 MHz, one tick every 5 ns.

So the plan: stop asking Linux to time the edge. PRU0 watches the PPS pin, latches the IEP counter the instant it goes high, and drops the value in shared memory for a userspace daemon to collect.

firmware/main.c (the capture loop)
uint32_t prev = __R31 & PPS_BIT;     // P8_16 = pr1_pru0_pru_r31_14
for (;;) {
    uint32_t cur = __R31 & PPS_BIT;
    if (cur && !prev) {
        // Rising edge. Latch the 200 MHz IEP counter right here,
        // in the PRU, with nothing scheduled between edge and read.
        pps_data.iep_lo = IEP_COUNT_LO;
        pps_data.seq++;
    }
    prev = cur;
    // ... service rpmsg, etc.
}

The PRU writes a tiny struct, { seq, iep_lo }, into its data RAM. seq increments on every pulse so the reader can tell a fresh sample from a stale one. iep_lo is the raw IEP tick count at the edge. Note what it's not: it's not wall-clock time. The PRU has no idea what time it is. It only knows its own free-running counter. Bridging that gap is where the actual work lives.

The hard part is the clock-domain crossing

The PRU produces timestamps in IEP ticks. Chrony wants CLOCK_REALTIME nanoseconds. Two different clocks, slightly different and drifting rates, and the translation can't smear the nanosecond precision you just went to all this trouble to get.

A userspace daemon, pru_pps_shm, does the crossing. It runs SCHED_FIFO so the scheduler can't sit on it, reads the PRU's struct out of DRAM via an mmap of /dev/mem, and then has to answer one question: at the moment the PPS edge happened (a known IEP tick), what was CLOCK_REALTIME?

To correlate the two clocks it brackets a reading of the IEP counter between two wall-clock reads, ten times over, keeping the tightest bracket:

daemon/pru_pps_shm.c (IEP to wall calibration)
long long best_spread = 999999999LL;
for (int i = 0; i < 10; i++) {
    struct timespec t1, t2;
    clock_gettime(CLOCK_REALTIME, &t1);
    uint32_t c = read_iep_counter();          // sample IEP between two wall reads
    clock_gettime(CLOCK_REALTIME, &t2);
    long long spread = ns2 - ns1;             // how tight is this bracket?
    if (spread < best_spread) {
        best_spread = spread;
        best_cal_iep  = c;
        best_cal_wall = ns1 + spread / 2;     // midpoint of the tightest bracket
    }
}

The bracket pins one IEP value to one wall-clock value, and spread is the uncertainty of that pin: the wall clock could've been anywhere in that window when the IEP was sampled, so the midpoint is the best estimate and the window width is the error. Best-of-ten keeps the bracket where an interrupt or cache miss didn't stretch the window. On the RT kernel that spread stays under 2 µs, typically around 1.3.

The second piece is the tick rate. Nominally the IEP runs at 200 MHz, so 5.0 ns per tick. It doesn't, quite, and it drifts with temperature. So the daemon measures the real period and smooths it with an IIR filter, which tracks thermal drift without chasing noise:

// Filtered IEP tick period in ns. Nominal 5.0; reality is a hair under,
// and it moves with temperature.
ns_per_tick = ns_per_tick * 0.9 + measured * 0.1;

With a calibrated (iep, wall) pin and a filtered ns_per_tick, projecting the PPS edge into wall time is arithmetic: take the calibration point and walk back by the number of IEP ticks between the edge and the calibration sample, times the ns per tick. That projected CLOCK_REALTIME instant goes into chrony's NTP shared-memory refclock, unit 2, which disciplines the system clock.

Capture cheap, correlate carefully

The PRU capture is the easy, glamorous half: hardware timestamping at 5 ns resolution. It's worthless without the boring half. Cross from IEP ticks to wall time with a sloppy single clock_gettime, or assume the counter runs at exactly 200 MHz, and you've reintroduced µs of error. The silicon-accurate edge was for nothing. Precision is set by the weakest link, and the weakest link is the clock-domain crossing, not the capture.

What the daemon tells you

Every pulse, the daemon logs a line that's the whole health of the chain at a glance:

seq=202 delta=200011758 offset=+634 ns gap=38925 (194.6 us) spread=1291 ns ns/tick=4.999707 [good=201]

delta is the IEP ticks between consecutive pulses; it should sit near 200 million (one second at 200 MHz), and 200011758 says the IEP's running a touch fast, which is exactly why ns/tick settles at 4.999707 rather than 5.0. offset is the sub-second residual of the edge against the UTC second. gap is how long after the edge the calibration ran, kept under 250 µs. spread is the calibration bracket width, here 1291 ns. good counts accepted pulses; the matching bad counter should stay at zero.

The result

The payoff shows up in every metric.

Typical offset
5-20 us (pps-gpio)100-800 ns
PRU capture vs GPIO IRQ
Edge resolution
~1 us5 ns
IEP tick vs GPIO IRQ jitter
Chrony offset sd
1.0e-9
estimated, after the servo converges

Once chrony's servo converges, chronyc tracking puts the system time offset in the low nanoseconds, often inside ±10 ns. The sourcestats view shows the estimated sd settling at 1.0e-09, one nanosecond, against the 5 to 20 µs that pps-gpio routinely shows. A tracking log excerpt after it's settled:

   Date (UTC) Time     IP Address   St   Freq ppm   Offset       Offset sd
2026-03-02 15:59:48 PPS              1     58.491   1.172e-14    1.716e-17
2026-03-02 15:59:50 PPS              1     58.491   2.141e-15    8.552e-18
2026-03-02 15:59:52 PPS              1     58.491   4.945e-16    8.621e-18

One footnote worth knowing: the IEP can also be exposed as a PTP hardware clock through the pru_iep kernel driver, showing up as /dev/ptpN, which would let linuxptp use it directly. This project doesn't take that path. It reads the IEP counter straight out of /dev/mem and does the correlation itself, which keeps the whole thing legible: a tight loop on a core with no OS, a careful bracket to cross into wall time, and a kernel kept clear of the one step that has to happen on time.

The code, firmware and daemon and device-tree overlays, is on GitHub.