All writing

~/writing/ceph-read-amplification

Systems debugging
10 min read

What is putting those bytes on the wire?

The storage interface on a production cluster was pulling 4 to 12x more than the public interface where client requests arrive. Getting to a defensible number meant distrusting everything, including my own profiler, which was off by 30x.

A graph that does not make sense at a glance. On the front-end nodes of a production cluster I help run, the private storage interface, the one carrying CephFS traffic, was receiving 4 to 12 times more bytes than the public interface where client requests arrive. These machines exist to accept inbound requests. They should be writing to storage if anything. Instead the storage network was pulling down a torrent of reads.

So: what is putting those bytes on the wire? I did not want a plausible story. I wanted a number I could defend when someone who knows the kernel better than I do comes to poke at it.

This is how I got that number, including the part where my own instrument lied to me by a factor of thirty and I trusted the counter instead.

Pick the one number that cannot lie

Pick what you trust before you start collecting data, because the moment you have data you'll be tempted to trust whatever was easiest to collect.

On a CephFS mount, the kernel client keeps a metrics block in debugfs:

the only number I trusted
cat /sys/kernel/debug/ceph/*/metrics
# read total_sz: bytes actually pulled from the OSDs
# write total_sz: bytes actually pushed to the OSDs

read total_sz is bytes the client physically fetched from the object storage daemons (OSDs). Not bytes an application asked for. Not bytes that came back from cache. Bytes that crossed the storage network because they had to. Immune to cache hits, async timing, burstiness, because it's incremented at the point where bytes leave the OSD and hit the client. If another number disagrees with this counter, the other number is wrong, until I can prove otherwise.

Everything else, process names, file paths, which PID was reading, I treated as attribution. Attribution is a hypothesis about who caused the bytes. It's not the measurement of the bytes. Keeping those two ideas in separate boxes is the whole methodology.

The cache trap

The first wrong answer showed up fast.

The obvious place to hook is ceph_read_iter, the VFS entry point for a read on a Ceph file. Probe it, sum the bytes, attribute to the process. I ran a sanity check by md5sum-ing a shared map file I knew was already cached on the node, and watched the probe report 203 MiB of reads.

That nearly went into the report as load. It has exactly the shape of a finding: big file, real process, fat byte count. The problem is that ceph_read_iter counts cache hits. Those 203 MiB never touched the OSDs. The debugfs counter barely moved during that md5sum. I'd measured VFS demand and called it wire traffic, which is exactly the mistake the discipline above exists to prevent.

Demand is not wire

A read served from page cache is real I/O to the application and zero bytes on the storage network. If you measure at the VFS layer you are measuring what the application wanted, not what the cluster had to deliver. Those two numbers can differ by orders of magnitude on a warm cache.

The fix was to stop measuring demand and start measuring fetch. I separated two things the first probe had smeared together: ceph_read_iter for what the application requested, and the readahead path for what the client actually prefetched off the OSDs. With both, I could compute a per-file cache-hit percentage, and the cached shared map dropped out of the "load" column where it never belonged. It reads, at the application layer, plenty. It puts almost nothing on the wire.

The amplifier

Once I was measuring fetch instead of demand, the central number came out of one file:

the answer was sitting in sysfs
cat /sys/class/bdi/ceph*/read_ahead_kb
# 8192

8 MB readahead on every CephFS mount. When a process reads a few bytes through a file descriptor, the kernel doesn't fetch a few bytes. It fetches up to 8 MB ahead, on the theory that you'll want the next bytes soon. For a workload streaming large files sequentially, that's a gift. For a workload poking at tiny files, it's a catastrophe: every byte-sized read drags up to 8 MB off the OSDs.

That's the amplification. Measured against actual read operations it came out to roughly 147 KiB pulled per read, about 13 times the bytes the application asked for. And it explains the shape of the whole graph:

Read amplification
~13x
~147 KiB pulled per read op
Lifetime read:write
31:1
906 GB read vs 29 GB written
Private RX:TX
~19:1
storage iface receives, barely sends

Reads dominate writes structurally, not because of a spike. These nodes' storage job is per-entity config lookups, not bulk writes. Lifetime ratio near 31 to 1: 906 GB read against 29 GB written. The private interface receives about 19 times what it transmits. The storage network was never doing storage in the way the name suggests. It was answering a flood of tiny config lookups, each one inflated 13x by readahead.

Why a front-end node reads shared storage at all

Fair question: why do these nodes touch CephFS on the read path at all? A node that only forwarded requests wouldn't. But these don't just forward, they make a per-request decision, and that decision needs configuration. Every inbound request has to read a handful of small per-entity config files off shared storage before it can act. That's roughly 0.65 MiB pulled per request, and none of it is bulk payload. These nodes never read large objects off CephFS. They read tiny per-entity config files, and the 8 MB readahead inflates each byte-sized config read into a large wire transfer.

Broken down by what the per-request path actually touches:

reads per inbound request, by source
identity / auth lookups           ~19-22%
limits / policy data              ~22%
per-entity rules                  ~19%
periodically refreshed shared map ~7%   (mostly cache hits)

The shared map is the file that nearly fooled me in the cache trap. Read constantly, served almost entirely from cache, so it shows up large in demand and small on the wire. Having the cache-hit percentage in hand meant it sat in the right column this time.

When the heavy reads are not the workload at all

There was a second pattern in the data that didn't fit "per-request config lookups," and it would have looked like noise if I'd averaged across all the nodes.

Some nodes were reading 6 to 11 MiB/s, about 4.7 MiB per request, an order of magnitude over the config-lookup baseline. Others were light. The split didn't track request volume. It tracked migration state. A background replication job bulk-reads full data objects on the nodes where it runs, and the heavy nodes were exactly the ones still being migrated. The node that had finished cutting over was light.

That gave a falsifiable prediction, which is the kind I like because it's the kind that can embarrass you later if you got it wrong: once the remaining clusters finish cutting over, their read rate drops to the light baseline. Not "should improve." Drops to a specific baseline I'd already measured on the node that was done.

The 30x lie in my own tool

Now the part that matters most, because it's the part I'm least comfortable telling and therefore the most worth telling.

To do attribution properly I'd written a custom profiler, bpftrace feeding a small Go aggregator, that hooked ceph_readahead and read the readahead window size out of the readahead_control structure passed to it:

the probe that was wrong
kprobe:ceph_readahead
  $rac = (struct readahead_control *)arg0;
  $pages = $rac->_nr_pages;     // window size, in pages
  // attribute $pages * 4096 to the current process

Reasonable on its face. The problem is that this kernel uses the netfs layer, and netfs expands the readahead window after ceph_readahead is entered, in ceph_netfs_expand_readahead. So _nr_pages at entry to ceph_readahead is the pre-expansion size. My probe was capturing the window before the kernel grew it, undercounting by about 30x the bytes that actually hit the OSDs.

I knew it was wrong because it disagreed with the debugfs counter. The counter said one thing, my profiler said about 1/30th of that thing, and the rule I set at the start is that the counter wins. My instrument was wrong until I could prove it right, and I couldn't, so it was wrong.

The fix was to anchor the probe deeper, on the core mm function page_cache_ra_unbounded, which runs after the window is finalized. Two reasons it works where the old hook didn't. First, it's present in this kernel's BTF, so bpftrace can attach to it and read its arguments by type. Second, its readahead_control argument carries the file pointer, which lets me reach the inode and superblock and filter to CephFS by the superblock magic:

the probe that reconciled
kprobe:page_cache_ra_unbounded
  $rac = (struct readahead_control *)arg0;
  $sb = $rac->file->f_inode->i_sb;
  if ($sb->s_magic == 0x00c36400) {        // CEPH_SUPER_MAGIC
      $bytes = arg2 * 4096;                 // post-expansion nr_to_read
      // now attribute, knowing the window is final
  }

After moving the anchor, the profiler's total reconciled to about 170 MB over 6 minutes, roughly 471 KB/s, and it matched the debugfs counter delta exactly. Not approximately. The delta. Then a three-way comparison: the debugfs counter, /proc/net/dev on the storage interface, and the time-series database we already had, all agreed within 6 percent. At that point the attribution was trustworthy, because the total it summed to was the number I already trusted.

Your instrument is a suspect

When your custom tool disagrees with a counter you have independent reasons to trust, the default assumption is that your tool is wrong, not the counter. A 30x discrepancy isn't something to reconcile by averaging or adding a fudge factor. It's a defect in the part of the system you built, and it stays a defect until the tool reproduces the trusted number on its own.

The reason traces lie about who

One more trap worth naming, because it's the reason attribution is hard even after the byte count is correct.

I'd considered attributing reads by tracing openat, matching opened paths to the processes that opened them. That's wrong, and wrong in a way that produces confident, plausible, incorrect answers. Tracing openat only catches files opened during the trace window. Real processes read through descriptors opened earlier, or inherited from a parent. The service forks worker processes that inherit open fds onto shared storage. So the file an openat trace never saw is exactly the file being hammered.

Worse, the process that shows up in ceph_readahead is the one whose read triggered the prefetch, which isn't necessarily the one that opened the file. A tiny read through an inherited descriptor pulls 8 MB, and the byte count lands on whoever happened to issue that read. Attribution by openat alone would have produced a clean, wrong story about who was responsible. The byte count anchored on the trusted counter is correct regardless; the who is the part that requires care.

What the investigation was

No patch at the end of this one. The 8 MB readahead is arguably correct for a filesystem designed to stream large objects; the right fix is a per-mount tuning conversation, not a kernel change, and that conversation is now a quantified one instead of a guess. The deliverable was the explanation: the storage network's receive load is per-entity config lookups amplified roughly 13x by readahead, plus a migration job that disappears on cutover, and here's the prediction for what happens when it does.

The explanation wasn't the hard part. The hard part was refusing to ship the cache-trap reading, refusing to ship the openat attribution, and refusing to trust my own profiler when it disagreed with a counter I had reasons to believe. Anchor on a number that can't lie. Treat process and path attribution as a hypothesis, not the measurement. Separate what the application demanded from what the cluster delivered. When your instrument disagrees with the anchor, the instrument is wrong until it reproduces the anchor on its own.

The 30x error in my profiler is not the embarrassing part of this story. It is the point of it.