All writing

~/writing/cephplayground

Systems debugging
7 min read

A Ceph cluster I throw away on every reboot

I wanted a real Ceph endpoint (S3, CephFS, RBD) to test app code against, without a k8s cluster or a single byte hitting my SSD. One container, ~30 seconds, all in RAM. The version sweep is where it got interesting.

Most of the time I don't actually want a Ceph cluster. I want the endpoint. An S3 URL with creds, or a CephFS mount, or an RBD device, so I can point app code at real RADOS and watch it do what it'll do in prod. The cluster is overhead I put up with to get there.

The normal way to get this locally is Rook: spin up a k8s cluster with minikube or kind, install the operator chart, write a CephCluster CR, carve out PVCs for the OSDs, wait a few minutes for it to converge. That's the right tool when you actually want a Ceph-cluster-shaped thing, with replication and failure injection and the whole operator lifecycle. It's a lot of machinery when all you wanted was an S3 endpoint to point a test suite at.

So I built the small version. One Docker container, one CLI verb, up in about thirty seconds, every byte it writes living in RAM. launch, run your tests, destroy. Reboot the box and there's no trace it existed, because nothing was ever on disk to leave one.

Everything in tmpfs, on purpose

The constraint that shaped the whole thing: no persistent state. Not "cleaned up on exit," I mean never written to disk in the first place. My /tmp is tmpfs, 62.8 GiB of it, so that's where the cluster lives.

The OSD is the fun part. BlueStore wants a block device and I don't have a spare one lying around, so the playground makes a sparse file in tmpfs, wraps it in a loop device, and hands BlueStore that.

the OSD backing store, in RAM
# Sparse file in tmpfs becomes the OSD block device via a loopback.
truncate -s 8G /tmp/cephplayground/<name>/osd0.img
losetup --find --show /tmp/cephplayground/<name>/osd0.img
# Every S3 object, CephFS file, and RBD block ends up in RAM
# through that loop device into that file.

Inside the container the daemons' own runtime dirs are tmpfs too: /var/lib/ceph (mon db, mgr db, MDS journal, OSD bookkeeping), plus /etc/ceph, /run/ceph, /var/log/ceph, /tmp. The only thing that touches my SSD is the read-only quay.io/ceph/ceph image layer Docker already cached. Reboot, it's all gone, and the disk never took a write for any of it.

State versus data

I kept catching myself calling the mon and MDS bookkeeping "state" and the objects "data," like only one of them needed to be throwaway. Same requirement. If the mon db survives a reboot but the OSD doesn't, you've got a cluster that remembers a map of storage that isn't there anymore. Both live in tmpfs or neither does.

No cephadm, no nested containers

The other call was to skip cephadm completely. It's how you're supposed to bootstrap modern Ceph, but it works by orchestrating a container per daemon. Do that inside a container and now you've got a container runtime inside a container, which is a layer of pain I didn't want to sign up for.

So the entrypoint just starts the daemons directly: a mon, a mgr, one OSD, then optionally an MDS for CephFS and a RADOS gateway for S3, all plain processes in the one container. A --services flag (default rgw,cephfs,rbd) picks which optional ones come up, and they coexist fine. When CephFS or RBD is on, the container flips to host networking so a client on the host can reach the mon, MDS and OSD directly; S3-only keeps the simpler port-forward.

That left the one job cephadm normally does for you: creating the OSD. Which is exactly where the version sweep fell over.

The Quincy launch that hung

I wanted this working on more than one Ceph release, so I ran a sweep: launch each major, probe it, destroy it. The by-hand way to bring up an OSD is ceph-volume raw prepare pointed at the loop device. Fine on the newer releases. On v17, Quincy, the launch just hung.

The container log wasn't much help at first. It kept scrolling an error about "no LV," which sent me off into LVM. Dead end: the playground doesn't use LVM at all, it hands BlueStore a raw loop device.

Once I stopped chasing the LVM thing, the real cause turned up. ceph-volume raw prepare is buggy on Quincy when the target is a loop device. The "no LV" line is rollback noise it prints while unwinding a prepare that never should've started. Not an LVM problem, not a config problem. The tool, on that release, against that kind of device.

the misdirection
[cephplayground] preparing OSD on /dev/cephplay-osd0
[cephplayground] raw prepare with explicit OSD id failed; retrying fresh-cluster prepare
# "no LV found" scrolls here. There is no LV. There was never going
# to be an LV. The message is describing the rollback, not the cause.

The fix was to stop using ceph-volume for this at all. The OSD bootstrap doesn't need it. You can register a new OSD and lay down a BlueStore fs by hand, and that path behaves the same from v16 through v20.

manual BlueStore, identical across four majors
# Register the OSD and mkfs BlueStore directly onto the loop device.
osd_uuid=$(uuidgen)
osd_id=$(ceph osd new "$osd_uuid")
ceph-osd -i "$osd_id" --mkkey
ceph-osd -i "$osd_id" --mkfs --osd-uuid "$osd_uuid"
# No ceph-volume, no LVM, no per-release surprises.

That one change, dropping ceph-volume raw prepare for a manual BlueStore mkfs, is what got the whole sweep passing.

Ceph majors working
v16 to v20
Pacific through Squid
Time to ready
~30 s
one container, one verb
SSD writes for state
0
tmpfs all the way down

One smaller wrinkle I guarded instead of fixing: the RGW realm bootstrap. Older releases want you to create the realm, zonegroup and zone yourself; v19 and up auto-create the defaults, so the explicit call turns into an error. So it's gated behind a radosgw-admin zonegroup get probe: it runs the manual setup where that's needed and no-ops where it isn't. The version knob stays plain --image quay.io/ceph/ceph:v18 instead of a dedicated flag, because a --ceph-version flag would just be sugar over the same string and I'd be chasing upstream tag renames forever.

The client side was the fiddly part

The cluster side was easy. RGW is just HTTP, so handing someone an S3 endpoint is printing AWS_ENDPOINT_URL and the access/secret keys for a pre-made user. CephFS and RBD aren't that polite. A client needs a keyring, a ceph.conf, and then an actual mount or map, and kernel mounts want root and matching kmods.

So env prints everything a client needs: the conf and keyring paths, plus a ready-to-paste mount line, and you do the mount yourself with the friendlier userspace tools. It writes per-service scoped keyrings instead of handing out client.admin: client.cephplay-fs can touch the filesystem, client.cephplay-rbd can touch the pool, and neither one can administer the cluster.

what the user actually does
# CephFS over FUSE, no kernel module required:
ceph-fuse --id cephplay-fs --conf $CEPHPLAY_CONF /mnt/play
 
# RBD over NBD, same idea:
sudo rbd-nbd map rbd/play --id cephplay-rbd --conf $CEPHPLAY_CONF

The end-to-end test is the part I wouldn't skip, because "the daemons came up" is not the same as "a client can use it." RGW answered 200 and a bucket round-tripped. CephFS mounted over ceph-fuse with the scoped keyring, and a file written through it read back. RBD was the strict one: map over rbd-nbd, mkfs.ext4, mount, write a file, unmap, remap, check the file's still there. That unmap-and-remap is what proves the block actually landed in the backing store and wasn't just sitting in a client cache.

The devices that look like leftovers but are not

After an RBD test, /dev/nbd0 through /dev/nbd15 stick around, all size zero, like the playground failed to clean up. It didn't. Those nodes are what the kernel nbd module makes when it loads (nbds_max defaults to 16), and rbd-nbd map auto-loads it. The slots sit idle until the next reboot or a modprobe -r nbd. Same deal as the pre-allocated /dev/loop* nodes. Knowing which leftovers are yours and which the kernel always makes is half of not chasing ghosts.

What it is for

This isn't a production deployment and it doesn't pretend to be. One OSD, no replication, no failure domains, the whole cluster a single point of failure by design, all of it in volatile memory. That's the point. It's a real RADOS endpoint with the real RGW, MDS and RBD surfaces, cheap enough to spin up and tear down inside a test run, that leaves nothing behind.

Rook's still the right answer when you want a cluster you can break and watch heal. When you want an S3 URL in thirty seconds and your SSD untouched, this is the smaller, sharper tool.

The code is on GitHub.