← blog

The Backup Was From November

The worst moment of any recovery is the one where you read the date on the backup you’re about to restore. Mine said November. It was May.

This is the story of a Proxmox node that lost a disk, a six-month-old Home Assistant restore that should have been a disaster and mostly wasn’t, and the rebuild that followed: a quorum device on my NAS so I can run on one node, and a backup strategy that no longer depends on me having remembered to click something in the fall.

What broke

My homelab runs two Proxmox nodes. The main one does the real work: a Talos VM running my single-node Kubernetes cluster, a Home Assistant VM, a handful of other things. The second is a smaller box I keep as a standby and normally leave powered off; it’s a space heater I don’t need running 24/7.

One disk in the main node died: the one holding the Proxmox VMs. No mirror behind it, so the VMs went with the drive. They weren’t coming back until I’d physically swapped the disk. Parts and downtime, not a five-minute fix.

So I went looking for my backups. The Talos VM was fine: I’d rebuilt that node new about a week earlier, in May, so its vzdump was days old. The Home Assistant VM was the uncomfortable one. Its most recent backup was from November. I’d backed it up once, watched it run, and never built the part where backups keep running and tell me when they don’t. Six months of “it’s fine, I’ll get to it” had quietly become the recovery point objective for the one VM I’d most regret losing. About 180 days of RPO, a number you only measure after the fact.

The restore

The mechanical part went better than the date on it deserved. I powered on the standby node, pointed it at the November vzdump files, and restored both VMs. They booted. The standby carried production while I waited on a drive for the main box.

Talos was the easy one. It came back essentially where I’d left it. The immutable, declarative shape helped: the machine config is the source of truth and the workloads reconcile from manifests, so even the small gap was a non-event. (The iSCSI storage wiring I wrote about earlier reattached cleanly, because the LUNs live on the NAS, not the node.)

Home Assistant was the one I dreaded. Since November I’d added a Z-Wave switch, two Hue bulbs, and a Zigbee bulb, and wired them into a nursery scene: the switch drives Hue presets, including red-only at night. A restore from before all of that should mean an evening with a paperclip, re-pairing devices and rebuilding everything by hand.

It mostly didn’t. The reason is worth more than the rest of this post.

The thing that survived

The Zigbee and Z-Wave networks didn’t live in the backup. They live on the radios.

When you pair a Zigbee device, the network it joins (network key, PAN ID, routing table, the list of who’s a member) is held on the coordinator radio itself. Z-Wave is the same shape: the controller stick’s NVM holds the node list and inclusion state, and the security keys live in the Z-Wave JS config. Home Assistant, in front of all that, is mostly a view of the network. It is not where the network is.

So when I restored that November Home Assistant config and passed the same Zigbee coordinator and Z-Wave stick through to the VM, the radios still held the full, current network. The Z-Wave switch, the Zigbee bulb, the two Hue bulbs: every device I’d added since November was still a member of its network, still keyed, still reachable. Nothing needed re-pairing. The bulbs answered the moment HA came up.

What didn’t survive was the glue. Two pieces of pure config, both predating the November snapshot:

Rebuilding an automation and reinstalling a HACS integration is an evening’s tidy-up. Re-pairing every device in the house is a different evening, one with a paperclip and a toddler who’d like the lights to work now. The restore cost me the former and spared me the latter: the radios held the membership, the config held the logic, only one was stale.

The lesson generalizes, and it’s the kind that only shows up on a bad day: where your state actually lives determines what a stale backup costs you. Wi-Fi or cloud integrations are the opposite shape: the config database is the source of truth, and a six-month-old restore would lose every device added since, full stop, no radio to fall back on. Same outage, same backup, catastrophic for cloud devices, a non-event for the mesh ones. I didn’t choose Zigbee and Z-Wave for disaster recovery, but the property that made me like them (local, no cloud, no account) is the same property that put my network membership on hardware the backup couldn’t make stale.

Powering the standby back off: the qdevice

The replacement drive arrived. I swapped it in on the main node, restored the VMs onto it, and wanted to do the obvious thing: power the standby off again and stop heating the closet.

A two-node Proxmox cluster makes that annoying. Quorum is a voting game, and with two nodes you have two votes. Proxmox’s default workaround is two_node: 1 in corosync.conf, which sets wait_for_all and lets one node stay quorate. But it’s a footgun: it assumes fencing you don’t have in a homelab, it splits brain when the link between nodes flaps, and wait_for_all means a cold-booted node won’t gain quorum until it has seen its partner at least once. Which is precisely the situation I’m engineering: a node that boots and runs while its partner stays dark.

The right fix is a third vote that isn’t a third Proxmox node. Corosync supports exactly this: a quorum device (corosync-qnetd) running somewhere always-on, with a corosync-qdevice daemon on each cluster node connecting to it. Now the math is three votes (one per node, one for the qdevice) and quorum is two. The main node plus the qdevice is two votes: quorate, healthy, running. The standby can stay off. And if the two nodes ever partition, the qdevice hands its vote to exactly one side, so there’s an actual arbiter instead of two nodes each convinced they’re in charge.

The always-on box in my house is the NAS, so that’s where qnetd goes. My NAS is TrueNAS Core, which means a FreeBSD jail. The catch: the TrueNAS package set ships corosync but not the qnetd piece. Someone on the TrueNAS forums had already walked this path, running corosync-qnetd in a jail, and it saved me a lot of flailing.

On the NAS (the qnetd side), inside a FreeBSD jail with a static IP:

On each Proxmox node (the qdevice client side):

After that, pvecm status shows three expected votes and the qdevice listed as a voting member:

Votequorum information
----------------------
Expected votes:   3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW  pve-main
         2          1    A,V,NMW  pve-standby
         0          1            Qdevice

The standby goes back to being a cold spare I can wake when I need it. The main node, with one node vote plus the qdevice vote (two out of three), runs quorate and alone in between.

The backup strategy that doesn’t say “November”

The qdevice fixed the cluster. It didn’t fix what actually hurt: the six-month-old backup. So I rebuilt that from the RPO backward, around one rule: don’t let the disaster-recovery copy depend on the hardware that’s also production.

Three layers, all daily, all automatic:

  1. Home Assistant config → NAS, over SMB. A scheduled HA backup writes a full snapshot to an SMB share on the NAS every day. HA’s own state (automations, dashboards, the entity tidy-up I’d otherwise lose) lands off the Proxmox host on a cadence, not a whim.
  2. Proxmox VM backups → the same SMB share. The NAS share is registered in Proxmox as CIFS storage, and a daily vzdump job writes the VM images straight to it. The thing that was a once-in-November event is now a job with a schedule and a retention count.
  3. The SMB share → Backblaze B2, daily. A TrueNAS Cloud Sync task pushes the whole backup share to Backblaze B2 every night. That’s the copy that survives the house: the one that’s still there if the NAS itself dies, or worse.

A 3-2-1 backup arriving the long way around: more than two copies, on two kinds of media (local NAS and object storage), one off-site. RPO went from “whenever I last remembered” to one day, and recovery no longer assumes the NAS and both Proxmox nodes survive the same event.

The last piece was the one that actually bit me: knowing whether the jobs run. Proxmox’s notification system fires a webhook on every backup job, success or failure, and I pointed that webhook at a Zap that turns it into an SMS. (Yes, I reached for the product I work on. It was right there and it took ten minutes.) Now each night the backup either texts me that it finished or texts me that it broke, and the night a text doesn’t arrive is itself the signal that something didn’t fire. That’s the whole point: a backup you don’t monitor isn’t a backup, it’s a coincidence with a date on it, and the date is the part you find out about too late.

What I’d tell myself in November

The mechanics weren’t hard. Restoring a vzdump, compiling a daemon in a jail, registering a CIFS target: none of it is exotic. The expensive part was entirely in what I’d decided, by default, months earlier and never revisited.

Two things stuck.

RPO is a choice you make in advance and discover after. Nobody decides “six months is fine.” You decide “I’ll set up backups,” watch them run once, and let the absence of an alert stand in for success. The number that matters isn’t whether backups exist; it’s how old the newest good one is at the exact moment you need it.

Know where your state really lives. The best thing about that night was unplanned: the devices survived because their network membership sat on the radios, not in the backup I’d neglected. That wasn’t foresight, but it could have been. If my newest backup is months stale, what does each part of this actually lose? is a question I can ask on a calm afternoon instead of a bad night. The radios would have answered “nothing.” The Home Assistant config (automations, the HACS integration, every tweak since the fall) would have answered “November.” I’d rather find out which is which before the next disk goes.