staff site reliability engineer · calgary, ab
Building the quiet parts of reliable systems.
I'm David. I've spent 10+ years on infrastructure, platform engineering, and reliability — most recently at Zapier, where I lead cross-team work on observability, authz at the edge, and SRE practice. This site is where I write things down.
what I work on
Reliability
SLOs, failure modes, and the calculus of what to page on.
Observability
OpenTelemetry, correlated telemetry, and signals that survive contact with production.
Platform
Kubernetes, Envoy, Terraform — internal tools that reduce cognitive load.
Cost + performance
Karpenter, Graviton, query-level optimization. Unit economics matter.
a bit more about me
I'm drawn to the boundary between software engineering and operations — the code that keeps other code running. My favorite work is the kind that disappears: an authz service that adds sub-millisecond p95 but unlocks account-level incident visibility, a cluster upgrade process that used to take weeks and now takes a week.
Away from keyboards, I'm usually somewhere in the mountains west of Calgary.
recent writing
The Backup Was From November
A Proxmox node lost a disk, my Home Assistant backup was six months old, and the thing that saved me wasn't a backup at all. It was where Zigbee and Z-Wave keep their state. Plus the qdevice and 3-2-1 setup that means next time won't be November.
iSCSI on Talos: Why the Obvious Path Doesn't Work
Getting iSCSI volumes working on Talos Linux — and why the in-tree volume plugin leads you somewhere painful before the CSI driver shows you the exit.
The Five-Second Ghosts
A diagnostic narrative — paired 500s during our Envoy Gateway rollout, and why AWS NLB client IP preservation breaks pod-to-pod traffic on the same EKS node.