staff site reliability engineer · calgary, ab
Building the quiet parts of reliable systems.
I'm David. I've spent 10+ years on infrastructure, platform engineering, and reliability — most recently at Zapier, where I lead cross-team work on observability, authz at the edge, and SRE practice. This site is where I write things down.
what I work on
Reliability
SLOs, failure modes, and the calculus of what to page on.
Observability
OpenTelemetry, correlated telemetry, and signals that survive contact with production.
Platform
Kubernetes, Envoy, Terraform — internal tools that reduce cognitive load.
Cost + performance
Karpenter, Graviton, query-level optimization. Unit economics matter.
a bit more about me
I'm drawn to the boundary between software engineering and operations — the code that keeps other code running. My favorite work is the kind that disappears: an authz service that adds sub-millisecond p95 but unlocks account-level incident visibility, a cluster upgrade process that used to take weeks and now takes a week.
Away from keyboards, I'm usually somewhere in the mountains west of Calgary.
recent writing
The Five-Second Ghosts
A diagnostic narrative — paired 500s during our Envoy Gateway rollout, and why AWS NLB client IP preservation breaks pod-to-pod traffic on the same EKS node.
Why I'm Starting This Blog
A short note on what to expect here — SRE, infrastructure, and platform engineering, with an emphasis on practical tradeoffs.