Why I'm Starting This Blog
I spend most of my days thinking about reliability, observability, and the unglamorous plumbing that keeps distributed systems running. The best lessons I’ve learned almost never came from a blog post — they came from an outage, a post-mortem, or a conversation with someone who’d already made the mistake I was about to make.
This space is an attempt to pay some of that back. I want to write up a few of the patterns, war stories, and opinions that have stuck with me, in the hope that someone who’s about to make a similar decision has one more reference point.
What to expect
A few threads I’m interested in writing about:
- Envoy, JWTs, and authz at the edge — what actually worked, what burned an afternoon, and when a 1ms p95 is a lie
- OpenTelemetry in practice — the gap between the spec and what your vendor actually ingests
- Kubernetes upgrade strategy — how to go from “a multi-week ordeal with three engineers” to “one person, one week”
- Karpenter, spot, and Graviton — cost wins that don’t come with an operational tax
I’m going to favor concrete over comprehensive. If a post is useful to exactly one person who has the same problem I had, that’s a win.
What not to expect
No hot takes on whatever service broke this week. No “10 things every SRE should know.” I’d rather publish less and mean it.
More soon.