← Blog

The Five-Second Ghosts

The Zapier engineering blog has the official version of how we cut over from Ingress NGINX to Envoy Gateway — 500 ingress resources, 200M requests per hour, six months. It tells you what we found and how we fixed it. This is the version with the wrong turns.

The first ghost

A few weeks into the rollout, our error budgets started bleeding very slowly. Not enough to page anyone — just a steady, low-grade trickle of 500s that hadn’t been there the day before. The shape was odd: once in a while you’d see one 500 in dashboards, and then another five or six seconds later. Pairs. Always pairs.

For a couple of days I thought it was noise. We’d dismissed weirder. But the gap was too clean — the same fingerprint, request after request — and at some point you stop being able to look away.

Wrong turns

The randomness was the first thing that grabbed me. The 500s weren’t concentrated on one service — they were spread across multiple services, none of which had anything in common at the application layer. And every individual failure landed just past the five-second mark after the request had started.

Five seconds in front of Envoy is a suspicious number, in a specific way: it smells like connection reuse. We’d done recent work on upstream pooling, and a misbehaving keep-alive can absolutely manifest as “request goes out, connection turns out to be half-dead, fail five seconds later when the keep-alive interval expires.” That theory had legs for a couple of days. Connection-reuse bugs in Envoy do present like this. I read a lot of upstream HTTP filter code in pursuit of it.

But the metrics never quite lined up. The connections that failed didn’t look stale — they looked like fresh ones, where the request had never made it to the backend at all. That’s a different problem from “we held a connection too long.”

The clue I should have followed sooner was where the five seconds actually came from. It wasn’t an Envoy timeout. It was the NLB’s. When a request lands at an NLB, the NLB opens a new TCP connection to the upstream target — and that new connection has its own SYN-ACK timeout. About five seconds, in our case.

Which means the failure wasn’t happening in Envoy, and it wasn’t happening in the application. It was happening between the NLB and the upstream pod, at the SYN. The NLB sent a SYN to the target. The SYN never saw a SYN-ACK. After about five seconds, the NLB gave up and returned a connection-failed signal upstream. Add a retry on top, and you get the pair.

That reframes the problem. It isn’t “intermittent backend errors.” It’s “the NLB can’t reach the backend at all, sometimes.” Different layer, different toolbox.

The hairpin

Once you frame it as “the connection never gets there,” the failure mode you reach for is layer 4. SYNs that don’t see SYN-ACKs. Connections that never reach ESTABLISHED. And in cloud networks, the most common reason for that is something between the client and the server eating packets — a security group, an NACL, a NAT, a policy.

We had Envoy fronted by an AWS Network Load Balancer with preserve_client_ip enabled. That sounded fine. NLB target groups had been preserving client IPs for years before we got there.

But here’s the thing: Envoy was deployed as pods on EKS, with the AWS VPC CNI. So Envoy pods got real VPC IPs. And Envoy’s traffic went back out through the NLB to upstream services that were also pods on EKS, with real VPC IPs, sometimes on the same EC2 worker node.

That’s the trap. AWS documents it directly in the NLB troubleshooting guide:

NAT loopback, also known as hairpinning, is not supported when client IP preservation is enabled. […] If the request is routed to the same instance it was sent from, the connection times out because the source and destination IP addresses are the same. Note that this applies to Amazon EKS pods running in the same EC2 worker node instance, even though they have different IP addresses.

That last sentence is the one that gets you. Two pods on the same node have different VPC IPs. Conceptually they look like they should be fine — they’re different hosts, as far as Kubernetes is concerned. But to the underlying instance, they’re the same TCP endpoint family, and conntrack does not care about your CNI’s mental model. The NLB sends a SYN at the target pod with the original client IP preserved. The host’s kernel looks at that SYN, sees a source and destination that both appear to belong to itself, decides the packet is invalid, and drops it on the floor before it ever reaches the destination pod.

The SYN-ACK never comes back. The NLB’s connection-establishment timeout fires. 5xx. Retry. SYN drops again. Another five seconds. 5xx. Pair.

Why pods, not just instances

The thing that makes this especially nasty on EKS is that the trap isn’t visible from the Kubernetes layer. If you’d told me a year before this incident “watch out for NLB hairpinning when an instance talks to itself,” I’d have nodded and not thought about it again, because we don’t run Kubernetes workloads where one instance talks to itself — we run pods, and pods have their own IPs.

But the VPC CNI mode we use puts pods directly on the VPC, sharing the host’s network plumbing. From the NLB’s perspective, “back to the same instance” is exactly what’s happening, even when “back to a different pod” is what we meant. The abstraction does not survive the round-trip.

The bit AWS quietly added to that troubleshooting page — “this applies to Amazon EKS pods running in the same EC2 worker node instance, even though they have different IP addresses” — reads like a footnote. It is not a footnote. It is the entire warning.

The fix landscape

Three real options.

Disable client IP preservation, switch to Proxy Protocol v2. This is what we did. PPv2 puts the original client IP in the protocol header instead of relying on the NLB to preserve it via the TCP layer, which means the NLB no longer has to do the trick that breaks hairpinning. You pay for it: every upstream that needs the client IP has to parse PPv2, including health checks. Envoy handles this fine; readiness probes do not by default, and that’s a separate fight you’ll have to plan around.

Spread workloads so client and server pods can never colocate. Anti-affinity, separate node groups. Functionally works, operationally a tax forever, and you’ll forget why you set it up in two years.

Don’t go through the NLB at all. Use cluster-internal DNS, talk pod-to-pod directly. Right answer for service-to-service traffic, but not always available — sometimes you genuinely need the NLB in the middle (TLS termination, external clients with IP allowlists, third-party integrations).

We did mostly the first, with some of the third. The second is a smell.

What I’d put in a runbook now

If I were writing the pre-flight check for the next time someone reaches for preserve_client_ip = true on an NLB pointed at EKS:

A 30-second question to ask before turning a knob, instead of two days of detective work after.

A footnote on dashboards

The thing I keep coming back to from this one isn’t the kernel-level mechanics — it’s how long the pattern stared at me before I named it. The five-second gap was visible from the first day, and it took the better part of a week to register as the clue it was. The lesson there is older than this incident: when something in your dashboards has the same fingerprint every time, that fingerprint is the diagnostic. Trust the timing. The packets are trying to tell you what they are.