Blog ENG - AWS - Post 8 2025
When hybrid connectivity is your lifeline, “good enough” failover isn’t good enough. In the last few years helping customers build cloud backbones, I’ve seen the same pattern: overlay tunnels riding on AWS Direct Connect (DX) behave beautifully on blue‑sky days, then take far too long to recover when something goes wrong. The fix isn’t magic; it’s design discipline. Tuning BGP timers on the underlay, enabling BFD where it matters, and – this is the big one – choosing the right tunnel strategy (pinned vs unpinned) can take you from 90‑second brownouts to sub‑second detours.
In this post I’ll share the mental model, concrete configurations, and a test playbook I use with teams so you can repeat the results with confidence.
The Two Clocks That Decide Your Bad Day
Every failover story here has two clocks:
1. Underlay (DX VIF) convergence: How fast your BGP session over the Virtual Interface decides “my neighbor is gone”. DX VIFs default to 90s hold and 30s keepalive, but you can push them down to 3s hold and 1s keepalive on your router (AWS will negotiate to match). Don’t go reckless: aggressive timers raise the chance of flaps under load.
2. Overlay tunnel behavior: What your GRE/IPsec/BGP on top of DX does when the path beneath it wobbles. Here, Transit Gateway Connect (the GRE+BGP option) uses a different set of defaults – 10s keepalive and 30s hold – which become painfully relevant if your tunnels are pinned to one DX path.
Now add BFD to the picture. On DX, asynchronous BFD is enabled on the AWS side by default, and you only need to enable it on your router. The minimum supported interval is 300 ms with a multiplier of 3, so your detection time is roughly 900 ms (a massive improvement over even the most aggressive BGP timers alone).
Pinned vs Unpinned Tunnels (and Why I’m Biased)
Think of pinned tunnels as bolted to a specific DX VIF interface IP. When that underlay hiccups, the tunnel stays “up” until its own BGP/DPD timers expire, so your failover time is underlay detection + overlay timeout. That’s why even with snappy underlay detection, you still wait for the overlay’s 30‑second hold to catch up.
By contrast, unpinned tunnels source from a loopback on your router. Both the AWS endpoint and your loopback are reachable via all DX VIFs. When one VIF fails, routing simply steers packets for the loopback across the surviving VIF, so the overlay doesn’t need to time out at all. Your failover becomes “whatever the underlay decides,” which with BFD is on the order of 900 ms (interval × multiplier). That architectural choice alone often saves you tens of seconds in the real world.
My rule of thumb: If you can stomach a bit more routing hygiene and internal iBGP, build unpinned and enable BFD on the VIFs. That’s the highest return on complexity I’ve found.
What “Good” Looks Like (Targets to Aim For)
- Underlay (DX VIF):
- Keepalive/Hold: tune down from 30s/90s only if your hardware and ops can sustain it; otherwise let BFD do the heavy lifting.
- BFD: interval 300 ms, multiplier 3 → detection ≈ 900 ms. Validate vendor support and CPU headroom.
- Overlay (TGW Connect or VPN):
- Prefer loopback‑sourced (unpinned) tunnels so overlay doesn’t need to flap on underlay loss.
- Remember: TGW Connect’s BGP defaults are 10s/30s; with pinned tunnels, those become your penalty box after an underlay fault.
## Minimal, Reproducible Configs (Cisco‑flavored)
Enable BFD on your DX VIF (AWS side already supports it; you only configure your router):
interface TenGigabitEthernet1/0/0.100
description DX-VIF-1
encapsulation dot1q 100
ip address 169.254.100.1 255.255.255.252
bfd interval 300 min_rx 300 multiplier 3
no shutdown
router bgp 65000
neighbor 169.254.100.2 remote-as 64512
neighbor 169.254.100.2 fall-over bfd
That’s the essence: 300 ms min tx/rx and multiplier 3. On most platforms you can confirm with a show bfd neighbors detail.
Unpinned tunnel pattern (loopback‑sourced, advertised everywhere):
! Tunnel source is a loopback, not the DX interface IP
interface Loopback1
description Tunnel-Source
ip address 192.168.1.1 255.255.255.255
no shutdown
interface Tunnel10
description TGW-Connect
ip address 172.16.1.1 255.255.255.252
tunnel source Loopback1
tunnel destination 10.0.1.1
no shutdown
! DX VIF (with BFD as above)
interface TenGigabitEthernet1/0/0.100
description DX-VIF-1
encapsulation dot1q 100
ip address 169.254.100.1 255.255.255.252
no shutdown
router bgp 65000
bgp log-neighbor-changes
address-family ipv4
! Advertise the loopback so AWS can reach your tunnel source via all VIFs
network 192.168.1.1 mask 255.255.255.255
neighbor 169.254.100.2 activate
! Use local-pref or communities on Private/Transit VIFs; AS-Path on Public VIFs
neighbor 169.254.100.2 route-map ADVERTISE-LOOPBACK out
exit-address-family
This pattern lets AWS reach your loopback via any surviving VIF. Combine it with iBGP/IGP between your edge routers so both can forward to the same loopback if a chassis fails.
Engineering the Routing (Without Overengineering)
- Symmetry matters: When advertising your loopbacks to AWS, use BGP Local Preference communities on Private/Transit VIFs and AS‑PATH prepending on Public VIFs to bias the preferred underlay while keeping the alternate fully viable.
- Don’t source tunnels from point‑to‑point /29s: Keep those /29 link‑locals for eBGP only; use loopbacks or LAN addresses for tunnel endpoints. It prevents unnecessary pinning and simplifies multi‑VIF reachability
How I Validate Before Anyone Sleeps on It
A crisp test plan turns theory into trust:
- Baseline: Record current failover with default timers (DX BGP 30/90, no BFD). Note packet loss and total recovery time.
- Turn on BFD (underlay only): Repeat the same test. You should see detection ≈ 900 ms and end‑to‑end recovery well under a second if your overlay is unpinned.
- Overlay sensitivity check: If you kept a pinned tunnel for comparison, notice the extra 30s tied to the overlay BGP hold. It’s a great demo for stakeholders.
- Document & monitor: Capture router BFD/BGP state transitions and correlate with CloudWatch metrics so operations can detect regressions later.
Common Gotchas I Still See (and How to Avoid Them)
- “We enabled BFD, but failover is still 30s”: Nine times out of ten the tunnels are pinned. BFD only accelerates the underlay; the overlay is still waiting on its own timers. Move to loopback‑sourced tunnels (unpinned) so only the underlay clock matters.
- “We set BGP to 1s/3s everywhere; now it flaps”: Be conservative with BGP timers; let BFD do fast detection. Under CPU spikes or transient congestion, ultra‑aggressive BGP timers will reset sessions and cause bigger outages than they prevent.
- “Does AWS support BFD on the overlay?”: On DX VIFs, yes (asynchronous BFD), and it’s enabled on the AWS side by default; you configure your router. Overlay heads (e.g., TGW Connect) rely on their own BGP timers. BFD is not something you tune on the overlay session here.
My Default Blueprint
- Use two or more DX VIFs across distinct devices/locations per your resiliency target.
- Enable BFD on every VIF: interval 300, multiplier 3.
- Build loopback‑sourced (unpinned) tunnels to TGW Connect or VPN.
- Engineer symmetric routing with communities and/or AS‑PATH; keep the alternate hot.
- Test. Measure. Then test again after every change window.
With that, I routinely see failover shrink from ~90 seconds to ~900 ms, and customer incident reviews get a lot shorter.
Final Thoughts
The beauty of this approach is that it’s not vendor‑religious or feature‑fragile. You’re aligning the fast‑detection tool (BFD) with the part of the system that can actually act on it (the underlay), and you’re designing the overlay so it doesn’t need to fail. That combination – plus a bias for unpinned tunnels – has been the most durable way I’ve found to make hybrid connectivity behave like a modern SLO‑driven platform.