Blog ENG - AWS - Post 1 2025
If you’ve spent years nurturing a well-behaved MPLS backbone, stepping into the cloud can feel both exciting and a little unnerving. The promise of agility meets the reality of governance, segmentation, and decades of hard‑won routing hygiene. I’ve helped many enterprises connect mature MPLS estates to AWS, and the pattern repeats: clear segmentation, clear route control, and clear operational guardrails. This article is a practical walk-through of integrating MPLS with AWS, shaped by field experience. It focuses on the mental models, trade-offs, and the “gotchas” you’ll want to anticipate before the first BGP session comes up.
What “MPLS to AWS” Really Means
Think of AWS as another significant site on your network, not just “the internet with a friendly face”. Your MPLS provider gives you predictable private transport with VRFs, QoS, and route governance. AWS gives you programmable infrastructure with VPCs, route tables, and service endpoints.
Bridging the two means you’ll:
- Decide where your “cloud core” lives (most teams choose AWS Transit Gateway for scale and segmentation).
- Map segmentation between worlds (MPLS VRFs ↔ TGW route tables ↔ VPC CIDR domains).
- Control route propagation and traffic preference across multiple paths (Direct Connect and VPN often coexist).
- Instrument the path (visibility, failover testing, and runbooks).
Connectivity Options: How to Get Packets Moving
1) AWS Direct Connect (DX) via Your MPLS Provider
- What it is: A dedicated, private connection from your network to AWS. In many enterprises, the MPLS partner acts as the bridge, handing off your VRF to a DX location and establishing a BGP session to AWS on a private virtual interface (VIF).
- Why it works well:
- Deterministic latency and throughput for east-west data flows and steady north-south traffic.
- Clean segmentation alignment: your provider can connect specific VRFs to specific AWS environments.
- Operational familiarity: BGP, route filtering, and standard change controls fit neatly into existing practices.
- Be mindful of:
- Encryption: DX itself is not encrypted. If required, add MACsec where supported or run an IPsec overlay.
- Redundancy: Prefer two DX locations with diverse paths and separate virtual circuits. Test failover.
- MTU handling: Jumbo frames inside AWS (9001) meet varied MPLS MTUs; plan and test accordingly.
- Route scale and governance: favor summarization, prefix lists, and communities to avoid noisy floodgates.
2) Site‑to‑Site IPsec VPN Over MPLS
- What it is: A quick, secure tunnel from your data center (or provider edge) to AWS, often terminating on a Virtual Private Gateway (VGW) or via Transit Gateway (TGW).
- Why teams use it:
- Fast start: A great way to light up initial connectivity while DX is being provisioned.
- Encryption by default: Security compliance is simpler from day one.
- Flexibility: Can ride your MPLS underlay or any internet path available.
- Watch out for:
- Throughput ceilings and packet overhead: NAT‑T can squeeze MTU.
- TCP MSS clamping and PMTUD: essential to avoid silent fragmentation pain.
- Long‑term scaling: As more VPCs and routes appear, you’ll eventually want DX in the mix.
3) Hybrid Path: Start with VPN, Grow into DX
Many enterprises start with VPNs (to unblock projects) and layer in Direct Connect for performance and scale. Keep VPNs as resilient backup paths, and steer traffic preference with BGP (no heroic cutovers required).
Build Your “Cloud Core” with AWS Transit Gateway
Transit Gateway (TGW) often becomes the routing nucleus in AWS: VPCs attach to TGW, TGW has route tables, and you decide which VPCs can talk to each other (and which can’t).
Design tips:
- Segment deliberately: Use multiple TGW route tables to mirror VRFs.
- Control propagation: Avoid “propagate everything to everyone”. Attach, propagate, then selectively associate.
- Central services: Shared DNS, inspection, and egress usually live in a dedicated shared-services VPC connected through TGW.
- DX + TGW pairing: Use the appropriate VIF and gateway constructs so your TGW sees DX prefixes cleanly and consistently.
The biggest wins come from route table discipline: it’s where cloud agility meets enterprise guardrails.
BGP and Route Control: Keep It Boring (That’s Good)
When BGP is boring, availability is high.
Essentials:
- Summarize wherever possible and filter aggressively: Don’t flood TGW with granular or unstable prefixes.
- Avoid overlapping RFC1918 assignments: A clean IP plan saves more time than any clever knob-twiddling.
- Plan ASNs carefully: distinct domains simplify policy and troubleshooting.
- Preference steering: Use AS‑path prepending and communities to prefer DX over VPN (or the reverse during tests).
- Timers and dampening: Keep convergence sane; protect the control plane from flapping peers.
- Static routes sparingly: If you must, document why, where, and for how long.
Think routes first, bandwidth second. Poor route hygiene will break things long before bandwidth does.
High Availability: Design for Failure, Test for Reality
Redundancy patterns:
- Two DX locations, diverse providers, independent power/cooling.
- Dual routers at each edge, with independent BGP sessions and separate virtual circuits.
- End‑to‑end failover tests that go beyond “BGP is up”: simulate real application flows and DNS resolution.
Operational instrumentation:
- Health hooks: Monitor BGP session state and metrics; alert on route count anomalies and path changes.
- Playbooks: Document how to force traffic down alternate paths (policy switch, prefix tweak, maintenance toggle).
- Change windows: Validate rollback steps and capture pre/post metrics. Treat network changes as product releases.
Failover plans that aren’t rehearsed are just PowerPoint.
QoS and Traffic Classes: Be Practical
MPLS QoS designs can get elaborate. In AWS, keep it pragmatic:
- DSCP preservation is typically fine on the wire, but remember: cloud services won’t honor enterprise QoS policies the way routed workloads do.
- Shape at the edges: Apply policing/shaping at the customer edge and provider handoffs; don’t rely on magic in the middle.
- Measure and adapt: Watch latency/queueing for critical flows (backup, database replication, interactive apps) and adjust.
Aim for predictability over perfection. You’ll get more value from steady performance than from exotic class hierarchies.
Security by Design: Segmentation, Encryption, Inspection
- Segmentation first: Distinct VPCs and TGW route tables minimize blast radius and simplify compliance.
- Encrypt when required: DX + MACsec (where supported) or IPsec overlays for sensitive flows.
- Inspection patterns: Insert firewalls via Gateway Load Balancer in a dedicated inspection VPC; steer traffic through TGW routing.
- Visibility: Use flow logs, traffic mirroring, and BGP telemetry to detect anomalies quickly.
Treat security as a routing problem and a visibility problem, not just a box problem.
Operational “Gotchas” I See Often
- MTU mismatches: VPC instances at 9001, MPLS at 1500, IPsec overhead chewing bytes; solve with MSS clamping and end‑to‑end tests.
- DNS surprises: Hybrid name resolution needs authoritative split‑horizon planning and conditional forwarding to avoid weird fallbacks.
- NAT side effects: Egress patterns and NAT gateways can obscure source IPs; inspection and logging must account for this.
- Asymmetric routing: Multi‑path designs (DX + VPN) can create subtle asymmetry; ensure stateful devices and policies tolerate it.
- Prefix sprawl: Avoid death by CIDR. Summarize, retire, and standardize before you migrate.
- IPv6 timing: Introduce IPv6 deliberately: cleaner addressing, future‑proofing, but watch dual‑stack complexity on day one.
None of these are hard, but they become hard when discovered during an outage.
A Pragmatic Migration Path
- Phase 1: Discover & Plan
- Inventory routes, VRFs, address space, critical apps, and interdependencies.
- Identify “shared services” and decide their home (often a dedicated VPC).
- Phase 2: Connect & Stabilize
- Stand up VPN for early connectivity, then bring Direct Connect online.
- Establish TGW segmentation and controlled propagation; validate failover.
- Phase 3: Refactor & Scale
- Move shared services behind consistent inspection.
- Summarize prefixes, trim legacy routes, and standardize automation.
- Expand with confidence: new VPCs slot into known TGW patterns.
The goal is not speed, but repeatability.
Final thoughts
Integrating MPLS with AWS is less about inventing new tricks and more about applying proven network discipline to programmable infrastructure. Keep your segmentation clean, your routes summarized, your paths redundant, and your operations rehearsed. Start small, stabilize, then scale with confidence.
In my experience, teams that succeed treat the cloud like a first‑class citizen of the enterprise network. When you do that, DX and TGW become powerful tools, VPNs become reliable safety nets, and your MPLS backbone continues to do what it was built to do: deliver predictable, governed connectivity for the workloads that matter.