Ensuring high availability is a major concern with inter-region VPC peering connections. While the IRC was still in development, we did a lot of customer outreach and research, to ensure we were on the right track and had a nice mix of all the features that were required. Ensuring high availability always popped up as a major requirement. Customers were reluctant to shift their VPC communication from the relative comfort of AWS’s network infrastructure to a completely new player, and wanted to know what we were doing to ensure high availability. But before you go off wondering what I am talking about, a little context:
Inter Region Connect
Datapath.io’s engineering team recently came up with a super easy solution to create secure, reliable and optimized VPC peering connections with reserved bandwidth. We call it the IRC (Inter-region Connect). The IRC integrates multiple networks in a marketplace of sorts, where AWS users can pick and choose the networks they want their VPCs to communicate over, the bandwidth they require and deploy security on top of it, all in a matter of minutes.
By ensuring that the traffic exchange between AWS VPCs is restricted to a single transit provider, Datapath.io dramatically reduces the probability of it hitting a congested internet highway. The ability to reserve bandwidth for VPC to VPC communication also avoids issues around over-provisioning and packet loss.
However, we see IRC’s core value in it being a marketplace. Having visibility into the pricing and performance of several networks/providers, makes the whole process more transparent, while passing on the benefits of a marketplace to customers.
Anyways, back to the discussion about availability:
The IRC works on top of an AWS DirectConnect. Once we had the DirectConnect in place, we then went on an ISP shopping spree, and scooped up as many as we could. The ISPs/transit providers gave us access to the public internet and other AWS regions.
The IRC re-routes inter-region traffic from the default AWS gateway to a VGW (Virtual Private Gateway) which in turn hands over traffic to a transit provider chosen by the customers.
How we ensure high availability for IRC
To ensure a resilient and highly available VPC peering connection we set to work baking in multiple redundancy levels. This is what we came up with:
Every component of the IRC is redundant. Instances, AZs, internet gateways and networks can all be replaced automatically.
To start with, the cloud formation stack that sets up the IRC, creates two instances each, on both sides of the IRC: two instances in each region. Each instance is placed in a different AZ. This is the first level of redundancy. Whenever an AZ goes down we can automatically failover to the instance in the other AZ, without affecting the IRC. The same happens whenever an instance fails. Auto scaling groups allow us to re-create an instance whenever it dies and take back the traffic, once it is fully operational.
This is what the IRC usually looks like. Active instances are communicating over Datapath.io’s VGW.
Any problems on the network/transit provider level, lead to automated failover to the default AWS IGW.
Another level of redundancy is implemented on the internet gateway level and responds to any problems on the network level. Whenever we detect problems with the network/transit provider that the IRC is setup on, we automatically failover to the default IGW of AWS. The same happens whenever the VGW has problems. We seamlessly failover to the default AWS IGW.
All these redundant components allow us to provide an SLA with 99.9% availability for the IRC.
If you want to learn more about Inter Region Connect, download the IRC Whitepaper.