Today, most FS (Financial Services) enterprises are transforming rapidly by adopting next-gen technologies swiftly, and cloud adoption is no exception. Most of the medium to large scale FS enterprises are operating through hybrid (combination of AWS and on-premises) infrastructure models to expand their footprints and meet their customer demands globally.
Most enterprise organizations start by adding resources in a single AWS account having few or more VPCs that represent a management boundary that includes users, roles, permissions, and other required AWS services configured, i.e., EC2 instances, S3 buckets, RDS databases, and everything else you need to run your business.
As an organization scales, with frequently changing customer demands around the world, it becomes important for organizations to serve their customers with the same experiences as before. These requirements land organizations in need of scalable and reliable networking infrastructure in no time.
Mostly, organizations rush to add what is required to fulfil their customers’ requirements without considering security and scalability aspects. This makes a pile of networking changes with each iteration having compromised scalability and security changes over a period, and it becomes hard to scale for future demands.
This whitepaper is written keeping these scenarios in mind and highlights the various networking options and their implementation best practices to build a scalable and reliable AWS network.
It covers AWS connectivity types, their use-cases, best practices, and design considerations.
In AWS, sometimes networking is considered an underestimated infrastructure, limited to management of connectivity between on-premises to AWS VPC or mostly about configuring firewalls. However, networking starts right from deciding CIDRs (Classless Inter-Domain Routing) for VPCs, configurations of subnets, Network Access Control Lists (NACLs), and security groups within VPCs, and it can extend to management of multi-account, multi-VPC, and hybrid connectivity.
Virtual Private Cloud (VPC) is the elementary service under AWS networking services that anyone needs to be familiar with before anything else in AWS networking because it’s a logical container for all other resources that can be provisioned in AWS cloud. It operates just like traditional data centers within premises, except resources are dynamically provisioned, disposed of, scaled, etc. These resources, in the real world, need to be connected to the resources in other VPCs or resources within on-premises networks to share data or communicate for enterprise applications to run and that’s how enterprises’ AWS network grows with growing business and application needs.
Because VPCs are networking containers for enterprises, it’s very important for them to make it scalable and reliable for any future needs. AWS advises best practices to be followed.
When it comes to making a VPC, the following best practices are a few common considerations advised based on learning and experiences in different use-cases and do not denote a complete networking solution.
VPC with its Purpose
Everything is designed and created with a purpose, and VPC is no exception. Enterprises need to undertake the purpose of what VPC is going to be designed for. The following are a few examples explaining the purpose and selection of the most suitable design for VPC.
Use-case | Suitable VPC configuration |
---|---|
Internet-facing application to be deployed. | VPC with a public subnet. |
Internet-facing application with processing power and database storage. | VPC with public subnet for internet requests and application processing and storage capabilities in private subnet. |
Cloud computing resources need data (stored in on-premises data centers) for processing. Processing is not time critical and can tolerate network in-consistencies. | VPC with private subnet and Site-to-Site VPN (Virtual Private Network) tunnel connecting VPC and on-premises networks. |
Cloud computing resources need data (stored in on-premises data centers) for processing. Processing is time critical and cannot tolerate network in-consistencies. | VPC with private subnet and AWS direct connectivity between VPC and on-premises networks. |
VPC with Multiple Data Centers for High Availability
Failures are bound to occur, and the only way to avoid them is by preparing for them and designing your networks. Multi-AZ deployment ensures that the network and site are available to customers if there are failures in one zone. It reduces the failures due to network bottlenecks by distributing load between the sites and increases the overall reliability. It can also tolerate and prevent DDoS attacks to an extent by scaling the network in multiple zones.
Leverage Security Group References
As we explained earlier, if security groups are used for resources access, scalability issues may not arise. For instance, if one EC2 instance is given access to another EC2 instance using a security group, then every new instance addition of the same type will get permission automatically.
Choose the Right CIDR Size
Classless Inter-Domain Routing (CIDR) size is one of the most important aspects of VPC design for the following reasons:
Use VPC Flow Logs to Monitor IP Traffic
Reliability is not just preventing a network from failing. It also tells how soon the network can be restored to its normal operating state. It becomes easy for enterprises and teams to diagnose the issue with proper logging in place. VPC flow logs help in analyzing IP traffic IN and OUT of VPC network interfaces.
A VPC, in real-time, needs to be connected to hundreds of other resources within and outside of the VPC for enterprise applications to run. This connectivity is broadly categorized into two types in AWS:
Enterprises across the globe have both networking channels present to be business performant and regulatory compliant.
Inter-VPC Connectivity
AWS provides the most suitable options to connect VPCs to communicate heterogeneous business needs. For scalable and reliable network connectivity, it is recommended to select the most appropriate option first. Following are a few of the most important connectivity options and relevant use-cases.
VPC Peering
VPC peering is a point-to-point private connection between VPCs that communicate over private IP addresses, making it the most secure option for connectivity within AWS networking. It has the following features:
Because it’s a point-to-point connection, scalability is not easy. Every new VPC needs a change at both ends. Consider this - if the organization is planning to have 100 VPCs, it would be a nightmare to add one more. It also has certain limits:
The following diagram illustrates the VPC peering mesh between networking resources:
AWS Transit Gateway (TGW)
A TGW is the networking service used in both inter-VPC and hybrid connectivity types. It is based on the Hub-Spoke model that enables multi-directional routing and centralizes configuration. With TGW at the center, the attachment of a new networking component is quite easy because it only needs access to the transit gateway. After that, the existing routing configuration is leveraged to propagate traffic to other components. It also supports Equal Cost Multi-Path routing over multiple VPN tunnels to scale the throughput up to a several number of connected VPN tunnels. Though it’s a regional construct, it is still an option to be considered when connectivity is required across the regions because it supports cross-region peering. So TGW is the best recommended scalable and reliable solution in a multi-VPC, multi-region architecture that connects to on-premises networks.
Characteristic | VPC Peering | AWS Transit Gateway |
---|---|---|
Connectivity Type Supported | Only AWS Networking | Hybrid |
Architecture | Full Mesh | Hub-Spoke |
Transitive Connectivity | Not Supported | Supported |
Cross-Region | Supported | Not Supported |
Infrastructure Scaling | Up to 125 Active Peers | 5000 attachments per region |
Bandwidth | No Limit | Up to 50 Gbps per attachment |
Latency | Minimum (point-to-point connectivity) | Higher than VPC peering because of transit gateway hop |
Visibility | Limited Visibility (VPC flow logs) | Higher Visibility (Flow Logs, TGW network manager, Cloud watch metrics) |
AWS Private Link
VPC peering and transit gateway are bidirectional networking components. In a real-world scenario, not all connections are bidirectional. Sometimes, only unidirectional connections are required where only the consumer is allowed to initiate a connection to the producer, and the producer is blocked from accessing the consumer. AWS PrivateLink is the best-suited option for such scenarios. It provides private connectivity between VPCs, AWS services, and VPCs to on-premises networks without exposing your traffic to the public internet. AWS Private Link is a regional construct used in a multi-regional network for private connectivity within a region.
Both AWS Private Link and VPC peering provide private connectivity, but there are a few differences listed here for quick reference.
VPC Peering | AWS PrivateLink |
---|---|
Bidirectional | Unidirectional |
VPC-to-VPC | VPC-to-VPC, AWS services, On-prem |
Only non-overlapping CIDRs are allowed | Overlapping CIDRs are allowed |
Use when layer-3 IP connectivity between VPC is required | Use when Client/Server setup for consumer-initiated service accesses are required |
The below diagram illustrates PrivateLink setup in AWS cloud:
As highlighted in the setup, the producer of services needs a Network Load Balancer (NLB) to make services available to consumers. Consumers use a VPC endpoint (PrivateLink) to consume the services. A multi-AZ setup is required for higher availability and more reliability.
NAT (Network Address Translation) Gateway
AWS Networking services like VPC peering, transit gateway and PrivateLink/Interface endpoints are useful in private networking setups where connectivity is needed within AWS or between AWS and on-premises. However, we have not discussed any service that connects to the internet yet.
Both NAT gateway and internet gateway are networking resources used within VPC to connect to the internet, except NAT gateway is a unidirectional connectivity originating from private subnets for software updates, new releases, etc. All NAT connections are secured because Instances in private subnets connected to NAT cannot receive internet-originated requests. NAT is also used in other following use-cases:
A NAT gateway based on the use-case is created in a public or a private subnet. A NAT created in a public subnet is used for internet connectivity, unlike a NAT created in a private subnet used for connectivity to on-premises. If, by mistake, an internet gateway gets attached to NAT, traffic is dropped automatically.
Hybrid Connectivity
Enterprises dealing with extremely sensitive customers’ financial information most likely store the data in an on-premises network and protect it from getting compromised. Like this, there can be thousands of other scenarios where data and information are stored on-premises and used by both on-premises and cloud applications for processing. Applications in the AWS cloud uses a hybrid connectivity setup to read, process, and write data to and from the on-premises network. Connectivity between AWS and on-premises networks is best achieved for non-overlapping CIDRs.
Enterprises start small in the cloud initially and scale to a multi-VPC, multi-region, and multi-account network. Later in this whitepaper, connectivity architectures are described to manage such scaling needs most appropriately and reliably.
The following the are the most common use-cases for hybrid connectivity networks.
Following are the AWS networking components and architectures to manage the most common use-cases and others.
AWS Site-to-Site VPN Connection
Site-to-site VPN comes with is the easiest way to enable hybrid connectivity between AWS and on-premises as the physical setup is not needed, and connectivity goes over the internet with end-to-end data protection in transit using Diffie-Hellman (DH) RSA encryption. It is called site-to-site because connection per site (on-premises to AWS or vice-versa) is needed. Each connection consists of two tunnel endpoints, connecting each tunnel to different AZs for high availability. Each tunnel ends in a different AZ within the AWS global network.
Being the simplest and quickest way to securely connect to an on-premises network and vice-versa, it is quickly adopted by enterprises during the initial phase of cloud adoption for small workloads, and networking is built around it. But this networking option has the following limitations:
Scalability: Scalability becomes hard with more maintenance required as number of sites increases.
Throughput: Each tunnel in a VPN has a maximum throughput of 1.25 Gbps, and it cannot scale beyond that.
Consistency: Because site-to-site VPN is a transmission over the internet, it is prone to interruptions.
Considering its quick adaptive nature and limitations, this hybrid connectivity is recommended in case of minimum scaling needs with maximum data transfer bandwidth up to 1.25 Gbps.
AWS Transit Gateway (TGW)
As described earlier, this hub-spoke model is used in both Inter-VPC and Hybrid connectivity types to solve the scaling problems in networking.
The below figure shows how it solves scaling problems that surface with an increasing number of VPCs in a site-to-site VPN networking design.
The above design is best recommended for workloads that need a data transfer bandwidth of not more than 1.25 Gbps. But considering the nature of processing and workload that cloud applications oversee nowadays, designing at first for more throughput capacity is a more convenient option. TGW supports an Equal Cost Multi-Path (ECMP) strategy that aggregates the bandwidth of multiple VPN connections from customer gateways, which solves a single tunnel’s throughput limitation problem and can burst up to 50 GB per attachment.
DX Dedicated Connection (Direct Connect)
Almost every industry has scheduled and in-memory processes to perform bulk operations, non-operational hours activities that are mission critical, need consistent throughput, and cannot tolerate an interruption. AWS Direct Connect is recommended to be used in designing networking for such mission-critical processes. It is a resolute (AWS set up) private connection from an on-premises network to an in-cloud network in a region. Because it needs to be set up physically, it takes more time than other networking designs but reduces network costs, increases bandwidth throughput, and supplies a “consistent network” experience once set up.
Direct Connect is a networking facility set up, monitored, and maintained by AWS. The customer’s on-premises center is physically connected to direct connect to a requested port. After that, AWS uses its own networking to transmit data between Direct Connect and customer VPCs. This connectivity setup is called Virtual Interface (VIF). The virtual interface is a logical interface built on top of the physical connection. Direct Connect supports three different types of VIFs:
Direct Connect Gateway (DXGW)
With increasing business demand and customer presence across the globe, enterprises scale from single-region data centers to multi-region data centers, and the same happens with enterprises’ networks, too. It grows from a single region to a multi-region network. A Direct Connect gateway is a globally available AWS networking resource that is used to connect on-premises networks to a cloud network span across multiple regions. It can connect VPCs across multiple accounts if the Direct Connect gateway and VPCs are owned by the same AWS payer account.
An AWS Direct Connect gateway is connected to either of the following gateways:
The below diagram, left to right, illustrates network simplification with a Direct Connect gateway.
Connectivity Design Selection
This paper has described some of the most vital networking components and their use-cases based on scalability needs. However, because business requirements are heterogeneous, a single networking component is not enough to manage these requirements and impact overall design selection.
This section discusses some of the most important connectivity use-cases and respective networking designs that are highly available, resilient, and fault-tolerant.
a. Fault-tolerant multi-VPC hybrid connectivity for small workloads
b. Fault-tolerant multi-VPC hybrid connectivity for bulk or consistent workloads
c. Highly available multi-region, multi-VPC hybrid connectivity for bulk or consistent workloads
d. Highly available multi-region, multi-VPC hybrid connectivity for bulk or consistent workloads with low operating cost
With the industry growing big each day and leading to dynamic and scalable infrastructure needs. Networks must be designed well upfront, considering the foreseen business growth, because network scalability is difficult and costly to achieve once connectivity is established.
To keep yourself updated on the latest technology and industry trends subscribe to the Infosys Knowledge Institute's publications
Count me in!