IntroductionYou've probably heard the saying, "Time is money" In IT, this couldn't be truer. Every minute a system is down, it means lost revenue and productivity. It's estimated that 98% of companies lose around $100,000 for every hour of downtime. That's a huge hit to the wallet!The AWS cloud allows teams to build highly resilient applications. All you have to do is ensure their systems are highly available or fault-tolerant as disaster recovery methods. High availability and fault tolerance might look similar but serve different purposes.AWS Scenarios: High Availability Vs. Fault ToleranceBoth high availability and fault tolerance have the same goal: They keep your systems up and running if something fails within your architecture. Yet, there are specific differences, which we'll explain below.High availability can be defined as maintaining a percentage of uptime that maintains operational performance. It closely aligns with SLAs (service-level agreements). Currently, Amazon Web Services has 300+ service level agreements published on its official website for each of its multiple services. Fault tolerance refers to a workload's ability to remain operational in case of disruption with no downtime or data loss.Let's look at a general scenario to understand the difference between high availability and fault tolerance!High Availability ScenarioLet's assume you have an application that needs to run across two Amazon EC2 instances! It has an SLA of 99.99%, allowing for a monthly downtime of 43.83 minutes. Then, we could architect our infrastructure as described below.We could use two different availability zones within a region. We could also have two EC2 instances in each AZ (availability zone). They are all associated with elastic load balancers. In this example, different elements contribute to a highly available solution. We have to use two availability zones and additional EC2 instances. So, we have plenty of computing resources if an instance fails. Or, if an entire AZ fails, we still have at least two instances to maintain the required SLA. Fault Tolerance ScenarioNow, let's look at fault tolerance! It expands on high availability to offer greater protection should components begin to fail in your infrastructure. However, there are usually additional cost implications due to the greater level of resiliency offered. But the upside is that your uptime percentage increases. And there is no service interruption even if one or more components fail.That said, having two EC2 instances in each availability zone is fault-tolerant. Operations would be maintained even if an AZ were lost, and the minimum number of instances would still be running.However, if another failure occurs, then the SLA will be impacted. High Availability With Fault Tolerance ScenarioSo, let's look at how we can adapt our high availability scenario with an increased fault-tolerant design approach. Previously, we had our single-region approach. To increase the solution's uptime, we could deploy the app across an additional AWS region, mirroring the environment from a single area to a second region. It means we still have to compute resources if an EC2 instance fails. And we still have enough computing capacity even if the AZ fails. But now, we can maintain operations even if the entire region fails. We can still suffer further EC2 outages and availability zone outages of that secondary region and maintain two EC2 instances at all times. Thus, this solution offers greater uptime availability than the previous single-region solution.Yet, it comes with an increased cost of running two active environments that can tolerate any component failing. Remember, we need to have the secondary region running to take advantage of avoiding any downtime if the primary region fails. Thus, the fault-tolerant systems are intrinsically highly available. However, we have seen that a highly available solution is only partially fault-tolerant. It’s up to you to decide the level of high availability or fault tolerance you want to implement. It also depends upon the business impacts it could have when components begin to fail. Key Differences Between High Availability & Fault Tolerance in AWSBasisHigh AvailabilityFault ToleranceService interruptionIt ensures minimal service interruption. Whatever we are running on AWS, we want to make sure that it's available for use as much of the time as possible. It ensures zero service interruption. In fault tolerance, you look for no service interruption with expanded availability zones in multiple active environments.Design of hardwareDesigned with no single point of failure (redundancy). Things might fail even on AWS. Thus, we have to ensure we build redundancy. It needs specialized hardware with instantaneous failover. Uptime measurementHigh availability is measured in terms of uptime, and it’s a percentage number, e.g. 99.99%. This means that the application should be available 99.99% of the time. Fault-tolerant systems aim for zero downtime, meaning any failure must be addressed instantaneously to avoid service interruption.ReplicationSynchronous or asynchronous replication. With synchronous replication, when one system replicates data to another, it waits for confirmation to say that the data has been successfully received and written. And there’s no wait for that data to come back to confirm that the system has received and written with asynchronous replication. So, it's a bit faster but can result in data loss if there's a failure during that time frame.Fault-tolerant hardware often operates synchronously (as you want zero data loss), where all components work in lockstep. It further allows for an immediate switch to a redundant component without delay.CostIt incurs lower costs compared to fault tolerance. It's because high availability often uses redundant components, unlike the fully redundant systems used in fault tolerance.Fault tolerance prioritizes zero downtime and uninterrupted service, making it ideal for mission-critical applications where downtime is unacceptable. Thus, the trade-off is significantly higher costs due to the need for fully redundant systems and complex architectures.Service tools/examplesElastic Load Balancing It distributes incoming connections to different targets. We can also spread those targets across multiple availability zones. EC2 Auto Scaling It ensures enough targets are available, and if one fails, it replaces it. Amazon Route 53 It is a DNS service that can be used to respond to queries for the application's IP address in various ways. It could be either in a load balancing way or it can also be a way of failover as well for disaster recovery.Its examples includes as follows: Disk Mirroring (RAID 1)Synchronous DB ReplicationRedundant Power Manage fault tolerance in AWS using the AWS(CLI)Step 1: Managing Availability Zonesaws ec2 describe-availability-zonesStep 2: Create an Application Load Balanceraws elbv2 create-load-balancer \ --name my-load-balancer \ --subnets subnet-12345678 subnet-87654321 \ --security-groups sg-12345678Step 3: Register instances with the load balanceraws elbv2 register-targets \ --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-targets/1234567890123456 \ --targets Id=i-12345678 Id=i-87654321Step 4: Create a launch configurationaws autoscaling create-launch-configuration \ --launch-configuration-name my-launch-config \ --image-id ami-12345678 \ --instance-type t2.microStep 5: Create an auto-scaling groupaws autoscaling create-auto-scaling-group \ --auto-scaling-group-name my-asg \ --launch-configuration-name my-launch-config \ --min-size 1 \ --max-size 5 \ --desired-capacity 2 \ --vpc-zone-identifier subnet-12345678,subnet-87654321Step 6: Create a hosted zoneaws route53 create-hosted-zone \ --name example.com \ --caller-reference unique-stringStep 7: Create a DNS recordaws route53 change-resource-record-sets \ --hosted-zone-id Z3M3LMPEXAMPLE \ --change-batch file://dns-record.jsonExample: dns-record.json{ "Comment": "Creating A record for example.com", "Changes": [ { "Action": "CREATE", "ResourceRecordSet": { "Name": "example.com", "Type": "A", "TTL": 300, "ResourceRecords": [ { "Value": "192.0.2.44" } ] } } ] }Step 8: Create an S3 bucketaws s3api create-bucket --bucket my-bucket --region us-west-2Step 9: Enable cross-region replicationaws s3api put-bucket-replication \ --bucket my-bucket \ --replication-configuration file://replication-config.jsonExample: replication-config.json{ "Role": "arn:aws:iam::account-id:role/replication-role", "Rules": [ { "Status": "Enabled", "Prefix": "", "Destination": { "Bucket": "arn:aws:s3:::my-destination-bucket" } } ] }Step 10: Create an RDS instance with Multi-AZaws rds create-db-instance \ --db-instance-identifier mydbinstance \ --db-instance-class db.m4.large \ --engine mysql \ --master-username admin \ --master-user-password password \ --allocated-storage 20 \ --multi-azStep 11: Create a global tableaws dynamodb create-global-table \ --global-table-name MyGlobalTable \ --replication-group RegionName=us-east-1 RegionName=us-west-2Step 12: Create a Lambda functionaws lambda create-function \ --function-name my-function \ --runtime python3.8 \ --role arn:aws:iam::account-id:role/service-role/MyLambdaRole \ --handler lambda_function.lambda_handler \ --zip-file fileb://my-deployment-package.zipStep 13: Deploy a CloudFormation stackaws cloudformation create-stack \ --stack-name my-stack \ --template-body file://template.jsonStep 14: Enable CloudWatch alarmsaws cloudwatch put-metric-alarm \ --alarm-name CPUAlarm \ --alarm-description "Alarm when CPU exceeds 80%" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --evaluation-periods 1 \ --alarm-actions arn:aws:sns:us-west-2:123456789012:MyTopicCost Analysis: High Availability vs. Fault Tolerance in AWSHigh Availability (HA)Components: Load balancers, Auto Scaling groups, multi-AZ deployments. Costs:Elastic Load Balancer (ELB): Charged based on hours and data processed.Auto Scaling: Costs related to the EC2 instances used.Multi-AZ RDS: Higher costs due to replication across availability zones.Example:ELB: $0.0225 per ELB-hour + $0.008 per GB data.Multi-AZ RDS: $0.20 per hour for a db.t3.micro instance.Fault Tolerance (FT)Components: Redundant components, data replication, multi-region deployments. Costs:Redundant Components: Additional EC2 instances, databases, etc.Data Replication: Higher storage and data transfer costs.Multi-Region Deployments: Cross-region data transfer fees.Example:S3 Cross-Region Replication: $0.02 per GB replicated.Multi-region DynamoDB: $1.25 per WCU for global tables.ComparisonFeatureHigh AvailabilityFault ToleranceScopeSingle region (multi-AZ)Multi-regionComponentsLoad balancers, Auto Scaling, multi-AZ deploymentsRedundant instances, cross-region replicationCostsModerate (e.g., ELB, Auto Scaling)High (e.g., cross-region data transfer, replication)Example CostELB: $0.0225/hr + $0.008/GBS3 Replication: $0.02/GBConclusionSo, we have looked at the practical scenarios for both the high availability and fault tolerance AWS architecture to better understand their comparison. Protection against hardware failure (not software issues) is the main benefit of high availability and fault tolerance. Yet, combining high availability (HA) and fault tolerance (FT) in AWS is significant for several reasons. It ensures the highest level of reliability, minimizes downtime and enhances overall system resilience. Practical examples include Amazon RDS (Relational Database Service) with Multi-AZ and read replicas. Read Morehttps://devopsden.io/article/terraform-roadmap-2024Follow us onhttps://www.linkedin.com/company/devopsden/