High Availability Vs Fault Tolerance In AWS

Description of the image


You've probably heard the saying, "Time is money" In IT, this couldn't be truer. Every minute a system is down, it means lost revenue and productivity. It's estimated that 98% of companies lose around $100,000 for every hour of downtime. That's a huge hit to the wallet!

The AWS cloud allows teams to build highly resilient applications. All you have to do is ensure their systems are highly available or fault-tolerant as disaster recovery methods. 

High availability and fault tolerance might look similar but serve different purposes.

AWS Scenarios: High Availability Vs. Fault Tolerance

Both high availability and fault tolerance have the same goal: They keep your systems up and running if something fails within your architecture. Yet, there are specific differences, which we'll explain below.

High availability can be defined as maintaining a percentage of uptime that maintains operational performance. It closely aligns with SLAs (service-level agreements). Currently, Amazon Web Services has 300+ service level agreements published on its official website for each of its multiple services. 

Fault tolerance refers to a workload's ability to remain operational in case of disruption with no downtime or data loss.

Let's look at a general scenario to understand the difference between high availability and fault tolerance!

High Availability Scenario

Let's assume you have an application that needs to run across two Amazon EC2 instances! It has an SLA of 99.99%, allowing for a monthly downtime of 43.83 minutes. Then, we could architect our infrastructure as described below.

We could use two different availability zones within a region. We could also have two EC2 instances in each AZ (availability zone). They are all associated with elastic load balancers

In this example, different elements contribute to a highly available solution. We have to use two availability zones and additional EC2 instances. 

So, we have plenty of computing resources if an instance fails. Or, if an entire AZ fails, we still have at least two instances to maintain the required SLA. 

Fault Tolerance Scenario

Now, let's look at fault tolerance! It expands on high availability to offer greater protection should components begin to fail in your infrastructure. 

However, there are usually additional cost implications due to the greater level of resiliency offered. But the upside is that your uptime percentage increases. And there is no service interruption even if one or more components fail.

That said, having two EC2 instances in each availability zone is fault-tolerant. Operations would be maintained even if an AZ were lost, and the minimum number of instances would still be running.

However, if another failure occurs, then the SLA will be impacted. 

High Availability With Fault Tolerance Scenario

So, let's look at how we can adapt our high availability scenario with an increased fault-tolerant design approach. Previously, we had our single-region approach. 

To increase the solution's uptime, we could deploy the app across an additional AWS region, mirroring the environment from a single area to a second region. 

It means we still have to compute resources if an EC2 instance fails. And we still have enough computing capacity even if the AZ fails. But now, we can maintain operations even if the entire region fails. 

We can still suffer further EC2 outages and availability zone outages of that secondary region and maintain two EC2 instances at all times. Thus, this solution offers greater uptime availability than the previous single-region solution.

Yet, it comes with an increased cost of running two active environments that can tolerate any component failing. Remember, we need to have the secondary region running to take advantage of avoiding any downtime if the primary region fails. 

Thus, the fault-tolerant systems are intrinsically highly available. However, we have seen that a highly available solution is only partially fault-tolerant. 

It’s up to you to decide the level of high availability or fault tolerance you want to implement. It also depends upon the business impacts it could have when components begin to fail. 

Key Differences Between High Availability & Fault Tolerance in AWS

BasisHigh AvailabilityFault Tolerance
Service interruptionIt ensures minimal service interruption. Whatever we are running on AWS, we want to make sure that it's available for use as much of the time as possible. It ensures zero service interruption. In fault tolerance, you look for no service interruption with expanded availability zones in multiple active environments.
Design of hardwareDesigned with no single point of failure (redundancy). Things might fail even on AWS. Thus, we have to ensure we build redundancy. It needs specialized hardware with instantaneous failover. 
Uptime measurementHigh availability is measured in terms of uptime, and it’s a percentage number, e.g. 99.99%. This means that the application should be available 99.99% of the time. Fault-tolerant systems aim for zero downtime, meaning any failure must be addressed instantaneously to avoid service interruption.
ReplicationSynchronous or asynchronous replication. 

With synchronous replication, when one system replicates data to another, it waits for confirmation to say that the data has been successfully received and written. 

And there’s no wait for that data to come back to confirm that the system has received and written with asynchronous replication. So, it's a bit faster but can result in data loss if there's a failure during that time frame.
Fault-tolerant hardware often operates synchronously (as you want zero data loss), where all components work in lockstep. It further allows for an immediate switch to a redundant component without delay.
CostIt incurs lower costs compared to fault tolerance. It's because high availability often uses redundant components, unlike the fully redundant systems used in fault tolerance.

Fault tolerance prioritizes zero downtime and uninterrupted service, making it ideal for mission-critical applications where downtime is unacceptable. 


Thus, the trade-off is significantly higher costs due to the need for fully redundant systems and complex architectures.

Service tools/examples


It distributes incoming connections to different targets. We can also spread those targets across multiple availability zones. 



It ensures enough targets are available, and if one fails, it replaces it. 



It is a DNS service that can be used to respond to queries for the application's IP address in various ways. It could be either in a load balancing way or it can also be a way of failover as well for disaster recovery.

Its examples includes as follows:



Manage fault tolerance in AWS using the AWS(CLI)

Step 1: Managing Availability Zones

aws ec2 describe-availability-zones

Step 2: Create an Application Load Balancer

aws elbv2 create-load-balancer \
    --name my-load-balancer \
    --subnets subnet-12345678 subnet-87654321 \
    --security-groups sg-12345678

Step 3: Register instances with the load balancer

aws elbv2 register-targets \
    --target-group-arn arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-targets/1234567890123456 \
    --targets Id=i-12345678 Id=i-87654321

Step 4: Create a launch configuration

aws autoscaling create-launch-configuration \
    --launch-configuration-name my-launch-config \
    --image-id ami-12345678 \
    --instance-type t2.micro

Step 5: Create an auto-scaling group

aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name my-asg \
    --launch-configuration-name my-launch-config \
    --min-size 1 \
    --max-size 5 \
    --desired-capacity 2 \
    --vpc-zone-identifier subnet-12345678,subnet-87654321

Step 6: Create a hosted zone

aws route53 create-hosted-zone \
    --name \
    --caller-reference unique-string

Step 7: Create a DNS record

aws route53 change-resource-record-sets \
    --hosted-zone-id Z3M3LMPEXAMPLE \
    --change-batch file://dns-record.json

Example:  dns-record.json

    "Comment": "Creating A record for",
    "Changes": [
            "Action": "CREATE",
            "ResourceRecordSet": {
                "Name": "",
                "Type": "A",
                "TTL": 300,
                "ResourceRecords": [
                        "Value": ""

Step 8: Create an S3 bucket

aws s3api create-bucket --bucket my-bucket --region us-west-2

Step 9: Enable cross-region replication

aws s3api put-bucket-replication \
    --bucket my-bucket \
    --replication-configuration file://replication-config.json

Example: replication-config.json

    "Role": "arn:aws:iam::account-id:role/replication-role",
    "Rules": [
            "Status": "Enabled",
            "Prefix": "",
            "Destination": {
                "Bucket": "arn:aws:s3:::my-destination-bucket"

Step 10: Create an RDS instance with Multi-AZ

aws rds create-db-instance \
    --db-instance-identifier mydbinstance \
    --db-instance-class db.m4.large \
    --engine mysql \
    --master-username admin \
    --master-user-password password \
    --allocated-storage 20 \

Step 11: Create a global table

aws dynamodb create-global-table \
    --global-table-name MyGlobalTable \
    --replication-group RegionName=us-east-1 RegionName=us-west-2

Step 12: Create a Lambda function

aws lambda create-function \
    --function-name my-function \
    --runtime python3.8 \
    --role arn:aws:iam::account-id:role/service-role/MyLambdaRole \
    --handler lambda_function.lambda_handler \
    --zip-file fileb://

Step 13: Deploy a CloudFormation stack

aws cloudformation create-stack \
    --stack-name my-stack \
    --template-body file://template.json

Step 14: Enable CloudWatch alarms

aws cloudwatch put-metric-alarm \
    --alarm-name CPUAlarm \
    --alarm-description "Alarm when CPU exceeds 80%" \
    --metric-name CPUUtilization \
    --namespace AWS/EC2 \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-west-2:123456789012:MyTopic


So, we have looked at the practical scenarios for both the high availability and fault tolerance AWS architecture to better understand their comparison. Protection against hardware failure (not software issues) is the main benefit of high availability and fault tolerance. 

Yet, combining high availability (HA) and fault tolerance (FT) in AWS is significant for several reasons. It ensures the highest level of reliability, minimizes downtime and enhances overall system resilience. 

Practical examples include Amazon RDS (Relational Database Service) with Multi-AZ and read replicas. 

Read More

Follow us on

Table of Contents

    Subscribe to Us

    Always Get Notified