Skip to main content

Reliability Pillar

TL;DR

The Reliability pillar focuses on ensuring your workload can recover from failures and continue to function under adverse conditions. Key concepts:

  • Design for failure: Assume everything will fail and plan accordingly
  • Redundancy: Eliminate single points of failure
  • Self-healing: Automate recovery from failures
  • Graceful degradation: Maintain partial functionality during outages
  • Testing: Regularly test failure scenarios

Design Principles

Core Reliability Principles

PrincipleDescriptionImplementation
Design for failureAssume components will failRedundancy, failover, retry logic
Observe application healthKnow when something is wrongHealth probes, monitoring, alerting
Drive automationReduce human error in recoveryAuto-scaling, self-healing, IaC
Design for self-healingRecover without interventionHealth checks, automatic restarts
Design for scale-outAdd capacity horizontallyStateless design, load balancing

Reliability Hierarchy


Key Concepts

Availability Targets

AvailabilityDowntime/YearDowntime/MonthUse Case
99%3.65 days7.3 hoursDev/Test
99.9%8.76 hours43.8 minutesStandard production
99.95%4.38 hours21.9 minutesBusiness critical
99.99%52.6 minutes4.38 minutesMission critical
99.999%5.26 minutes26.3 secondsLife safety systems

RTO and RPO

MetricDefinitionQuestion
RTO (Recovery Time Objective)Maximum acceptable downtimeHow long can you be down?
RPO (Recovery Point Objective)Maximum acceptable data lossHow much data can you lose?
MTTR (Mean Time to Recovery)Average recovery timeHow fast do you typically recover?
MTBF (Mean Time Between Failures)Average uptime between failuresHow often do failures occur?

Resiliency Patterns

Retry Pattern

Handle transient failures by retrying operations with exponential backoff.

// C# Example using Polly
var retryPolicy = Policy
.Handle<HttpRequestException>()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt =>
TimeSpan.FromSeconds(Math.Pow(2, attempt)), // 2, 4, 8 seconds
onRetry: (exception, timeSpan, retryCount, context) =>
{
_logger.LogWarning(
"Retry {RetryCount} after {Delay}s due to {Exception}",
retryCount, timeSpan.TotalSeconds, exception.Message);
});

await retryPolicy.ExecuteAsync(async () =>
{
await httpClient.GetAsync("https://api.example.com/data");
});

Circuit Breaker Pattern

Prevent cascading failures by stopping calls to failing services.

// C# Example using Polly
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (exception, duration) =>
{
_logger.LogError("Circuit opened for {Duration}s", duration.TotalSeconds);
},
onReset: () =>
{
_logger.LogInformation("Circuit closed, normal operation resumed");
});

Bulkhead Pattern

Isolate failures to prevent them from affecting the entire system.

Health Endpoint Pattern

Expose health status for monitoring and load balancer decisions.

// ASP.NET Core Health Checks
public void ConfigureServices(IServiceCollection services)
{
services.AddHealthChecks()
.AddSqlServer(connectionString, name: "database")
.AddRedis(redisConnection, name: "cache")
.AddAzureBlobStorage(blobConnection, name: "storage")
.AddCheck<CustomHealthCheck>("custom");
}

public void Configure(IApplicationBuilder app)
{
app.UseHealthChecks("/health", new HealthCheckOptions
{
ResponseWriter = async (context, report) =>
{
context.Response.ContentType = "application/json";
await context.Response.WriteAsync(JsonSerializer.Serialize(new
{
status = report.Status.ToString(),
checks = report.Entries.Select(e => new
{
name = e.Key,
status = e.Value.Status.ToString(),
duration = e.Value.Duration.TotalMilliseconds
})
}));
}
});
}

Disaster Recovery Strategies

DR Strategy Comparison

StrategyRTORPOCostComplexity
Backup & RestoreHours-DaysHours$Low
Pilot LightMinutes-HoursMinutes$$Medium
Warm StandbyMinutesSeconds-Minutes$$$Medium-High
Active-ActiveNear-zeroNear-zero$$$$High

DR Architecture Patterns

Azure DR Services

ServicePurposeRPORTO
Azure Site RecoveryVM replication and failoverMinutesMinutes
Geo-redundant StorageBlob/file replication~15 minutesHours
SQL Geo-replicationDatabase replicationSecondsMinutes
Cosmos DB Multi-regionGlobal databaseSecondsAutomatic
Traffic ManagerDNS-based failoverN/AMinutes
Front DoorLayer 7 global load balancingN/ASeconds

Multi-Region Architecture

Active-Passive Configuration

Active-Active Configuration


Health Modeling

Health Model Components

Health States

StateDescriptionAction
HealthyAll components functioning normallyNone
DegradedPartial functionality, non-critical issuesAlert, investigate
UnhealthyCritical functionality impairedAlert, auto-heal, failover
UnknownCannot determine health statusInvestigate immediately

Failure Mode Analysis (FMA)

ComponentFailure ModeImpactMitigationDetection
DatabaseConnection timeoutService unavailableConnection pooling, retryHealth probe
CacheNode failureIncreased latencyCluster mode, fallback to DBHealth check
API GatewayOverloadRequest failuresRate limiting, auto-scaleMetrics
StorageRegion outageData inaccessibleGeo-redundancyAzure status

Azure Services for Reliability

Compute Reliability

ServiceReliability FeatureSLA
Virtual MachinesAvailability Sets, Zones99.95-99.99%
App ServiceMulti-instance, slots99.95%
AKSNode pools, pod replicas99.95%
FunctionsConsumption auto-scale99.95%

Data Reliability

ServiceReliability FeatureDurability
Blob StorageLRS, ZRS, GRS, GZRS99.999999999%
SQL DatabaseGeo-replication, auto-failover99.99%
Cosmos DBMulti-region, automatic failover99.999%
Redis CacheClustering, geo-replication99.9%

Networking Reliability

ServicePurposeFailover Time
Traffic ManagerDNS-based global load balancing30-60 seconds
Front DoorLayer 7 global load balancingSeconds
Load BalancerRegional load balancingSeconds
Application GatewayLayer 7 regional load balancingSeconds

Reliability Checklist

Design Phase

  • Define RTO and RPO for the workload
  • Identify single points of failure
  • Design for zone and region redundancy
  • Plan for graceful degradation
  • Document failure modes and mitigations

Implementation Phase

  • Implement retry logic with exponential backoff
  • Add circuit breakers for external dependencies
  • Configure health probes for all components
  • Set up auto-scaling rules
  • Enable diagnostic logging

Operations Phase

  • Configure monitoring and alerting
  • Create runbooks for common failures
  • Schedule regular DR drills
  • Review and update SLAs
  • Conduct chaos engineering tests

Assessment Questions

Use these questions to assess your workload's reliability:

AreaQuestion
AvailabilityWhat is your target availability SLA?
RecoveryWhat are your RTO and RPO requirements?
RedundancyDo you have redundancy at every tier?
FailoverIs failover automated or manual?
TestingHow often do you test failure scenarios?
MonitoringCan you detect failures before users do?
DependenciesHow do you handle dependency failures?
DataIs your data backed up and recoverable?

Key Takeaways

  1. Assume failure: Design every component to handle failures gracefully
  2. Eliminate SPOFs: No single component should bring down the system
  3. Automate recovery: Manual recovery is slow and error-prone
  4. Test regularly: Untested DR plans are unreliable
  5. Monitor proactively: Detect issues before they become outages

Resources