Skip to main content

Enterprise Case Study: Contoso Retail

TL;DR

This case study follows Contoso Retail, a fictional mid-size retailer, through their journey of applying the Azure Well-Architected Framework to transform their e-commerce platform. You'll see:

  • Before/After architecture comparisons
  • Real assessment scores and findings
  • Prioritized remediation decisions
  • Implementation details for each pillar
  • Measurable outcomes and lessons learned

Company Background

About Contoso Retail

AttributeDetails
IndustryRetail / E-commerce
Annual Revenue$500M
Employees2,000
Customers2M registered users
Daily Orders15,000 average, 50,000 peak (holidays)
Product Catalog50,000 SKUs
Tech Team25 engineers

Business Context

Contoso Retail operates both physical stores and an e-commerce platform. Their online business has grown 40% year-over-year, but their legacy architecture is struggling to keep pace. Recent Black Friday outages cost them an estimated $2M in lost sales.

Key Business Drivers:

  • Eliminate revenue-impacting outages
  • Reduce cloud spending (currently $150K/month)
  • Accelerate feature delivery (currently 6-week release cycles)
  • Meet PCI-DSS compliance requirements
  • Support international expansion

Initial State Assessment

Current Architecture (Before WAF)

Initial WAF Assessment Scores

Detailed Findings by Pillar

Reliability Findings (Score: 35/100)

FindingSeverityImpact
Single region deploymentCriticalComplete outage if region fails
No defined RTO/RPOCriticalUnknown recovery capabilities
Database single point of failureCriticalData loss risk
Manual failover proceduresHighExtended downtime
No health monitoringHighReactive incident response
Monolithic applicationMediumBlast radius of failures

Security Findings (Score: 42/100)

FindingSeverityImpact
No WAF protectionCriticalVulnerable to web attacks
Secrets in config filesCriticalCredential exposure risk
No MFA for admin accessCriticalAccount compromise risk
Public database endpointHighData breach risk
No encryption at restHighCompliance violation
Overly permissive RBACMediumInsider threat risk

Cost Findings (Score: 38/100)

FindingSeverityImpact
Over-provisioned VMsHigh$30K/month waste
No reserved instancesHighMissing 40% savings
No auto-scalingMediumPaying for peak capacity 24/7
LRS storage for critical dataMediumRisk vs cost mismatch
No cost allocation tagsMediumNo accountability
Dev/Test using production SKUsLowUnnecessary spend

Operational Excellence Findings (Score: 45/100)

FindingSeverityImpact
Manual deploymentsHigh6-week release cycles
No Infrastructure as CodeHighConfiguration drift
Limited monitoringHighBlind to issues
No runbooksMediumInconsistent incident response
Tribal knowledgeMediumKey person dependency
No automated testingMediumQuality issues

Performance Findings (Score: 50/100)

FindingSeverityImpact
No caching layerHighDatabase overload
No CDN for static contentHighSlow page loads
Unoptimized queriesHigh3-5 second response times
No connection poolingMediumConnection exhaustion
Synchronous processingMediumBlocking operations
No load testingMediumUnknown capacity limits

Prioritization and Roadmap

Risk-Based Prioritization

Using the impact vs effort matrix to prioritize findings:

PriorityItemsEffortImpact
Quick WinsWAF, Key Vault, MFA, CachingLowHigh
Plan CarefullyDatabase migration, CI/CDHighHigh
StrategicGeo-redundancy, MicroservicesHighHigh
DeferDocumentation updatesLowLow

Phased Roadmap


Phase 1: Critical Security Fixes

1.1 Enable MFA for All Users

Before: Password-only authentication for Azure portal and admin access.

Implementation:

# Enable Security Defaults (includes MFA)
# Or use Conditional Access for more control

# Conditional Access Policy via Graph API
$policy = @{
displayName = "Require MFA for all users"
state = "enabled"
conditions = @{
users = @{
includeUsers = @("All")
excludeUsers = @("BreakGlassAccount@contoso.com")
}
applications = @{
includeApplications = @("All")
}
}
grantControls = @{
operator = "OR"
builtInControls = @("mfa")
}
}

Outcome: 100% of admin accounts now require MFA.


1.2 Migrate Secrets to Key Vault

Before: Connection strings and API keys in appsettings.json and environment variables.

// BEFORE - appsettings.json (INSECURE!)
{
"ConnectionStrings": {
"Database": "Server=sql.contoso.com;Database=Orders;User=admin;Password=P@ssw0rd123!"
},
"PaymentGateway": {
"ApiKey": "sk_live_abc123xyz789"
}
}

After: All secrets in Azure Key Vault with managed identity access.

// Key Vault with private endpoint
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
name: 'kv-contoso-prod'
location: location
properties: {
sku: { family: 'A', name: 'standard' }
tenantId: subscription().tenantId
enableRbacAuthorization: true
enableSoftDelete: true
softDeleteRetentionInDays: 90
enablePurgeProtection: true
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
}
}
}

// Private endpoint for Key Vault
resource kvPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-05-01' = {
name: 'pe-kv-contoso'
location: location
properties: {
subnet: { id: privateEndpointSubnet.id }
privateLinkServiceConnections: [
{
name: 'kv-connection'
properties: {
privateLinkServiceId: keyVault.id
groupIds: ['vault']
}
}
]
}
}
// Application code - access secrets via managed identity
builder.Configuration.AddAzureKeyVault(
new Uri("https://kv-contoso-prod.vault.azure.net/"),
new DefaultAzureCredential());

// Secrets are now accessed like regular configuration
var connectionString = builder.Configuration["Database-ConnectionString"];

Outcome: Zero secrets in code or config files. All secrets centrally managed with audit logging.


1.3 Deploy Web Application Firewall

Before: Direct internet access to load balancer, no web attack protection.

After: Azure Front Door with WAF in Prevention mode.

// Front Door with WAF
resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
name: 'fd-contoso-prod'
location: 'global'
sku: { name: 'Premium_AzureFrontDoor' }
}

resource wafPolicy 'Microsoft.Network/FrontDoorWebApplicationFirewallPolicies@2022-05-01' = {
name: 'waf-contoso-prod'
location: 'global'
sku: { name: 'Premium_AzureFrontDoor' }
properties: {
policySettings: {
enabledState: 'Enabled'
mode: 'Prevention'
requestBodyCheck: 'Enabled'
}
managedRules: {
managedRuleSets: [
{
ruleSetType: 'Microsoft_DefaultRuleSet'
ruleSetVersion: '2.1'
}
{
ruleSetType: 'Microsoft_BotManagerRuleSet'
ruleSetVersion: '1.0'
}
]
}
customRules: {
rules: [
{
name: 'RateLimitRule'
priority: 1
ruleType: 'RateLimitRule'
rateLimitThreshold: 1000
rateLimitDurationInMinutes: 1
action: 'Block'
matchConditions: [
{
matchVariable: 'RequestUri'
operator: 'Contains'
matchValue: ['/api/']
}
]
}
]
}
}
}

Outcome: Blocked 50,000+ malicious requests in first month. Zero successful web attacks.


1.4 Secure Database with Private Endpoint

Before: SQL Server accessible via public IP with firewall rules.

After: Private endpoint only, no public access.

// Disable public access
resource sqlServer 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: 'sql-contoso-prod'
location: location
properties: {
publicNetworkAccess: 'Disabled'
minimalTlsVersion: '1.2'
}
}

// Private endpoint for SQL
resource sqlPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-05-01' = {
name: 'pe-sql-contoso'
location: location
properties: {
subnet: { id: dataSubnet.id }
privateLinkServiceConnections: [
{
name: 'sql-connection'
properties: {
privateLinkServiceId: sqlServer.id
groupIds: ['sqlServer']
}
}
]
}
}

// Private DNS zone for SQL
resource privateDnsZone 'Microsoft.Network/privateDnsZones@2020-06-01' = {
name: 'privatelink.database.windows.net'
location: 'global'
}

Outcome: Database no longer accessible from internet. All access via private network.


Phase 2: Reliability Foundation

2.1 Define RTO and RPO

Working with business stakeholders, Contoso defined recovery objectives:

WorkloadRTORPOJustification
E-commerce website15 minutes5 minutesRevenue-critical
Order processing30 minutes0 (no data loss)Financial transactions
Product catalog1 hour1 hourCan rebuild from source
Analytics4 hours24 hoursNot customer-facing

2.2 Implement Health Checks

// Comprehensive health checks
builder.Services.AddHealthChecks()
// Database connectivity
.AddSqlServer(
connectionString: builder.Configuration["Database-ConnectionString"],
name: "database",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "db", "critical" })

// Redis cache
.AddRedis(
redisConnectionString: builder.Configuration["Redis-ConnectionString"],
name: "redis",
failureStatus: HealthStatus.Degraded,
tags: new[] { "cache" })

// External payment gateway
.AddUrlGroup(
new Uri("https://api.paymentgateway.com/health"),
name: "payment-gateway",
failureStatus: HealthStatus.Degraded,
tags: new[] { "external" })

// Blob storage
.AddAzureBlobStorage(
connectionString: builder.Configuration["Storage-ConnectionString"],
name: "blob-storage",
failureStatus: HealthStatus.Degraded,
tags: new[] { "storage" });

// Health check endpoints
app.MapHealthChecks("/health", new HealthCheckOptions
{
Predicate = _ => true,
ResponseWriter = WriteHealthCheckResponse
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("critical"),
ResponseWriter = WriteHealthCheckResponse
});

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = _ => false // Just checks if app is running
});

2.3 Add Redis Caching Layer

Before: Every request hit the database directly.

After: Distributed caching with Redis reduces database load by 70%.

// Cache-aside pattern for product catalog
public class ProductService
{
private readonly IDistributedCache _cache;
private readonly IProductRepository _repository;
private readonly ILogger<ProductService> _logger;

private static readonly TimeSpan CacheDuration = TimeSpan.FromMinutes(15);

public async Task<Product?> GetProductAsync(string productId)
{
var cacheKey = $"product:{productId}";

// Try cache first
var cached = await _cache.GetStringAsync(cacheKey);
if (cached != null)
{
_logger.LogDebug("Cache hit for product {ProductId}", productId);
return JsonSerializer.Deserialize<Product>(cached);
}

// Cache miss - get from database
_logger.LogDebug("Cache miss for product {ProductId}", productId);
var product = await _repository.GetByIdAsync(productId);

if (product != null)
{
await _cache.SetStringAsync(
cacheKey,
JsonSerializer.Serialize(product),
new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = CacheDuration
});
}

return product;
}

public async Task InvalidateProductCacheAsync(string productId)
{
await _cache.RemoveAsync($"product:{productId}");
await _cache.RemoveAsync("products:featured"); // Invalidate related caches
}
}

Outcome:

  • Database queries reduced by 70%
  • Average response time: 3.2s → 180ms
  • Database CPU: 85% → 25%

2.4 Database Geo-Replication

// Primary database
resource sqlDatabase 'Microsoft.Sql/servers/databases@2022-05-01-preview' = {
parent: sqlServerPrimary
name: 'contoso-orders'
location: 'eastus'
sku: {
name: 'BC_Gen5_4'
tier: 'BusinessCritical'
}
properties: {
zoneRedundant: true
}
}

// Secondary server in different region
resource sqlServerSecondary 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: 'sql-contoso-secondary'
location: 'westus'
properties: {
publicNetworkAccess: 'Disabled'
}
}

// Geo-replication link
resource geoReplication 'Microsoft.Sql/servers/databases@2022-05-01-preview' = {
parent: sqlServerSecondary
name: 'contoso-orders'
location: 'westus'
properties: {
createMode: 'Secondary'
sourceDatabaseId: sqlDatabase.id
}
}

// Auto-failover group
resource failoverGroup 'Microsoft.Sql/servers/failoverGroups@2022-05-01-preview' = {
parent: sqlServerPrimary
name: 'fg-contoso'
properties: {
readWriteEndpoint: {
failoverPolicy: 'Automatic'
failoverWithDataLossGracePeriodMinutes: 60
}
readOnlyEndpoint: {
failoverPolicy: 'Enabled'
}
partnerServers: [
{ id: sqlServerSecondary.id }
]
databases: [sqlDatabase.id]
}
}

Outcome: RPO reduced to ~5 seconds with automatic failover capability.


Phase 3: Operational Excellence

3.1 Infrastructure as Code with Bicep

All infrastructure now defined in Bicep modules:

infrastructure/
├── main.bicep
├── modules/
│ ├── networking.bicep
│ ├── compute.bicep
│ ├── data.bicep
│ ├── security.bicep
│ └── monitoring.bicep
├── environments/
│ ├── dev.bicepparam
│ ├── staging.bicepparam
│ └── prod.bicepparam
└── .github/
└── workflows/
└── infrastructure.yml
// main.bicep
targetScope = 'subscription'

@description('Environment name')
@allowed(['dev', 'staging', 'prod'])
param environment string

@description('Primary Azure region')
param primaryLocation string = 'eastus'

@description('Secondary Azure region for DR')
param secondaryLocation string = 'westus'

// Resource Group
resource rg 'Microsoft.Resources/resourceGroups@2023-07-01' = {
name: 'rg-contoso-${environment}'
location: primaryLocation
tags: {
Environment: environment
CostCenter: 'IT-Engineering'
Application: 'Contoso-Ecommerce'
}
}

// Networking
module networking 'modules/networking.bicep' = {
scope: rg
name: 'networking'
params: {
environment: environment
location: primaryLocation
}
}

// Security (Key Vault, etc.)
module security 'modules/security.bicep' = {
scope: rg
name: 'security'
params: {
environment: environment
location: primaryLocation
subnetId: networking.outputs.privateEndpointSubnetId
}
}

// Data (SQL, Redis, Storage)
module data 'modules/data.bicep' = {
scope: rg
name: 'data'
params: {
environment: environment
primaryLocation: primaryLocation
secondaryLocation: secondaryLocation
subnetId: networking.outputs.dataSubnetId
}
}

// Compute (App Service, Functions)
module compute 'modules/compute.bicep' = {
scope: rg
name: 'compute'
params: {
environment: environment
location: primaryLocation
subnetId: networking.outputs.appSubnetId
keyVaultName: security.outputs.keyVaultName
}
}

// Monitoring
module monitoring 'modules/monitoring.bicep' = {
scope: rg
name: 'monitoring'
params: {
environment: environment
location: primaryLocation
}
}

3.2 CI/CD Pipeline

# .github/workflows/deploy.yml
name: Build and Deploy

on:
push:
branches: [main, develop]
pull_request:
branches: [main]

env:
DOTNET_VERSION: '8.0.x'
AZURE_WEBAPP_NAME: app-contoso

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: ${{ env.DOTNET_VERSION }}

- name: Restore dependencies
run: dotnet restore

- name: Build
run: dotnet build --configuration Release --no-restore

- name: Run unit tests
run: dotnet test --no-build --verbosity normal --collect:"XPlat Code Coverage" --results-directory ./coverage

- name: Run security scan
uses: github/codeql-action/analyze@v3

- name: Publish
run: dotnet publish src/Contoso.Web/Contoso.Web.csproj -c Release -o ./publish

- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: webapp
path: ./publish

deploy-staging:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp
path: ./publish

- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}

- name: Deploy to staging slot
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}-staging
package: ./publish

- name: Run smoke tests
run: |
response=$(curl -s -o /dev/null -w "%{http_code}" https://${{ env.AZURE_WEBAPP_NAME }}-staging.azurewebsites.net/health)
if [ "$response" != "200" ]; then
echo "Health check failed with status $response"
exit 1
fi

- name: Run integration tests
run: |
dotnet test tests/Contoso.IntegrationTests --filter Category=Smoke

deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}

- name: Swap staging to production
run: |
az webapp deployment slot swap \
--name ${{ env.AZURE_WEBAPP_NAME }} \
--resource-group rg-contoso-prod \
--slot staging \
--target-slot production

- name: Verify production health
run: |
for i in {1..5}; do
response=$(curl -s -o /dev/null -w "%{http_code}" https://${{ env.AZURE_WEBAPP_NAME }}.azurewebsites.net/health)
if [ "$response" = "200" ]; then
echo "Production health check passed"
exit 0
fi
sleep 10
done
echo "Production health check failed"
exit 1

Outcome:

  • Release cycle: 6 weeks → daily deployments
  • Deployment time: 2 hours manual → 15 minutes automated
  • Rollback time: 1 hour → 2 minutes (slot swap)

3.3 Monitoring Stack

// Application Insights
resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
name: 'ai-contoso-${environment}'
location: location
kind: 'web'
properties: {
Application_Type: 'web'
WorkspaceResourceId: logAnalytics.id
RetentionInDays: 90
}
}

// Log Analytics Workspace
resource logAnalytics 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
name: 'log-contoso-${environment}'
location: location
properties: {
sku: { name: 'PerGB2018' }
retentionInDays: 90
}
}

// Alert for high error rate
resource errorRateAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-high-error-rate'
location: 'global'
properties: {
description: 'Alert when error rate exceeds 5%'
severity: 1
enabled: true
scopes: [appInsights.id]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighErrorRate'
metricName: 'requests/failed'
operator: 'GreaterThan'
threshold: 5
timeAggregation: 'Average'
}
]
}
actions: [{ actionGroupId: actionGroup.id }]
}
}

Key Dashboards Created:

DashboardMetricsAudience
ExecutiveRevenue, orders, availabilityLeadership
OperationsErrors, latency, throughputOn-call team
PerformanceResponse times, DB queries, cache hitsEngineers
SecurityFailed logins, blocked requests, anomaliesSecurity team

Phase 4: Cost Optimization

4.1 Right-Sizing Analysis

Before:

ResourceSKUUtilizationMonthly Cost
Web VMs (2x)D4s_v315% CPU$560
API VMs (2x)D8s_v320% CPU$1,120
SQL VME16s_v325% CPU$1,680
Total$3,360

After (PaaS Migration):

ResourceSKUMonthly CostSavings
App Service PlanP1v3 (auto-scale 2-6)$29248%
Azure SQLBC_Gen5_4$1,46013%
Redis CacheC1 Standard$81N/A
Total$1,83345%

4.2 Reserved Instances

# Purchase 3-year reserved capacity for predictable workloads
# App Service Plan - 3 year reservation
# SQL Database - 3 year reserved capacity

# Estimated savings:
# - App Service: $292/mo → $117/mo (60% savings)
# - SQL Database: $1,460/mo → $584/mo (60% savings)
# - Total monthly: $1,833 → $782 (57% additional savings)

4.3 Tagging Strategy Implementation

// Standard tags applied to all resources
var standardTags = {
Environment: environment
CostCenter: 'IT-Engineering'
Application: 'Contoso-Ecommerce'
Owner: 'platform-team@contoso.com'
BusinessUnit: 'Digital'
DataClassification: 'Confidential'
}

// Azure Policy to enforce required tags
resource tagPolicy 'Microsoft.Authorization/policyAssignments@2022-06-01' = {
name: 'require-tags'
properties: {
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/require-tag-and-value'
parameters: {
tagName: { value: 'CostCenter' }
}
enforcementMode: 'Default'
}
}

Cost Allocation Report:

Cost CenterMonthly Spend% of Total
IT-Engineering$12,50045%
Marketing$5,20019%
Operations$4,80017%
Analytics$3,10011%
Dev/Test$2,2008%

Phase 5: Target Architecture

Final Architecture (After WAF)


Results and Outcomes

WAF Score Improvement

Score Comparison

PillarBeforeAfterImprovement
Reliability3582+47 points
Security4288+46 points
Cost Optimization3875+37 points
Operational Excellence4585+40 points
Performance Efficiency5090+40 points
Overall4284+42 points

Business Outcomes

MetricBeforeAfterImpact
Availability99.2%99.95%$1.5M saved in prevented outages
Monthly Cloud Cost$150,000$85,000$780K annual savings
Release Frequency6 weeksDaily10x faster feature delivery
MTTR4 hours15 minutes94% reduction
Page Load Time3.2 seconds0.8 seconds75% faster
Security Incidents3/year0Zero breaches

ROI Analysis

InvestmentCostAnnual BenefitROI
WAF Implementation$200K (one-time)
Ongoing Operations$50K/year
Total Cost$250K
Cost Savings$780K
Prevented Outages$1.5M
Productivity Gains$300K
Total Benefit$2.58M932%

Lessons Learned

What Worked Well

  1. Phased approach: Tackling critical security first built confidence
  2. Quick wins: Early caching improvements showed immediate value
  3. Business alignment: Tying improvements to revenue impact got executive support
  4. Automation first: IaC and CI/CD accelerated subsequent phases

Challenges Faced

ChallengeHow We Addressed It
Legacy code dependenciesIncremental refactoring, strangler fig pattern
Team skill gapsTraining, pair programming, external consultants
Resistance to changeDemonstrated quick wins, involved team in decisions
Budget constraintsPrioritized by ROI, showed cost savings early

Recommendations for Others

  1. Start with assessment: Know your baseline before making changes
  2. Prioritize ruthlessly: You can't fix everything at once
  3. Measure everything: Data drives better decisions
  4. Automate early: Manual processes don't scale
  5. Involve the business: Technical improvements need business context

Next Steps for Contoso

InitiativeTimelineExpected Outcome
Kubernetes migrationQ3 2024Better resource utilization
AI-powered searchQ4 2024Improved customer experience
International expansionQ1 2025Multi-region active-active
Sustainability optimizationQ2 2025Carbon footprint reduction

Resources