Enterprise Case Study: Contoso Retail
TL;DR
This case study follows Contoso Retail, a fictional mid-size retailer, through their journey of applying the Azure Well-Architected Framework to transform their e-commerce platform. You'll see:
- Before/After architecture comparisons
- Real assessment scores and findings
- Prioritized remediation decisions
- Implementation details for each pillar
- Measurable outcomes and lessons learned
Company Background
About Contoso Retail
| Attribute | Details |
|---|---|
| Industry | Retail / E-commerce |
| Annual Revenue | $500M |
| Employees | 2,000 |
| Customers | 2M registered users |
| Daily Orders | 15,000 average, 50,000 peak (holidays) |
| Product Catalog | 50,000 SKUs |
| Tech Team | 25 engineers |
Business Context
Contoso Retail operates both physical stores and an e-commerce platform. Their online business has grown 40% year-over-year, but their legacy architecture is struggling to keep pace. Recent Black Friday outages cost them an estimated $2M in lost sales.
Key Business Drivers:
- Eliminate revenue-impacting outages
- Reduce cloud spending (currently $150K/month)
- Accelerate feature delivery (currently 6-week release cycles)
- Meet PCI-DSS compliance requirements
- Support international expansion
Initial State Assessment
Current Architecture (Before WAF)
Initial WAF Assessment Scores
Detailed Findings by Pillar
Reliability Findings (Score: 35/100)
| Finding | Severity | Impact |
|---|---|---|
| Single region deployment | Critical | Complete outage if region fails |
| No defined RTO/RPO | Critical | Unknown recovery capabilities |
| Database single point of failure | Critical | Data loss risk |
| Manual failover procedures | High | Extended downtime |
| No health monitoring | High | Reactive incident response |
| Monolithic application | Medium | Blast radius of failures |
Security Findings (Score: 42/100)
| Finding | Severity | Impact |
|---|---|---|
| No WAF protection | Critical | Vulnerable to web attacks |
| Secrets in config files | Critical | Credential exposure risk |
| No MFA for admin access | Critical | Account compromise risk |
| Public database endpoint | High | Data breach risk |
| No encryption at rest | High | Compliance violation |
| Overly permissive RBAC | Medium | Insider threat risk |
Cost Findings (Score: 38/100)
| Finding | Severity | Impact |
|---|---|---|
| Over-provisioned VMs | High | $30K/month waste |
| No reserved instances | High | Missing 40% savings |
| No auto-scaling | Medium | Paying for peak capacity 24/7 |
| LRS storage for critical data | Medium | Risk vs cost mismatch |
| No cost allocation tags | Medium | No accountability |
| Dev/Test using production SKUs | Low | Unnecessary spend |
Operational Excellence Findings (Score: 45/100)
| Finding | Severity | Impact |
|---|---|---|
| Manual deployments | High | 6-week release cycles |
| No Infrastructure as Code | High | Configuration drift |
| Limited monitoring | High | Blind to issues |
| No runbooks | Medium | Inconsistent incident response |
| Tribal knowledge | Medium | Key person dependency |
| No automated testing | Medium | Quality issues |
Performance Findings (Score: 50/100)
| Finding | Severity | Impact |
|---|---|---|
| No caching layer | High | Database overload |
| No CDN for static content | High | Slow page loads |
| Unoptimized queries | High | 3-5 second response times |
| No connection pooling | Medium | Connection exhaustion |
| Synchronous processing | Medium | Blocking operations |
| No load testing | Medium | Unknown capacity limits |
Prioritization and Roadmap
Risk-Based Prioritization
Using the impact vs effort matrix to prioritize findings:
| Priority | Items | Effort | Impact |
|---|---|---|---|
| Quick Wins | WAF, Key Vault, MFA, Caching | Low | High |
| Plan Carefully | Database migration, CI/CD | High | High |
| Strategic | Geo-redundancy, Microservices | High | High |
| Defer | Documentation updates | Low | Low |
Phased Roadmap
Phase 1: Critical Security Fixes
1.1 Enable MFA for All Users
Before: Password-only authentication for Azure portal and admin access.
Implementation:
# Enable Security Defaults (includes MFA)
# Or use Conditional Access for more control
# Conditional Access Policy via Graph API
$policy = @{
displayName = "Require MFA for all users"
state = "enabled"
conditions = @{
users = @{
includeUsers = @("All")
excludeUsers = @("BreakGlassAccount@contoso.com")
}
applications = @{
includeApplications = @("All")
}
}
grantControls = @{
operator = "OR"
builtInControls = @("mfa")
}
}
Outcome: 100% of admin accounts now require MFA.
1.2 Migrate Secrets to Key Vault
Before: Connection strings and API keys in appsettings.json and environment variables.
// BEFORE - appsettings.json (INSECURE!)
{
"ConnectionStrings": {
"Database": "Server=sql.contoso.com;Database=Orders;User=admin;Password=P@ssw0rd123!"
},
"PaymentGateway": {
"ApiKey": "sk_live_abc123xyz789"
}
}
After: All secrets in Azure Key Vault with managed identity access.
// Key Vault with private endpoint
resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = {
name: 'kv-contoso-prod'
location: location
properties: {
sku: { family: 'A', name: 'standard' }
tenantId: subscription().tenantId
enableRbacAuthorization: true
enableSoftDelete: true
softDeleteRetentionInDays: 90
enablePurgeProtection: true
networkAcls: {
defaultAction: 'Deny'
bypass: 'AzureServices'
}
}
}
// Private endpoint for Key Vault
resource kvPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-05-01' = {
name: 'pe-kv-contoso'
location: location
properties: {
subnet: { id: privateEndpointSubnet.id }
privateLinkServiceConnections: [
{
name: 'kv-connection'
properties: {
privateLinkServiceId: keyVault.id
groupIds: ['vault']
}
}
]
}
}
// Application code - access secrets via managed identity
builder.Configuration.AddAzureKeyVault(
new Uri("https://kv-contoso-prod.vault.azure.net/"),
new DefaultAzureCredential());
// Secrets are now accessed like regular configuration
var connectionString = builder.Configuration["Database-ConnectionString"];
Outcome: Zero secrets in code or config files. All secrets centrally managed with audit logging.
1.3 Deploy Web Application Firewall
Before: Direct internet access to load balancer, no web attack protection.
After: Azure Front Door with WAF in Prevention mode.
// Front Door with WAF
resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
name: 'fd-contoso-prod'
location: 'global'
sku: { name: 'Premium_AzureFrontDoor' }
}
resource wafPolicy 'Microsoft.Network/FrontDoorWebApplicationFirewallPolicies@2022-05-01' = {
name: 'waf-contoso-prod'
location: 'global'
sku: { name: 'Premium_AzureFrontDoor' }
properties: {
policySettings: {
enabledState: 'Enabled'
mode: 'Prevention'
requestBodyCheck: 'Enabled'
}
managedRules: {
managedRuleSets: [
{
ruleSetType: 'Microsoft_DefaultRuleSet'
ruleSetVersion: '2.1'
}
{
ruleSetType: 'Microsoft_BotManagerRuleSet'
ruleSetVersion: '1.0'
}
]
}
customRules: {
rules: [
{
name: 'RateLimitRule'
priority: 1
ruleType: 'RateLimitRule'
rateLimitThreshold: 1000
rateLimitDurationInMinutes: 1
action: 'Block'
matchConditions: [
{
matchVariable: 'RequestUri'
operator: 'Contains'
matchValue: ['/api/']
}
]
}
]
}
}
}
Outcome: Blocked 50,000+ malicious requests in first month. Zero successful web attacks.
1.4 Secure Database with Private Endpoint
Before: SQL Server accessible via public IP with firewall rules.
After: Private endpoint only, no public access.
// Disable public access
resource sqlServer 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: 'sql-contoso-prod'
location: location
properties: {
publicNetworkAccess: 'Disabled'
minimalTlsVersion: '1.2'
}
}
// Private endpoint for SQL
resource sqlPrivateEndpoint 'Microsoft.Network/privateEndpoints@2023-05-01' = {
name: 'pe-sql-contoso'
location: location
properties: {
subnet: { id: dataSubnet.id }
privateLinkServiceConnections: [
{
name: 'sql-connection'
properties: {
privateLinkServiceId: sqlServer.id
groupIds: ['sqlServer']
}
}
]
}
}
// Private DNS zone for SQL
resource privateDnsZone 'Microsoft.Network/privateDnsZones@2020-06-01' = {
name: 'privatelink.database.windows.net'
location: 'global'
}
Outcome: Database no longer accessible from internet. All access via private network.
Phase 2: Reliability Foundation
2.1 Define RTO and RPO
Working with business stakeholders, Contoso defined recovery objectives:
| Workload | RTO | RPO | Justification |
|---|---|---|---|
| E-commerce website | 15 minutes | 5 minutes | Revenue-critical |
| Order processing | 30 minutes | 0 (no data loss) | Financial transactions |
| Product catalog | 1 hour | 1 hour | Can rebuild from source |
| Analytics | 4 hours | 24 hours | Not customer-facing |
2.2 Implement Health Checks
// Comprehensive health checks
builder.Services.AddHealthChecks()
// Database connectivity
.AddSqlServer(
connectionString: builder.Configuration["Database-ConnectionString"],
name: "database",
failureStatus: HealthStatus.Unhealthy,
tags: new[] { "db", "critical" })
// Redis cache
.AddRedis(
redisConnectionString: builder.Configuration["Redis-ConnectionString"],
name: "redis",
failureStatus: HealthStatus.Degraded,
tags: new[] { "cache" })
// External payment gateway
.AddUrlGroup(
new Uri("https://api.paymentgateway.com/health"),
name: "payment-gateway",
failureStatus: HealthStatus.Degraded,
tags: new[] { "external" })
// Blob storage
.AddAzureBlobStorage(
connectionString: builder.Configuration["Storage-ConnectionString"],
name: "blob-storage",
failureStatus: HealthStatus.Degraded,
tags: new[] { "storage" });
// Health check endpoints
app.MapHealthChecks("/health", new HealthCheckOptions
{
Predicate = _ => true,
ResponseWriter = WriteHealthCheckResponse
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Tags.Contains("critical"),
ResponseWriter = WriteHealthCheckResponse
});
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = _ => false // Just checks if app is running
});
2.3 Add Redis Caching Layer
Before: Every request hit the database directly.
After: Distributed caching with Redis reduces database load by 70%.
// Cache-aside pattern for product catalog
public class ProductService
{
private readonly IDistributedCache _cache;
private readonly IProductRepository _repository;
private readonly ILogger<ProductService> _logger;
private static readonly TimeSpan CacheDuration = TimeSpan.FromMinutes(15);
public async Task<Product?> GetProductAsync(string productId)
{
var cacheKey = $"product:{productId}";
// Try cache first
var cached = await _cache.GetStringAsync(cacheKey);
if (cached != null)
{
_logger.LogDebug("Cache hit for product {ProductId}", productId);
return JsonSerializer.Deserialize<Product>(cached);
}
// Cache miss - get from database
_logger.LogDebug("Cache miss for product {ProductId}", productId);
var product = await _repository.GetByIdAsync(productId);
if (product != null)
{
await _cache.SetStringAsync(
cacheKey,
JsonSerializer.Serialize(product),
new DistributedCacheEntryOptions
{
AbsoluteExpirationRelativeToNow = CacheDuration
});
}
return product;
}
public async Task InvalidateProductCacheAsync(string productId)
{
await _cache.RemoveAsync($"product:{productId}");
await _cache.RemoveAsync("products:featured"); // Invalidate related caches
}
}
Outcome:
- Database queries reduced by 70%
- Average response time: 3.2s → 180ms
- Database CPU: 85% → 25%
2.4 Database Geo-Replication
// Primary database
resource sqlDatabase 'Microsoft.Sql/servers/databases@2022-05-01-preview' = {
parent: sqlServerPrimary
name: 'contoso-orders'
location: 'eastus'
sku: {
name: 'BC_Gen5_4'
tier: 'BusinessCritical'
}
properties: {
zoneRedundant: true
}
}
// Secondary server in different region
resource sqlServerSecondary 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: 'sql-contoso-secondary'
location: 'westus'
properties: {
publicNetworkAccess: 'Disabled'
}
}
// Geo-replication link
resource geoReplication 'Microsoft.Sql/servers/databases@2022-05-01-preview' = {
parent: sqlServerSecondary
name: 'contoso-orders'
location: 'westus'
properties: {
createMode: 'Secondary'
sourceDatabaseId: sqlDatabase.id
}
}
// Auto-failover group
resource failoverGroup 'Microsoft.Sql/servers/failoverGroups@2022-05-01-preview' = {
parent: sqlServerPrimary
name: 'fg-contoso'
properties: {
readWriteEndpoint: {
failoverPolicy: 'Automatic'
failoverWithDataLossGracePeriodMinutes: 60
}
readOnlyEndpoint: {
failoverPolicy: 'Enabled'
}
partnerServers: [
{ id: sqlServerSecondary.id }
]
databases: [sqlDatabase.id]
}
}
Outcome: RPO reduced to ~5 seconds with automatic failover capability.
Phase 3: Operational Excellence
3.1 Infrastructure as Code with Bicep
All infrastructure now defined in Bicep modules:
infrastructure/
├── main.bicep
├── modules/
│ ├── networking.bicep
│ ├── compute.bicep
│ ├── data.bicep
│ ├── security.bicep
│ └── monitoring.bicep
├── environments/
│ ├── dev.bicepparam
│ ├── staging.bicepparam
│ └── prod.bicepparam
└── .github/
└── workflows/
└── infrastructure.yml
// main.bicep
targetScope = 'subscription'
@description('Environment name')
@allowed(['dev', 'staging', 'prod'])
param environment string
@description('Primary Azure region')
param primaryLocation string = 'eastus'
@description('Secondary Azure region for DR')
param secondaryLocation string = 'westus'
// Resource Group
resource rg 'Microsoft.Resources/resourceGroups@2023-07-01' = {
name: 'rg-contoso-${environment}'
location: primaryLocation
tags: {
Environment: environment
CostCenter: 'IT-Engineering'
Application: 'Contoso-Ecommerce'
}
}
// Networking
module networking 'modules/networking.bicep' = {
scope: rg
name: 'networking'
params: {
environment: environment
location: primaryLocation
}
}
// Security (Key Vault, etc.)
module security 'modules/security.bicep' = {
scope: rg
name: 'security'
params: {
environment: environment
location: primaryLocation
subnetId: networking.outputs.privateEndpointSubnetId
}
}
// Data (SQL, Redis, Storage)
module data 'modules/data.bicep' = {
scope: rg
name: 'data'
params: {
environment: environment
primaryLocation: primaryLocation
secondaryLocation: secondaryLocation
subnetId: networking.outputs.dataSubnetId
}
}
// Compute (App Service, Functions)
module compute 'modules/compute.bicep' = {
scope: rg
name: 'compute'
params: {
environment: environment
location: primaryLocation
subnetId: networking.outputs.appSubnetId
keyVaultName: security.outputs.keyVaultName
}
}
// Monitoring
module monitoring 'modules/monitoring.bicep' = {
scope: rg
name: 'monitoring'
params: {
environment: environment
location: primaryLocation
}
}
3.2 CI/CD Pipeline
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
DOTNET_VERSION: '8.0.x'
AZURE_WEBAPP_NAME: app-contoso
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: ${{ env.DOTNET_VERSION }}
- name: Restore dependencies
run: dotnet restore
- name: Build
run: dotnet build --configuration Release --no-restore
- name: Run unit tests
run: dotnet test --no-build --verbosity normal --collect:"XPlat Code Coverage" --results-directory ./coverage
- name: Run security scan
uses: github/codeql-action/analyze@v3
- name: Publish
run: dotnet publish src/Contoso.Web/Contoso.Web.csproj -c Release -o ./publish
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: webapp
path: ./publish
deploy-staging:
needs: build
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp
path: ./publish
- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Deploy to staging slot
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}-staging
package: ./publish
- name: Run smoke tests
run: |
response=$(curl -s -o /dev/null -w "%{http_code}" https://${{ env.AZURE_WEBAPP_NAME }}-staging.azurewebsites.net/health)
if [ "$response" != "200" ]; then
echo "Health check failed with status $response"
exit 1
fi
- name: Run integration tests
run: |
dotnet test tests/Contoso.IntegrationTests --filter Category=Smoke
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Swap staging to production
run: |
az webapp deployment slot swap \
--name ${{ env.AZURE_WEBAPP_NAME }} \
--resource-group rg-contoso-prod \
--slot staging \
--target-slot production
- name: Verify production health
run: |
for i in {1..5}; do
response=$(curl -s -o /dev/null -w "%{http_code}" https://${{ env.AZURE_WEBAPP_NAME }}.azurewebsites.net/health)
if [ "$response" = "200" ]; then
echo "Production health check passed"
exit 0
fi
sleep 10
done
echo "Production health check failed"
exit 1
Outcome:
- Release cycle: 6 weeks → daily deployments
- Deployment time: 2 hours manual → 15 minutes automated
- Rollback time: 1 hour → 2 minutes (slot swap)
3.3 Monitoring Stack
// Application Insights
resource appInsights 'Microsoft.Insights/components@2020-02-02' = {
name: 'ai-contoso-${environment}'
location: location
kind: 'web'
properties: {
Application_Type: 'web'
WorkspaceResourceId: logAnalytics.id
RetentionInDays: 90
}
}
// Log Analytics Workspace
resource logAnalytics 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
name: 'log-contoso-${environment}'
location: location
properties: {
sku: { name: 'PerGB2018' }
retentionInDays: 90
}
}
// Alert for high error rate
resource errorRateAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-high-error-rate'
location: 'global'
properties: {
description: 'Alert when error rate exceeds 5%'
severity: 1
enabled: true
scopes: [appInsights.id]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighErrorRate'
metricName: 'requests/failed'
operator: 'GreaterThan'
threshold: 5
timeAggregation: 'Average'
}
]
}
actions: [{ actionGroupId: actionGroup.id }]
}
}
Key Dashboards Created:
| Dashboard | Metrics | Audience |
|---|---|---|
| Executive | Revenue, orders, availability | Leadership |
| Operations | Errors, latency, throughput | On-call team |
| Performance | Response times, DB queries, cache hits | Engineers |
| Security | Failed logins, blocked requests, anomalies | Security team |
Phase 4: Cost Optimization
4.1 Right-Sizing Analysis
Before:
| Resource | SKU | Utilization | Monthly Cost |
|---|---|---|---|
| Web VMs (2x) | D4s_v3 | 15% CPU | $560 |
| API VMs (2x) | D8s_v3 | 20% CPU | $1,120 |
| SQL VM | E16s_v3 | 25% CPU | $1,680 |
| Total | $3,360 |
After (PaaS Migration):
| Resource | SKU | Monthly Cost | Savings |
|---|---|---|---|
| App Service Plan | P1v3 (auto-scale 2-6) | $292 | 48% |
| Azure SQL | BC_Gen5_4 | $1,460 | 13% |
| Redis Cache | C1 Standard | $81 | N/A |
| Total | $1,833 | 45% |
4.2 Reserved Instances
# Purchase 3-year reserved capacity for predictable workloads
# App Service Plan - 3 year reservation
# SQL Database - 3 year reserved capacity
# Estimated savings:
# - App Service: $292/mo → $117/mo (60% savings)
# - SQL Database: $1,460/mo → $584/mo (60% savings)
# - Total monthly: $1,833 → $782 (57% additional savings)
4.3 Tagging Strategy Implementation
// Standard tags applied to all resources
var standardTags = {
Environment: environment
CostCenter: 'IT-Engineering'
Application: 'Contoso-Ecommerce'
Owner: 'platform-team@contoso.com'
BusinessUnit: 'Digital'
DataClassification: 'Confidential'
}
// Azure Policy to enforce required tags
resource tagPolicy 'Microsoft.Authorization/policyAssignments@2022-06-01' = {
name: 'require-tags'
properties: {
policyDefinitionId: '/providers/Microsoft.Authorization/policyDefinitions/require-tag-and-value'
parameters: {
tagName: { value: 'CostCenter' }
}
enforcementMode: 'Default'
}
}
Cost Allocation Report:
| Cost Center | Monthly Spend | % of Total |
|---|---|---|
| IT-Engineering | $12,500 | 45% |
| Marketing | $5,200 | 19% |
| Operations | $4,800 | 17% |
| Analytics | $3,100 | 11% |
| Dev/Test | $2,200 | 8% |
Phase 5: Target Architecture
Final Architecture (After WAF)
Results and Outcomes
WAF Score Improvement
Score Comparison
| Pillar | Before | After | Improvement |
|---|---|---|---|
| Reliability | 35 | 82 | +47 points |
| Security | 42 | 88 | +46 points |
| Cost Optimization | 38 | 75 | +37 points |
| Operational Excellence | 45 | 85 | +40 points |
| Performance Efficiency | 50 | 90 | +40 points |
| Overall | 42 | 84 | +42 points |
Business Outcomes
| Metric | Before | After | Impact |
|---|---|---|---|
| Availability | 99.2% | 99.95% | $1.5M saved in prevented outages |
| Monthly Cloud Cost | $150,000 | $85,000 | $780K annual savings |
| Release Frequency | 6 weeks | Daily | 10x faster feature delivery |
| MTTR | 4 hours | 15 minutes | 94% reduction |
| Page Load Time | 3.2 seconds | 0.8 seconds | 75% faster |
| Security Incidents | 3/year | 0 | Zero breaches |
ROI Analysis
| Investment | Cost | Annual Benefit | ROI |
|---|---|---|---|
| WAF Implementation | $200K (one-time) | ||
| Ongoing Operations | $50K/year | ||
| Total Cost | $250K | ||
| Cost Savings | $780K | ||
| Prevented Outages | $1.5M | ||
| Productivity Gains | $300K | ||
| Total Benefit | $2.58M | 932% |
Lessons Learned
What Worked Well
- Phased approach: Tackling critical security first built confidence
- Quick wins: Early caching improvements showed immediate value
- Business alignment: Tying improvements to revenue impact got executive support
- Automation first: IaC and CI/CD accelerated subsequent phases
Challenges Faced
| Challenge | How We Addressed It |
|---|---|
| Legacy code dependencies | Incremental refactoring, strangler fig pattern |
| Team skill gaps | Training, pair programming, external consultants |
| Resistance to change | Demonstrated quick wins, involved team in decisions |
| Budget constraints | Prioritized by ROI, showed cost savings early |
Recommendations for Others
- Start with assessment: Know your baseline before making changes
- Prioritize ruthlessly: You can't fix everything at once
- Measure everything: Data drives better decisions
- Automate early: Manual processes don't scale
- Involve the business: Technical improvements need business context
Next Steps for Contoso
| Initiative | Timeline | Expected Outcome |
|---|---|---|
| Kubernetes migration | Q3 2024 | Better resource utilization |
| AI-powered search | Q4 2024 | Improved customer experience |
| International expansion | Q1 2025 | Multi-region active-active |
| Sustainability optimization | Q2 2025 | Carbon footprint reduction |