Operational Excellence Pillar
TL;DR
The Operational Excellence pillar focuses on operating and improving your workloads effectively. Key concepts:
- DevOps culture: Collaboration between development and operations
- Infrastructure as Code: Version-controlled, repeatable deployments
- CI/CD pipelines: Automated build, test, and deployment
- Monitoring & observability: Understand system behavior and health
- Automation: Reduce manual effort and human error
Design Principles
Core Operational Excellence Principles
| Principle | Description | Implementation |
|---|---|---|
| Embrace DevOps | Break down silos | Shared ownership, collaboration |
| Use IaC | Treat infrastructure as software | Bicep, Terraform, ARM |
| Automate operations | Reduce manual intervention | Runbooks, auto-remediation |
| Monitor everything | Full visibility into systems | Metrics, logs, traces |
| Learn from failures | Continuous improvement | Blameless postmortems |
DevOps Lifecycle
Infrastructure as Code
IaC Benefits
| Benefit | Description |
|---|---|
| Version Control | Track changes, rollback if needed |
| Consistency | Same infrastructure every time |
| Automation | Deploy without manual steps |
| Documentation | Code is the documentation |
| Testing | Validate before deployment |
| Collaboration | Review changes via PRs |
IaC Tool Comparison
| Feature | Bicep | Terraform | ARM Templates |
|---|---|---|---|
| Syntax | Clean, concise | HCL | JSON (verbose) |
| Learning Curve | Low | Medium | High |
| Multi-cloud | Azure only | Yes | Azure only |
| State Management | Azure-managed | External state file | Azure-managed |
| Modularity | Modules | Modules | Linked templates |
| IDE Support | VS Code extension | VS Code extension | VS Code extension |
Bicep Example
// main.bicep - Deploy a web app with SQL database
@description('The Azure region for resources')
param location string = resourceGroup().location
@description('Environment name')
@allowed(['dev', 'staging', 'prod'])
param environment string = 'dev'
@description('SQL admin password')
@secure()
param sqlAdminPassword string
// Variables
var appServicePlanName = 'asp-${environment}-${uniqueString(resourceGroup().id)}'
var webAppName = 'app-${environment}-${uniqueString(resourceGroup().id)}'
var sqlServerName = 'sql-${environment}-${uniqueString(resourceGroup().id)}'
// App Service Plan
resource appServicePlan 'Microsoft.Web/serverfarms@2022-09-01' = {
name: appServicePlanName
location: location
sku: {
name: environment == 'prod' ? 'P1v3' : 'B1'
tier: environment == 'prod' ? 'PremiumV3' : 'Basic'
}
properties: {
reserved: true // Linux
}
}
// Web App
resource webApp 'Microsoft.Web/sites@2022-09-01' = {
name: webAppName
location: location
properties: {
serverFarmId: appServicePlan.id
siteConfig: {
linuxFxVersion: 'DOTNETCORE|8.0'
alwaysOn: environment == 'prod'
healthCheckPath: '/health'
}
}
identity: {
type: 'SystemAssigned'
}
}
// SQL Server
resource sqlServer 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: sqlServerName
location: location
properties: {
administratorLogin: 'sqladmin'
administratorLoginPassword: sqlAdminPassword
minimalTlsVersion: '1.2'
}
}
// Outputs
output webAppUrl string = 'https://${webApp.properties.defaultHostName}'
output webAppIdentityId string = webApp.identity.principalId
Terraform Example
# main.tf - Deploy a web app with SQL database
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
backend "azurerm" {
resource_group_name = "tfstate-rg"
storage_account_name = "tfstatestorage"
container_name = "tfstate"
key = "webapp.tfstate"
}
}
provider "azurerm" {
features {}
}
variable "environment" {
type = string
default = "dev"
}
variable "location" {
type = string
default = "eastus"
}
resource "azurerm_resource_group" "main" {
name = "rg-webapp-${var.environment}"
location = var.location
}
resource "azurerm_service_plan" "main" {
name = "asp-${var.environment}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
os_type = "Linux"
sku_name = var.environment == "prod" ? "P1v3" : "B1"
}
resource "azurerm_linux_web_app" "main" {
name = "app-${var.environment}-${random_string.suffix.result}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
service_plan_id = azurerm_service_plan.main.id
site_config {
application_stack {
dotnet_version = "8.0"
}
health_check_path = "/health"
}
identity {
type = "SystemAssigned"
}
}
resource "random_string" "suffix" {
length = 8
special = false
upper = false
}
output "webapp_url" {
value = "https://${azurerm_linux_web_app.main.default_hostname}"
}
CI/CD Pipelines
Pipeline Architecture
GitHub Actions Example
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
AZURE_WEBAPP_NAME: my-web-app
DOTNET_VERSION: '8.0.x'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: ${{ env.DOTNET_VERSION }}
- name: Restore dependencies
run: dotnet restore
- name: Build
run: dotnet build --configuration Release --no-restore
- name: Test
run: dotnet test --no-build --verbosity normal --collect:"XPlat Code Coverage"
- name: Publish
run: dotnet publish -c Release -o ./publish
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: webapp
path: ./publish
deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
if: github.ref == 'refs/heads/main'
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp
- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Deploy to Staging
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}-staging
package: .
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp
- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Deploy to Production
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}
package: .
Azure DevOps Pipeline
# azure-pipelines.yml
trigger:
branches:
include:
- main
pool:
vmImage: 'ubuntu-latest'
variables:
buildConfiguration: 'Release'
azureSubscription: 'Azure-Connection'
webAppName: 'my-web-app'
stages:
- stage: Build
jobs:
- job: BuildJob
steps:
- task: UseDotNet@2
inputs:
version: '8.0.x'
- task: DotNetCoreCLI@2
displayName: 'Restore'
inputs:
command: 'restore'
- task: DotNetCoreCLI@2
displayName: 'Build'
inputs:
command: 'build'
arguments: '--configuration $(buildConfiguration)'
- task: DotNetCoreCLI@2
displayName: 'Test'
inputs:
command: 'test'
arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage"'
- task: DotNetCoreCLI@2
displayName: 'Publish'
inputs:
command: 'publish'
publishWebProjects: true
arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)'
- task: PublishBuildArtifacts@1
inputs:
pathToPublish: '$(Build.ArtifactStagingDirectory)'
artifactName: 'webapp'
- stage: DeployStaging
dependsOn: Build
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: DeployStaging
environment: 'staging'
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: '$(azureSubscription)'
appType: 'webAppLinux'
appName: '$(webAppName)-staging'
package: '$(Pipeline.Workspace)/webapp/**/*.zip'
- stage: DeployProduction
dependsOn: DeployStaging
jobs:
- deployment: DeployProduction
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: '$(azureSubscription)'
appType: 'webAppLinux'
appName: '$(webAppName)'
package: '$(Pipeline.Workspace)/webapp/**/*.zip'
Safe Deployment Practices
Deployment Strategies
Strategy Comparison
| Strategy | Risk | Rollback Speed | Resource Cost |
|---|---|---|---|
| Blue-Green | Low | Instant | 2x during deploy |
| Canary | Very Low | Fast | Minimal |
| Rolling | Medium | Slow | Minimal |
| Recreate | High | Slow | Minimal |
App Service Deployment Slots
# Create staging slot
az webapp deployment slot create \
--name myWebApp \
--resource-group myRG \
--slot staging
# Deploy to staging
az webapp deployment source config-zip \
--name myWebApp \
--resource-group myRG \
--slot staging \
--src app.zip
# Swap staging to production
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--target-slot production
# Swap with preview (validate first)
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--action preview
# Complete or cancel swap
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--action swap # or 'reset' to cancel
Monitoring and Observability
Three Pillars of Observability
Azure Monitor Stack
| Service | Purpose | Data Type |
|---|---|---|
| Azure Monitor | Unified monitoring platform | All telemetry |
| Log Analytics | Log aggregation and querying | Logs |
| Application Insights | APM for applications | Traces, metrics |
| Metrics Explorer | Metric visualization | Metrics |
| Alerts | Proactive notifications | All |
| Workbooks | Custom dashboards | All |
Application Insights Setup
// Program.cs - Configure Application Insights
var builder = WebApplication.CreateBuilder(args);
// Add Application Insights
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
// Add custom telemetry
builder.Services.AddSingleton<ITelemetryInitializer, CustomTelemetryInitializer>();
var app = builder.Build();
// Custom telemetry initializer
public class CustomTelemetryInitializer : ITelemetryInitializer
{
public void Initialize(ITelemetry telemetry)
{
telemetry.Context.Cloud.RoleName = "OrderService";
telemetry.Context.GlobalProperties["Environment"] = "Production";
}
}
KQL Queries for Monitoring
// Request performance
requests
| where timestamp > ago(1h)
| summarize
Count = count(),
AvgDuration = avg(duration),
P95Duration = percentile(duration, 95),
FailureRate = countif(success == false) * 100.0 / count()
by bin(timestamp, 5m)
| render timechart
// Dependency failures
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize Count = count() by target, type, resultCode
| order by Count desc
// Exception analysis
exceptions
| where timestamp > ago(24h)
| summarize Count = count() by type, outerMessage
| order by Count desc
| take 10
// Slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| project timestamp, name, duration, resultCode, customDimensions
| order by duration desc
| take 20
Alert Configuration
// Bicep - Create metric alert
resource alert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'High-CPU-Alert'
location: 'global'
properties: {
description: 'Alert when CPU exceeds 80%'
severity: 2
enabled: true
scopes: [appServicePlan.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighCPU'
metricName: 'CpuPercentage'
operator: 'GreaterThan'
threshold: 80
timeAggregation: 'Average'
}
]
}
actions: [
{
actionGroupId: actionGroup.id
}
]
}
}
Automation
Azure Automation Runbooks
# PowerShell Runbook - Auto-restart unhealthy VMs
param(
[Parameter(Mandatory=$true)]
[string]$ResourceGroupName
)
# Connect using managed identity
Connect-AzAccount -Identity
# Get all VMs in resource group
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName
foreach ($vm in $vms) {
# Get VM status
$status = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name -Status
$powerState = ($status.Statuses | Where-Object { $_.Code -like "PowerState/*" }).DisplayStatus
if ($powerState -eq "VM running") {
# Check if VM is responsive (custom health check)
$healthCheck = Test-NetConnection -ComputerName $vm.Name -Port 443 -WarningAction SilentlyContinue
if (-not $healthCheck.TcpTestSucceeded) {
Write-Output "Restarting unhealthy VM: $($vm.Name)"
Restart-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name
}
}
}
Logic Apps for Automation
// Logic App - Auto-remediation workflow
{
"definition": {
"triggers": {
"When_alert_is_triggered": {
"type": "Request",
"kind": "Http"
}
},
"actions": {
"Parse_alert": {
"type": "ParseJson",
"inputs": {
"content": "@triggerBody()",
"schema": { /* alert schema */ }
}
},
"Check_alert_type": {
"type": "Switch",
"expression": "@body('Parse_alert')?['data']?['essentials']?['alertRule']",
"cases": {
"High_CPU": {
"actions": {
"Scale_up": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "https://management.azure.com/...",
"authentication": { "type": "ManagedServiceIdentity" }
}
}
}
}
}
},
"Send_notification": {
"type": "SendEmail",
"inputs": {
"to": "ops-team@company.com",
"subject": "Auto-remediation executed",
"body": "Action taken: @{body('Check_alert_type')}"
}
}
}
}
}
Incident Management
Incident Response Process
Postmortem Template
| Section | Content |
|---|---|
| Summary | Brief description of the incident |
| Impact | Users affected, duration, severity |
| Timeline | Chronological events |
| Root Cause | What caused the incident |
| Resolution | How it was fixed |
| Action Items | Preventive measures |
| Lessons Learned | What we learned |
Operational Excellence Checklist
DevOps Practices
- Use version control for all code and configuration
- Implement code review process
- Automate builds and tests
- Use feature flags for safe releases
- Practice trunk-based development
Infrastructure as Code
- All infrastructure defined in code
- IaC templates in version control
- Automated infrastructure testing
- Environment parity (dev = staging = prod)
- Modular, reusable templates
CI/CD
- Automated build pipeline
- Automated testing (unit, integration, e2e)
- Security scanning in pipeline
- Automated deployments
- Deployment approvals for production
Monitoring
- Application performance monitoring
- Infrastructure monitoring
- Log aggregation and analysis
- Distributed tracing
- Alerting and on-call rotation
Automation
- Automated scaling
- Automated remediation for common issues
- Automated backups
- Automated security patching
- Runbooks for manual procedures
Assessment Questions
| Area | Question |
|---|---|
| DevOps | Do dev and ops teams collaborate effectively? |
| IaC | Is all infrastructure defined as code? |
| CI/CD | Are deployments fully automated? |
| Testing | What percentage of code is covered by tests? |
| Monitoring | Can you detect issues before users report them? |
| Alerting | Are alerts actionable and not noisy? |
| Automation | What manual tasks could be automated? |
| Incidents | Do you conduct blameless postmortems? |
Key Takeaways
- Automate everything: Manual processes are error-prone and slow
- Infrastructure as Code: Treat infrastructure like application code
- Monitor proactively: Detect issues before users do
- Deploy safely: Use progressive deployment strategies
- Learn from failures: Blameless postmortems drive improvement