Skip to main content

Operational Excellence Pillar

TL;DR

The Operational Excellence pillar focuses on operating and improving your workloads effectively. Key concepts:

  • DevOps culture: Collaboration between development and operations
  • Infrastructure as Code: Version-controlled, repeatable deployments
  • CI/CD pipelines: Automated build, test, and deployment
  • Monitoring & observability: Understand system behavior and health
  • Automation: Reduce manual effort and human error

Design Principles

Core Operational Excellence Principles

PrincipleDescriptionImplementation
Embrace DevOpsBreak down silosShared ownership, collaboration
Use IaCTreat infrastructure as softwareBicep, Terraform, ARM
Automate operationsReduce manual interventionRunbooks, auto-remediation
Monitor everythingFull visibility into systemsMetrics, logs, traces
Learn from failuresContinuous improvementBlameless postmortems

DevOps Lifecycle


Infrastructure as Code

IaC Benefits

BenefitDescription
Version ControlTrack changes, rollback if needed
ConsistencySame infrastructure every time
AutomationDeploy without manual steps
DocumentationCode is the documentation
TestingValidate before deployment
CollaborationReview changes via PRs

IaC Tool Comparison

FeatureBicepTerraformARM Templates
SyntaxClean, conciseHCLJSON (verbose)
Learning CurveLowMediumHigh
Multi-cloudAzure onlyYesAzure only
State ManagementAzure-managedExternal state fileAzure-managed
ModularityModulesModulesLinked templates
IDE SupportVS Code extensionVS Code extensionVS Code extension

Bicep Example

// main.bicep - Deploy a web app with SQL database
@description('The Azure region for resources')
param location string = resourceGroup().location

@description('Environment name')
@allowed(['dev', 'staging', 'prod'])
param environment string = 'dev'

@description('SQL admin password')
@secure()
param sqlAdminPassword string

// Variables
var appServicePlanName = 'asp-${environment}-${uniqueString(resourceGroup().id)}'
var webAppName = 'app-${environment}-${uniqueString(resourceGroup().id)}'
var sqlServerName = 'sql-${environment}-${uniqueString(resourceGroup().id)}'

// App Service Plan
resource appServicePlan 'Microsoft.Web/serverfarms@2022-09-01' = {
name: appServicePlanName
location: location
sku: {
name: environment == 'prod' ? 'P1v3' : 'B1'
tier: environment == 'prod' ? 'PremiumV3' : 'Basic'
}
properties: {
reserved: true // Linux
}
}

// Web App
resource webApp 'Microsoft.Web/sites@2022-09-01' = {
name: webAppName
location: location
properties: {
serverFarmId: appServicePlan.id
siteConfig: {
linuxFxVersion: 'DOTNETCORE|8.0'
alwaysOn: environment == 'prod'
healthCheckPath: '/health'
}
}
identity: {
type: 'SystemAssigned'
}
}

// SQL Server
resource sqlServer 'Microsoft.Sql/servers@2022-05-01-preview' = {
name: sqlServerName
location: location
properties: {
administratorLogin: 'sqladmin'
administratorLoginPassword: sqlAdminPassword
minimalTlsVersion: '1.2'
}
}

// Outputs
output webAppUrl string = 'https://${webApp.properties.defaultHostName}'
output webAppIdentityId string = webApp.identity.principalId

Terraform Example

# main.tf - Deploy a web app with SQL database
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
backend "azurerm" {
resource_group_name = "tfstate-rg"
storage_account_name = "tfstatestorage"
container_name = "tfstate"
key = "webapp.tfstate"
}
}

provider "azurerm" {
features {}
}

variable "environment" {
type = string
default = "dev"
}

variable "location" {
type = string
default = "eastus"
}

resource "azurerm_resource_group" "main" {
name = "rg-webapp-${var.environment}"
location = var.location
}

resource "azurerm_service_plan" "main" {
name = "asp-${var.environment}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
os_type = "Linux"
sku_name = var.environment == "prod" ? "P1v3" : "B1"
}

resource "azurerm_linux_web_app" "main" {
name = "app-${var.environment}-${random_string.suffix.result}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
service_plan_id = azurerm_service_plan.main.id

site_config {
application_stack {
dotnet_version = "8.0"
}
health_check_path = "/health"
}

identity {
type = "SystemAssigned"
}
}

resource "random_string" "suffix" {
length = 8
special = false
upper = false
}

output "webapp_url" {
value = "https://${azurerm_linux_web_app.main.default_hostname}"
}

CI/CD Pipelines

Pipeline Architecture

GitHub Actions Example

# .github/workflows/deploy.yml
name: Build and Deploy

on:
push:
branches: [main]
pull_request:
branches: [main]

env:
AZURE_WEBAPP_NAME: my-web-app
DOTNET_VERSION: '8.0.x'

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: ${{ env.DOTNET_VERSION }}

- name: Restore dependencies
run: dotnet restore

- name: Build
run: dotnet build --configuration Release --no-restore

- name: Test
run: dotnet test --no-build --verbosity normal --collect:"XPlat Code Coverage"

- name: Publish
run: dotnet publish -c Release -o ./publish

- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: webapp
path: ./publish

deploy-staging:
needs: build
runs-on: ubuntu-latest
environment: staging
if: github.ref == 'refs/heads/main'
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp

- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}

- name: Deploy to Staging
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}-staging
package: .

deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: webapp

- name: Login to Azure
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}

- name: Deploy to Production
uses: azure/webapps-deploy@v3
with:
app-name: ${{ env.AZURE_WEBAPP_NAME }}
package: .

Azure DevOps Pipeline

# azure-pipelines.yml
trigger:
branches:
include:
- main

pool:
vmImage: 'ubuntu-latest'

variables:
buildConfiguration: 'Release'
azureSubscription: 'Azure-Connection'
webAppName: 'my-web-app'

stages:
- stage: Build
jobs:
- job: BuildJob
steps:
- task: UseDotNet@2
inputs:
version: '8.0.x'

- task: DotNetCoreCLI@2
displayName: 'Restore'
inputs:
command: 'restore'

- task: DotNetCoreCLI@2
displayName: 'Build'
inputs:
command: 'build'
arguments: '--configuration $(buildConfiguration)'

- task: DotNetCoreCLI@2
displayName: 'Test'
inputs:
command: 'test'
arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage"'

- task: DotNetCoreCLI@2
displayName: 'Publish'
inputs:
command: 'publish'
publishWebProjects: true
arguments: '--configuration $(buildConfiguration) --output $(Build.ArtifactStagingDirectory)'

- task: PublishBuildArtifacts@1
inputs:
pathToPublish: '$(Build.ArtifactStagingDirectory)'
artifactName: 'webapp'

- stage: DeployStaging
dependsOn: Build
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: DeployStaging
environment: 'staging'
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: '$(azureSubscription)'
appType: 'webAppLinux'
appName: '$(webAppName)-staging'
package: '$(Pipeline.Workspace)/webapp/**/*.zip'

- stage: DeployProduction
dependsOn: DeployStaging
jobs:
- deployment: DeployProduction
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs:
azureSubscription: '$(azureSubscription)'
appType: 'webAppLinux'
appName: '$(webAppName)'
package: '$(Pipeline.Workspace)/webapp/**/*.zip'

Safe Deployment Practices

Deployment Strategies

Strategy Comparison

StrategyRiskRollback SpeedResource Cost
Blue-GreenLowInstant2x during deploy
CanaryVery LowFastMinimal
RollingMediumSlowMinimal
RecreateHighSlowMinimal

App Service Deployment Slots

# Create staging slot
az webapp deployment slot create \
--name myWebApp \
--resource-group myRG \
--slot staging

# Deploy to staging
az webapp deployment source config-zip \
--name myWebApp \
--resource-group myRG \
--slot staging \
--src app.zip

# Swap staging to production
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--target-slot production

# Swap with preview (validate first)
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--action preview

# Complete or cancel swap
az webapp deployment slot swap \
--name myWebApp \
--resource-group myRG \
--slot staging \
--action swap # or 'reset' to cancel

Monitoring and Observability

Three Pillars of Observability

Azure Monitor Stack

ServicePurposeData Type
Azure MonitorUnified monitoring platformAll telemetry
Log AnalyticsLog aggregation and queryingLogs
Application InsightsAPM for applicationsTraces, metrics
Metrics ExplorerMetric visualizationMetrics
AlertsProactive notificationsAll
WorkbooksCustom dashboardsAll

Application Insights Setup

// Program.cs - Configure Application Insights
var builder = WebApplication.CreateBuilder(args);

// Add Application Insights
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});

// Add custom telemetry
builder.Services.AddSingleton<ITelemetryInitializer, CustomTelemetryInitializer>();

var app = builder.Build();

// Custom telemetry initializer
public class CustomTelemetryInitializer : ITelemetryInitializer
{
public void Initialize(ITelemetry telemetry)
{
telemetry.Context.Cloud.RoleName = "OrderService";
telemetry.Context.GlobalProperties["Environment"] = "Production";
}
}

KQL Queries for Monitoring

// Request performance
requests
| where timestamp > ago(1h)
| summarize
Count = count(),
AvgDuration = avg(duration),
P95Duration = percentile(duration, 95),
FailureRate = countif(success == false) * 100.0 / count()
by bin(timestamp, 5m)
| render timechart

// Dependency failures
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize Count = count() by target, type, resultCode
| order by Count desc

// Exception analysis
exceptions
| where timestamp > ago(24h)
| summarize Count = count() by type, outerMessage
| order by Count desc
| take 10

// Slow requests
requests
| where timestamp > ago(1h)
| where duration > 1000
| project timestamp, name, duration, resultCode, customDimensions
| order by duration desc
| take 20

Alert Configuration

// Bicep - Create metric alert
resource alert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'High-CPU-Alert'
location: 'global'
properties: {
description: 'Alert when CPU exceeds 80%'
severity: 2
enabled: true
scopes: [appServicePlan.id]
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'HighCPU'
metricName: 'CpuPercentage'
operator: 'GreaterThan'
threshold: 80
timeAggregation: 'Average'
}
]
}
actions: [
{
actionGroupId: actionGroup.id
}
]
}
}

Automation

Azure Automation Runbooks

# PowerShell Runbook - Auto-restart unhealthy VMs
param(
[Parameter(Mandatory=$true)]
[string]$ResourceGroupName
)

# Connect using managed identity
Connect-AzAccount -Identity

# Get all VMs in resource group
$vms = Get-AzVM -ResourceGroupName $ResourceGroupName

foreach ($vm in $vms) {
# Get VM status
$status = Get-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name -Status
$powerState = ($status.Statuses | Where-Object { $_.Code -like "PowerState/*" }).DisplayStatus

if ($powerState -eq "VM running") {
# Check if VM is responsive (custom health check)
$healthCheck = Test-NetConnection -ComputerName $vm.Name -Port 443 -WarningAction SilentlyContinue

if (-not $healthCheck.TcpTestSucceeded) {
Write-Output "Restarting unhealthy VM: $($vm.Name)"
Restart-AzVM -ResourceGroupName $ResourceGroupName -Name $vm.Name
}
}
}

Logic Apps for Automation

// Logic App - Auto-remediation workflow
{
"definition": {
"triggers": {
"When_alert_is_triggered": {
"type": "Request",
"kind": "Http"
}
},
"actions": {
"Parse_alert": {
"type": "ParseJson",
"inputs": {
"content": "@triggerBody()",
"schema": { /* alert schema */ }
}
},
"Check_alert_type": {
"type": "Switch",
"expression": "@body('Parse_alert')?['data']?['essentials']?['alertRule']",
"cases": {
"High_CPU": {
"actions": {
"Scale_up": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "https://management.azure.com/...",
"authentication": { "type": "ManagedServiceIdentity" }
}
}
}
}
}
},
"Send_notification": {
"type": "SendEmail",
"inputs": {
"to": "ops-team@company.com",
"subject": "Auto-remediation executed",
"body": "Action taken: @{body('Check_alert_type')}"
}
}
}
}
}

Incident Management

Incident Response Process

Postmortem Template

SectionContent
SummaryBrief description of the incident
ImpactUsers affected, duration, severity
TimelineChronological events
Root CauseWhat caused the incident
ResolutionHow it was fixed
Action ItemsPreventive measures
Lessons LearnedWhat we learned

Operational Excellence Checklist

DevOps Practices

  • Use version control for all code and configuration
  • Implement code review process
  • Automate builds and tests
  • Use feature flags for safe releases
  • Practice trunk-based development

Infrastructure as Code

  • All infrastructure defined in code
  • IaC templates in version control
  • Automated infrastructure testing
  • Environment parity (dev = staging = prod)
  • Modular, reusable templates

CI/CD

  • Automated build pipeline
  • Automated testing (unit, integration, e2e)
  • Security scanning in pipeline
  • Automated deployments
  • Deployment approvals for production

Monitoring

  • Application performance monitoring
  • Infrastructure monitoring
  • Log aggregation and analysis
  • Distributed tracing
  • Alerting and on-call rotation

Automation

  • Automated scaling
  • Automated remediation for common issues
  • Automated backups
  • Automated security patching
  • Runbooks for manual procedures

Assessment Questions

AreaQuestion
DevOpsDo dev and ops teams collaborate effectively?
IaCIs all infrastructure defined as code?
CI/CDAre deployments fully automated?
TestingWhat percentage of code is covered by tests?
MonitoringCan you detect issues before users report them?
AlertingAre alerts actionable and not noisy?
AutomationWhat manual tasks could be automated?
IncidentsDo you conduct blameless postmortems?

Key Takeaways

  1. Automate everything: Manual processes are error-prone and slow
  2. Infrastructure as Code: Treat infrastructure like application code
  3. Monitor proactively: Detect issues before users do
  4. Deploy safely: Use progressive deployment strategies
  5. Learn from failures: Blameless postmortems drive improvement

Resources