Skip to main content

Case Study: Platform Migration

This case study walks through a Staff+ engineer's approach to leading a major platform migration from a monolithic architecture to microservices.

Context

The Situation

**Company:** E-commerce platform, $500M ARR
**Team Size:** 80 engineers across 12 teams
**Current State:**
- 8-year-old .NET Framework monolith
- 2M lines of code
- SQL Server database (5TB)
- 6-week release cycles
- 99.5% availability (not meeting 99.9% SLA)

**Business Pressure:**
- Competitors releasing features weekly
- Global expansion blocked by single-region architecture
- Black Friday performance concerns
- Developer satisfaction declining

The Challenge

You've been asked to lead the platform modernization. The CEO wants "microservices" but the CTO wants a realistic plan that doesn't halt feature development.

Phase 1: Assessment (Weeks 1-4)

Stakeholder Interviews

## Interview Findings

### Engineering Teams
- "Deployments are terrifying - we break things constantly"
- "I can't understand how my change affects other parts"
- "Tests take 4 hours to run"
- "The database is a black box of stored procedures"

### Product
- "We can't experiment - everything takes too long"
- "Competitors ship features we've been planning for months"
- "We need to launch in Europe but can't"

### Operations
- "Black Friday is a nightmare"
- "We can't scale individual components"
- "Debugging production issues takes hours"

### Business
- "Time-to-market is killing us"
- "We're losing deals due to availability concerns"
- "Infrastructure costs are growing faster than revenue"

Technical Assessment

## Monolith Analysis

### Code Structure
- 15 "modules" but highly coupled
- Shared database with 800+ tables
- 50+ stored procedures with business logic
- No clear domain boundaries

### Hotspots (by change frequency + bug rate)
1. Order Processing - 40% of bugs
2. Inventory Management - 25% of bugs
3. Payment Integration - 15% of bugs
4. User Management - 10% of bugs
5. Other - 10%

### Dependencies
- 200+ NuGet packages (many outdated)
- 15 external integrations
- Custom ORM (undocumented)
- Legacy authentication system

### Database Analysis
- 800 tables, 500 views, 200 stored procedures
- Heavy use of triggers (150+)
- No foreign keys in many places
- 50GB of unused/legacy tables

Domain Analysis

Phase 2: Strategy Development (Weeks 5-8)

Options Analysis

## Migration Options

### Option 1: Big Bang Rewrite
- **Approach:** Build new system, switch over
- **Timeline:** 18-24 months
- **Risk:** Extremely high
- **Recommendation:** ❌ Reject

### Option 2: Strangler Fig
- **Approach:** Gradually replace pieces
- **Timeline:** 24-36 months
- **Risk:** Medium
- **Recommendation:** ✅ Recommended

### Option 3: Modular Monolith First
- **Approach:** Restructure monolith, then extract
- **Timeline:** 30-42 months
- **Risk:** Low-Medium
- **Recommendation:** ✅ Consider as hybrid

## Recommended Approach: Hybrid Strangler + Modular

1. Establish event backbone alongside monolith
2. Modularize monolith around domain boundaries
3. Extract services starting with lowest-risk domains
4. Migrate traffic gradually via feature flags

Migration Roadmap

## 3-Year Roadmap

### Year 1: Foundation
**Q1:** Platform foundation
- Event streaming infrastructure (Kafka)
- API Gateway deployment
- Observability platform (OpenTelemetry)
- CI/CD modernization

**Q2:** First extraction
- Notification Service (low risk, high value)
- Event publishing from monolith
- Dual-write pattern established

**Q3-Q4:** Core services
- Product Catalog Service
- Search Service (Elasticsearch)
- Monolith modularization begins

### Year 2: Acceleration
**Q1-Q2:** Order domain
- Order Service (event-sourced)
- Inventory Service
- Saga orchestration

**Q3-Q4:** Customer domain
- Customer Service
- Authentication modernization
- Payment Service extraction

### Year 3: Completion
**Q1-Q2:** Remaining services
- Shipping integration
- Analytics platform
- Legacy decommissioning begins

**Q3-Q4:** Optimization
- Multi-region deployment
- Performance optimization
- Monolith sunset

Phase 3: Execution

Team Structure

## Migration Team Structure

### Platform Team (6 engineers)
- Event infrastructure
- API Gateway
- Shared libraries
- Developer experience

### Migration Strike Team (4 engineers)
- Service extraction execution
- Data migration
- Integration testing
- Rollback procedures

### Domain Teams (existing)
- Own their extracted services
- Continue feature development
- Gradual skill building

Strangler Implementation

// API Gateway routing during migration
public class StranglerRouter
{
private readonly IFeatureFlagService _flags;

public async Task<HttpResponseMessage> RouteAsync(
HttpRequest request,
string feature)
{
var percentage = await _flags.GetRolloutPercentageAsync(
$"route-to-new-{feature}");

var useNewService = ShouldRouteToNew(request, percentage);

if (useNewService)
{
// Route to new microservice
return await RouteToNewServiceAsync(request, feature);
}
else
{
// Route to monolith
return await RouteToMonolithAsync(request, feature);
}
}

private bool ShouldRouteToNew(HttpRequest request, int percentage)
{
// Consistent routing based on user ID
var userId = request.GetUserId();
var hash = userId.GetHashCode() % 100;
return hash < percentage;
}
}

Data Migration Pattern

// Dual-write during migration
public class DualWriteOrderRepository : IOrderRepository
{
private readonly IMonolithOrderRepository _legacy;
private readonly IOrderServiceClient _newService;
private readonly IFeatureFlagService _flags;

public async Task<Order> CreateAsync(Order order)
{
// Always write to legacy (source of truth during migration)
var legacyOrder = await _legacy.CreateAsync(order);

if (await _flags.IsEnabledAsync("dual-write-orders"))
{
try
{
// Also write to new service
await _newService.CreateOrderAsync(MapToDto(legacyOrder));
}
catch (Exception ex)
{
// Log but don't fail - new service is secondary
_logger.LogWarning(ex,
"Dual-write to new Order service failed");
_metrics.IncrementDualWriteFailure();
}
}

return legacyOrder;
}
}

// Data reconciliation job
public class OrderReconciliationJob : IHostedService
{
public async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
var legacyOrders = await _legacy.GetRecentOrdersAsync(
TimeSpan.FromHours(1));
var newOrders = await _newService.GetRecentOrdersAsync(
TimeSpan.FromHours(1));

var discrepancies = FindDiscrepancies(legacyOrders, newOrders);

foreach (var discrepancy in discrepancies)
{
_logger.LogWarning(
"Order discrepancy: {OrderId}, Legacy: {Legacy}, New: {New}",
discrepancy.OrderId,
discrepancy.LegacyState,
discrepancy.NewState);

// Auto-fix or alert based on severity
await HandleDiscrepancyAsync(discrepancy);
}

await Task.Delay(TimeSpan.FromMinutes(5), ct);
}
}
}

Phase 4: Challenges & Solutions

Challenge 1: Team Resistance

## Problem
Senior engineers comfortable with monolith resist change.
"We've tried this before and failed."

## Solution
1. **Acknowledge history:** "You're right, past attempts failed.
Here's what's different this time..."
2. **Involve skeptics:** Made loudest critic the tech lead for
first service extraction
3. **Quick wins:** Notification service extracted in 6 weeks,
deployed 10x faster
4. **Celebrate success:** Public recognition, team demos

Challenge 2: Feature Freeze Pressure

## Problem
Product wants to halt migration for "critical" features.
"We'll do migration after this release."

## Solution
1. **Parallel tracks:** Migration team doesn't block feature teams
2. **Migration enables features:** "Multi-region requires this work"
3. **Incremental value:** Each phase delivers measurable improvement
4. **Executive alignment:** Monthly steering committee with metrics

Challenge 3: Data Consistency

## Problem
During dual-write period, data gets out of sync.
Customer sees different order status in different places.

## Solution
1. **Single source of truth:** Legacy DB remains authoritative
until cutover
2. **Reconciliation jobs:** Automated sync every 5 minutes
3. **Alerts:** Discrepancy alerts to on-call
4. **Gradual cutover:** 1% → 10% → 50% → 100% traffic migration

Results

Metrics After 18 Months

## Migration Progress

### Services Extracted: 8/12
- ✅ Notifications
- ✅ Product Catalog
- ✅ Search
- ✅ Orders
- ✅ Inventory
- ✅ Customer
- ✅ Payments
- ✅ Shipping
- 🔄 Analytics (in progress)
- ⏳ Reporting
- ⏳ Admin
- ⏳ Legacy integrations

### Performance Improvements
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Deploy frequency | Weekly | Daily | +7x |
| Lead time | 6 weeks | 3 days | -93% |
| Availability | 99.5% | 99.95% | +0.45% |
| P99 latency | 800ms | 150ms | -81% |
| MTTR | 4 hours | 15 min | -94% |

### Business Impact
- Time-to-market: 6 weeks → 1 week
- European launch: Completed (was blocked)
- Black Friday: Zero incidents (vs. 3 previous year)
- Developer satisfaction: 6/10 → 8/10

Key Learnings

## What Worked
1. **Incremental approach:** No big bang, continuous value delivery
2. **Platform team:** Dedicated team for shared infrastructure
3. **Feature flags:** Gradual rollout reduced risk
4. **Metrics focus:** Data-driven decisions, not opinions

## What I'd Do Differently
1. **Start observability earlier:** Should have been week 1
2. **More automation:** Manual reconciliation was painful
3. **Team training:** Underestimated learning curve
4. **Documentation:** Kept falling behind

## Advice for Others
1. Get executive sponsorship before starting
2. Find your skeptics and convert them to champions
3. Measure everything - you'll need the data
4. Celebrate small wins loudly
5. Plan for 2x the time you estimate

💡 Flashcard

What is the Strangler Fig pattern for migration?

Click to reveal answer
✅ Answer

Gradually replace parts of a legacy system by routing traffic to new services while keeping the old system running. New functionality goes to new services; existing functionality is migrated piece by piece until the legacy system can be decommissioned.

Click to see question
💡 Flashcard

Why is a dual-write pattern used during migrations?

Click to reveal answer
✅ Answer

Dual-write ensures data consistency by writing to both old and new systems simultaneously. The legacy system remains the source of truth until the new system is proven reliable, then traffic is gradually shifted.

Click to see question
Loading quiz...