Cloud Provider Migration
The Interview Question
"Our company wants to move from AWS to GCP (or Azure). We have 50TB of data in S3/RDS, 200 microservices, and can't afford more than 15 minutes of downtime. How do you approach this?"
Asked at: Google, Microsoft, large enterprises
Time to solve: 40-45 minutes
Difficulty: ⭐⭐⭐⭐⭐ (Principal/Staff)
Clarifying Questions to Ask
- "Why are we migrating?" → Cost, features, vendor lock-in?
- "What's the timeline?" → 3 months vs 18 months changes everything
- "Is partial multi-cloud acceptable?" → Affects phasing strategy
- "What's our busiest period?" → Avoid Black Friday migrations
- "Are there compliance requirements?" → Data residency, etc.
The Complexity Map
Solution: Phased Migration Strategy
Phase 0: Assessment & Abstraction (Months 1-2)
Create abstraction layers to avoid rewriting twice:
# BEFORE: Direct AWS SDK usage everywhere
import boto3
def upload_file(bucket, key, data):
s3 = boto3.client('s3')
s3.put_object(Bucket=bucket, Key=key, Body=data)
# AFTER: Abstraction layer
from storage import StorageClient
def upload_file(bucket, key, data):
storage = StorageClient() # Returns S3 or GCS based on config
storage.put(bucket, key, data)
Storage abstraction:
# storage/client.py
from abc import ABC, abstractmethod
import os
class StorageClient(ABC):
@abstractmethod
def put(self, bucket, key, data): pass
@abstractmethod
def get(self, bucket, key): pass
@staticmethod
def create():
provider = os.getenv("CLOUD_PROVIDER", "aws")
if provider == "aws":
return S3Client()
elif provider == "gcp":
return GCSClient()
class S3Client(StorageClient):
def __init__(self):
import boto3
self.client = boto3.client('s3')
def put(self, bucket, key, data):
self.client.put_object(Bucket=bucket, Key=key, Body=data)
class GCSClient(StorageClient):
def __init__(self):
from google.cloud import storage
self.client = storage.Client()
def put(self, bucket, key, data):
bucket_obj = self.client.bucket(bucket)
blob = bucket_obj.blob(key)
blob.upload_from_string(data)
Phase 1: Data Sync Setup (Month 2)
Continuous data replication to GCP while still running on AWS:
# Database sync using AWS DMS + GCP Database Migration Service
# Or use pglogical / Debezium for real-time CDC
sync_strategy:
database:
type: continuous_replication
source: aws_rds_postgres
target: gcp_cloudsql_postgres
method: pglogical # Real-time logical replication
lag_threshold: 30s
blob_storage:
type: continuous_sync
tool: rclone # or gsutil rsync
schedule: every_5_minutes
direction: aws_s3 -> gcp_gcs
rclone sync script:
#!/bin/bash
# sync_s3_to_gcs.sh
# Continuous sync from S3 to GCS
rclone sync \
s3:my-bucket \
gcs:my-bucket-gcp \
--transfers 32 \
--checkers 16 \
--fast-list \
--progress \
--log-file /var/log/s3-gcs-sync.log
# Verify sync
aws s3 ls s3://my-bucket --recursive | wc -l
gsutil ls -r gs://my-bucket-gcp/** | wc -l
Phase 2: Infrastructure Parity (Months 2-3)
Deploy services to BOTH clouds:
# kubernetes/deployment.yaml - works on both EKS and GKE
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 3
template:
spec:
containers:
- name: user-service
image: gcr.io/myproject/user-service:v1.2.3 # Use GCR
env:
- name: CLOUD_PROVIDER
valueFrom:
configMapKeyRef:
name: cloud-config
key: provider
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: url
Terraform for multi-cloud:
# terraform/main.tf
variable "cloud_provider" {
type = string
validation {
condition = contains(["aws", "gcp"], var.cloud_provider)
error_message = "Must be aws or gcp"
}
}
module "kubernetes" {
source = var.cloud_provider == "aws" ? "./modules/eks" : "./modules/gke"
# ... common variables
}
module "database" {
source = var.cloud_provider == "aws" ? "./modules/rds" : "./modules/cloudsql"
# ... common variables
}
Phase 3: Traffic Splitting (Month 4)
Use Global Load Balancer for gradual traffic shift:
Implementation with Cloud DNS weighted routing:
# GCP Cloud DNS for weighted routing
resource "google_dns_record_set" "weighted" {
name = "api.example.com."
type = "A"
ttl = 60 # Low TTL for quick changes
routing_policy {
wrr {
weight = 90
rrdatas = [aws_lb.main.dns_name] # AWS ALB
}
wrr {
weight = 10
rrdatas = [google_compute_global_address.main.address] # GCP
}
}
}
Phase 4: The Cutover (Month 5)
The 15-minute window:
T-24h: Final full backup of both systems
T-1h: Reduce AWS traffic to 0% (all to GCP)
T-0: Stop writes to AWS
T+1m: Final CDC sync, verify lag = 0
T+2m: Promote GCP database to primary
T+3m: Point all services to GCP database
T+5m: Run automated verification suite
T+10m: Enable production traffic on GCP
T+15m: Cutover complete ✅
Cutover script:
#!/usr/bin/env python3
# cutover.py
import time
from datetime import datetime
def execute_cutover():
log("Starting cutover procedure")
# Step 1: Stop new writes to AWS
log("Step 1: Enabling maintenance mode on AWS")
set_maintenance_mode("aws", enabled=True)
# Step 2: Wait for replication to catch up
log("Step 2: Waiting for replication lag = 0")
while get_replication_lag() > 0:
time.sleep(1)
if time.time() - start > 120: # 2 min timeout
raise Exception("Replication didn't catch up!")
# Step 3: Promote GCP database
log("Step 3: Promoting GCP CloudSQL to primary")
promote_database("gcp")
# Step 4: Update service configs
log("Step 4: Pointing services to GCP database")
update_database_config(target="gcp")
# Step 5: Verification
log("Step 5: Running verification tests")
results = run_smoke_tests()
if not results.all_passed:
log("VERIFICATION FAILED - INITIATING ROLLBACK")
rollback()
return
# Step 6: Enable traffic
log("Step 6: Enabling production traffic on GCP")
set_maintenance_mode("gcp", enabled=False)
update_dns_weight(aws=0, gcp=100)
log("✅ Cutover complete!")
Phase 5: Cleanup (Month 6)
cleanup_checklist:
- Terminate AWS EC2/EKS resources
- Delete AWS RDS (after 30-day backup retention)
- Transfer Route53 domains to Cloud DNS
- Cancel AWS reserved instances
- Update documentation
- Archive AWS Terraform state
- Update cost tracking
Risk Mitigation
Rollback Plan
def rollback():
"""
Emergency rollback procedure.
Can execute at any point during cutover.
"""
log("🚨 ROLLBACK INITIATED")
# 1. Point traffic back to AWS
update_dns_weight(aws=100, gcp=0)
# 2. Re-enable AWS database as primary
promote_database("aws")
# 3. Update service configs back to AWS
update_database_config(target="aws")
# 4. Disable maintenance mode on AWS
set_maintenance_mode("aws", enabled=False)
# 5. Alert team
send_alert("Rollback completed - investigate root cause")
Data Consistency Verification
def verify_data_consistency():
"""
Run before cutover to ensure data is in sync.
"""
tables = ["users", "orders", "payments", "products"]
for table in tables:
aws_count = aws_db.execute(f"SELECT COUNT(*) FROM {table}")
gcp_count = gcp_db.execute(f"SELECT COUNT(*) FROM {table}")
aws_checksum = aws_db.execute(f"""
SELECT MD5(STRING_AGG(id::text, ''))
FROM {table}
ORDER BY id
""")
gcp_checksum = gcp_db.execute(f"""
SELECT MD5(STRING_AGG(id::text, ''))
FROM {table}
ORDER BY id
""")
if aws_count != gcp_count or aws_checksum != gcp_checksum:
raise DataMismatchError(f"Table {table} is out of sync!")
log("✅ All tables verified in sync")
Trade-offs Discussion
| Approach | Pros | Cons |
|---|---|---|
| Big Bang | Fast, simple | High risk, long downtime |
| Parallel Run | Safe, reversible | Expensive (2x infra) |
| Strangler | Low risk | Slow (12-18 months) |
| Hybrid Long-term | Flexibility | Ongoing complexity |
Follow-up Questions
"What about vendor-specific services like Lambda/DynamoDB?"
Abstract or rewrite. Lambda → Cloud Functions, DynamoDB → Firestore/Spanner. Budget 2-3 months for serverless migration.
"How do you handle cross-cloud latency during transition?"
Place synchronous calls in same cloud, use async (queues) for cross-cloud communication.
"What about cost comparison?"
Run both clouds in parallel for 1 month, measure actual costs, factor in egress fees.
Key Takeaways
- Abstraction first - Create cloud-agnostic interfaces
- Continuous replication - Keep data in sync throughout
- Parallel run - Full infrastructure in both clouds before cutover
- Gradual traffic shift - Use weighted DNS routing
- Automated verification - Don't trust, verify
- Clear rollback plan - Know exactly how to go back
Timeline: 5-6 months for a company with 200 services
Team size: 5-8 engineers dedicated to migration
Budget: Expect 2x infrastructure costs during parallel run period