Skip to main content

Cloud Provider Migration

The Interview Question

"Our company wants to move from AWS to GCP (or Azure). We have 50TB of data in S3/RDS, 200 microservices, and can't afford more than 15 minutes of downtime. How do you approach this?"

Asked at: Google, Microsoft, large enterprises

Time to solve: 40-45 minutes

Difficulty: ⭐⭐⭐⭐⭐ (Principal/Staff)


Clarifying Questions to Ask

  1. "Why are we migrating?" → Cost, features, vendor lock-in?
  2. "What's the timeline?" → 3 months vs 18 months changes everything
  3. "Is partial multi-cloud acceptable?" → Affects phasing strategy
  4. "What's our busiest period?" → Avoid Black Friday migrations
  5. "Are there compliance requirements?" → Data residency, etc.

The Complexity Map


Solution: Phased Migration Strategy

Phase 0: Assessment & Abstraction (Months 1-2)

Create abstraction layers to avoid rewriting twice:

# BEFORE: Direct AWS SDK usage everywhere
import boto3

def upload_file(bucket, key, data):
s3 = boto3.client('s3')
s3.put_object(Bucket=bucket, Key=key, Body=data)

# AFTER: Abstraction layer
from storage import StorageClient

def upload_file(bucket, key, data):
storage = StorageClient() # Returns S3 or GCS based on config
storage.put(bucket, key, data)

Storage abstraction:

# storage/client.py
from abc import ABC, abstractmethod
import os

class StorageClient(ABC):
@abstractmethod
def put(self, bucket, key, data): pass

@abstractmethod
def get(self, bucket, key): pass

@staticmethod
def create():
provider = os.getenv("CLOUD_PROVIDER", "aws")
if provider == "aws":
return S3Client()
elif provider == "gcp":
return GCSClient()

class S3Client(StorageClient):
def __init__(self):
import boto3
self.client = boto3.client('s3')

def put(self, bucket, key, data):
self.client.put_object(Bucket=bucket, Key=key, Body=data)

class GCSClient(StorageClient):
def __init__(self):
from google.cloud import storage
self.client = storage.Client()

def put(self, bucket, key, data):
bucket_obj = self.client.bucket(bucket)
blob = bucket_obj.blob(key)
blob.upload_from_string(data)

Phase 1: Data Sync Setup (Month 2)

Continuous data replication to GCP while still running on AWS:

# Database sync using AWS DMS + GCP Database Migration Service
# Or use pglogical / Debezium for real-time CDC

sync_strategy:
database:
type: continuous_replication
source: aws_rds_postgres
target: gcp_cloudsql_postgres
method: pglogical # Real-time logical replication
lag_threshold: 30s

blob_storage:
type: continuous_sync
tool: rclone # or gsutil rsync
schedule: every_5_minutes
direction: aws_s3 -> gcp_gcs

rclone sync script:

#!/bin/bash
# sync_s3_to_gcs.sh

# Continuous sync from S3 to GCS
rclone sync \
s3:my-bucket \
gcs:my-bucket-gcp \
--transfers 32 \
--checkers 16 \
--fast-list \
--progress \
--log-file /var/log/s3-gcs-sync.log

# Verify sync
aws s3 ls s3://my-bucket --recursive | wc -l
gsutil ls -r gs://my-bucket-gcp/** | wc -l

Phase 2: Infrastructure Parity (Months 2-3)

Deploy services to BOTH clouds:

# kubernetes/deployment.yaml - works on both EKS and GKE
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 3
template:
spec:
containers:
- name: user-service
image: gcr.io/myproject/user-service:v1.2.3 # Use GCR
env:
- name: CLOUD_PROVIDER
valueFrom:
configMapKeyRef:
name: cloud-config
key: provider
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-secret
key: url

Terraform for multi-cloud:

# terraform/main.tf
variable "cloud_provider" {
type = string
validation {
condition = contains(["aws", "gcp"], var.cloud_provider)
error_message = "Must be aws or gcp"
}
}

module "kubernetes" {
source = var.cloud_provider == "aws" ? "./modules/eks" : "./modules/gke"
# ... common variables
}

module "database" {
source = var.cloud_provider == "aws" ? "./modules/rds" : "./modules/cloudsql"
# ... common variables
}

Phase 3: Traffic Splitting (Month 4)

Use Global Load Balancer for gradual traffic shift:

Implementation with Cloud DNS weighted routing:

# GCP Cloud DNS for weighted routing
resource "google_dns_record_set" "weighted" {
name = "api.example.com."
type = "A"
ttl = 60 # Low TTL for quick changes

routing_policy {
wrr {
weight = 90
rrdatas = [aws_lb.main.dns_name] # AWS ALB
}
wrr {
weight = 10
rrdatas = [google_compute_global_address.main.address] # GCP
}
}
}

Phase 4: The Cutover (Month 5)

The 15-minute window:

T-24h: Final full backup of both systems
T-1h: Reduce AWS traffic to 0% (all to GCP)
T-0: Stop writes to AWS
T+1m: Final CDC sync, verify lag = 0
T+2m: Promote GCP database to primary
T+3m: Point all services to GCP database
T+5m: Run automated verification suite
T+10m: Enable production traffic on GCP
T+15m: Cutover complete ✅

Cutover script:

#!/usr/bin/env python3
# cutover.py

import time
from datetime import datetime

def execute_cutover():
log("Starting cutover procedure")

# Step 1: Stop new writes to AWS
log("Step 1: Enabling maintenance mode on AWS")
set_maintenance_mode("aws", enabled=True)

# Step 2: Wait for replication to catch up
log("Step 2: Waiting for replication lag = 0")
while get_replication_lag() > 0:
time.sleep(1)
if time.time() - start > 120: # 2 min timeout
raise Exception("Replication didn't catch up!")

# Step 3: Promote GCP database
log("Step 3: Promoting GCP CloudSQL to primary")
promote_database("gcp")

# Step 4: Update service configs
log("Step 4: Pointing services to GCP database")
update_database_config(target="gcp")

# Step 5: Verification
log("Step 5: Running verification tests")
results = run_smoke_tests()
if not results.all_passed:
log("VERIFICATION FAILED - INITIATING ROLLBACK")
rollback()
return

# Step 6: Enable traffic
log("Step 6: Enabling production traffic on GCP")
set_maintenance_mode("gcp", enabled=False)
update_dns_weight(aws=0, gcp=100)

log("✅ Cutover complete!")

Phase 5: Cleanup (Month 6)

cleanup_checklist:
- Terminate AWS EC2/EKS resources
- Delete AWS RDS (after 30-day backup retention)
- Transfer Route53 domains to Cloud DNS
- Cancel AWS reserved instances
- Update documentation
- Archive AWS Terraform state
- Update cost tracking

Risk Mitigation

Rollback Plan

def rollback():
"""
Emergency rollback procedure.
Can execute at any point during cutover.
"""
log("🚨 ROLLBACK INITIATED")

# 1. Point traffic back to AWS
update_dns_weight(aws=100, gcp=0)

# 2. Re-enable AWS database as primary
promote_database("aws")

# 3. Update service configs back to AWS
update_database_config(target="aws")

# 4. Disable maintenance mode on AWS
set_maintenance_mode("aws", enabled=False)

# 5. Alert team
send_alert("Rollback completed - investigate root cause")

Data Consistency Verification

def verify_data_consistency():
"""
Run before cutover to ensure data is in sync.
"""
tables = ["users", "orders", "payments", "products"]

for table in tables:
aws_count = aws_db.execute(f"SELECT COUNT(*) FROM {table}")
gcp_count = gcp_db.execute(f"SELECT COUNT(*) FROM {table}")

aws_checksum = aws_db.execute(f"""
SELECT MD5(STRING_AGG(id::text, ''))
FROM {table}
ORDER BY id
""")
gcp_checksum = gcp_db.execute(f"""
SELECT MD5(STRING_AGG(id::text, ''))
FROM {table}
ORDER BY id
""")

if aws_count != gcp_count or aws_checksum != gcp_checksum:
raise DataMismatchError(f"Table {table} is out of sync!")

log("✅ All tables verified in sync")

Trade-offs Discussion

ApproachProsCons
Big BangFast, simpleHigh risk, long downtime
Parallel RunSafe, reversibleExpensive (2x infra)
StranglerLow riskSlow (12-18 months)
Hybrid Long-termFlexibilityOngoing complexity

Follow-up Questions

"What about vendor-specific services like Lambda/DynamoDB?"

Abstract or rewrite. Lambda → Cloud Functions, DynamoDB → Firestore/Spanner. Budget 2-3 months for serverless migration.

"How do you handle cross-cloud latency during transition?"

Place synchronous calls in same cloud, use async (queues) for cross-cloud communication.

"What about cost comparison?"

Run both clouds in parallel for 1 month, measure actual costs, factor in egress fees.


Key Takeaways

  1. Abstraction first - Create cloud-agnostic interfaces
  2. Continuous replication - Keep data in sync throughout
  3. Parallel run - Full infrastructure in both clouds before cutover
  4. Gradual traffic shift - Use weighted DNS routing
  5. Automated verification - Don't trust, verify
  6. Clear rollback plan - Know exactly how to go back

Timeline: 5-6 months for a company with 200 services

Team size: 5-8 engineers dedicated to migration

Budget: Expect 2x infrastructure costs during parallel run period