Cloud Provider Migration

The Interview Question

"Our company wants to move from AWS to GCP (or Azure). We have 50TB of data in S3/RDS, 200 microservices, and can't afford more than 15 minutes of downtime. How do you approach this?"

Asked at: Google, Microsoft, large enterprises

Time to solve: 40-45 minutes

Difficulty: ⭐⭐⭐⭐⭐ (Principal/Staff)

Clarifying Questions to Ask

"Why are we migrating?" → Cost, features, vendor lock-in?
"What's the timeline?" → 3 months vs 18 months changes everything
"Is partial multi-cloud acceptable?" → Affects phasing strategy
"What's our busiest period?" → Avoid Black Friday migrations
"Are there compliance requirements?" → Data residency, etc.

The Complexity Map

Solution: Phased Migration Strategy

Phase 0: Assessment & Abstraction (Months 1-2)

Create abstraction layers to avoid rewriting twice:

# BEFORE: Direct AWS SDK usage everywhere
import boto3

def upload_file(bucket, key, data):
    s3 = boto3.client('s3')
    s3.put_object(Bucket=bucket, Key=key, Body=data)

# AFTER: Abstraction layer
from storage import StorageClient

def upload_file(bucket, key, data):
    storage = StorageClient()  # Returns S3 or GCS based on config
    storage.put(bucket, key, data)

Storage abstraction:

# storage/client.py
from abc import ABC, abstractmethod
import os

class StorageClient(ABC):
    @abstractmethod
    def put(self, bucket, key, data): pass
    
    @abstractmethod
    def get(self, bucket, key): pass
    
    @staticmethod
    def create():
        provider = os.getenv("CLOUD_PROVIDER", "aws")
        if provider == "aws":
            return S3Client()
        elif provider == "gcp":
            return GCSClient()

class S3Client(StorageClient):
    def __init__(self):
        import boto3
        self.client = boto3.client('s3')
    
    def put(self, bucket, key, data):
        self.client.put_object(Bucket=bucket, Key=key, Body=data)

class GCSClient(StorageClient):
    def __init__(self):
        from google.cloud import storage
        self.client = storage.Client()
    
    def put(self, bucket, key, data):
        bucket_obj = self.client.bucket(bucket)
        blob = bucket_obj.blob(key)
        blob.upload_from_string(data)

Phase 1: Data Sync Setup (Month 2)

Continuous data replication to GCP while still running on AWS:

# Database sync using AWS DMS + GCP Database Migration Service
# Or use pglogical / Debezium for real-time CDC

sync_strategy:
  database:
    type: continuous_replication
    source: aws_rds_postgres
    target: gcp_cloudsql_postgres
    method: pglogical  # Real-time logical replication
    lag_threshold: 30s
    
  blob_storage:
    type: continuous_sync
    tool: rclone  # or gsutil rsync
    schedule: every_5_minutes
    direction: aws_s3 -> gcp_gcs

rclone sync script:

#!/bin/bash
# sync_s3_to_gcs.sh

# Continuous sync from S3 to GCS
rclone sync \
  s3:my-bucket \
  gcs:my-bucket-gcp \
  --transfers 32 \
  --checkers 16 \
  --fast-list \
  --progress \
  --log-file /var/log/s3-gcs-sync.log

# Verify sync
aws s3 ls s3://my-bucket --recursive | wc -l
gsutil ls -r gs://my-bucket-gcp/** | wc -l

Phase 2: Infrastructure Parity (Months 2-3)

Deploy services to BOTH clouds:

# kubernetes/deployment.yaml - works on both EKS and GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: user-service
        image: gcr.io/myproject/user-service:v1.2.3  # Use GCR
        env:
        - name: CLOUD_PROVIDER
          valueFrom:
            configMapKeyRef:
              name: cloud-config
              key: provider
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url

Terraform for multi-cloud:

# terraform/main.tf
variable "cloud_provider" {
  type = string
  validation {
    condition     = contains(["aws", "gcp"], var.cloud_provider)
    error_message = "Must be aws or gcp"
  }
}

module "kubernetes" {
  source = var.cloud_provider == "aws" ? "./modules/eks" : "./modules/gke"
  # ... common variables
}

module "database" {
  source = var.cloud_provider == "aws" ? "./modules/rds" : "./modules/cloudsql"
  # ... common variables
}

Phase 3: Traffic Splitting (Month 4)

Use Global Load Balancer for gradual traffic shift:

Implementation with Cloud DNS weighted routing:

# GCP Cloud DNS for weighted routing
resource "google_dns_record_set" "weighted" {
  name         = "api.example.com."
  type         = "A"
  ttl          = 60  # Low TTL for quick changes
  
  routing_policy {
    wrr {
      weight  = 90
      rrdatas = [aws_lb.main.dns_name]  # AWS ALB
    }
    wrr {
      weight  = 10
      rrdatas = [google_compute_global_address.main.address]  # GCP
    }
  }
}

Phase 4: The Cutover (Month 5)

The 15-minute window:

T-24h: Final full backup of both systems
T-1h:  Reduce AWS traffic to 0% (all to GCP)
T-0:   Stop writes to AWS
T+1m:  Final CDC sync, verify lag = 0
T+2m:  Promote GCP database to primary
T+3m:  Point all services to GCP database
T+5m:  Run automated verification suite
T+10m: Enable production traffic on GCP
T+15m: Cutover complete ✅

Cutover script:

#!/usr/bin/env python3
# cutover.py

import time
from datetime import datetime

def execute_cutover():
    log("Starting cutover procedure")
    
    # Step 1: Stop new writes to AWS
    log("Step 1: Enabling maintenance mode on AWS")
    set_maintenance_mode("aws", enabled=True)
    
    # Step 2: Wait for replication to catch up
    log("Step 2: Waiting for replication lag = 0")
    while get_replication_lag() > 0:
        time.sleep(1)
        if time.time() - start > 120:  # 2 min timeout
            raise Exception("Replication didn't catch up!")
    
    # Step 3: Promote GCP database
    log("Step 3: Promoting GCP CloudSQL to primary")
    promote_database("gcp")
    
    # Step 4: Update service configs
    log("Step 4: Pointing services to GCP database")
    update_database_config(target="gcp")
    
    # Step 5: Verification
    log("Step 5: Running verification tests")
    results = run_smoke_tests()
    if not results.all_passed:
        log("VERIFICATION FAILED - INITIATING ROLLBACK")
        rollback()
        return
    
    # Step 6: Enable traffic
    log("Step 6: Enabling production traffic on GCP")
    set_maintenance_mode("gcp", enabled=False)
    update_dns_weight(aws=0, gcp=100)
    
    log("✅ Cutover complete!")

Phase 5: Cleanup (Month 6)

cleanup_checklist:
  - Terminate AWS EC2/EKS resources
  - Delete AWS RDS (after 30-day backup retention)
  - Transfer Route53 domains to Cloud DNS
  - Cancel AWS reserved instances
  - Update documentation
  - Archive AWS Terraform state
  - Update cost tracking

Risk Mitigation

Rollback Plan

def rollback():
    """
    Emergency rollback procedure.
    Can execute at any point during cutover.
    """
    log("🚨 ROLLBACK INITIATED")
    
    # 1. Point traffic back to AWS
    update_dns_weight(aws=100, gcp=0)
    
    # 2. Re-enable AWS database as primary
    promote_database("aws")
    
    # 3. Update service configs back to AWS
    update_database_config(target="aws")
    
    # 4. Disable maintenance mode on AWS
    set_maintenance_mode("aws", enabled=False)
    
    # 5. Alert team
    send_alert("Rollback completed - investigate root cause")

Data Consistency Verification

def verify_data_consistency():
    """
    Run before cutover to ensure data is in sync.
    """
    tables = ["users", "orders", "payments", "products"]
    
    for table in tables:
        aws_count = aws_db.execute(f"SELECT COUNT(*) FROM {table}")
        gcp_count = gcp_db.execute(f"SELECT COUNT(*) FROM {table}")
        
        aws_checksum = aws_db.execute(f"""
            SELECT MD5(STRING_AGG(id::text, ''))
            FROM {table}
            ORDER BY id
        """)
        gcp_checksum = gcp_db.execute(f"""
            SELECT MD5(STRING_AGG(id::text, ''))
            FROM {table}
            ORDER BY id
        """)
        
        if aws_count != gcp_count or aws_checksum != gcp_checksum:
            raise DataMismatchError(f"Table {table} is out of sync!")
    
    log("✅ All tables verified in sync")

Trade-offs Discussion

Approach	Pros	Cons
Big Bang	Fast, simple	High risk, long downtime
Parallel Run	Safe, reversible	Expensive (2x infra)
Strangler	Low risk	Slow (12-18 months)
Hybrid Long-term	Flexibility	Ongoing complexity

Follow-up Questions

"What about vendor-specific services like Lambda/DynamoDB?"

Abstract or rewrite. Lambda → Cloud Functions, DynamoDB → Firestore/Spanner. Budget 2-3 months for serverless migration.

"How do you handle cross-cloud latency during transition?"

Place synchronous calls in same cloud, use async (queues) for cross-cloud communication.

"What about cost comparison?"

Run both clouds in parallel for 1 month, measure actual costs, factor in egress fees.

Key Takeaways

Abstraction first - Create cloud-agnostic interfaces
Continuous replication - Keep data in sync throughout
Parallel run - Full infrastructure in both clouds before cutover
Gradual traffic shift - Use weighted DNS routing
Automated verification - Don't trust, verify
Clear rollback plan - Know exactly how to go back

Timeline: 5-6 months for a company with 200 services

Team size: 5-8 engineers dedicated to migration

Budget: Expect 2x infrastructure costs during parallel run period

The Interview Question​

Clarifying Questions to Ask​

The Complexity Map​

Solution: Phased Migration Strategy​

Phase 0: Assessment & Abstraction (Months 1-2)​

Phase 1: Data Sync Setup (Month 2)​

Phase 2: Infrastructure Parity (Months 2-3)​

Phase 3: Traffic Splitting (Month 4)​

Phase 4: The Cutover (Month 5)​

Phase 5: Cleanup (Month 6)​

Risk Mitigation​

Rollback Plan​

Data Consistency Verification​

Trade-offs Discussion​

Follow-up Questions​

"What about vendor-specific services like Lambda/DynamoDB?"​

"How do you handle cross-cloud latency during transition?"​

"What about cost comparison?"​

Key Takeaways​