Skip to main content

Data Pipelines

TL;DR

Data pipeline: Move and transform data from source to destination. Batch: Process large datasets periodically (ETL). Streaming: Process data in real-time. Use cases: Analytics, ML, reporting.

Core Concepts

1. Batch vs Streaming

AspectBatch (ETL)Streaming
LatencyHours/daysSeconds/minutes
Data volumeLarge (TB/PB)Smaller chunks
Use caseDaily reports, ML trainingReal-time dashboards, alerts
ToolsSpark, AirflowKafka, Flink, Spark Streaming

2. ETL Pipeline

Example (Daily sales report):

  1. Extract: Pull orders from database (last 24 hours)
  2. Transform: Aggregate by product, region
  3. Load: Insert into analytics database

Tools: Apache Airflow, AWS Glue, dbt

3. Streaming Pipeline

Example (Real-time fraud detection):

  • Kafka receives transaction events
  • Flink processes each transaction (check rules)
  • Alert if fraud detected

Tools: Apache Flink, Kafka Streams, Apache Storm

Common Interview Questions

Q1: "Batch vs streaming?"

Answer:

  • Batch: Large data, tolerate latency (daily reports)
  • Streaming: Real-time, low latency (fraud detection)

Q2: "How would you build a data pipeline for analytics?"

Answer:

  1. Extract: CDC (Change Data Capture) from production DB
  2. Transform: Airflow runs Spark jobs (aggregate, join)
  3. Load: Store in data warehouse (Snowflake, BigQuery)
  4. Schedule: Run nightly (batch) or continuously (streaming)

Quick Reference

Batch: Periodic, large data (Spark, Airflow)
Streaming: Real-time, continuous (Kafka, Flink)


Next: Search Systems - Elasticsearch, full-text search.