Data Pipelines
TL;DR
Data pipeline: Move and transform data from source to destination. Batch: Process large datasets periodically (ETL). Streaming: Process data in real-time. Use cases: Analytics, ML, reporting.
Core Concepts
1. Batch vs Streaming
| Aspect | Batch (ETL) | Streaming |
|---|---|---|
| Latency | Hours/days | Seconds/minutes |
| Data volume | Large (TB/PB) | Smaller chunks |
| Use case | Daily reports, ML training | Real-time dashboards, alerts |
| Tools | Spark, Airflow | Kafka, Flink, Spark Streaming |
2. ETL Pipeline
Example (Daily sales report):
- Extract: Pull orders from database (last 24 hours)
- Transform: Aggregate by product, region
- Load: Insert into analytics database
Tools: Apache Airflow, AWS Glue, dbt
3. Streaming Pipeline
Example (Real-time fraud detection):
- Kafka receives transaction events
- Flink processes each transaction (check rules)
- Alert if fraud detected
Tools: Apache Flink, Kafka Streams, Apache Storm
Common Interview Questions
Q1: "Batch vs streaming?"
Answer:
- Batch: Large data, tolerate latency (daily reports)
- Streaming: Real-time, low latency (fraud detection)
Q2: "How would you build a data pipeline for analytics?"
Answer:
- Extract: CDC (Change Data Capture) from production DB
- Transform: Airflow runs Spark jobs (aggregate, join)
- Load: Store in data warehouse (Snowflake, BigQuery)
- Schedule: Run nightly (batch) or continuously (streaming)
Quick Reference
Batch: Periodic, large data (Spark, Airflow)
Streaming: Real-time, continuous (Kafka, Flink)
Next: Search Systems - Elasticsearch, full-text search.