Data Pipelines

TL;DR

Data pipeline: Move and transform data from source to destination. Batch: Process large datasets periodically (ETL). Streaming: Process data in real-time. Use cases: Analytics, ML, reporting.

Core Concepts

1. Batch vs Streaming

Aspect	Batch (ETL)	Streaming
Latency	Hours/days	Seconds/minutes
Data volume	Large (TB/PB)	Smaller chunks
Use case	Daily reports, ML training	Real-time dashboards, alerts
Tools	Spark, Airflow	Kafka, Flink, Spark Streaming

2. ETL Pipeline

Example (Daily sales report):

Extract: Pull orders from database (last 24 hours)
Transform: Aggregate by product, region
Load: Insert into analytics database

Tools: Apache Airflow, AWS Glue, dbt

3. Streaming Pipeline

Example (Real-time fraud detection):

Kafka receives transaction events
Flink processes each transaction (check rules)
Alert if fraud detected

Tools: Apache Flink, Kafka Streams, Apache Storm

Common Interview Questions

Q1: "Batch vs streaming?"

Answer:

Batch: Large data, tolerate latency (daily reports)
Streaming: Real-time, low latency (fraud detection)

Q2: "How would you build a data pipeline for analytics?"

Answer:

Extract: CDC (Change Data Capture) from production DB
Transform: Airflow runs Spark jobs (aggregate, join)
Load: Store in data warehouse (Snowflake, BigQuery)
Schedule: Run nightly (batch) or continuously (streaming)

Quick Reference

Batch: Periodic, large data (Spark, Airflow)
Streaming: Real-time, continuous (Kafka, Flink)

Next: Search Systems - Elasticsearch, full-text search.

TL;DR​

Core Concepts​

1. Batch vs Streaming​

2. ETL Pipeline​

3. Streaming Pipeline​

Common Interview Questions​

Q1: "Batch vs streaming?"​

Q2: "How would you build a data pipeline for analytics?"​

Quick Reference​