What is Apache Spark?
/tldr: Unified, lightning-fast, distributed data processing engine
THE CORE IDEA
Apache Spark is the de facto standard for large-scale data processing. Born in 2009 at UC Berkeley AMPLab, now the most active Apache project.
Architecture in 60 Seconds
Driver (The Brain)
Runs your main() function
Creates SparkSession
Converts code → logical plan → physical plan
Cluster Manager
YARN • Kubernetes • Mesos • Standalone
Allocates resources across the cluster
Executors (The Muscle)
Live on worker nodes
Run tasks in parallel
Cache data in memory/disk
Job → Stage → Task
Action → Job
Shuffle boundary → Stage
One partition → One Task
One Stack to Rule Them All
RDDs • Task scheduling • Memory management
DataFrames • Datasets • Catalyst Optimizer
Exactly-once • Event-time • Watermarking
Scalable ML • Graph processing
Write Once, Run Anywhere
Native language
PySpark (most popular)
SparkR
Spark SQL
Deploy on: Databricks • EMR • Kubernetes • On-prem • Cloud • Laptop
Hello World in 10 Seconds (PySpark)
from pyspark.sql import SparkSession
# 1. Create session (the entry point)
spark = SparkSession.builder \
.appName("Spark TLDR Demo") \
.getOrCreate()
# 2. Read any data
df = spark.read.parquet("s3a://my-bucket/events/")
# 3. Transform
result = df.filter("country = 'BR'") \
.groupBy("city") \
.count() \
.orderBy("count", ascending=False)
# 4. Action → triggers execution
result.show(10)
# 5. Stop
spark.stop()
That's it. Same code works on 1 laptop or 10,000 cores.
Use Spark When...
- Datasets > 100 GB
- You reuse data multiple times (caching)
- Need SQL + ML + Streaming in same pipeline
- Iterative algorithms (ML, Graph)
- Don't use for < 10 GB simple ETL
- Don't use for OLTP (use RDS/PostgreSQL)
You're now Spark-ready.