What is Apache Spark? TL;DR

What is Apache Spark?

/tldr: Unified, lightning-fast, distributed data processing engine

100x faster than Hadoop MapReduce In-memory computing Runs everywhere

THE CORE IDEA

Apache Spark is the de facto standard for large-scale data processing. Born in 2009 at UC Berkeley AMPLab, now the most active Apache project.

100×

Faster than Hadoop MapReduce

Built-in modules (SQL, ML, Streaming, Graph)

3B+

Lines of data processed daily by users

Architecture in 60 Seconds

Driver (The Brain)

Runs your main() function
Creates SparkSession
Converts code → logical plan → physical plan

Cluster Manager

YARN • Kubernetes • Mesos • Standalone
Allocates resources across the cluster

Executors (The Muscle)

Live on worker nodes
Run tasks in parallel
Cache data in memory/disk

Job → Stage → Task

Action → Job
Shuffle boundary → Stage
One partition → One Task

One Stack to Rule Them All

Spark Core

RDDs • Task scheduling • Memory management

Spark SQL

DataFrames • Datasets • Catalyst Optimizer

Structured Streaming

Exactly-once • Event-time • Watermarking

MLlib / GraphX

Scalable ML • Graph processing

Write Once, Run Anywhere

Scala

Native language

Python

PySpark (most popular)

Java

SparkR

SQL

Spark SQL

Deploy on: Databricks • EMR • Kubernetes • On-prem • Cloud • Laptop

Hello World in 10 Seconds (PySpark)


from pyspark.sql import SparkSession

# 1. Create session (the entry point)
spark = SparkSession.builder \
    .appName("Spark TLDR Demo") \
    .getOrCreate()

# 2. Read any data
df = spark.read.parquet("s3a://my-bucket/events/")

# 3. Transform
result = df.filter("country = 'BR'") \
           .groupBy("city") \
           .count() \
           .orderBy("count", ascending=False)

# 4. Action → triggers execution
result.show(10)

# 5. Stop
spark.stop()

That's it. Same code works on 1 laptop or 10,000 cores.

Use Spark When...

Datasets > 100 GB
You reuse data multiple times (caching)
Need SQL + ML + Streaming in same pipeline
Iterative algorithms (ML, Graph)

Don't use for < 10 GB simple ETL
Don't use for OLTP (use RDS/PostgreSQL)

You're now Spark-ready.