Data Engineering TL;DR Cheatsheet
Learn Data Engineering 10× faster with the cleanest, zero fluff content
Core Data Engineering
Pipeline Design Patterns
Batch vs Streaming • Lambda/Kappa • Idempotency • Late Data
Data Modeling
Star Schema • Kimball • Data Vault • Slowly Changing Dimensions
Partitioning & Bucketing
Date • Region • Skew • Z-Order • Hive vs Delta
Lakehouse & Medallion Architecture
Medallion Architecture
Bronze • Silver • Gold • Quality Gates • DLT
Data Quality & Testing
Great Expectations • dbt tests • Deequ • Monte Carlo
Data Lineage & Observability
Unity Catalog • OpenLineage • Monte Carlo • Anomaly Detection
Orchestration & CI/CD
Apache Airflow
DAGs • Operators • XComs • TaskFlow • Astronomer
dbt (data build tool)
Models • Tests • Docs • Jinja • Snapshots • Exposures
DE CI/CD Best Practices
GitOps • Databricks Repos • dbt Cloud • Terraform • Promotions
Streaming & Real-Time
Kafka Fundamentals
Topics • Partitions • Consumer Groups • Exactly-once • Schema Registry
Structured Streaming
Watermarks • Triggers • ForeachBatch • Checkpointing
Apache Flink
Stateful • Event Time • Checkpoints • Table API
File Formats & Optimization
Parquet Deep Dive
Columnar • Compression • Row Groups • Predicate Pushdown
Delta vs Iceberg vs Hudi
ACID • Schema Evolution • Time Travel • MERGE Performance
Compaction & Vacuum
Small Files • OPTIMIZE • ZORDER • Retention • Bin Packing