FAQ
-
What is the difference between a Data Engineer and a Data Scientist?
Data Engineers focus on building and maintaining robust, scalable data pipelines (ETL/ELT) using tools like Apache Spark, AWS Glue, and Airflow. Their primary goal is data availability and reliability. Data Scientists focus on analyzing that clean data, building machine learning models (using MLflow or Spark MLlib), and extracting business insights.
-
What is a Data Lakehouse, and why is it replacing Data Warehouses?
The Data Lakehouse is a new, unified architecture combining the low-cost storage and flexibility of a Data Lake (like AWS S3) with the structure, ACID transactions, and governance of a Data Warehouse (like AWS Redshift). It is typically powered by open formats like Delta Lake or Iceberg, offering the best of both worlds for BI and ML.
-
What are the best programming languages for Data Engineering?
The primary languages for modern Data Engineering are Python (specifically PySpark for distributed processing and dbt integration) and SQL (for querying, transformation, and data modeling). Scala is also used, but Python/PySpark is now the industry standard for pipeline development.
-
What is the core difference between RDD, DataFrame, and Dataset in Spark?
RDD is the original, low-level, untyped API. DataFrame is the current standard: a distributed collection of named columns that uses the Catalyst Optimizer for high performance. Dataset offers the performance of a DataFrame but adds type-safety (mostly for Scala users). DataFrames are recommended for most PySpark work.
-
How does the Apache Spark Catalyst Optimizer improve query performance?
The Catalyst Optimizer is the core component of Spark SQL and DataFrames. It takes your code (the logical plan) and applies a series of rules and cost-based analyses to generate the most efficient physical execution plan possible before any computation starts. This optimization is key to Spark's speed.
-
What is Data Shuffle and why is it a bottleneck in Spark jobs?
Data Shuffle is the process where Spark moves data between partitions across the network, typically required by wide transformations like groupBy(), join(), or repartition(). It is slow because it involves network I/O, serialization, and disk spills. Tuning configurations like spark.sql.shuffle.partitions is critical for performance.
-
When should I use AWS Glue and when should I use AWS DataBrew?
Use AWS Glue (specifically Glue ETL Jobs and Spark) when you need to write complex, scalable, distributed code for ETL/ELT pipelines. Use AWS DataBrew when you need a visual, code-free interface for data profiling, cleaning, and preparation, often preferred by Data Analysts.
-
What is the main use case for AWS Athena vs. AWS Redshift?
AWS Athena is a serverless query service best used for ad-hoc, interactive SQL querying directly on data stored in your AWS S3 Data Lake. AWS Redshift is a fully managed, scalable Cloud Data Warehouse built for complex, high-volume BI reporting and historical analysis.
-
What is the purpose of an AWS Glue Crawler?
An AWS Glue Crawler automatically scans your data sources (like AWS S3 or RDS), determines the schema and data types, and registers this metadata into the AWS Glue Data Catalog. This allows services like AWS Athena or Redshift Spectrum to query the data using SQL.
-
What problem does Databricks Unity Catalog solve?
Unity Catalog is the solution for centralized data governance and security across the entire Databricks Lakehouse Platform. It allows organizations to manage users, groups, and access controls for data (tables, views, files) and ML models using a single, standards-based interface.
-
What are the key features of Delta Lake?
Delta Lake is an open-source storage layer that brings reliability to data lakes. Its key features are ACID Transactions (Atomicity, Consistency, Isolation, Durability), Time Travel (data versioning for rollbacks), Schema Enforcement, and data reliability for Spark jobs.
-
What are Delta Live Tables (DLT) used for?
Delta Live Tables (DLT) is a framework built into Databricks for building reliable, declarative ETL/ELT pipelines. DLT automatically manages job orchestration, auto-scaling, monitoring, and dependency management for the Medallion Architecture (Bronze, Silver, Gold tables).