AWS Glue DataBrew
/tldr: Visual, no-code data preparation and cleaning tool.
1. Core Concepts: No-Code Preparation
DataBrew allows data analysts and data scientists to clean, normalize, and transform raw data without writing any code. It provides an interactive, visual interface to explore data quality and apply pre-built transformations.
Key Components
- **Projects & Datasets:** A Project is where you interactively clean your data. The Dataset is the connection to your raw data source (S3, RDS, etc.).
- **Recipes (Transformations):** A Recipe is the ordered set of steps (transformations) you apply to the data. It's automatically recorded as you click and select steps.
- **Jobs:** Once the Recipe is finalized, you run a DataBrew Job. This job executes the Recipe on the entire dataset, saving the cleaned result to a target location (usually S3).
2. 300+ Built-in Transformations
The power of DataBrew lies in its massive library of pre-canned operations, which greatly reduces the time spent on common data cleaning tasks.
Common Transformations
- **Standardization:** Converting dates to a consistent format, or ensuring all text fields are uppercase/lowercase.
- **Imputation:** Filling in missing values (nulls) based on statistical methods (mean, median) or a specific constant.
- **Anomaly Detection:** Quickly identifying and filtering out outliers in the data.
- **Encoding:** One-hot encoding or label encoding categorical features for Machine Learning model readiness.
- **Pivoting/Unpivoting:** Restructuring columns and rows for different analytical needs.
3. Integrations and Workflow
DataBrew integrates seamlessly with the rest of the AWS Data Lake and Analytics stack, making it a powerful step in any ETL/ELT pipeline.
Inputs and Outputs
- **Sources:** Reads data from S3, JDBC databases (RDS, Redshift, Snowflake), and the Glue Data Catalog.
- **Destinations:** Writes cleaned data to S3, often in optimized formats like Parquet or ORC, which are ready for querying by Athena.
Pricing Model
- **Interactive Sessions:** Charged per-hour while you are actively working in the visual console.
- **Jobs:** Charged per-hour based on the compute time needed to run the Recipe against the full dataset.
DataBrew Workflow Summary
1. Create Dataset (Connect to S3)
2. Create Project (Start visual cleaning session)
3. Build Recipe (Apply 300+ transformations)
4. Run Job (Execute Recipe on full dataset)
5. Output (Cleaned data ready for ML/Analytics)
DataBrew is the fastest way to get raw data ready for analysis and ML without writing code.