AWS TL;DR: Redshift
Back to AWS TL;DR Hub

Amazon Redshift

/tldr: Petabyte-scale, fully managed cloud data warehouse for fast analytics.

Data Warehouse Analytics MPP Architecture

1. Architecture for Performance (MPP)

Redshift is built for OLAP (Online Analytical Processing)—complex queries over massive datasets. It achieves sub-second query times on petabytes of data through two key architectural choices: Columnar Storage and Massive Parallel Processing (MPP).

Key Components

  • **Leader Node:** Handles incoming queries, develops the execution plan, and distributes it to the compute nodes. It also aggregates the final results.
  • **Compute Nodes:** Executes the distributed query plan segments in parallel. Each node has its own dedicated CPU, memory, and storage.
  • **Columnar Storage:** Unlike traditional row-based databases, Redshift stores data by column. This is vastly more efficient for analytical queries that often only need a few columns from a large table.
  • **Compression:** Columnar storage allows for greater data compression, reducing the amount of data read from disk, which dramatically speeds up I/O.

2. Redshift Spectrum and RA3 Nodes

Redshift has evolved to work directly with data stored outside its cluster in S3, making it a critical component of a modern Data Lakehouse architecture.

Data Access Features

  • **Redshift Spectrum:** Allows users to query data directly in S3 (Parquet, ORC, JSON, CSV) using standard SQL, without loading it into the Redshift cluster. You only pay for the data scanned.
  • **RA3 Instance Types:** Uses managed storage, separating compute and storage. This allows you to scale compute independently of storage, and leverage high-performance storage based on S3.
  • **Concurrency Scaling:** Automatically and elastically adds cluster capacity to handle spikes in concurrent user queries, ensuring consistent, fast performance under heavy load.

3. The Serverless Option

Redshift Serverless provides the same powerful analytics capabilities without requiring you to manage clusters, nodes, or scaling.

Serverless Advantages

  • **Pay-per-Usage:** You pay only for the compute capacity consumed while your queries are running (measured in Redshift Processing Units - RPUs).
  • **Instant Scaling:** Scaling is automatic and instantaneous based on query volume and complexity.
  • **Simplified Management:** Eliminates the need for tasks like cluster provisioning, sizing, and manual scaling.

SQL Dialect Example

Redshift uses a PostgreSQL-compatible SQL dialect.

-- Example: Creating a table with Distribution Key (DISTKEY) and Sort Key (SORTKEY)
-- These are critical for performance and data distribution optimization
CREATE TABLE sales (
    sale_id INTEGER,
    product_name VARCHAR(255),
    region VARCHAR(50),
    sale_date DATE,
    amount DECIMAL(8, 2)
)
DISTKEY (region)  -- Distribute data slices across nodes based on 'region' value
SORTKEY (sale_date); -- Optimize query range scans based on 'sale_date'
            

Redshift is the premier choice for large-scale, high-performance analytical processing on AWS.

AWS Fundamentals Series: Redshift