ELT Made Easy: A Step-by-Step Guide to Building a Scalable Data Warehouse on AWS

Modern data architectures are shifting from traditional ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) pipelines. This shift leverages the processing power of cloud data warehouses to transform data after loading, offering greater flexibility and scalability. In this guide, you'll learn how to build an ELT pipeline on AWS using Amazon Redshift, Amazon Athena, and AWS Glue.

Understanding ELT: Why It Matters

ELT inverts the traditional ETL approach by loading raw data into the target system first, then transforming it using the compute power of the data warehouse itself. This pattern has gained popularity with the rise of cloud data warehouses that offer massive parallel processing capabilities.

ETL vs ELT: Key Differences

Aspect ETL ELT
Transform Location Staging server/ETL tool Target data warehouse
Data Movement Transformed data moves to warehouse Raw data moves to warehouse
Flexibility Schema changes require pipeline updates Schema-on-read enables flexibility
Compute Dedicated ETL infrastructure Leverage warehouse compute
Historical Data Often discarded after transformation Raw data preserved for reprocessing

Benefits of ELT

  • Faster data availability: Raw data lands in the warehouse immediately without waiting for transformations
  • Flexibility: Transform data multiple ways without re-extracting from source systems
  • Cost efficiency: Use the scalable compute of your data warehouse instead of dedicated ETL servers
  • Data preservation: Keep raw data for audit trails, debugging, or future transformations
  • Simplified pipeline: Fewer moving parts between source and destination

Architecture Overview

Here's the ELT architecture you'll build:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Source    │────▶│   Amazon    │────▶│   AWS       │────▶│   Amazon    │
│   Systems   │     │   S3        │     │   Glue      │     │   Redshift  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                         │                                        │
                         │                                        │
                    ┌────▼────┐                              ┌────▼────┐
                    │  Raw    │                              │Transform│
                    │  Data   │                              │  in DW  │
                    │  Lake   │                              │  (SQL)  │
                    └────┬────┘                              └─────────┘
                         │
                    ┌────▼────┐
                    │ Amazon  │
                    │ Athena  │
                    │ Queries │
                    └─────────┘

In this architecture:

  1. Amazon S3 stores raw data as a data lake
  2. AWS Glue catalogs data and handles the Extract and Load phases
  3. Amazon Redshift serves as the data warehouse where transformations occur
  4. Amazon Athena provides serverless SQL queries directly on S3 data

Prerequisites

Before you begin, ensure you have:

  • An AWS account with appropriate IAM permissions
  • AWS CLI installed and configured
  • Basic knowledge of SQL
  • Understanding of AWS networking (VPC, subnets, security groups)

Step 1: Setting Up Amazon S3 as Your Data Lake

Amazon S3 serves as the foundation of your ELT pipeline, storing both raw ingested data and serving as a staging area for Redshift.

Create S3 Buckets

Create buckets with a clear naming convention:

# Create bucket for raw/landing data
aws s3 mb s3://your-company-elt-raw --region us-east-1

# Create bucket for processed/curated data
aws s3 mb s3://your-company-elt-processed --region us-east-1

# Create bucket for Glue scripts and temporary files
aws s3 mb s3://your-company-elt-artifacts --region us-east-1

Organize Data with Partitions

Structure your raw data using partitions for efficient querying:

s3://your-company-elt-raw/
├── sales/
│   ├── year=2025/
│   │   ├── month=01/
│   │   │   ├── day=15/
│   │   │   │   └── sales_20250115.parquet
│   │   │   └── day=16/
│   │   │       └── sales_20250116.parquet

Upload Sample Data

Create sample sales data in CSV format:

order_id,customer_id,product_id,quantity,unit_price,order_date,region
1001,C001,P100,2,29.99,2025-01-15,us-east
1002,C002,P101,1,49.99,2025-01-15,us-west
1003,C001,P102,3,19.99,2025-01-16,us-east
1004,C003,P100,1,29.99,2025-01-16,eu-west
1005,C002,P103,2,39.99,2025-01-17,us-west

Upload to S3:

aws s3 cp sales_data.csv s3://your-company-elt-raw/sales/year=2025/month=01/

Step 2: Setting Up Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service that uses columnar storage and parallel query execution for fast analytical performance.

Redshift Serverless automatically provisions and scales data warehouse capacity. According to AWS documentation, you pay only for the compute used when the warehouse is active, with no charges when idle.

# Create a namespace (logical container for database objects)
aws redshift-serverless create-namespace \
    --namespace-name elt-warehouse \
    --admin-username admin \
    --admin-user-password 'YourSecurePassword123!' \
    --db-name analytics

# Create a workgroup (compute resources)
aws redshift-serverless create-workgroup \
    --workgroup-name elt-workgroup \
    --namespace-name elt-warehouse \
    --base-capacity 32

Option B: Provisioned Cluster

For predictable workloads with consistent usage patterns:

# Create cluster subnet group
aws redshift create-cluster-subnet-group \
    --cluster-subnet-group-name elt-subnet-group \
    --description "Subnet group for ELT warehouse" \
    --subnet-ids subnet-xxxxxxxx subnet-yyyyyyyy

# Create the cluster
aws redshift create-cluster \
    --cluster-identifier elt-data-warehouse \
    --node-type ra3.xlplus \
    --number-of-nodes 2 \
    --master-username admin \
    --master-user-password 'YourSecurePassword123!' \
    --db-name analytics \
    --cluster-subnet-group-name elt-subnet-group \
    --iam-roles arn:aws:iam::YOUR_ACCOUNT_ID:role/RedshiftSpectrumRole

Create IAM Role for Redshift

Redshift needs permissions to access S3 and the Glue Data Catalog:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-company-elt-*",
                "arn:aws:s3:::your-company-elt-*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetDatabase",
                "glue:GetDatabases"
            ],
            "Resource": "*"
        }
    ]
}

Create the Raw Data Landing Schema

In ELT, you first create tables to hold raw data before transformation:

-- Create schema for raw/staging data
CREATE SCHEMA IF NOT EXISTS raw_data;

-- Create staging table that mirrors source structure
CREATE TABLE raw_data.sales_landing (
    order_id INTEGER,
    customer_id VARCHAR(20),
    product_id VARCHAR(20),
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    order_date DATE,
    region VARCHAR(50),
    loaded_at TIMESTAMP DEFAULT GETDATE()
)
DISTSTYLE AUTO;

-- Create schema for transformed/analytics data
CREATE SCHEMA IF NOT EXISTS analytics;

-- Create fact table for transformed orders
CREATE TABLE analytics.fact_orders (
    order_key BIGINT IDENTITY(1,1),
    order_id INTEGER,
    customer_id VARCHAR(20),
    product_id VARCHAR(20),
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2),
    order_date DATE,
    order_year INTEGER,
    order_month INTEGER,
    region VARCHAR(50),
    loaded_at TIMESTAMP DEFAULT GETDATE()
)
DISTKEY(order_date)
SORTKEY(order_date, region);

Step 3: Setting Up AWS Glue for Data Cataloging

AWS Glue is a serverless data integration service that discovers, prepares, and combines data from multiple sources. In an ELT pipeline, Glue handles the Extract and Load phases while cataloging data for both Athena and Redshift Spectrum.

Create a Glue Database

aws glue create-database \
    --database-input '{
        "Name": "elt_data_catalog",
        "Description": "Data catalog for ELT pipeline"
    }'

Create IAM Role for Glue

# Create the role
aws iam create-role \
    --role-name AWSGlueServiceRole-ELT \
    --assume-role-policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {"Service": "glue.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }]
    }'

# Attach managed policy
aws iam attach-role-policy \
    --role-name AWSGlueServiceRole-ELT \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Attach S3 access policy
aws iam put-role-policy \
    --role-name AWSGlueServiceRole-ELT \
    --policy-name S3Access \
    --policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
            "Resource": [
                "arn:aws:s3:::your-company-elt-*",
                "arn:aws:s3:::your-company-elt-*/*"
            ]
        }]
    }'

Create a Glue Crawler

A crawler automatically discovers schema and creates table definitions in the Glue Data Catalog:

aws glue create-crawler \
    --name elt-sales-crawler \
    --role AWSGlueServiceRole-ELT \
    --database-name elt_data_catalog \
    --targets '{
        "S3Targets": [{
            "Path": "s3://your-company-elt-raw/sales/",
            "Exclusions": ["**/_temporary/**"]
        }]
    }' \
    --schema-change-policy '{
        "UpdateBehavior": "UPDATE_IN_DATABASE",
        "DeleteBehavior": "LOG"
    }'

# Run the crawler
aws glue start-crawler --name elt-sales-crawler

Create Glue Connection to Redshift

aws glue create-connection \
    --connection-input '{
        "Name": "redshift-elt-connection",
        "ConnectionType": "JDBC",
        "ConnectionProperties": {
            "JDBC_CONNECTION_URL": "jdbc:redshift://elt-data-warehouse.xxxxx.us-east-1.redshift.amazonaws.com:5439/analytics",
            "USERNAME": "admin",
            "PASSWORD": "YourSecurePassword123!"
        },
        "PhysicalConnectionRequirements": {
            "SubnetId": "subnet-xxxxxxxx",
            "SecurityGroupIdList": ["sg-xxxxxxxx"],
            "AvailabilityZone": "us-east-1a"
        }
    }'

Step 4: Creating the ELT Pipeline with Glue

In an ELT pattern, the Glue job focuses on extracting data from sources and loading it into Redshift with minimal transformation. Heavy transformations happen in Redshift using SQL.

Create the Extract-Load Script

This script handles the E and L of ELT:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_path', 'redshift_table'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Extract: Read raw data from S3 via Glue Data Catalog
raw_data = glueContext.create_dynamic_frame.from_catalog(
    database="elt_data_catalog",
    table_name="sales",
    transformation_ctx="raw_data"
)

# Minimal transformation: Add load timestamp and basic type casting
# Heavy transformations will happen in Redshift
from pyspark.sql.functions import current_timestamp, col

df = raw_data.toDF()
df = df.withColumn("loaded_at", current_timestamp())

# Convert back to DynamicFrame
load_ready = DynamicFrame.fromDF(df, glueContext, "load_ready")

# Apply basic mappings for Redshift compatibility
mapped_data = ApplyMapping.apply(
    frame=load_ready,
    mappings=[
        ("order_id", "long", "order_id", "int"),
        ("customer_id", "string", "customer_id", "string"),
        ("product_id", "string", "product_id", "string"),
        ("quantity", "long", "quantity", "int"),
        ("unit_price", "double", "unit_price", "decimal"),
        ("order_date", "string", "order_date", "date"),
        ("region", "string", "region", "string"),
        ("loaded_at", "timestamp", "loaded_at", "timestamp")
    ]
)

# Load: Write to Redshift staging table
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=mapped_data,
    catalog_connection="redshift-elt-connection",
    connection_options={
        "dbtable": "raw_data.sales_landing",
        "database": "analytics"
    },
    redshift_tmp_dir="s3://your-company-elt-artifacts/temp/"
)

job.commit()

Create the Glue Job

# Upload the script to S3
aws s3 cp elt_load_job.py s3://your-company-elt-artifacts/scripts/

# Create the Glue job
aws glue create-job \
    --name elt-sales-load \
    --role AWSGlueServiceRole-ELT \
    --command '{
        "Name": "glueetl",
        "ScriptLocation": "s3://your-company-elt-artifacts/scripts/elt_load_job.py",
        "PythonVersion": "3"
    }' \
    --default-arguments '{
        "--job-language": "python",
        "--TempDir": "s3://your-company-elt-artifacts/temp/",
        "--enable-metrics": "true",
        "--enable-continuous-cloudwatch-log": "true"
    }' \
    --connections '{"Connections": ["redshift-elt-connection"]}' \
    --glue-version "4.0" \
    --number-of-workers 2 \
    --worker-type "G.1X"

Transform in Redshift (The T in ELT)

After raw data lands in Redshift, perform transformations using SQL:

-- Transform: Load from staging to analytics with calculations
INSERT INTO analytics.fact_orders (
    order_id,
    customer_id,
    product_id,
    quantity,
    unit_price,
    total_amount,
    order_date,
    order_year,
    order_month,
    region,
    loaded_at
)
SELECT
    order_id,
    customer_id,
    product_id,
    quantity,
    unit_price,
    quantity * unit_price AS total_amount,
    order_date,
    EXTRACT(YEAR FROM order_date) AS order_year,
    EXTRACT(MONTH FROM order_date) AS order_month,
    region,
    loaded_at
FROM raw_data.sales_landing
WHERE loaded_at > (
    SELECT COALESCE(MAX(loaded_at), '1900-01-01')
    FROM analytics.fact_orders
);

-- Create aggregated views for common queries
CREATE OR REPLACE VIEW analytics.daily_sales_summary AS
SELECT
    order_date,
    region,
    COUNT(*) AS order_count,
    SUM(quantity) AS total_units,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS avg_order_value
FROM analytics.fact_orders
GROUP BY order_date, region;

-- Create materialized view for frequently accessed aggregations
CREATE MATERIALIZED VIEW analytics.monthly_regional_sales AS
SELECT
    order_year,
    order_month,
    region,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM analytics.fact_orders
GROUP BY order_year, order_month, region;

Step 5: Setting Up Amazon Athena

Amazon Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. According to AWS documentation, you pay only for the queries you run, based on the amount of data scanned.

Configure Athena

Set up query result location:

# Create bucket for Athena query results
aws s3 mb s3://your-company-athena-results --region us-east-1

# Configure Athena workgroup
aws athena create-work-group \
    --name elt-workgroup \
    --configuration '{
        "ResultConfiguration": {
            "OutputLocation": "s3://your-company-athena-results/"
        },
        "EnforceWorkGroupConfiguration": true,
        "PublishCloudWatchMetricsEnabled": true
    }'

Query Raw Data with Athena

Athena can query the same data cataloged by Glue, enabling analysis of raw data before it's loaded to Redshift:

-- Query raw data directly in S3
SELECT
    order_date,
    region,
    COUNT(*) as order_count,
    SUM(CAST(quantity AS INTEGER) * CAST(unit_price AS DECIMAL(10,2))) as revenue
FROM elt_data_catalog.sales
WHERE year = '2025' AND month = '01'
GROUP BY order_date, region
ORDER BY order_date;

Create Athena Views for Data Validation

-- Create view to compare raw vs loaded data
CREATE OR REPLACE VIEW elt_data_catalog.data_validation AS
SELECT
    'S3 Raw' as source,
    COUNT(*) as record_count,
    SUM(CAST(quantity AS INTEGER) * CAST(unit_price AS DECIMAL(10,2))) as total_amount
FROM elt_data_catalog.sales

UNION ALL

SELECT
    'Redshift Loaded' as source,
    COUNT(*) as record_count,
    SUM(total_amount) as total_amount
FROM awsdatacatalog.analytics.fact_orders;

Step 6: Using Redshift Spectrum for Unified Queries

Redshift Spectrum extends your data warehouse to query data directly in S3 without loading it. This enables a unified query layer across your data lake and warehouse.

Create External Schema

-- Create external schema pointing to Glue Data Catalog
CREATE EXTERNAL SCHEMA IF NOT EXISTS data_lake
FROM DATA CATALOG
DATABASE 'elt_data_catalog'
IAM_ROLE 'arn:aws:iam::YOUR_ACCOUNT_ID:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Query Across Data Lake and Warehouse

-- Join data lake (S3) with data warehouse (Redshift)
SELECT
    dl.region,
    dl.order_date,
    COUNT(dl.order_id) as raw_orders,
    dw.loaded_orders,
    COUNT(dl.order_id) - dw.loaded_orders as pending_load
FROM data_lake.sales dl
LEFT JOIN (
    SELECT
        region,
        order_date,
        COUNT(*) as loaded_orders
    FROM analytics.fact_orders
    GROUP BY region, order_date
) dw ON dl.region = dw.region AND dl.order_date = dw.order_date
GROUP BY dl.region, dl.order_date, dw.loaded_orders
HAVING COUNT(dl.order_id) <> COALESCE(dw.loaded_orders, 0);

Step 7: Orchestrating the Pipeline

Automate the complete ELT workflow using Glue workflows and triggers.

Create Glue Workflow

# Create the workflow
aws glue create-workflow \
    --name elt-sales-pipeline \
    --description "Complete ELT pipeline for sales data"

# Create trigger to start on schedule
aws glue create-trigger \
    --name elt-daily-trigger \
    --workflow-name elt-sales-pipeline \
    --type SCHEDULED \
    --schedule "cron(0 6 * * ? *)" \
    --actions '[{"CrawlerName": "elt-sales-crawler"}]'

# Create trigger to run load job after crawler completes
aws glue create-trigger \
    --name elt-load-trigger \
    --workflow-name elt-sales-pipeline \
    --type CONDITIONAL \
    --predicate '{
        "Logical": "ANY",
        "Conditions": [{
            "CrawlerName": "elt-sales-crawler",
            "CrawlState": "SUCCEEDED",
            "LogicalOperator": "EQUALS"
        }]
    }' \
    --actions '[{"JobName": "elt-sales-load"}]'

Set Up Monitoring with CloudWatch

# Create alarm for job failures
aws cloudwatch put-metric-alarm \
    --alarm-name "ELT-Job-Failure" \
    --metric-name "glue.driver.aggregate.numFailedTasks" \
    --namespace "Glue" \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --evaluation-periods 1 \
    --dimensions Name=JobName,Value=elt-sales-load \
    --alarm-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:elt-alerts

# Create alarm for data freshness
aws cloudwatch put-metric-alarm \
    --alarm-name "ELT-Data-Freshness" \
    --metric-name "JobRunTime" \
    --namespace "Glue" \
    --statistic Maximum \
    --period 86400 \
    --threshold 1 \
    --comparison-operator LessThanThreshold \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:elt-alerts

Best Practices for ELT on AWS

Data Organization

  • Partition strategically: Use date-based partitions (year/month/day) as pushdown predicates to minimize data scanned
  • Use columnar formats: Store data in Parquet or ORC format for better compression and query performance
  • Implement data zones: Maintain raw, staging, and curated data zones for data quality control

Performance Optimization

  • Use COPY for bulk loads: The COPY command is the most efficient way to load data into Redshift according to AWS documentation
  • Choose appropriate distribution keys: Distribute large fact tables by commonly joined columns
  • Leverage materialized views: Pre-compute expensive aggregations that are queried frequently
  • Partition pruning: Structure Glue jobs to pass partition predicates that minimize S3 scans

Cost Management

  • Use Redshift Serverless: For variable or unpredictable workloads, serverless provides automatic scaling with no charges when idle
  • Optimize Athena queries: Use partitions and columnar formats to reduce data scanned
  • Implement lifecycle policies: Archive or delete old raw data that's been processed
  • Monitor Glue DPU usage: Right-size Glue jobs based on actual data volumes

Security

  • Use IAM roles: Avoid embedding credentials; use IAM roles for service-to-service authentication
  • Enable encryption: Use server-side encryption for S3 and encryption at rest for Redshift
  • Implement VPC endpoints: Keep traffic within AWS network using VPC endpoints for S3 and Glue
  • Audit access: Enable CloudTrail and Redshift audit logging for compliance

Conclusion

You've built a complete ELT pipeline on AWS that extracts data from sources, loads it into a Redshift data warehouse, and transforms it using SQL. This architecture provides the flexibility to transform data multiple ways, preserves raw data for reprocessing, and leverages the scalable compute of cloud services.

Key components in this architecture:

  • Amazon S3 as the data lake foundation
  • AWS Glue for data cataloging and the extract-load process
  • Amazon Redshift for data warehousing and SQL-based transformations
  • Amazon Athena for serverless queries on raw data
  • Redshift Spectrum for unified queries across the data lake and warehouse

As your data platform matures, consider extending this architecture with:

  • AWS Lake Formation for centralized data governance and fine-grained access control
  • Amazon QuickSight for business intelligence dashboards
  • Apache Airflow (via Amazon MWAA) for complex workflow orchestration
  • dbt for version-controlled transformation logic

Sources

  1. Amazon Redshift Data Loading Best Practices - AWS official documentation on efficient data loading strategies
  2. What is Amazon Athena? - AWS documentation on Athena's serverless query capabilities
  3. What is AWS Glue? - AWS Glue overview covering ETL/ELT capabilities and data catalog features
  4. Getting Started with Amazon Redshift Spectrum - Guide to querying S3 data directly from Redshift
  5. Build an ETL Service Pipeline from S3 to Redshift using AWS Glue - AWS prescriptive guidance on incremental data pipelines