Building a Scalable Data Warehouse on AWS: A Step-by-Step Guide

Data warehousing has become essential for organizations that need to analyze large volumes of data from multiple sources. Unlike traditional databases designed for transactional workloads, data warehouses are optimized for analytical queries across historical data. In this guide, you'll learn how to build a scalable data warehouse on AWS using Amazon Redshift and AWS Glue.

What is a Data Warehouse and Why Use One?

A data warehouse is a centralized repository designed for storing and analyzing large amounts of structured data from multiple sources. While traditional relational databases handle day-to-day transactions (OLTP), data warehouses are optimized for analytical processing (OLAP).

Key differences between traditional databases and data warehouses:

Aspect Traditional Database Data Warehouse
Purpose Transactional operations Analytical queries
Data Structure Normalized Denormalized (star/snowflake schema)
Query Type Simple, frequent queries Complex, aggregated queries
Data Volume Current operational data Historical data from multiple sources
Optimization Write-optimized Read-optimized

Benefits of using a data warehouse:

  • Consolidated data: Combine data from multiple sources into a single source of truth
  • Improved query performance: Columnar storage and optimized indexes for analytical workloads
  • Historical analysis: Store and analyze trends over time
  • Separation of concerns: Keep analytical workloads separate from production databases

Architecture Overview

Before diving into the implementation, here's the architecture you'll build:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Source    │────▶│   Amazon    │────▶│   AWS       │────▶│   Amazon    │
│   Data      │     │   S3        │     │   Glue      │     │   Redshift  │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                         │                    │
                         │                    │
                         ▼                    ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Raw       │     │   AWS Glue  │
                    │   Data      │     │   Data      │
                    │   Storage   │     │   Catalog   │
                    └─────────────┘     └─────────────┘

This architecture follows AWS best practices for data warehousing workflows, where data flows from various sources into S3, gets transformed by AWS Glue, and lands in Amazon Redshift for analysis.

Prerequisites

Before you begin, ensure you have:

  • An AWS account with appropriate permissions
  • AWS CLI installed and configured
  • Basic knowledge of SQL and Python
  • Understanding of networking concepts (VPC, subnets, security groups)

Step 1: Setting Up Amazon S3 for Data Storage

Amazon S3 serves as the staging area for your raw data before transformation and loading into Redshift.

Create S3 Buckets

Create separate buckets for raw and processed data:

# Create bucket for raw data
aws s3 mb s3://your-company-data-warehouse-raw --region us-east-1

# Create bucket for processed/transformed data
aws s3 mb s3://your-company-data-warehouse-processed --region us-east-1

Upload Sample Data

For this guide, create a sample CSV file representing sales data:

order_id,customer_id,product_id,quantity,unit_price,order_date
1001,C001,P100,2,29.99,2025-01-15
1002,C002,P101,1,49.99,2025-01-15
1003,C001,P102,3,19.99,2025-01-16
1004,C003,P100,1,29.99,2025-01-16
1005,C002,P103,2,39.99,2025-01-17

Upload to S3:

aws s3 cp sales_data.csv s3://your-company-data-warehouse-raw/sales/

Step 2: Setting Up Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service. You can choose between provisioned clusters or Redshift Serverless depending on your workload patterns.

Redshift Serverless automatically provisions and scales capacity based on your workload, making it ideal for variable or unpredictable usage patterns:

# Create a Redshift Serverless namespace
aws redshift-serverless create-namespace \
    --namespace-name my-data-warehouse \
    --admin-username admin \
    --admin-user-password 'YourSecurePassword123!' \
    --db-name analytics

# Create a workgroup
aws redshift-serverless create-workgroup \
    --workgroup-name my-workgroup \
    --namespace-name my-data-warehouse \
    --base-capacity 32

Option B: Provisioned Cluster

For predictable workloads with consistent usage, a provisioned cluster offers more control:

Create a VPC and Subnet Group

# Create a cluster subnet group (assumes you have a VPC with subnets)
aws redshift create-cluster-subnet-group \
    --cluster-subnet-group-name my-redshift-subnet-group \
    --description "Subnet group for Redshift cluster" \
    --subnet-ids subnet-xxxxxxxx subnet-yyyyyyyy

Create the Redshift Cluster

aws redshift create-cluster \
    --cluster-identifier my-data-warehouse \
    --node-type dc2.large \
    --number-of-nodes 2 \
    --master-username admin \
    --master-user-password 'YourSecurePassword123!' \
    --db-name analytics \
    --cluster-subnet-group-name my-redshift-subnet-group \
    --publicly-accessible \
    --iam-roles arn:aws:iam::YOUR_ACCOUNT_ID:role/RedshiftS3AccessRole

Create IAM Role for Redshift

Redshift needs permissions to access S3. Create an IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-company-data-warehouse-*",
                "arn:aws:s3:::your-company-data-warehouse-*/*"
            ]
        }
    ]
}

Create the Data Warehouse Schema

Connect to your Redshift cluster and create the schema:

-- Create schema for analytics
CREATE SCHEMA IF NOT EXISTS analytics;

-- Create dimension table for customers
CREATE TABLE analytics.dim_customers (
    customer_id VARCHAR(10) PRIMARY KEY,
    customer_name VARCHAR(100),
    email VARCHAR(255),
    created_at TIMESTAMP DEFAULT GETDATE()
);

-- Create dimension table for products
CREATE TABLE analytics.dim_products (
    product_id VARCHAR(10) PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50),
    created_at TIMESTAMP DEFAULT GETDATE()
);

-- Create fact table for orders
CREATE TABLE analytics.fact_orders (
    order_id INTEGER PRIMARY KEY,
    customer_id VARCHAR(10) REFERENCES analytics.dim_customers(customer_id),
    product_id VARCHAR(10) REFERENCES analytics.dim_products(product_id),
    quantity INTEGER,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2),
    order_date DATE,
    loaded_at TIMESTAMP DEFAULT GETDATE()
)
DISTKEY(order_date)
SORTKEY(order_date, customer_id);

The DISTKEY and SORTKEY optimize query performance by distributing data across nodes and sorting it for efficient scans.

Step 3: Setting Up AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics. It handles the ETL (Extract, Transform, Load) process for your data warehouse.

Create a Glue Database

aws glue create-database \
    --database-input '{
        "Name": "data_warehouse_db",
        "Description": "Database for data warehouse ETL"
    }'

Create a Glue Crawler

A crawler automatically discovers your data schema and creates table definitions in the Glue Data Catalog:

# Create IAM role for Glue first
aws iam create-role \
    --role-name AWSGlueServiceRole-DataWarehouse \
    --assume-role-policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {"Service": "glue.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }]
    }'

# Attach required policies
aws iam attach-role-policy \
    --role-name AWSGlueServiceRole-DataWarehouse \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Create the crawler
aws glue create-crawler \
    --name sales-data-crawler \
    --role AWSGlueServiceRole-DataWarehouse \
    --database-name data_warehouse_db \
    --targets '{
        "S3Targets": [{
            "Path": "s3://your-company-data-warehouse-raw/sales/"
        }]
    }'

# Run the crawler
aws glue start-crawler --name sales-data-crawler

Create a Glue Connection to Redshift

aws glue create-connection \
    --connection-input '{
        "Name": "redshift-connection",
        "ConnectionType": "JDBC",
        "ConnectionProperties": {
            "JDBC_CONNECTION_URL": "jdbc:redshift://my-data-warehouse.xxxxx.us-east-1.redshift.amazonaws.com:5439/analytics",
            "USERNAME": "admin",
            "PASSWORD": "YourSecurePassword123!"
        },
        "PhysicalConnectionRequirements": {
            "SubnetId": "subnet-xxxxxxxx",
            "SecurityGroupIdList": ["sg-xxxxxxxx"],
            "AvailabilityZone": "us-east-1a"
        }
    }'

Step 4: Creating the ETL Pipeline

Now create an AWS Glue ETL job to transform and load data from S3 into Redshift.

Create the Glue ETL Script

Create a Python script for the ETL job:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from the Glue Data Catalog
sales_data = glueContext.create_dynamic_frame.from_catalog(
    database="data_warehouse_db",
    table_name="sales"
)

# Transform: Calculate total_amount
sales_df = sales_data.toDF()
sales_df = sales_df.withColumn(
    "total_amount",
    sales_df["quantity"] * sales_df["unit_price"]
)

# Convert back to DynamicFrame
sales_transformed = DynamicFrame.fromDF(
    sales_df,
    glueContext,
    "sales_transformed"
)

# Apply mappings to match Redshift schema
mapped_data = ApplyMapping.apply(
    frame=sales_transformed,
    mappings=[
        ("order_id", "long", "order_id", "int"),
        ("customer_id", "string", "customer_id", "string"),
        ("product_id", "string", "product_id", "string"),
        ("quantity", "long", "quantity", "int"),
        ("unit_price", "double", "unit_price", "decimal"),
        ("total_amount", "double", "total_amount", "decimal"),
        ("order_date", "string", "order_date", "date")
    ]
)

# Write to Redshift
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=mapped_data,
    catalog_connection="redshift-connection",
    connection_options={
        "dbtable": "analytics.fact_orders",
        "database": "analytics"
    },
    redshift_tmp_dir="s3://your-company-data-warehouse-processed/temp/"
)

job.commit()

Create the Glue Job

aws glue create-job \
    --name sales-etl-job \
    --role AWSGlueServiceRole-DataWarehouse \
    --command '{
        "Name": "glueetl",
        "ScriptLocation": "s3://your-company-data-warehouse-processed/scripts/sales_etl.py",
        "PythonVersion": "3"
    }' \
    --default-arguments '{
        "--job-language": "python",
        "--TempDir": "s3://your-company-data-warehouse-processed/temp/"
    }' \
    --connections '{"Connections": ["redshift-connection"]}' \
    --glue-version "4.0" \
    --number-of-workers 2 \
    --worker-type "G.1X"

Run the ETL Job

aws glue start-job-run --job-name sales-etl-job

Monitor the job status:

aws glue get-job-runs --job-name sales-etl-job

Step 5: Querying the Data Warehouse

Once your data is loaded into Redshift, you can run analytical queries using SQL.

Connect to Redshift

Use the Redshift Query Editor in the AWS Console, or connect via a SQL client:

# Using psql
psql -h my-data-warehouse.xxxxx.us-east-1.redshift.amazonaws.com \
     -p 5439 \
     -U admin \
     -d analytics

Sample Analytical Queries

Total sales by date:

SELECT
    order_date,
    COUNT(*) as total_orders,
    SUM(total_amount) as revenue
FROM analytics.fact_orders
GROUP BY order_date
ORDER BY order_date;

Top customers by revenue:

SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_revenue
FROM analytics.fact_orders o
JOIN analytics.dim_customers c ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.customer_name
ORDER BY total_revenue DESC
LIMIT 10;

Product performance analysis:

SELECT
    p.product_id,
    p.product_name,
    p.category,
    SUM(o.quantity) as units_sold,
    SUM(o.total_amount) as revenue,
    AVG(o.unit_price) as avg_price
FROM analytics.fact_orders o
JOIN analytics.dim_products p ON o.product_id = p.product_id
GROUP BY p.product_id, p.product_name, p.category
ORDER BY revenue DESC;

Monthly revenue trend:

SELECT
    DATE_TRUNC('month', order_date) as month,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(*) as total_orders,
    SUM(total_amount) as monthly_revenue
FROM analytics.fact_orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Step 6: Automating the Pipeline

For production workloads, automate the ETL pipeline to run on a schedule.

Create a Glue Trigger

aws glue create-trigger \
    --name daily-sales-etl \
    --type SCHEDULED \
    --schedule "cron(0 6 * * ? *)" \
    --actions '[{
        "JobName": "sales-etl-job"
    }]' \
    --start-on-creation

This trigger runs the ETL job daily at 6 AM UTC.

Set Up Monitoring

Create CloudWatch alarms to monitor job failures:

aws cloudwatch put-metric-alarm \
    --alarm-name "GlueJobFailure-sales-etl" \
    --metric-name "glue.driver.aggregate.numFailedTasks" \
    --namespace "Glue" \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --evaluation-periods 1 \
    --dimensions Name=JobName,Value=sales-etl-job \
    --alarm-actions arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:alerts

Best Practices

Data Organization

  • Use partitioned tables in S3 for better query performance
  • Implement a medallion architecture (bronze/silver/gold layers) for data quality
  • Maintain separate environments for development, staging, and production

Security

  • Enable encryption at rest for S3 buckets and Redshift clusters
  • Use IAM roles instead of access keys
  • Implement VPC endpoints for private connectivity
  • Enable audit logging for compliance

Performance Optimization

  • Choose appropriate distribution and sort keys for Redshift tables
  • Use Redshift Spectrum for querying data directly in S3 without loading
  • Implement materialized views for frequently-run queries
  • Monitor and vacuum tables regularly

Cost Management

  • Use Redshift Serverless for variable workloads
  • Implement S3 lifecycle policies to archive old data
  • Schedule Glue jobs during off-peak hours
  • Monitor and optimize Glue DPU usage

Conclusion

You've now built a scalable data warehouse on AWS using Amazon Redshift for storage and analytics, AWS Glue for ETL processing, and Amazon S3 as a staging layer. This architecture can handle growing data volumes and provides the foundation for advanced analytics and business intelligence.

As your needs grow, consider extending this architecture with:

  • Amazon QuickSight for business intelligence dashboards
  • Amazon Athena for ad-hoc queries directly on S3 data
  • AWS Lake Formation for centralized data governance
  • Apache Airflow (via Amazon MWAA) for complex workflow orchestration

Sources

  1. Designing data warehousing workflows - Data Warehousing on AWS - AWS official whitepaper on data warehouse design patterns
  2. Building a Data Warehouse from Scratch on AWS - Practical implementation guide by Fadli Ahmad J
  3. Data Lakehouse on AWS: Step-by-Step Tutorial - NashTech's guide on modern lakehouse architecture
  4. Building scalable AWS Lake Formation governed data lakes - AWS Big Data Blog on scalable data lake patterns
  5. How to build a scalable datalake on AWS - Minfy Technologies guide on data lake architecture