Building a Scalable ETL Pipeline on AWS: A Step-by-Step Guide

Extract, Transform, Load (ETL) pipelines are the backbone of modern data engineering. They move data from source systems, transform it into usable formats, and load it into target destinations for analysis. In this guide, you'll learn how to build a scalable, serverless ETL pipeline using AWS Glue and Amazon S3.

What is AWS Glue?

AWS Glue is a fully managed, serverless data integration service that simplifies discovering, preparing, and combining data for analytics and machine learning. It handles the undifferentiated heavy lifting of provisioning infrastructure, scaling resources, and managing job execution.

Key components of AWS Glue include:

Data Catalog: A centralized metadata repository that stores table definitions, schema information, and data location
Crawlers: Automated processes that scan your data sources and populate the Data Catalog with table definitions
ETL Jobs: Apache Spark-based jobs that extract, transform, and load data between sources and targets
Triggers: Mechanisms that start ETL jobs on a schedule or in response to events

AWS Glue supports various data formats including JSON, CSV, Parquet, and ORC, and scales automatically based on data volume.

Architecture Overview

Here's the architecture you'll build in this guide:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Raw Data  │────▶│   S3 Raw    │────▶│   AWS Glue  │────▶│ S3 Processed│
│   Source    │     │   Bucket    │     │   ETL Job   │     │   Bucket    │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘
                          │                    │
                          ▼                    │
                    ┌─────────────┐            │
                    │  AWS Glue   │◀───────────┘
                    │  Crawler    │
                    └─────────────┘
                          │
                          ▼
                    ┌─────────────┐
                    │  AWS Glue   │
                    │  Data       │
                    │  Catalog    │
                    └─────────────┘

This serverless architecture uses AWS managed services that scale automatically and require no infrastructure management.

Prerequisites

Before you begin, ensure you have:

An AWS account with appropriate permissions
AWS CLI installed and configured
Basic knowledge of Python and SQL
Familiarity with S3 and IAM concepts

Step 1: Setting Up S3 Buckets for Data Storage

Amazon S3 serves as both the source and destination for your ETL pipeline. Following AWS best practices, create separate buckets for raw and processed data.

Create S3 Buckets

# Create bucket for raw/landing data
aws s3 mb s3://your-company-etl-raw-data --region us-east-1

# Create bucket for processed/transformed data
aws s3 mb s3://your-company-etl-processed-data --region us-east-1

# Create bucket for Glue scripts and temporary files
aws s3 mb s3://your-company-etl-assets --region us-east-1

Configure Bucket Structure

Organize your buckets with a clear folder structure based on ingestion frequency and data source:

# Create folder structure for raw data
aws s3api put-object --bucket your-company-etl-raw-data --key sales/
aws s3api put-object --bucket your-company-etl-raw-data --key customers/
aws s3api put-object --bucket your-company-etl-raw-data --key products/

# Create folder structure for processed data
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/sales/
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/customers/
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/products/

Upload Sample Data

Create sample CSV files to use throughout this guide:

sales_data.csv:

transaction_id,customer_id,product_id,quantity,unit_price,transaction_date,region
TXN001,CUST100,PROD001,2,29.99,2025-01-15,US-EAST
TXN002,CUST101,PROD002,1,49.99,2025-01-15,US-WEST
TXN003,CUST100,PROD003,3,19.99,2025-01-16,US-EAST
TXN004,CUST102,PROD001,1,29.99,2025-01-16,EU-WEST
TXN005,CUST101,PROD004,2,39.99,2025-01-17,US-WEST

Upload to S3:

aws s3 cp sales_data.csv s3://your-company-etl-raw-data/sales/

Step 2: Creating the Glue Database and Crawler

The AWS Glue Data Catalog serves as a central metadata repository for your data assets. You'll create a database and use a crawler to automatically discover the schema of your raw data.

Create a Glue Database

aws glue create-database \
    --database-input '{
        "Name": "etl_pipeline_db",
        "Description": "Database for ETL pipeline data catalog"
    }'

Create an IAM Role for Glue

AWS Glue needs permissions to access S3 and write to the Data Catalog:

# Create the trust policy document
cat > glue-trust-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF

# Create the IAM role
aws iam create-role \
    --role-name AWSGlueServiceRole-ETLPipeline \
    --assume-role-policy-document file://glue-trust-policy.json

# Attach the AWS managed Glue service policy
aws iam attach-role-policy \
    --role-name AWSGlueServiceRole-ETLPipeline \
    --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

# Create and attach S3 access policy
cat > glue-s3-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-company-etl-*",
                "arn:aws:s3:::your-company-etl-*/*"
            ]
        }
    ]
}
EOF

aws iam put-role-policy \
    --role-name AWSGlueServiceRole-ETLPipeline \
    --policy-name S3AccessPolicy \
    --policy-document file://glue-s3-policy.json

Create a Glue Crawler

The crawler scans your S3 data and creates table definitions in the Data Catalog:

aws glue create-crawler \
    --name sales-data-crawler \
    --role AWSGlueServiceRole-ETLPipeline \
    --database-name etl_pipeline_db \
    --targets '{
        "S3Targets": [
            {
                "Path": "s3://your-company-etl-raw-data/sales/"
            }
        ]
    }' \
    --schema-change-policy '{
        "UpdateBehavior": "UPDATE_IN_DATABASE",
        "DeleteBehavior": "LOG"
    }'

Run the Crawler

# Start the crawler
aws glue start-crawler --name sales-data-crawler

# Check crawler status
aws glue get-crawler --name sales-data-crawler --query 'Crawler.State'

Once the crawler completes, verify the table was created:

aws glue get-table --database-name etl_pipeline_db --name sales

Step 3: Creating a Glue ETL Job

Now create an AWS Glue ETL job to transform your raw data and write it to S3 in an optimized Parquet format.

Write the ETL Script

Create a Python script that performs the following transformations: - Reads raw CSV data from the Glue Data Catalog - Adds calculated fields (total_amount) - Partitions data by region for optimized queries - Writes to S3 in Parquet format

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, to_date, round as spark_round

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from the Glue Data Catalog
raw_data = glueContext.create_dynamic_frame.from_catalog(
    database="etl_pipeline_db",
    table_name="sales"
)

# Convert to Spark DataFrame for transformations
df = raw_data.toDF()

# Transform: Add calculated fields and clean data
transformed_df = df \
    .withColumn("total_amount",
                spark_round(col("quantity") * col("unit_price"), 2)) \
    .withColumn("transaction_date",
                to_date(col("transaction_date"), "yyyy-MM-dd"))

# Convert back to DynamicFrame
transformed_dynamic_frame = DynamicFrame.fromDF(
    transformed_df,
    glueContext,
    "transformed_sales"
)

# Apply schema mapping
mapped_data = ApplyMapping.apply(
    frame=transformed_dynamic_frame,
    mappings=[
        ("transaction_id", "string", "transaction_id", "string"),
        ("customer_id", "string", "customer_id", "string"),
        ("product_id", "string", "product_id", "string"),
        ("quantity", "long", "quantity", "int"),
        ("unit_price", "double", "unit_price", "decimal(10,2)"),
        ("transaction_date", "date", "transaction_date", "date"),
        ("region", "string", "region", "string"),
        ("total_amount", "double", "total_amount", "decimal(10,2)")
    ]
)

# Write to S3 in Parquet format, partitioned by region
glueContext.write_dynamic_frame.from_options(
    frame=mapped_data,
    connection_type="s3",
    connection_options={
        "path": "s3://your-company-etl-processed-data/parquet/sales/",
        "partitionKeys": ["region"]
    },
    format="parquet",
    format_options={
        "compression": "snappy"
    }
)

job.commit()

Upload the Script to S3

aws s3 cp sales_etl_job.py s3://your-company-etl-assets/scripts/

Create the Glue Job

aws glue create-job \
    --name sales-etl-job \
    --role AWSGlueServiceRole-ETLPipeline \
    --command '{
        "Name": "glueetl",
        "ScriptLocation": "s3://your-company-etl-assets/scripts/sales_etl_job.py",
        "PythonVersion": "3"
    }' \
    --default-arguments '{
        "--job-language": "python",
        "--TempDir": "s3://your-company-etl-assets/temp/",
        "--enable-metrics": "true",
        "--enable-continuous-cloudwatch-log": "true"
    }' \
    --glue-version "4.0" \
    --number-of-workers 2 \
    --worker-type "G.1X"

Run and Monitor the Job

# Start the job
aws glue start-job-run --job-name sales-etl-job

# Monitor job status
aws glue get-job-runs --job-name sales-etl-job \
    --query 'JobRuns[0].{Status:JobRunState,StartTime:StartedOn}'

Step 4: Deploying with CloudFormation

For production deployments, use AWS CloudFormation to define your infrastructure as code. This ensures consistency and makes it easy to replicate the pipeline across environments.

CloudFormation Template

Create a template that provisions all required resources:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Scalable ETL Pipeline with AWS Glue and S3'

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues:
      - dev
      - staging
      - prod
    Description: Deployment environment

  ProjectName:
    Type: String
    Default: etl-pipeline
    Description: Name prefix for all resources

Resources:
  # S3 Buckets
  RawDataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-raw-data-${Environment}-${AWS::AccountId}'
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      VersioningConfiguration:
        Status: Enabled

  ProcessedDataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-processed-data-${Environment}-${AWS::AccountId}'
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true

  AssetsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-assets-${Environment}-${AWS::AccountId}'
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256

  # IAM Role for Glue
  GlueServiceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub 'AWSGlueServiceRole-${ProjectName}-${Environment}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: glue.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                  - s3:DeleteObject
                  - s3:ListBucket
                Resource:
                  - !GetAtt RawDataBucket.Arn
                  - !Sub '${RawDataBucket.Arn}/*'
                  - !GetAtt ProcessedDataBucket.Arn
                  - !Sub '${ProcessedDataBucket.Arn}/*'
                  - !GetAtt AssetsBucket.Arn
                  - !Sub '${AssetsBucket.Arn}/*'

  # Glue Database
  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: !Sub '${ProjectName}_db_${Environment}'
        Description: !Sub 'Database for ${ProjectName} ETL pipeline'

  # Glue Crawler
  SalesCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Sub '${ProjectName}-sales-crawler-${Environment}'
      Role: !GetAtt GlueServiceRole.Arn
      DatabaseName: !Ref GlueDatabase
      Targets:
        S3Targets:
          - Path: !Sub 's3://${RawDataBucket}/sales/'
      SchemaChangePolicy:
        UpdateBehavior: UPDATE_IN_DATABASE
        DeleteBehavior: LOG

  # Glue ETL Job
  SalesETLJob:
    Type: AWS::Glue::Job
    Properties:
      Name: !Sub '${ProjectName}-sales-etl-${Environment}'
      Role: !GetAtt GlueServiceRole.Arn
      Command:
        Name: glueetl
        ScriptLocation: !Sub 's3://${AssetsBucket}/scripts/sales_etl_job.py'
        PythonVersion: '3'
      DefaultArguments:
        '--job-language': python
        '--TempDir': !Sub 's3://${AssetsBucket}/temp/'
        '--enable-metrics': 'true'
        '--enable-continuous-cloudwatch-log': 'true'
        '--OUTPUT_BUCKET': !Ref ProcessedDataBucket
        '--DATABASE_NAME': !Ref GlueDatabase
      GlueVersion: '4.0'
      NumberOfWorkers: 2
      WorkerType: G.1X
      Timeout: 60

  # Glue Trigger for scheduled execution
  DailyTrigger:
    Type: AWS::Glue::Trigger
    Properties:
      Name: !Sub '${ProjectName}-daily-trigger-${Environment}'
      Type: SCHEDULED
      Schedule: 'cron(0 6 * * ? *)'
      StartOnCreation: true
      Actions:
        - JobName: !Ref SalesETLJob

  # CloudWatch Alarm for job failures
  JobFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${ProjectName}-job-failure-${Environment}'
      AlarmDescription: Alert when Glue job fails
      MetricName: glue.driver.aggregate.numFailedTasks
      Namespace: Glue
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: GreaterThanOrEqualToThreshold
      Dimensions:
        - Name: JobName
          Value: !Ref SalesETLJob

Outputs:
  RawDataBucketName:
    Description: Name of the raw data S3 bucket
    Value: !Ref RawDataBucket

  ProcessedDataBucketName:
    Description: Name of the processed data S3 bucket
    Value: !Ref ProcessedDataBucket

  GlueDatabaseName:
    Description: Name of the Glue database
    Value: !Ref GlueDatabase

  ETLJobName:
    Description: Name of the Glue ETL job
    Value: !Ref SalesETLJob

Deploy the Stack

# Deploy the CloudFormation stack
aws cloudformation create-stack \
    --stack-name etl-pipeline-dev \
    --template-body file://etl-pipeline-template.yaml \
    --parameters ParameterKey=Environment,ParameterValue=dev \
    --capabilities CAPABILITY_NAMED_IAM

# Monitor deployment progress
aws cloudformation describe-stacks \
    --stack-name etl-pipeline-dev \
    --query 'Stacks[0].StackStatus'

Monitoring and Troubleshooting

CloudWatch Metrics

AWS Glue automatically publishes metrics to CloudWatch. Key metrics to monitor include:

glue.driver.aggregate.bytesRead: Data volume processed
glue.driver.aggregate.numCompletedTasks: Successfully completed tasks
glue.driver.aggregate.numFailedTasks: Failed tasks requiring attention
glue.driver.jvm.heap.used: Memory usage

View Job Logs

# Get recent job runs
aws glue get-job-runs --job-name sales-etl-job --max-results 5

# View CloudWatch logs for a specific run
aws logs get-log-events \
    --log-group-name /aws-glue/jobs/output \
    --log-stream-name sales-etl-job/<job-run-id>

Common Issues and Solutions

Schema mismatch errors: Run the crawler after data format changes to update the Data Catalog.

Memory issues: Increase worker count or worker type (G.1X to G.2X) for larger datasets.

Permission errors: Verify the Glue service role has access to all required S3 paths.

Best Practices

Data Organization

Partition data by frequently-filtered columns (date, region) to reduce scan costs
Use Parquet or ORC formats for analytical workloads
Implement a medallion architecture with bronze (raw), silver (cleansed), and gold (aggregated) layers

Security

Enable S3 bucket encryption at rest
Use IAM roles with least-privilege permissions
Block public access to all data buckets
Enable versioning for data recovery

Cost Optimization

Use Glue job bookmarks for incremental processing
Schedule jobs during off-peak hours
Right-size worker types based on actual workload
Set appropriate timeouts to prevent runaway jobs

Reliability

Implement error handling in ETL scripts
Set up CloudWatch alarms for job failures
Use S3 versioning for data recovery
Test jobs in a non-production environment first

Conclusion

You've now built a scalable, serverless ETL pipeline on AWS using AWS Glue and Amazon S3. This architecture provides automatic scaling, requires no infrastructure management, and can handle data volumes from megabytes to petabytes.

The key components work together seamlessly: S3 provides durable storage for raw and processed data, Glue Crawlers automatically discover schemas, and Glue ETL jobs transform data using distributed Spark processing. CloudFormation enables reproducible deployments across environments.

As your data needs grow, consider extending this pipeline with:

AWS Step Functions for orchestrating complex multi-step workflows
Amazon Athena for SQL queries directly on S3 data
Amazon Redshift as a target for analytical data warehousing
AWS Lambda for event-driven triggers when new data arrives

Sources

Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance on ETL patterns
Creating a Scalable ETL Pipeline with S3, Lambda, and AWS Glue - DEV Community practical implementation guide
Streamlining Data Processing: AWS Glue and Step Functions for ETL - LinkedIn article on scalable ETL architectures
Building ETL Pipelines on AWS - AWS video guide on ETL pipeline patterns
Build intelligent ETL pipelines using AWS Model Context Protocol and Amazon Q - AWS Storage Blog on modern ETL approaches