Building a Scalable ETL Pipeline on AWS: A Step-by-Step Guide
Extract, Transform, Load (ETL) pipelines are the backbone of modern data engineering. They move data from source systems, transform it into usable formats, and load it into target destinations for analysis. In this guide, you'll learn how to build a scalable, serverless ETL pipeline using AWS Glue and Amazon S3.
What is AWS Glue?
AWS Glue is a fully managed, serverless data integration service that simplifies discovering, preparing, and combining data for analytics and machine learning. It handles the undifferentiated heavy lifting of provisioning infrastructure, scaling resources, and managing job execution.
Key components of AWS Glue include:
- Data Catalog: A centralized metadata repository that stores table definitions, schema information, and data location
- Crawlers: Automated processes that scan your data sources and populate the Data Catalog with table definitions
- ETL Jobs: Apache Spark-based jobs that extract, transform, and load data between sources and targets
- Triggers: Mechanisms that start ETL jobs on a schedule or in response to events
AWS Glue supports various data formats including JSON, CSV, Parquet, and ORC, and scales automatically based on data volume.
Architecture Overview
Here's the architecture you'll build in this guide:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Raw Data │────▶│ S3 Raw │────▶│ AWS Glue │────▶│ S3 Processed│
│ Source │ │ Bucket │ │ ETL Job │ │ Bucket │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │
▼ │
┌─────────────┐ │
│ AWS Glue │◀───────────┘
│ Crawler │
└─────────────┘
│
▼
┌─────────────┐
│ AWS Glue │
│ Data │
│ Catalog │
└─────────────┘
This serverless architecture uses AWS managed services that scale automatically and require no infrastructure management.
Prerequisites
Before you begin, ensure you have:
- An AWS account with appropriate permissions
- AWS CLI installed and configured
- Basic knowledge of Python and SQL
- Familiarity with S3 and IAM concepts
Step 1: Setting Up S3 Buckets for Data Storage
Amazon S3 serves as both the source and destination for your ETL pipeline. Following AWS best practices, create separate buckets for raw and processed data.
Create S3 Buckets
# Create bucket for raw/landing data
aws s3 mb s3://your-company-etl-raw-data --region us-east-1
# Create bucket for processed/transformed data
aws s3 mb s3://your-company-etl-processed-data --region us-east-1
# Create bucket for Glue scripts and temporary files
aws s3 mb s3://your-company-etl-assets --region us-east-1
Configure Bucket Structure
Organize your buckets with a clear folder structure based on ingestion frequency and data source:
# Create folder structure for raw data
aws s3api put-object --bucket your-company-etl-raw-data --key sales/
aws s3api put-object --bucket your-company-etl-raw-data --key customers/
aws s3api put-object --bucket your-company-etl-raw-data --key products/
# Create folder structure for processed data
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/sales/
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/customers/
aws s3api put-object --bucket your-company-etl-processed-data --key parquet/products/
Upload Sample Data
Create sample CSV files to use throughout this guide:
sales_data.csv:
transaction_id,customer_id,product_id,quantity,unit_price,transaction_date,region
TXN001,CUST100,PROD001,2,29.99,2025-01-15,US-EAST
TXN002,CUST101,PROD002,1,49.99,2025-01-15,US-WEST
TXN003,CUST100,PROD003,3,19.99,2025-01-16,US-EAST
TXN004,CUST102,PROD001,1,29.99,2025-01-16,EU-WEST
TXN005,CUST101,PROD004,2,39.99,2025-01-17,US-WEST
Upload to S3:
aws s3 cp sales_data.csv s3://your-company-etl-raw-data/sales/
Step 2: Creating the Glue Database and Crawler
The AWS Glue Data Catalog serves as a central metadata repository for your data assets. You'll create a database and use a crawler to automatically discover the schema of your raw data.
Create a Glue Database
aws glue create-database \
--database-input '{
"Name": "etl_pipeline_db",
"Description": "Database for ETL pipeline data catalog"
}'
Create an IAM Role for Glue
AWS Glue needs permissions to access S3 and write to the Data Catalog:
# Create the trust policy document
cat > glue-trust-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
# Create the IAM role
aws iam create-role \
--role-name AWSGlueServiceRole-ETLPipeline \
--assume-role-policy-document file://glue-trust-policy.json
# Attach the AWS managed Glue service policy
aws iam attach-role-policy \
--role-name AWSGlueServiceRole-ETLPipeline \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
# Create and attach S3 access policy
cat > glue-s3-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-company-etl-*",
"arn:aws:s3:::your-company-etl-*/*"
]
}
]
}
EOF
aws iam put-role-policy \
--role-name AWSGlueServiceRole-ETLPipeline \
--policy-name S3AccessPolicy \
--policy-document file://glue-s3-policy.json
Create a Glue Crawler
The crawler scans your S3 data and creates table definitions in the Data Catalog:
aws glue create-crawler \
--name sales-data-crawler \
--role AWSGlueServiceRole-ETLPipeline \
--database-name etl_pipeline_db \
--targets '{
"S3Targets": [
{
"Path": "s3://your-company-etl-raw-data/sales/"
}
]
}' \
--schema-change-policy '{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
}'
Run the Crawler
# Start the crawler
aws glue start-crawler --name sales-data-crawler
# Check crawler status
aws glue get-crawler --name sales-data-crawler --query 'Crawler.State'
Once the crawler completes, verify the table was created:
aws glue get-table --database-name etl_pipeline_db --name sales
Step 3: Creating a Glue ETL Job
Now create an AWS Glue ETL job to transform your raw data and write it to S3 in an optimized Parquet format.
Write the ETL Script
Create a Python script that performs the following transformations: - Reads raw CSV data from the Glue Data Catalog - Adds calculated fields (total_amount) - Partitions data by region for optimized queries - Writes to S3 in Parquet format
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, to_date, round as spark_round
# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read data from the Glue Data Catalog
raw_data = glueContext.create_dynamic_frame.from_catalog(
database="etl_pipeline_db",
table_name="sales"
)
# Convert to Spark DataFrame for transformations
df = raw_data.toDF()
# Transform: Add calculated fields and clean data
transformed_df = df \
.withColumn("total_amount",
spark_round(col("quantity") * col("unit_price"), 2)) \
.withColumn("transaction_date",
to_date(col("transaction_date"), "yyyy-MM-dd"))
# Convert back to DynamicFrame
transformed_dynamic_frame = DynamicFrame.fromDF(
transformed_df,
glueContext,
"transformed_sales"
)
# Apply schema mapping
mapped_data = ApplyMapping.apply(
frame=transformed_dynamic_frame,
mappings=[
("transaction_id", "string", "transaction_id", "string"),
("customer_id", "string", "customer_id", "string"),
("product_id", "string", "product_id", "string"),
("quantity", "long", "quantity", "int"),
("unit_price", "double", "unit_price", "decimal(10,2)"),
("transaction_date", "date", "transaction_date", "date"),
("region", "string", "region", "string"),
("total_amount", "double", "total_amount", "decimal(10,2)")
]
)
# Write to S3 in Parquet format, partitioned by region
glueContext.write_dynamic_frame.from_options(
frame=mapped_data,
connection_type="s3",
connection_options={
"path": "s3://your-company-etl-processed-data/parquet/sales/",
"partitionKeys": ["region"]
},
format="parquet",
format_options={
"compression": "snappy"
}
)
job.commit()
Upload the Script to S3
aws s3 cp sales_etl_job.py s3://your-company-etl-assets/scripts/
Create the Glue Job
aws glue create-job \
--name sales-etl-job \
--role AWSGlueServiceRole-ETLPipeline \
--command '{
"Name": "glueetl",
"ScriptLocation": "s3://your-company-etl-assets/scripts/sales_etl_job.py",
"PythonVersion": "3"
}' \
--default-arguments '{
"--job-language": "python",
"--TempDir": "s3://your-company-etl-assets/temp/",
"--enable-metrics": "true",
"--enable-continuous-cloudwatch-log": "true"
}' \
--glue-version "4.0" \
--number-of-workers 2 \
--worker-type "G.1X"
Run and Monitor the Job
# Start the job
aws glue start-job-run --job-name sales-etl-job
# Monitor job status
aws glue get-job-runs --job-name sales-etl-job \
--query 'JobRuns[0].{Status:JobRunState,StartTime:StartedOn}'
Step 4: Deploying with CloudFormation
For production deployments, use AWS CloudFormation to define your infrastructure as code. This ensures consistency and makes it easy to replicate the pipeline across environments.
CloudFormation Template
Create a template that provisions all required resources:
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Scalable ETL Pipeline with AWS Glue and S3'
Parameters:
Environment:
Type: String
Default: dev
AllowedValues:
- dev
- staging
- prod
Description: Deployment environment
ProjectName:
Type: String
Default: etl-pipeline
Description: Name prefix for all resources
Resources:
# S3 Buckets
RawDataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${ProjectName}-raw-data-${Environment}-${AWS::AccountId}'
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
VersioningConfiguration:
Status: Enabled
ProcessedDataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${ProjectName}-processed-data-${Environment}-${AWS::AccountId}'
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
AssetsBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${ProjectName}-assets-${Environment}-${AWS::AccountId}'
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
# IAM Role for Glue
GlueServiceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub 'AWSGlueServiceRole-${ProjectName}-${Environment}'
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: glue.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Policies:
- PolicyName: S3Access
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
- s3:ListBucket
Resource:
- !GetAtt RawDataBucket.Arn
- !Sub '${RawDataBucket.Arn}/*'
- !GetAtt ProcessedDataBucket.Arn
- !Sub '${ProcessedDataBucket.Arn}/*'
- !GetAtt AssetsBucket.Arn
- !Sub '${AssetsBucket.Arn}/*'
# Glue Database
GlueDatabase:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Sub '${ProjectName}_db_${Environment}'
Description: !Sub 'Database for ${ProjectName} ETL pipeline'
# Glue Crawler
SalesCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: !Sub '${ProjectName}-sales-crawler-${Environment}'
Role: !GetAtt GlueServiceRole.Arn
DatabaseName: !Ref GlueDatabase
Targets:
S3Targets:
- Path: !Sub 's3://${RawDataBucket}/sales/'
SchemaChangePolicy:
UpdateBehavior: UPDATE_IN_DATABASE
DeleteBehavior: LOG
# Glue ETL Job
SalesETLJob:
Type: AWS::Glue::Job
Properties:
Name: !Sub '${ProjectName}-sales-etl-${Environment}'
Role: !GetAtt GlueServiceRole.Arn
Command:
Name: glueetl
ScriptLocation: !Sub 's3://${AssetsBucket}/scripts/sales_etl_job.py'
PythonVersion: '3'
DefaultArguments:
'--job-language': python
'--TempDir': !Sub 's3://${AssetsBucket}/temp/'
'--enable-metrics': 'true'
'--enable-continuous-cloudwatch-log': 'true'
'--OUTPUT_BUCKET': !Ref ProcessedDataBucket
'--DATABASE_NAME': !Ref GlueDatabase
GlueVersion: '4.0'
NumberOfWorkers: 2
WorkerType: G.1X
Timeout: 60
# Glue Trigger for scheduled execution
DailyTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: !Sub '${ProjectName}-daily-trigger-${Environment}'
Type: SCHEDULED
Schedule: 'cron(0 6 * * ? *)'
StartOnCreation: true
Actions:
- JobName: !Ref SalesETLJob
# CloudWatch Alarm for job failures
JobFailureAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub '${ProjectName}-job-failure-${Environment}'
AlarmDescription: Alert when Glue job fails
MetricName: glue.driver.aggregate.numFailedTasks
Namespace: Glue
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: JobName
Value: !Ref SalesETLJob
Outputs:
RawDataBucketName:
Description: Name of the raw data S3 bucket
Value: !Ref RawDataBucket
ProcessedDataBucketName:
Description: Name of the processed data S3 bucket
Value: !Ref ProcessedDataBucket
GlueDatabaseName:
Description: Name of the Glue database
Value: !Ref GlueDatabase
ETLJobName:
Description: Name of the Glue ETL job
Value: !Ref SalesETLJob
Deploy the Stack
# Deploy the CloudFormation stack
aws cloudformation create-stack \
--stack-name etl-pipeline-dev \
--template-body file://etl-pipeline-template.yaml \
--parameters ParameterKey=Environment,ParameterValue=dev \
--capabilities CAPABILITY_NAMED_IAM
# Monitor deployment progress
aws cloudformation describe-stacks \
--stack-name etl-pipeline-dev \
--query 'Stacks[0].StackStatus'
Monitoring and Troubleshooting
CloudWatch Metrics
AWS Glue automatically publishes metrics to CloudWatch. Key metrics to monitor include:
- glue.driver.aggregate.bytesRead: Data volume processed
- glue.driver.aggregate.numCompletedTasks: Successfully completed tasks
- glue.driver.aggregate.numFailedTasks: Failed tasks requiring attention
- glue.driver.jvm.heap.used: Memory usage
View Job Logs
# Get recent job runs
aws glue get-job-runs --job-name sales-etl-job --max-results 5
# View CloudWatch logs for a specific run
aws logs get-log-events \
--log-group-name /aws-glue/jobs/output \
--log-stream-name sales-etl-job/<job-run-id>
Common Issues and Solutions
Schema mismatch errors: Run the crawler after data format changes to update the Data Catalog.
Memory issues: Increase worker count or worker type (G.1X to G.2X) for larger datasets.
Permission errors: Verify the Glue service role has access to all required S3 paths.
Best Practices
Data Organization
- Partition data by frequently-filtered columns (date, region) to reduce scan costs
- Use Parquet or ORC formats for analytical workloads
- Implement a medallion architecture with bronze (raw), silver (cleansed), and gold (aggregated) layers
Security
- Enable S3 bucket encryption at rest
- Use IAM roles with least-privilege permissions
- Block public access to all data buckets
- Enable versioning for data recovery
Cost Optimization
- Use Glue job bookmarks for incremental processing
- Schedule jobs during off-peak hours
- Right-size worker types based on actual workload
- Set appropriate timeouts to prevent runaway jobs
Reliability
- Implement error handling in ETL scripts
- Set up CloudWatch alarms for job failures
- Use S3 versioning for data recovery
- Test jobs in a non-production environment first
Conclusion
You've now built a scalable, serverless ETL pipeline on AWS using AWS Glue and Amazon S3. This architecture provides automatic scaling, requires no infrastructure management, and can handle data volumes from megabytes to petabytes.
The key components work together seamlessly: S3 provides durable storage for raw and processed data, Glue Crawlers automatically discover schemas, and Glue ETL jobs transform data using distributed Spark processing. CloudFormation enables reproducible deployments across environments.
As your data needs grow, consider extending this pipeline with:
- AWS Step Functions for orchestrating complex multi-step workflows
- Amazon Athena for SQL queries directly on S3 data
- Amazon Redshift as a target for analytical data warehousing
- AWS Lambda for event-driven triggers when new data arrives
Sources
- Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using AWS Glue - AWS Prescriptive Guidance on ETL patterns
- Creating a Scalable ETL Pipeline with S3, Lambda, and AWS Glue - DEV Community practical implementation guide
- Streamlining Data Processing: AWS Glue and Step Functions for ETL - LinkedIn article on scalable ETL architectures
- Building ETL Pipelines on AWS - AWS video guide on ETL pipeline patterns
- Build intelligent ETL pipelines using AWS Model Context Protocol and Amazon Q - AWS Storage Blog on modern ETL approaches