Getting Started with S3: A Step-by-Step Guide to Setting Up a Scalable Data Store

Amazon S3 is the foundation of data engineering on AWS. Whether you're building a data lake, storing raw files for ETL processing, or archiving historical data, S3 provides the durability, scalability, and cost-effectiveness that modern data workloads demand. This guide walks you through setting up S3 for data engineering use cases, from creating your first bucket to implementing security best practices.

What is Amazon S3?

Amazon Simple Storage Service (S3) is an object storage service that stores data as objects within buckets. Unlike traditional file systems with hierarchical directories, S3 uses a flat structure where each object is identified by a unique key within a bucket.

Key characteristics that make S3 suitable for data engineering:

Durability: S3 Standard provides 99.999999999% (11 nines) durability by automatically replicating data across multiple facilities
Scalability: Store unlimited amounts of data without capacity planning
Strong consistency: All read operations return the most recent version of an object immediately after a write
Integration: Native connectivity with AWS analytics services like Athena, Glue, EMR, and Redshift

S3 supports multiple storage classes optimized for different access patterns. S3 Standard works for frequently accessed data, while S3 Intelligent-Tiering automatically moves data between tiers based on access patterns. For archive use cases, S3 Glacier provides lower-cost storage with retrieval times ranging from minutes to hours.

Setting Up an S3 Bucket for Data Storage

Prerequisites

Before you begin, ensure you have:

An AWS account with appropriate permissions
AWS CLI installed and configured with your credentials
Basic familiarity with command-line operations

Create Your First Bucket

S3 bucket names must be globally unique across all AWS accounts. Use a naming convention that includes your organization name and purpose to avoid conflicts.

Using the AWS CLI:

# Create a bucket for raw data landing zone
aws s3 mb s3://your-org-data-lake-raw --region us-east-1

# Create a bucket for processed/curated data
aws s3 mb s3://your-org-data-lake-processed --region us-east-1

Bucket names must follow these rules: - Be between 3-63 characters long - Contain only lowercase letters, numbers, and hyphens - Start and end with a letter or number - Not be formatted as an IP address

Organize Your Data with Prefixes

Although S3 uses a flat namespace, you can simulate folder structures using prefixes. A well-organized prefix structure improves data discoverability and enables partition pruning in query engines.

# Create a folder structure for a data lake
aws s3api put-object --bucket your-org-data-lake-raw --key bronze/
aws s3api put-object --bucket your-org-data-lake-raw --key bronze/sales/
aws s3api put-object --bucket your-org-data-lake-raw --key bronze/customers/
aws s3api put-object --bucket your-org-data-lake-raw --key bronze/products/

# Structure for processed data with date partitioning
aws s3api put-object --bucket your-org-data-lake-processed --key silver/sales/
aws s3api put-object --bucket your-org-data-lake-processed --key gold/aggregates/

For time-series data, use a partitioning scheme that matches your query patterns:

s3://your-org-data-lake-raw/bronze/sales/year=2026/month=01/day=25/

This Hive-style partitioning enables AWS Glue, Athena, and Spark to prune partitions and scan only relevant data.

Configuring Access Control and Security

Security is critical when storing sensitive data. AWS provides multiple layers of access control that work together to protect your data.

Block Public Access

By default, new S3 buckets block all public access. Verify this setting is enabled:

aws s3api get-public-access-block --bucket your-org-data-lake-raw

If not already configured, enable it:

aws s3api put-public-access-block \
    --bucket your-org-data-lake-raw \
    --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Configure Server-Side Encryption

S3 encrypts all new objects by default using server-side encryption with Amazon S3 managed keys (SSE-S3). For additional control, you can use AWS Key Management Service (KMS) keys:

# Enable default encryption with SSE-S3
aws s3api put-bucket-encryption \
    --bucket your-org-data-lake-raw \
    --server-side-encryption-configuration '{
        "Rules": [
            {
                "ApplyServerSideEncryptionByDefault": {
                    "SSEAlgorithm": "AES256"
                },
                "BucketKeyEnabled": true
            }
        ]
    }'

For regulatory requirements that mandate customer-managed keys:

# Enable default encryption with SSE-KMS
aws s3api put-bucket-encryption \
    --bucket your-org-data-lake-raw \
    --server-side-encryption-configuration '{
        "Rules": [
            {
                "ApplyServerSideEncryptionByDefault": {
                    "SSEAlgorithm": "aws:kms",
                    "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-key-id"
                },
                "BucketKeyEnabled": true
            }
        ]
    }'

Enforce HTTPS Connections

Add a bucket policy that denies requests made over unencrypted HTTP:

cat > bucket-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EnforceHTTPS",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::your-org-data-lake-raw",
                "arn:aws:s3:::your-org-data-lake-raw/*"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        }
    ]
}
EOF

aws s3api put-bucket-policy \
    --bucket your-org-data-lake-raw \
    --policy file://bucket-policy.json

Enable Versioning for Data Protection

Versioning protects against accidental deletions and enables point-in-time recovery:

aws s3api put-bucket-versioning \
    --bucket your-org-data-lake-raw \
    --versioning-configuration Status=Enabled

With versioning enabled, deleting an object creates a delete marker rather than permanently removing the data. You can restore previous versions when needed.

Grant Access Using IAM Policies

Use IAM policies to grant applications and users access to your buckets. Follow the principle of least privilege by granting only the permissions required:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-org-data-lake-processed",
                "arn:aws:s3:::your-org-data-lake-processed/*"
            ]
        }
    ]
}

For applications running on EC2 or ECS, attach this policy to an IAM role rather than using access keys.

Uploading and Managing Data

Upload Files

Upload individual files or entire directories using the AWS CLI:

# Upload a single file
aws s3 cp sales_data.csv s3://your-org-data-lake-raw/bronze/sales/

# Upload a directory recursively
aws s3 cp ./local-data/ s3://your-org-data-lake-raw/bronze/sales/ --recursive

# Upload with a specific storage class
aws s3 cp archive_data.parquet s3://your-org-data-lake-raw/archive/ \
    --storage-class STANDARD_IA

Sync Local and Remote Data

The sync command uploads only new or changed files:

# Sync local directory to S3
aws s3 sync ./daily-exports/ s3://your-org-data-lake-raw/bronze/daily/

# Sync with delete (removes files from S3 that don't exist locally)
aws s3 sync ./daily-exports/ s3://your-org-data-lake-raw/bronze/daily/ --delete

List and Query Objects

# List objects in a bucket
aws s3 ls s3://your-org-data-lake-raw/bronze/sales/

# List with human-readable sizes
aws s3 ls s3://your-org-data-lake-raw/bronze/sales/ --human-readable --summarize

# List recursively
aws s3 ls s3://your-org-data-lake-raw/ --recursive

Configure Lifecycle Policies

Lifecycle policies automate data movement between storage classes and object expiration:

cat > lifecycle-policy.json << 'EOF'
{
    "Rules": [
        {
            "ID": "MoveToIAAfter30Days",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "bronze/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                }
            ]
        },
        {
            "ID": "DeleteOldVersions",
            "Status": "Enabled",
            "Filter": {},
            "NoncurrentVersionExpiration": {
                "NoncurrentDays": 30
            }
        }
    ]
}
EOF

aws s3api put-bucket-lifecycle-configuration \
    --bucket your-org-data-lake-raw \
    --lifecycle-configuration file://lifecycle-policy.json

This policy moves objects in the bronze/ prefix to Infrequent Access after 30 days and to Glacier after 90 days. Non-current versions are deleted after 30 days.

Enable Server Access Logging

Track requests to your bucket for security auditing and access analysis:

# Create a bucket for logs
aws s3 mb s3://your-org-data-lake-logs --region us-east-1

# Grant S3 log delivery permissions
aws s3api put-bucket-acl \
    --bucket your-org-data-lake-logs \
    --grant-write URI=http://acs.amazonaws.com/groups/s3/LogDelivery \
    --grant-read-acp URI=http://acs.amazonaws.com/groups/s3/LogDelivery

# Enable logging
aws s3api put-bucket-logging \
    --bucket your-org-data-lake-raw \
    --bucket-logging-status '{
        "LoggingEnabled": {
            "TargetBucket": "your-org-data-lake-logs",
            "TargetPrefix": "access-logs/raw-bucket/"
        }
    }'

Best Practices for Data Engineering

Use Columnar Formats

Store analytical data in columnar formats like Parquet or ORC instead of CSV or JSON. Columnar formats provide:

Compression ratios of 4-10x compared to row-based formats
Predicate pushdown that reduces data scanned
Column pruning that reads only required columns

Partition Strategically

Choose partition keys based on your most common query filters. For time-series data:

s3://bucket/table/year=2026/month=01/day=25/data.parquet

Avoid over-partitioning, which creates many small files. Aim for partition sizes of at least 128 MB.

Monitor with S3 Storage Lens

S3 Storage Lens provides visibility into storage usage and activity across your organization. Enable it in the S3 console to track metrics like:

Total storage by bucket and prefix
Request patterns and error rates
Cost optimization opportunities

Use S3 Select for Filtering

When you need only a subset of data from large files, S3 Select retrieves specific rows and columns without downloading the entire object:

aws s3api select-object-content \
    --bucket your-org-data-lake-raw \
    --key bronze/sales/sales_2026.csv \
    --expression "SELECT * FROM s3object WHERE region = 'US-EAST'" \
    --expression-type SQL \
    --input-serialization '{"CSV": {"FileHeaderInfo": "USE"}}' \
    --output-serialization '{"CSV": {}}' \
    output.csv

Conclusion

Amazon S3 provides the storage foundation for data engineering on AWS. By following the practices in this guide, you can set up buckets that are secure, cost-effective, and ready for integration with AWS analytics services.

Start with a clear prefix structure that matches your data processing stages. Enable encryption and block public access from the beginning. Use lifecycle policies to automate cost optimization. As your data volumes grow, S3 scales automatically without requiring capacity management.

With your S3 storage configured, you can connect it to AWS Glue for cataloging and ETL, query data directly with Athena, or load it into Redshift for warehousing. S3 serves as the durable, scalable layer that ties these services together.

Sources

What is Amazon S3? - AWS S3 User Guide overview of service features and capabilities
Creating a bucket - AWS S3 User Guide step-by-step bucket creation instructions
Security best practices for Amazon S3 - AWS S3 User Guide security recommendations and IAM policy guidance