Getting Started with Athena: A Step-by-Step Guide to Setting Up a Scalable Query Service

Amazon Athena enables you to run SQL queries directly against data stored in S3 without provisioning servers or managing infrastructure. For data engineers who need to perform ad-hoc analysis on data lakes, Athena provides a fast path from raw data to insights. This guide walks you through setting up Athena, creating your first database and table, and writing queries to analyze your data.

What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that lets you analyze data in Amazon S3 using standard SQL. You point Athena at your data, define the schema, and start querying. There's no infrastructure to set up or manage, and you pay only for the queries you run.

Key characteristics that make Athena valuable for data engineering:

Serverless: No clusters to provision or manage. Athena automatically scales query execution in parallel
SQL-based: Uses standard SQL syntax through the Trino (formerly PrestoSQL) engine for queries and Apache Hive for DDL statements
Schema-on-read: Define your schema when you query rather than when you load data
Pay-per-query: Charged based on the amount of data scanned, with no upfront costs
Native S3 integration: Query data directly where it lives in your data lake

Athena fits naturally into the AWS analytics ecosystem. It uses the AWS Glue Data Catalog to store table metadata, integrates with IAM for access control, and can query data produced by services like Kinesis Data Firehose, AWS Glue ETL jobs, or EMR.

Setting Up Athena for Querying S3 Data

Prerequisites

Before you begin, ensure you have:

An AWS account with appropriate IAM permissions for Athena and S3
An S3 bucket containing data you want to query
AWS CLI installed and configured (optional but recommended)

Configure a Query Results Location

Athena stores query results in an S3 bucket you specify. Create a dedicated bucket for query results to keep them organized and apply appropriate lifecycle policies.

# Create a bucket for Athena query results
aws s3 mb s3://your-org-athena-results --region us-east-1

Configure the query results location in the Athena console:

Open the Athena console at https://console.aws.amazon.com/athena/
If this is your first time, you'll see a prompt to set up a query result location
Choose Settings in the navigation panel
Enter your S3 path: s3://your-org-athena-results/
Optionally enable encryption for query results
Save your settings

You can also configure this using the AWS CLI by creating a workgroup:

aws athena create-work-group \
    --name primary-workgroup \
    --configuration '{
        "ResultConfiguration": {
            "OutputLocation": "s3://your-org-athena-results/"
        }
    }'

Set Up IAM Permissions

Users and applications querying Athena need permissions for both Athena operations and S3 data access. A minimal IAM policy includes:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:StartQueryExecution",
                "athena:GetQueryExecution",
                "athena:GetQueryResults",
                "athena:StopQueryExecution",
                "athena:GetWorkGroup"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-data-bucket",
                "arn:aws:s3:::your-data-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-org-athena-results",
                "arn:aws:s3:::your-org-athena-results/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions"
            ],
            "Resource": "*"
        }
    ]
}

This policy grants permission to run queries, read source data, write query results, and access the Glue Data Catalog for metadata.

Creating a Database and Table in Athena

Create a Database

A database in Athena is a logical container for your tables. Create one to organize tables for a specific project or data domain.

In the Athena query editor, run:

CREATE DATABASE IF NOT EXISTS sales_analytics
COMMENT 'Database for sales data analysis'
LOCATION 's3://your-data-bucket/sales/';

The LOCATION clause is optional but useful for specifying a default S3 path for tables in this database.

Create a Table

Athena tables are metadata definitions that describe the structure and location of your data in S3. The data remains in S3—Athena reads it at query time using the schema you define.

For CSV data, create an external table:

CREATE EXTERNAL TABLE IF NOT EXISTS sales_analytics.orders (
    order_id STRING,
    customer_id STRING,
    order_date DATE,
    product_id STRING,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://your-data-bucket/sales/orders/'
TBLPROPERTIES ('skip.header.line.count'='1');

For Parquet data (recommended for analytics workloads):

CREATE EXTERNAL TABLE IF NOT EXISTS sales_analytics.orders_parquet (
    order_id STRING,
    customer_id STRING,
    order_date DATE,
    product_id STRING,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
)
STORED AS PARQUET
LOCATION 's3://your-data-bucket/sales/orders_parquet/';

Create a Partitioned Table

Partitioning reduces query costs by limiting the data Athena scans. For time-series data, partition by date:

CREATE EXTERNAL TABLE IF NOT EXISTS sales_analytics.orders_partitioned (
    order_id STRING,
    customer_id STRING,
    product_id STRING,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(10,2)
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://your-data-bucket/sales/orders_partitioned/';

Your S3 data should follow a Hive-style partition structure:

s3://your-data-bucket/sales/orders_partitioned/year=2026/month=01/day=25/data.parquet

Load the partition metadata:

MSCK REPAIR TABLE sales_analytics.orders_partitioned;

This command scans your S3 location and registers all partitions it finds.

Supported Data Types

Athena supports a range of data types for defining your tables:

Category	Types
Numeric	BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL
String	STRING, VARCHAR, CHAR
Date/Time	DATE, TIMESTAMP
Complex	ARRAY, MAP, STRUCT

For DDL statements like CREATE TABLE, Athena uses Apache Hive data type syntax. For DML queries, it uses Trino syntax, though the engine handles the translation automatically.

Writing SQL Queries Using Athena

Basic Queries

Query your data using standard SQL:

-- Count total orders
SELECT COUNT(*) AS total_orders
FROM sales_analytics.orders;

-- Calculate revenue by product
SELECT
    product_id,
    SUM(total_amount) AS revenue,
    COUNT(*) AS order_count
FROM sales_analytics.orders
GROUP BY product_id
ORDER BY revenue DESC
LIMIT 10;

Filtering with Partitions

When querying partitioned tables, include partition columns in your WHERE clause to reduce data scanned:

-- Query specific date range using partition filters
SELECT
    customer_id,
    SUM(total_amount) AS total_spent
FROM sales_analytics.orders_partitioned
WHERE year = '2026'
  AND month = '01'
  AND day BETWEEN '01' AND '15'
GROUP BY customer_id
ORDER BY total_spent DESC;

Athena prunes partitions that don't match your filter, scanning only the relevant data and reducing costs.

Joining Tables

Join multiple tables to combine data:

SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    c.region,
    o.total_amount
FROM sales_analytics.orders o
JOIN sales_analytics.customers c
    ON o.customer_id = c.customer_id
WHERE o.order_date >= DATE '2026-01-01';

Using Common Table Expressions (CTEs)

CTEs improve query readability for complex analysis:

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', order_date) AS month,
        SUM(total_amount) AS revenue
    FROM sales_analytics.orders
    GROUP BY DATE_TRUNC('month', order_date)
),
growth_calc AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_revenue
    FROM monthly_sales
)
SELECT
    month,
    revenue,
    ROUND((revenue - prev_revenue) / prev_revenue * 100, 2) AS growth_pct
FROM growth_calc
WHERE prev_revenue IS NOT NULL
ORDER BY month;

Creating Views

Save frequently used queries as views:

CREATE OR REPLACE VIEW sales_analytics.daily_summary AS
SELECT
    order_date,
    COUNT(*) AS order_count,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(total_amount) AS revenue
FROM sales_analytics.orders
GROUP BY order_date;

Query the view like any table:

SELECT * FROM sales_analytics.daily_summary
WHERE order_date >= DATE '2026-01-01'
ORDER BY order_date;

Create Table As Select (CTAS)

Transform and materialize query results as a new table:

CREATE TABLE sales_analytics.top_customers
WITH (
    format = 'PARQUET',
    external_location = 's3://your-data-bucket/sales/top_customers/'
) AS
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS lifetime_value
FROM sales_analytics.orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;

CTAS is useful for creating derived datasets that you query frequently, converting data to columnar formats, or aggregating data for dashboards.

Best Practices for Athena

Optimize Data Format

Use columnar formats like Parquet or ORC instead of CSV or JSON. Columnar formats provide:

Better compression (4-10x smaller files)
Column pruning (read only the columns you need)
Predicate pushdown (filter data during scan)

Partition Strategically

Partition data by columns you frequently filter on. For time-series data, partitioning by date is common. Avoid over-partitioning—aim for partition sizes of at least 128 MB to prevent small file overhead.

Use Workgroups

Organize users and queries into workgroups to:

Set query result locations per team
Enforce data limits per query
Track costs by workgroup
Apply different IAM permissions

Compress Your Data

Athena supports gzip, Snappy, LZ4, and ZSTD compression. Compressing data reduces storage costs and the amount of data scanned, lowering query costs.

Monitor Costs

Since Athena charges based on data scanned, monitor your queries:

Use the AWS Cost Explorer to track Athena spending
Check the "Data scanned" metric for each query in the console
Optimize expensive queries by adding filters and using partitions

Conclusion

Amazon Athena provides a serverless path to querying data in S3 using standard SQL. You define your schema, point to your data, and start analyzing—without managing infrastructure or loading data into a separate system.

Start by setting up a query results bucket and creating a database for your tables. Use partitioned tables and columnar formats to control costs as your data grows. Write queries that filter on partition columns to scan only the data you need.

With Athena configured, you can explore data lakes interactively, build dashboards with Amazon QuickSight, or integrate queries into data pipelines. The serverless model means you can run occasional ad-hoc queries without maintaining always-on infrastructure, paying only when you actually query your data.

Sources

What is Amazon Athena? - AWS Athena User Guide overview of service capabilities and architecture
Getting started with Amazon Athena - AWS Athena User Guide tutorial for initial setup and configuration
Creating tables in Athena - AWS Athena User Guide documentation on table creation methods and syntax
Data types in Amazon Athena - AWS Athena User Guide reference for supported data types
Partitioning data in Athena - AWS Athena User Guide best practices for table partitioning