Ready to explore the power of Apache Airflow? This step-by-step guide will walk you through setting up a weather data pipeline that fetches data from an API and stores it in PostgreSQL. Whether you're new to Airflow or looking to automate your workflows, this tutorial will help you build a fully functional pipeline in no time!
Apache Airflow is a powerful workflow orchestration tool that enables data engineers to automate data pipelines, schedule tasks, and monitor execution flows. In this blog post, we will explore Apache Airflow’s capabilities by building a weather data pipeline that:
We will use Docker Compose to set up Apache Airflow and PostgreSQL.
Apache Airflow provides several advantages for workflow orchestration:
✅ Dynamic Workflows – Define workflows as Python code using Directed Acyclic Graphs (DAGs).
✅ Scalability – Run tasks in parallel using distributed execution.
✅ Monitoring & UI – Track execution progress and logs via a web interface.
✅ Extensibility – Integrate with APIs, databases, and cloud platforms using built-in and custom operators.
For this tutorial, we will set up a weather data pipeline with Airflow and PostgreSQL using Docker Compose.
Before we start, make sure you have:
A well-structured project is essential for maintainability. Here's how we'll organize our files:
airflow-weather-pipeline/
├── Makefile
│── .env.example # example environment variables
├── README.md # Project documentation
├── dags # DAG scripts
│ ├── weather_pipeline.py # The weather DAG code
│ └── weather_utils/ # Additional modules
├── docker-compose.yml # Docker setup
└── scripts # Config, fetch, and store scripts
├── init.sh # The script that runs on Airflow initialization
├── schema.sql # Database schema
└── schema_version.txt # Current schema version
Create a docker-compose.yml
file to define the services:
services:
postgres:
image: postgres:17
container_name: airflow-postgres
...
airflow-init:
image: apache/airflow:2.10.5
container_name: airflow-init
...
airflow-scheduler:
image: apache/airflow:2.10.5
container_name: airflow-scheduler
...
airflow-webserver:
image: apache/airflow:2.10.5
container_name: airflow-webserver
...
volumes:
...
networks:
...
This setup:
http://localhost:8080/
.To start the services, run:
make up
Before we run our DAG, let's create a PostgreSQL table for storing weather data.
Create a script scripts/init.py
:
#!/bin/bash
...
PGPASSWORD="$POSTGRES_PASSWORD" psql \
-h "$POSTGRES_HOST" \
-U "$POSTGRES_USER" \
-d "$POSTGRES_DB" \
-f /opt/airflow/scripts/schema.sql
...
Create a script scripts/schema.sql
:
CREATE TABLE IF NOT EXISTS weather_data (
id SERIAL PRIMARY KEY,
city VARCHAR(50) NOT NULL,
temperature FLOAT NOT NULL,
humidity INT NOT NULL,
weather VARCHAR(50) NOT NULL,
... ... ...
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Now, let's create the DAG to fetch and store weather data.
Edit dags/weather_pipeline.py
:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from weather_utils.fetch_weather import fetch_weather_data
from weather_utils.db_operations import store_weather_data
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2025, 3, 24),
"retries": 2,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"weather_pipeline",
default_args=default_args,
schedule_interval="@hourly",
catchup=False,
description="Fetches weather data and stores it in PostgreSQL",
)
fetch_weather_task = PythonOperator(
task_id="fetch_weather_data",
python_callable=fetch_weather_data,
dag=dag,
)
store_weather_task = PythonOperator(
task_id="store_weather_data",
python_callable=store_weather_data,
dag=dag,
)
fetch_weather_task >> store_weather_task
make up
http://localhost:8080/
.weather_pipeline
.
docker exec -it airflow-postgres psql -U airflow -d airflow -c "SELECT * FROM weather_data;"
🎯 You just built a fully functional Apache Airflow pipeline! 🚀
✅ Fetches real-time weather data.
✅ Stores it in PostgreSQL.
✅ Runs automatically on an hourly schedule.
From here, you can extend the pipeline by:
Category: DIY Tips