dlt (Data Load Tool) Example Explanation

This document explains the dlt example provided in dlt_example.py.

Overview

The example demonstrates how to:

Use dlt to extract data from a REST API.
Perform simple data transformations.
Load the data into a DuckDB database.
Query the data after loading to verify the pipeline's success.

Prerequisites

Before running the example, ensure you have installed the required packages. You can install dlt and duckdb using the following commands:

pip install dlt duckdb

Code Explanation

Imports and API Setup

import dlt
import requests

In this section, we import dlt, the Data Load Tool, which helps with data extraction and loading, and requests, a common library for making HTTP requests to APIs.

Define the Data Source

@dlt.resource(write_disposition="append")
def weather_data():
    api_url = "https://api.open-meteo.com/v1/forecast?latitude=35.6895&longitude=139.6917&hourly=temperature_2m"
    response = requests.get(api_url)
    data = response.json()

    # Yield the hourly temperature data
    for timestamp, temperature in zip(data['hourly']['time'], data['hourly']['temperature_2m']):
        yield {
            "timestamp": timestamp,
            "temperature": temperature
        }

In this block, we define the weather_data function, which extracts hourly temperature data from the Open-Meteo API.

@dlt.resource(write_disposition="append"): This decorator tells dlt to handle this function as a resource. The write_disposition="append" ensures that new data is appended to the database instead of overwriting existing data.
We then make a GET request to the weather API and parse the response as JSON.
The yield statement is used to provide records one at a time, which is essential for loading large datasets efficiently.

Pipeline Setup and Execution

pipeline = dlt.pipeline(
    pipeline_name="weather_pipeline",
    destination="duckdb",
    dataset_name="weather_data",
    credentials={"database": "weather_data.duckdb"}
)

load_info = pipeline.run(weather_data)
print(load_info)

This block sets up the data pipeline using dlt:

pipeline_name="weather_pipeline": The name of the pipeline.
destination="duckdb": Specifies DuckDB as the destination database.
dataset_name="weather_data": This is the name of the dataset inside the DuckDB database.
credentials={"database": "weather_data.duckdb"}: Specifies that the data should be saved in a local DuckDB file called weather_data.duckdb.

The pipeline is executed using pipeline.run(weather_data), which loads the data into DuckDB.

Querying the Data in DuckDB

After the pipeline has run, you can query the data from the DuckDB database like this:

import duckdb

con = duckdb.connect("weather_data.duckdb")
df = con.execute("SELECT * FROM weather_data").fetchdf()

print(df)

This connects to the weather_data.duckdb file and runs a query to fetch the data stored in the weather_data table. The result is printed as a Pandas DataFrame.

Conclusion

This example shows how to use dlt to automate the process of extracting, transforming, and loading data from a REST API into a DuckDB database. The pipeline can be extended with more complex data transformations or additional data sources.

Code Explanation

Imports and API Setup

import dlt
import requests

In this section, we import dlt, the Data Load Tool, which helps with data extraction and loading, and requests, a common library for making HTTP requests to APIs.

Define the Data Source

@dlt.resource(write_disposition="append")
def weather_data():
    api_url = "https://api.open-meteo.com/v1/forecast?latitude=35.6895&longitude=139.6917&hourly=temperature_2m"
    response = requests.get(api_url)
    data = response.json()

    # Yield the hourly temperature data
    for timestamp, temperature in zip(data['hourly']['time'], data['hourly']['temperature_2m']):
        yield {
            "timestamp": timestamp,
            "temperature": temperature
        }

In this block, we define the weather_data function, which extracts hourly temperature data from the Open-Meteo API.

@dlt.resource(write_disposition="append"): This decorator tells dlt to handle this function as a resource. The write_disposition="append" ensures that new data is appended to the database instead of overwriting existing data.

We then make a GET request to the weather API and parse the response as JSON.

The yield statement is used to provide records one at a time, which is essential for loading large datasets efficiently.

Pipeline Setup and Execution

pipeline = dlt.pipeline(
    pipeline_name="weather_pipeline",
    destination="duckdb",
    dataset_name="weather_data",
    credentials={"database": "weather_data.duckdb"}
)

load_info = pipeline.run(weather_data)
print(load_info)

This block sets up the data pipeline using dlt:

pipeline_name="weather_pipeline": The name of the pipeline.

destination="duckdb": Specifies DuckDB as the destination database.

dataset_name="weather_data": This is the name of the dataset inside the DuckDB database.

credentials={"database": "weather_data.duckdb"}: Specifies that the data should be saved in a local DuckDB file called weather_data.duckdb.

The pipeline is executed using pipeline.run(weather_data), which loads the data into DuckDB.

Querying the Data in DuckDB

After the pipeline has run, you can query the data from the DuckDB database like this:

import duckdb

con = duckdb.connect("weather_data.duckdb")
df = con.execute("SELECT * FROM weather_data").fetchdf()

print(df)

This connects to the weather_data.duckdb file and runs a query to fetch the data stored in the weather_data table. The result is printed as a Pandas DataFrame.

Weather Data Pipeline with DLT

Prerequisites

What You'll Learn

dlt (Data Load Tool) Example Explanation

Overview

Prerequisites

Code Explanation

Imports and API Setup

Define the Data Source

Pipeline Setup and Execution

Querying the Data in DuckDB

Conclusion

Category

Tools Used

Weather Data Pipeline with DLT

Prerequisites

What You'll Learn

dlt (Data Load Tool) Example Explanation

Overview

Prerequisites

Code Explanation

Imports and API Setup

Define the Data Source

Pipeline Setup and Execution

Querying the Data in DuckDB

Conclusion

Category

Tools Used