What will I learn from the Polars vs Pandas in Production Pipelines project?

Understand why Polars is structurally faster than pandas (GIL, Arrow, lazy evaluation). Read and interpret production benchmark results critically. Install and configure Polars with uv, including the Apple Silicon variant. Rewrite a pandas ETL pipeline in Polars LazyFrame API. Use streaming execution for larger-than-memory datasets. Apply sink_parquet for out-of-core writes. Set POLARS_MAX_THREADS for safe production deployments. Identify which workloads still belong in pandas. Use the Polars-to-pandas hybrid pattern for ML pipelines

What are the prerequisites for Polars vs Pandas in Production Pipelines?

Solid Python basics. Familiarity with pandas DataFrames and groupby operations. Basic understanding of ETL pipelines. Command-line comfort for running scripts

What difficulty level is the Polars vs Pandas in Production Pipelines project?

This is a intermediate-level project. It assumes familiarity with Python and basic data engineering concepts.

Where can I find the code for Polars vs Pandas in Production Pipelines?

The complete code for Polars vs Pandas in Production Pipelines is available on GitHub at https://github.com/nenalukic/polars-vs-pandas

Polars vs Pandas in Production Pipelines

The tool that made Python data engineering accessible is quietly becoming the thing that slows it down.

The Default That Is Slowing You Down

Most Python data engineers learned pandas first. I was one of them as well. It was the standard, it worked, and it kept working well enough that nobody questioned it. That is exactly the problem.

Pandas was designed in 2008 for a different kind of data problem. It assumes you want immediate results from every operation, that a single CPU core is enough, and that data fits comfortably in RAM. Those assumptions held for years. They hold less and less as pipelines grow.

The question is not whether pandas is good. I know it is. I used it for years and it served me well. The question is if it still matches the shape of the workloads you are actually running.

For file-based ETL above roughly 1 GB, the answer is increasingly no. Not because pandas is broken, but because the ecosystem has produced something structurally better for that specific job. That tool is Polars. And the gap between the two is not marginal.

Why Polars Is Structurally Faster

Before any benchmark numbers, the mechanism matters. I spent a long time treating Polars as "pandas but faster" before I understood why it is actually different. Once I did, the decision about when to use it became obvious. Understanding why it is faster tells you exactly when the advantage applies and when it does not.

Pandas vs Polars execution models compared side by side

The Pandas Execution Model

Pandas runs on NumPy, which is single-threaded for most operations. Python's Global Interpreter Lock prevents parallel execution even when multiple threads are available. Every operation runs immediately, which means intermediate results are created and stored in memory whether you need them or not. A chain of five transformations on a 5 GB DataFrame creates five intermediate copies. Memory usage compounds fast.

String columns make this worse. By default, pandas stores strings as NumPy object arrays: each value is a Python object pointer, consuming roughly 67 bytes per unique string versus 4 bytes of overhead in columnar storage. I first noticed this when a pipeline that should have used 3 GB of RAM was consistently consuming 9 GB. The culprit was a single string column on a wide DataFrame. A column of 1 million 18-character strings uses around 75 MB in pandas object dtype. The same column in Arrow format uses around 22 MB.

The Polars Execution Model

Polars is written in Rust and runs outside the GIL entirely. It stores data in Apache Arrow columnar format: each column is a contiguous block of typed memory, cache-friendly and SIMD-ready. When you call scan_parquet, no data is read yet. Polars builds a logical execution plan and optimizes it before touching a single byte: predicates are pushed down to the scan level, unused columns are pruned before reading, and joins are reordered for efficiency.

Execution is morsel-driven. Each CPU thread takes a chunk of input, processes it independently with its own local state, and merges results at the end. There is no lock contention during computation. All cores run simultaneously.

The practical consequence is something I measured myself, and the difference is not subtle. A pipeline that touches a 10 GB Parquet file in pandas reads all 10 GB, processes it sequentially on one core, and creates intermediate copies along the way. The same pipeline in Polars reads only the columns and row groups it needs, processes them in parallel across all cores, and never materializes an intermediate result it does not have to.

The Benchmark Evidence

Three named tests establish the range of the gap.

Benchmark results across different scales and workload types

The PDS-H Benchmark

The Polars team published updated PDS-H results in May 2025, run on an AWS c7a.24xlarge instance with 96 vCPUs and 192 GB of RAM. The benchmark runs all 22 TPC-H-derived queries against a 10 GB CSV dataset. Polars streaming finished in 3.89 seconds. Pandas 2.2.3 with PyArrow dtypes enabled finished in 365.71 seconds. That is a 94x gap on a full multi-step analytical pipeline.

At scale factor 100 (100 GB), pandas was excluded entirely. Its single-threaded execution and lack of a query optimizer caused out-of-memory failures before completing the benchmark. Polars streaming finished in 23.94 seconds.

This is a vendor-run benchmark and I want to be clear about that. Treat it as directional, not as a universal promise. Independent tests on modest hardware consistently show smaller gaps, typically 5 to 22x on individual operations. The headline multiple emerges when the query optimizer and multi-threading compound across a full pipeline.

The Production Evidence

Check Technologies, a Dutch mobility provider, migrated all 100+ Airflow DAGs from pandas to Polars over a single sprint. Their motivation was not performance benchmarks — it was repeated out-of-memory errors on their most data-intensive pipelines. The migration took under two weeks and delivered a 3.3x speed improvement on the most problematic DAG, roughly 2x gains across almost all others, and a 25 percent reduction in cloud infrastructure cost.

DB Systel, a Deutsche Bahn subsidiary, rewrote a train-scheduling reprocessing job in Polars 0.20. The job previously ran in 96 minutes. The Polars version ran in 5.5 minutes. That is a 17.5x improvement on a real production workload, not a synthetic benchmark.

The counterintuitive finding: on small data, the gap largely disappears and sometimes reverses. Polars' query optimizer adds planning overhead that exceeds the compute savings when the dataset is under a few hundred megabytes. For quick scripts on small frames, pandas is faster to write and fast enough to run.

Setting Up And Running Both

The setup is one command.

# Create project and add Polars
uv init polars-pipeline
cd polars-pipeline
uv add polars pyarrow  # standard install
uv run pipeline.py

No compilation. No system dependencies. Polars ships as a pre-built wheel with the Rust runtime included.

One thing worth knowing before you start: on Apple Silicon (M-series Macs), the standard polars wheel triggers a CPU compatibility warning and recommends the runtime-compatible variant.

# Apple Silicon: use this instead
uv add "polars[rtcompat]" pyarrow

On Linux and Intel Macs the standard install is fine. On M-series hardware, polars[rtcompat] avoids the warning and ensures the correct SIMD instructions are used for your processor.

Side-By-Side: The Same Operation In Both Libraries

When I first looked at Polars code, the syntax felt unfamiliar. The expression API takes a day to get used to. After that day, I found it more readable than the pandas equivalent, not less. The code below runs an identical ETL transformation: read a Parquet file, filter rows, aggregate by category, and sort the result.

# pandas version
import pandas as pd
import time

start = time.perf_counter()
df = pd.read_parquet("sales.parquet")
result = (
    df[df["revenue"] > 1000]
    .groupby("category")
    .agg(
        total_revenue=("revenue", "sum"),
        avg_price=("price", "mean"),
        order_count=("order_id", "count"),
    )
    .sort_values("total_revenue", ascending=False)
    .reset_index()
)
print(f"pandas: {time.perf_counter() - start:.3f}s")
print(result)

# Polars version — lazy execution with predicate and projection pushdown
import polars as pl
import time

start = time.perf_counter()
result = (
    pl.scan_parquet("sales.parquet")    # no data read yet
    .filter(pl.col("revenue") > 1000)   # pushed into the scan
    .group_by("category")
    .agg(
        total_revenue=pl.col("revenue").sum(),
        avg_price=pl.col("price").mean(),
        order_count=pl.col("order_id").count(),
    )
    .sort("total_revenue", descending=True)
    .collect()  # execution happens here
)
print(f"Polars: {time.perf_counter() - start:.3f}s")
print(result)

The structural difference is in scan_parquet versus read_parquet. The pandas version reads the entire file immediately. The Polars version builds a plan and reads only the rows and columns required to satisfy the filter and aggregation. On a 1 GB file with 20 columns where you need 3, Polars may read less than 20 percent of the bytes pandas reads.

Streaming For Larger-Than-Memory Data

When your dataset exceeds available RAM, add engine="streaming" to collect. Polars processes the data in batches called morsels, spilling to disk adaptively and never holding the full dataset in memory at once.

result = (
    pl.scan_parquet("large_dataset/*.parquet")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("amount").sum())
    .sort("amount", descending=True)
    .collect(engine="streaming")  # out-of-core execution
)

For writing directly to disk without materializing in memory at all, use sink_parquet:

(
    pl.scan_parquet("raw/*.parquet")
    .filter(pl.col("error_code").is_null())
    .sink_parquet("clean/output.parquet")  # streams to disk in batches
)

Production Guardrails

Set thread count explicitly in production environments to avoid contention with other processes:

import os
os.environ["POLARS_MAX_THREADS"] = "8"  # MUST be set before importing polars
import polars as pl  # Polars initializes its thread pool at import time

Pin your Polars version in pyproject.toml. The API is stable at 1.x but minor versions introduce new features that change behavior on edge cases. Lock it, test in a container matching your production OS, and upgrade deliberately.

Where Pandas Still Wins

Polars is the better tool for file-based ETL on data above 1 GB. I still use pandas every week. There are three situations where I reach for it without hesitation.

Decision framework: Polars vs Pandas by data size and workload type

The ML Ecosystem Constraint

Most ML libraries accept pandas DataFrames as their native input format. scikit-learn, statsmodels, and many plotting libraries expect pandas either explicitly or by default. Rewriting your entire stack to avoid a single .to_pandas() call is not a trade-off worth making.

The practical pattern I use in every pipeline that feeds a model is the hybrid approach: Polars for the expensive compute work, then convert at the final step.

# Polars for the heavy lifting
features = (
    pl.scan_parquet("events/*.parquet")
    .group_by("user_id")
    .agg([
        pl.col("session_duration").mean().alias("avg_session"),
        pl.col("purchase").sum().alias("total_purchases"),
        pl.col("event_date").max().alias("last_seen"),
    ])
    .collect(engine="streaming")
)

# Zero-copy conversion at the boundary
X = features.to_pandas()  # Arrow-backed, near-instant

The conversion is near-instant because both Polars and recent pandas share the Apache Arrow memory format. No data is copied when both sides use Arrow types.

What Pandas 3.0 Changed

Pandas 3.0, released January 2026, made PyArrow-backed strings the default for string columns and made Copy-on-Write the only execution mode. These changes reduce memory usage on string-heavy datasets by up to 70 percent and eliminate a class of silent mutation bugs.

These are real improvements. They close the I/O and memory gap meaningfully. But they do not close the multi-core execution gap. Pandas aggregations remain single-threaded and eager with no query optimizer in any version. If your bottleneck is compute rather than memory, the 3.0 update does not change the calculus.

The Decision You Need To Make This Week

Three signals mean a pipeline is ready to migrate:

It runs out of memory on production-sized data, or you have added chunking logic to work around memory limits.
It takes minutes where it should take seconds, and profiling shows that pandas groupby or join operations dominate the runtime.
It runs on a machine with multiple CPU cores that sit idle during execution.

Any one of those signals is enough. All three together mean the migration pays back in the first sprint.

The starting point is narrow and concrete. Pick one pipeline — the slowest or the one that OOMs most often. Rewrite only the compute-heavy portion in Polars LazyFrame, keep the pandas boundary at the output for anything feeding into ML or plotting, and benchmark on production-sized data before promoting. If you do not see at least a 3x improvement on a dataset above 1 GB, the bottleneck is probably not the DataFrame library.

Sources

Polars team, "Updated PDS-H Benchmark Results," pola.rs, May 2025
Paul Duvenage, Check Technologies engineering blog, 2024
DB Systel engineering blog, 2024
Nahrstedt et al., "An Empirical Study on the Energy Usage and Performance of Pandas and Polars," EASE 2024
Polars 1.0 announcement, pola.rs, July 2024
Pandas 3.0 release notes, pandas.pydata.org, January 2026

Polars vs Pandas in Production Pipelines

The tool that made Python data engineering accessible is quietly becoming the thing that slows it down.

The Default That Is Slowing You Down

Most Python data engineers learned pandas first. I was one of them as well. It was the standard, it worked, and it kept working well enough that nobody questioned it. That is exactly the problem.

The question is not whether pandas is good. I know it is. I used it for years and it served me well. The question is if it still matches the shape of the workloads you are actually running.

Why Polars Is Structurally Faster

Pandas vs Polars execution models compared side by side

The Pandas Execution Model

The Polars Execution Model

The Benchmark Evidence

Three named tests establish the range of the gap.

Benchmark results across different scales and workload types

The PDS-H Benchmark

The Production Evidence

Setting Up And Running Both

The setup is one command.

# Create project and add Polars
uv init polars-pipeline
cd polars-pipeline
uv add polars pyarrow  # standard install
uv run pipeline.py

No compilation. No system dependencies. Polars ships as a pre-built wheel with the Rust runtime included.

One thing worth knowing before you start: on Apple Silicon (M-series Macs), the standard polars wheel triggers a CPU compatibility warning and recommends the runtime-compatible variant.

# Apple Silicon: use this instead
uv add "polars[rtcompat]" pyarrow

On Linux and Intel Macs the standard install is fine. On M-series hardware, polars[rtcompat] avoids the warning and ensures the correct SIMD instructions are used for your processor.

Side-By-Side: The Same Operation In Both Libraries

# pandas version
import pandas as pd
import time

start = time.perf_counter()
df = pd.read_parquet("sales.parquet")
result = (
    df[df["revenue"] > 1000]
    .groupby("category")
    .agg(
        total_revenue=("revenue", "sum"),
        avg_price=("price", "mean"),
        order_count=("order_id", "count"),
    )
    .sort_values("total_revenue", ascending=False)
    .reset_index()
)
print(f"pandas: {time.perf_counter() - start:.3f}s")
print(result)

# Polars version — lazy execution with predicate and projection pushdown
import polars as pl
import time

start = time.perf_counter()
result = (
    pl.scan_parquet("sales.parquet")    # no data read yet
    .filter(pl.col("revenue") > 1000)   # pushed into the scan
    .group_by("category")
    .agg(
        total_revenue=pl.col("revenue").sum(),
        avg_price=pl.col("price").mean(),
        order_count=pl.col("order_id").count(),
    )
    .sort("total_revenue", descending=True)
    .collect()  # execution happens here
)
print(f"Polars: {time.perf_counter() - start:.3f}s")
print(result)

Streaming For Larger-Than-Memory Data

result = (
    pl.scan_parquet("large_dataset/*.parquet")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("amount").sum())
    .sort("amount", descending=True)
    .collect(engine="streaming")  # out-of-core execution
)

For writing directly to disk without materializing in memory at all, use sink_parquet:

(
    pl.scan_parquet("raw/*.parquet")
    .filter(pl.col("error_code").is_null())
    .sink_parquet("clean/output.parquet")  # streams to disk in batches
)

Production Guardrails

Set thread count explicitly in production environments to avoid contention with other processes:

import os
os.environ["POLARS_MAX_THREADS"] = "8"  # MUST be set before importing polars
import polars as pl  # Polars initializes its thread pool at import time

Where Pandas Still Wins

Polars is the better tool for file-based ETL on data above 1 GB. I still use pandas every week. There are three situations where I reach for it without hesitation.

Decision framework: Polars vs Pandas by data size and workload type

The ML Ecosystem Constraint

The practical pattern I use in every pipeline that feeds a model is the hybrid approach: Polars for the expensive compute work, then convert at the final step.

# Polars for the heavy lifting
features = (
    pl.scan_parquet("events/*.parquet")
    .group_by("user_id")
    .agg([
        pl.col("session_duration").mean().alias("avg_session"),
        pl.col("purchase").sum().alias("total_purchases"),
        pl.col("event_date").max().alias("last_seen"),
    ])
    .collect(engine="streaming")
)

# Zero-copy conversion at the boundary
X = features.to_pandas()  # Arrow-backed, near-instant

The conversion is near-instant because both Polars and recent pandas share the Apache Arrow memory format. No data is copied when both sides use Arrow types.

What Pandas 3.0 Changed

The Decision You Need To Make This Week

Three signals mean a pipeline is ready to migrate:

It runs out of memory on production-sized data, or you have added chunking logic to work around memory limits.
It takes minutes where it should take seconds, and profiling shows that pandas groupby or join operations dominate the runtime.
It runs on a machine with multiple CPU cores that sit idle during execution.

Any one of those signals is enough. All three together mean the migration pays back in the first sprint.

Sources

Polars team, "Updated PDS-H Benchmark Results," pola.rs, May 2025
Paul Duvenage, Check Technologies engineering blog, 2024
DB Systel engineering blog, 2024
Nahrstedt et al., "An Empirical Study on the Energy Usage and Performance of Pandas and Polars," EASE 2024
Polars 1.0 announcement, pola.rs, July 2024
Pandas 3.0 release notes, pandas.pydata.org, January 2026

Polars vs Pandas in Production Pipelines

Prerequisites

What You'll Learn

Polars vs Pandas in Production Pipelines

The Default That Is Slowing You Down

Why Polars Is Structurally Faster

The Pandas Execution Model

The Polars Execution Model

The Benchmark Evidence

The PDS-H Benchmark

The Production Evidence

Setting Up And Running Both

Side-By-Side: The Same Operation In Both Libraries

Streaming For Larger-Than-Memory Data

Production Guardrails

Where Pandas Still Wins

The ML Ecosystem Constraint

What Pandas 3.0 Changed

The Decision You Need To Make This Week

Sources

Category

Tools Used

Polars vs Pandas in Production Pipelines

Prerequisites

What You'll Learn

Polars vs Pandas in Production Pipelines

The Default That Is Slowing You Down

Why Polars Is Structurally Faster

The Pandas Execution Model

The Polars Execution Model

The Benchmark Evidence

The PDS-H Benchmark

The Production Evidence

Setting Up And Running Both

Side-By-Side: The Same Operation In Both Libraries

Streaming For Larger-Than-Memory Data

Production Guardrails

Where Pandas Still Wins

The ML Ecosystem Constraint

What Pandas 3.0 Changed

The Decision You Need To Make This Week

Sources

Category

Tools Used