InfinitePy Newsletter 🇺🇸
Posts
Understanding the Speed and Efficiency of Polars

Understanding the Speed and Efficiency of Polars

Learn how Polars achieves its remarkable speed and memory efficiency compared to pandas, leveraging mechanisms like optimized query execution, Apache Arrow integration, and parallel processing.

Eduardo Miranda
August 09, 2024

_{🕒 Estimated reading time: 6 minutes}

Today, we will be going through some advanced topics of a growing library in Python called Polars, which is used mainly for DataFrames and is highly efficient. As we explore innovative options in Polars, you will discover how you can process and analyze big data smoothly.

To delve into the first article, Introduction to Python Polars 🐻‍❄️: A High-Efficiency DataFrames Built to Scale, please click here.

All the examples are also explained here👨‍🔬, a corresponding Google Colab notebook to make your learning even more interactive.

The Rising Popularity of Polars in Python Data Science

Despite these advantages, pandas remains a preferred choice due to its deep integration with the broader Python data science ecosystem. Its interoperability with various packages in the machine learning pipeline is unmatched. However, Polars is rapidly catching up, with growing compatibility with several plotting libraries, such as plotly, matplotlib, seaborn, altair, and hvplot, making it a viable option for exploratory data analysis.

Polars can also now be integrated into machine learning and deep learning pipelines. For example, scikit-learn's release 1.4.0 allows transformers to output Polars DataFrames. Moreover, Polars DataFrames can be converted to PyTorch data types, enabling easier integration into PyTorch workflows. This transformation can be done using the to_torch method on a Polars DataFrame.

Why is Polars so fast?

In essence, Polars is engineered for speed. It's been built from the ground up to be incredibly fast and can execute common operations up to 5 to 10 times faster than pandas. Furthermore, Polars uses significantly less memory for its operations compared to pandas. While pandas requires about 5 to 10 times more RAM than the size of the dataset for operations, Polars only needs 2 to 4 times as much.

To get a sense of how Polars compares to other dataframe libraries, you can check out this comparison. As the results show, Polars is 10 to 100 times faster than pandas for common operations and is one of the fastest DataFrame libraries available. Furthermore, it can handle larger datasets than pandas before encountering out-of-memory errors.

These results are truly remarkable, and you might be curious about how Polars achieves such high performance while still operating on a single machine. The library was designed with performance in mind from the start, and this is accomplished through several methods.

The Role of Apache Arrow in Enhancing Performance

Polars' impressive performance can be attributed in part to its use of Apache Arrow, a language-independent memory format co-created by Wes McKinney to address the limitations of pandas amid growing data sizes. This format also underpins pandas 2.0, which was released in March 2023 to enhance its performance. However, Polars takes a different route by implementing its own version of Arrow instead of relying on PyArrow like pandas 2.0.

One major benefit of using Arrow as a data library is the seamless interoperability it offers. Arrow standardizes in-memory data formats across various libraries and databases, which eliminates the need for data conversion while passing it between different steps in a data pipeline. This can be particularly advantageous when working with data science pipelines that utilize various tools.

Here's a simple example to illustrate this:

import pyarrow as pa
import polars as pl

# Create an Arrow Table with two columns 'a' and 'b'
arrow_table = pa.table({
    'a': range(1000),
    'b': range(1000, 2000)
})

# Convert the Arrow Table to a Polars DataFrame
polars_df = pl.from_arrow(arrow_table)

# Display the Polars DataFrame
print(polars_df)

This not only boosts performance by avoiding the costly process of serialization and deserialization but also improves memory efficiency. In fact, serialization and deserialization are estimated to comprise about 80–90% of computing costs in typical data workflows, making Arrow’s role in Polars a significant performance booster.

Arrow also supports a broader range of data types compared to pandas. While pandas, built on NumPy, excels at handling integer and float columns, it struggles with other types. Arrow, on the other hand, efficiently manages datetime, boolean, binary, and even complex column types that contain lists. It can also natively handle missing data, which requires workarounds in NumPy. Additionally, Arrow’s use of columnar data storage, where all columns are stored in a continuous memory block, facilitates parallelism and speeds up data retrieval.

Query optimization

One of Polars’ strengths lies in how it processes code. Unlike pandas, which typically follows eager execution (executing operations in the order they are written), Polars can perform both eager and lazy execution. Lazy execution involves a query optimizer that evaluates all required operations and determines the most efficient execution order. For instance, consider this expression to compute the mean of a column Quantities for specific fruits, Apple and Banana:

import polars as pl

# Create example DataFrame
df = pl.DataFrame({
    "Fruits": ["Apple", "Banana", "Apple", "Cherry", "Banana"],
    "Quantities": [10, 20, 15, 30, 25]
})

# Group by "Fruits", calculate mean of "Quantities", and filter for fruits "Apple" and "Banana"
result = (
    df.groupby("Fruits").agg(pl.col("Quantities").mean())
    .filter(pl.col("Fruits").is_in(["Apple", "Banana"]))
)

print(result)

During eager execution, the groupby operation is first applied to the entire DataFrame, followed by a filtering step. In the case of lazy execution, however, the filtering happens first. This makes the groupby operation more efficient as it only processes the relevant data.

Expressive API

Polars also boasts an expressive API, allowing almost any operation to be performed using Polars methods. In contrast, pandas often requires the use of the apply method for complex operations, which iterates through DataFrame rows sequentially. By leveraging built-in methods, Polars can operate at the columnar level, enabling parallel processing through Single Instruction, Multiple Data (SIMD).

Polars provides users with a powerful expression API for intricate data manipulations, all while optimizing queries in the background to boost performance.

Let's take a look at an example using this feature:

import polars as pl

# We start with a simple DataFrame
df = pl.DataFrame({
    "age": [25, 32, 45, 22, 18],
    "income": [50000, 150000, 70000, 30000, 20000]
})

# We can use conditional logic to create a new column, 'income_level'
# If the income is greater than 100000, we'll label it as 'High', otherwise 'Low'
df = df.with_columns([
    pl.when(pl.col("income") > 100000)
      .then(pl.lit("High"))
      .otherwise(pl.lit("Low"))
      .alias("income_level")
])

print(df)

Another feature Polars offers is the ability to normalize data and create new columns based on that. Here's an example:

# Let's create a DataFrame
df = pl.DataFrame({
    "names": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
    "scores": [85, 95, 70, 65, 92, 88]
})

# Calculate the average score
average_score = df["scores"].mean()

# Normalize the scores and create a new column, 'normalized_scores'
df = df.with_columns([
    ((pl.col("scores") / average_score) * 100).alias("normalized_scores")
])

# Use the 'normalized_scores' column to create a 'grade' column
result = df.with_columns([
    pl.when(pl.col("normalized_scores") > 90).then(pl.lit("A"))
    .when(pl.col("normalized_scores") > 70).then(pl.lit("B"))
    .when(pl.col("normalized_scores") > 50).then(pl.lit("C"))
    .otherwise(pl.lit("D")).alias("grade")
])

print(result)

Polars also supports a variety of join operations, making sure your data merging operations are both flexible and efficient. Here's an example:

# Import the polars library
import polars as pl

# Create the first DataFrame
df1 = pl.DataFrame({
    "key": [1, 2, 3],
    "value": ["one", "two", "three"]
})

# Create the second DataFrame
df2 = pl.DataFrame({
    "key": [2, 3, 4],
    "value": ["two", "three", "four"]
})

# Perform an inner join on the 'key' column
# The 'how' parameter is set to 'inner' by default, so it can be omitted
# The 'suffix' parameter is used to add a suffix to overlapping column names from the second DataFrame
result = df1.join(df2, on="key", suffix="_df2")

# Print the result
print(result)

Conclusion

Polars is a powerful library for data manipulation that keeps performance at its core. Its advanced features, such as lazy evaluation, parallel processing, Arrow integration, expressive APIs, and efficient join operations, provide a robust toolkit for handling large and complex datasets. By leveraging these advancements, you can considerably elevate your data processing capabilities in Python.

So, the next time your data processing tasks start dragging down time and efficiency, give Polars a try to experience the speedup for yourself. Keep experimenting and happy coding!