- InfinitePy Newsletter πΊπΈ
- Posts
- Introduction to Python Polars π»ββοΈ: A High-Efficiency DataFrames Built to Scale
Introduction to Python Polars π»ββοΈ: A High-Efficiency DataFrames Built to Scale
Polars efficiently handles millions of rows, making Python codes simpler and cleaner. In terms of speed, Polars is not just quick; it's incredibly fast.
π Estimated reading time: 9 minutes
This article is meant for those who are already familiar with using pandas πΌ and are curious about whether polars π»ββοΈ could be a good addition to their workflow. If you are not yet familiar with pandas, we highly recommend starting with the article Working with Data in Python: From Basics to Advanced Techniques to gain a foundational understanding of pandas.
Today, there are plenty of libraries in Python to deal with the data, and pandas is the most commonly used one.
Over the years, pandas has established itself as the go-to tool for data analysis in Python. The project, initiated by Wes McKinney in 2008, reached its major milestone with the 1.0 release in January 2020. Since then, it has remained a staple in the data analysis community and shows no signs of fading.
Despite its popularity, pandas is not without its flaws. Wes McKinney (pandas creator) has highlighted several of these challenges, and a significant number of online critiques generally focus on two main issues:
performance limitations and
a sometimes awkward or complex API.
In an effort to address these shortcomings, Richie Vink developed Polars π»ββοΈ. In a detailed 2021 blog post, Vink presented metrics that substantiate his claims regarding Polars' improved performance and its more efficient design.
In this article, we will talk about what Polars is, some of its functionalities, and a practical use case where Polars performs outstandingly.
Why Polars π»ββοΈ?
As data size increases and the speed becomes a major factor, new libraries like Polars appear to replace previous ones with the new improved speed. Polars is an exceptionally fast DataFrame library designed for handling structured data. Its core is developed in Rust and is accessible for Python, R, and NodeJS users.
Polars offers several benefits that make it an attractive choice for data manipulation and analysis:
Speed: Built from the ground up in Rust, Polars is machine-close and free from external dependencies, ensuring high performance.
Versatile I/O: Supports various data storage systems, including local storage, cloud services, and databases.
User-Friendly API: Allows you to write queries naturally, with Polars' internal query optimizer figuring out the most efficient execution method.
Efficient Memory Usage: The streaming API processes results without loading all data into memory simultaneously.
Parallel Processing: Leverages all available CPU cores for workload distribution without needing extra configuration.
Optimized Query Engine: Uses Apache Arrow for columnar data processing and SIMD for peak CPU efficiency.
Apache Arrow establishes a columnar memory format that is platform-agnostic, catering to both flat and hierarchical data structures. This format is optimized for efficient analytical processing on contemporary hardware, including both CPUs and GPUs. Additionally, the Arrow memory format allows for zero-copy reads, enabling extremely fast data access without the burden of serialization overhead.
Single Instruction Multiple Data (SIMD) is an advanced microarchitecture method used in processors. This technique allows one instruction to simultaneously perform an operation on multiple data points. For example, it can multiply several numbers in just one clock cycle of the processor.
All the examples are also explained hereπ¨βπ¬, a corresponding Google Colab notebook to make your learning even more interactive.
Basic Usage
To begin using Polars, you'll need to install it. This can be done easily with pip:
# Running the following line will install the 'polars' library !pip install polars
Once installed, you can start using Polars just like any other DataFrame library.
# Import the polars library as pl to handle data frames efficiently import polars as pl
Here's a simple example to demonstrate Polars' basic functionality.
# Create a DataFrame using polars (similar to pandas, but optimized for performance) # The DataFrame contains three columns: 'name', 'age', and 'salary' df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "salary": [50000, 60000, 70000] }) # Print the initial DataFrame to the console for visualization print("Initial DataFrame:") print(df) # Filter the DataFrame to include only rows where the 'age' column is greater than 28 # This is achieved using the filter method and the col function from polars to select the 'age' column filtered_df = df.filter(pl.col("age") > 28) # Print the filtered DataFrame to show only those entries where age > 28 print("\nFiltered DataFrame (age > 28):") print(filtered_df) # Group the DataFrame by the 'age' column and aggregate the 'salary' column # Specifically, calculate the sum of the 'salary' for each unique age # The agg method is used for aggregation, and alias is used to rename the resulting column to 'total_salary' grouped_df = df.groupby("age").agg([pl.sum("salary").alias("total_salary")]) # Print the grouped DataFrame to display the total salary for each age group print("\nGrouped DataFrame (total salary by age):") print(grouped_df)
Running the code above will produce the following output.
Initial DataFrame:
shape: (3, 3)
βββββββββββ¬ββββββ¬βββββββββ
β name β age β salary β
β --- β --- β --- β
β str β i64 β i64 β
βββββββββββͺββββββͺβββββββββ‘
β Alice β 25 β 50000 β
β Bob β 30 β 60000 β
β Charlie β 35 β 70000 β
βββββββββββ΄ββββββ΄βββββββββ
Filtered DataFrame (age > 28):
shape: (2, 3)
βββββββββββ¬ββββββ¬βββββββββ
β name β age β salary β
β --- β --- β --- β
β str β i64 β i64 β
βββββββββββͺββββββͺβββββββββ‘
β Bob β 30 β 60000 β
β Charlie β 35 β 70000 β
βββββββββββ΄ββββββ΄βββββββββ
Grouped DataFrame (total salary by age):
shape: (3, 2)
βββββββ¬βββββββββββββββ
β age β total_salary β
β --- β --- β
β i64 β i64 β
βββββββͺβββββββββββββββ‘
β 30 β 60000 β
β 35 β 70000 β
β 25 β 50000 β
βββββββ΄βββββββββββββββ
<ipython-input-4-d0adbb2babae>:24: DeprecationWarning: `groupby` is deprecated. It has been renamed to `group_by`.
grouped_df = df.groupby("age").agg([pl.sum("salary").alias("total_salary")])
Deep Dive into Functions
Letβs explore some of Polars' advanced functionalities through examples:
1. Lazy Execution
Lazy execution allows you to declare a series of transformations and execute them all at once. This can significantly improve performance for complex workflows.
# Convert the DataFrame to a LazyFrame. LazyFrames allow you to build up # a query (series of transformations) without executing them immediately. # This can optimize performance by combining operations and reducing # multiple scans through your data. lf = df.lazy() # Declare transformations on the LazyFrame. # Transformation 1: Filter the rows where the 'age' column is greater than 28. # Transformation 2: Group the filtered data by the 'age' column. # Transformation 3: Aggregate the group by summing the 'salary' column and renaming the result to 'total_salary'. lazy_result = lf.filter(pl.col("age") > 28).groupby("age").agg([ pl.sum("salary").alias("total_salary")]) # Execute transformations. # The collect() method triggers the execution of the query built so far in the LazyFrame. # This reads the data, applies the filter, groupby, and aggregation, and returns a conventional DataFrame. result = lazy_result.collect() # Print the result. print("Lazy Execution Result:") print(result)
Running the code above will produce the following output.
Lazy Execution Result:
shape: (2, 2)
βββββββ¬βββββββββββββββ
β age β total_salary β
β --- β --- β
β i64 β i64 β
βββββββͺβββββββββββββββ‘
β 35 β 70000 β
β 30 β 60000 β
βββββββ΄βββββββββββββββ
<ipython-input-5-18a9a5405b66>:11: DeprecationWarning: `groupby` is deprecated. It has been renamed to `group_by`.
lazy_result = lf.filter(pl.col("age") > 28).groupby("age").agg([ pl.sum("salary").alias("total_salary")])
2. Parallel Execution
Polars can automatically parallelize operations to take full advantage of multicore processors.
# Create a large DataFrame ('df_large') with a single column named 'num'. # The column 'num' is populated with integers ranging from 1 to 1,000,000. # The 'list(range(1, 1000001))' generates a list starting from 1 up to and including 1,000,000. df_large = pl.DataFrame({"num": list(range(1, 1000001))}) # Apply a transformation to the DataFrame using Polars' select method. # The pl.col("num") references the 'num' column of the DataFrame. # The '*' operator doubles each value in the 'num' column, effectively creating a new column with these transformed values. # Polars uses lazy execution and multi-threading under the hood, which can help speed up operations on large datasets. parallel_result = df_large.select( pl.col("num") * 2) # Print the transformed DataFrame 'parallel_result', which contains the doubled values of the 'num' column. print("Parallel Execution:") print(parallel_result)
Running the code above will produce the following output.
Lazy Execution Result:
shape: (2, 2)
βββββββ¬βββββββββββββββ
β age β total_salary β
β --- β --- β
β i64 β i64 β
βββββββͺβββββββββββββββ‘
β 35 β 70000 β
β 30 β 60000 β
βββββββ΄βββββββββββββββ
<ipython-input-5-18a9a5405b66>:11: DeprecationWarning: `groupby` is deprecated. It has been renamed to `group_by`.
lazy_result = lf.filter(pl.col("age") > 28).groupby("age").agg([ pl.sum("salary").alias("total_salary")])
Real-World Use Case: Financial Data Analysis
Let's simulate a real-world use case where we analyze a large dataset of stock prices to find trends and calculate moving averages.
Data Loading: Load CSV file with historical stock prices.
Data Cleaning: Remove any rows with missing data.
Calculations: Calculate the 7-day and 30-day moving averages.
Analysis: Find dates where the 7-day moving average crosses above the 30-day moving average.
# Step 1: Data Loading # Load the stock price data from the given URL and read it into a DataFrame using Polars df = pl.read_csv("https://infinitepy.s3.amazonaws.com/samples/stock_price.csv") # Print the initial data to get a quick look at the first few rows print("Initial Data:") print(df.head()) # head() method shows the first 5 rows by default # Step 2: Data Cleaning # Remove rows with any null (missing) values and store the cleaned DataFrame df_clean = df.drop_nulls() # Print the cleaned data to inspect the first few rows after removing null values print("\nCleaned Data:") print(df_clean.head()) # Step 3: Calculate moving averages # Calculate 7-day and 30-day moving averages for the 'Price' column # The with_columns method is used to add new columns to the DataFrame df_clean = df_clean.with_columns([ # Calculate 7-day moving average of 'Price' and name the resulting column '7_day_ma' pl.col("Price").rolling_mean(window_size=7).alias("7_day_ma"), # Calculate 30-day moving average of 'Price' and name the resulting column '30_day_ma' pl.col("Price").rolling_mean(window_size=30).alias("30_day_ma") ]) # Print the first 40 rows of the data to see the moving averages print("\nData with Moving Averages:") print(df_clean.head(40)) # head(40) shows the first 40 rows # Step 4: Find Crossovers # Define the condition for crossovers: when the 7-day moving average crosses above the 30-day moving average # Use the filter method to apply this condition crossovers = df_clean.filter( (pl.col("7_day_ma") > pl.col("30_day_ma")) & # Current condition where 7-day MA is greater than 30-day MA (pl.col("7_day_ma").shift(1) <= pl.col("30_day_ma").shift(1)) # Previous condition where 7-day MA was less than or equal to 30-day MA # shift(1) looks at the previous row; this helps to detect the crossover point ) # Print the rows where crossovers are detected print("\nCrossovers:") print(crossovers)
Running the code above will produce the following output.
Initial Data:
shape: (5, 2)
ββββββββββββββ¬ββββββββββββ
β Date β Price β
β --- β --- β
β str β f64 β
ββββββββββββββͺββββββββββββ‘
β 14-08-2018 β 23.02 β
β 15-08-2018 β 23.15 β
β 16-08-2018 β 23.5 β
β 17-08-2018 β 23.4 β
β 20-08-2018 β 23.549999 β
ββββββββββββββ΄ββββββββββββ
Cleaned Data:
shape: (5, 2)
ββββββββββββββ¬ββββββββββββ
β Date β Price β
β --- β --- β
β str β f64 β
ββββββββββββββͺββββββββββββ‘
β 14-08-2018 β 23.02 β
β 15-08-2018 β 23.15 β
β 16-08-2018 β 23.5 β
β 17-08-2018 β 23.4 β
β 20-08-2018 β 23.549999 β
ββββββββββββββ΄ββββββββββββ
Data with Moving Averages:
shape: (40, 4)
ββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ
β Date β Price β 7_day_ma β 30_day_ma β
β --- β --- β --- β --- β
β str β f64 β f64 β f64 β
ββββββββββββββͺββββββββββββͺββββββββββββͺββββββββββββ‘
β 14-08-2018 β 23.02 β null β null β
β 15-08-2018 β 23.15 β null β null β
β 16-08-2018 β 23.5 β null β null β
β 17-08-2018 β 23.4 β null β null β
β β¦ β β¦ β β¦ β β¦ β
β 04-10-2018 β 20.780001 β 21.427143 β 22.306667 β
β 05-10-2018 β 20.76 β 21.28 β 22.224 β
β 08-10-2018 β 20.809999 β 21.177142 β 22.143667 β
β 09-10-2018 β 20.76 β 21.064285 β 22.047 β
ββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ
Crossovers:
shape: (4, 4)
ββββββββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ
β Date β Price β 7_day_ma β 30_day_ma β
β --- β --- β --- β --- β
β str β f64 β f64 β f64 β
ββββββββββββββͺββββββββββββͺββββββββββββͺββββββββββββ‘
β 09-01-2019 β 16.58 β 15.945714 β 15.837 β
β 13-05-2019 β 19.43 β 18.917143 β 18.896 β
β 21-06-2019 β 19.110001 β 19.341429 β 19.310333 β
β 30-07-2019 β 19.459999 β 18.941429 β 18.784333 β
ββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ
Conclusion
Polars is a powerful DataFrame library that offers significant performance advantages over traditional libraries like pandas. Its ability to handle large datasets efficiently and its emphasis on speed and memory usage make it an excellent choice for data-intensive applications. Whether you're dealing with financial data, time series, or large-scale data analytics, Polars can help you achieve faster and more efficient results.
By incorporating Polars into your data analysis workflows, you can take full advantage of modern hardware capabilities and achieve better performance, giving you more time to focus on deriving insights from your data rather than worrying about execution speed.
π Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.
InfinitePy Newsletter - Your source for Python learning and inspiration.