Polars and pandas both provide DataFrame-based data analysis in Python, but they differ in syntax, performance, and features. In this tutorial on Polars vs pandas, you’ll compare their method chaining styles, run timed performance tests, explore LazyFrame optimizations in Polars, convert data between the two libraries, and create plots with their built-in tools. You’ll also examine scenarios where each library’s strengths make it the better choice.
By the end of this tutorial, you’ll understand that:
- Polars expressions and contexts let you build clear, optimized query pipelines without mutating your original data.
- LazyFrames with query optimization in Polars can outperform pandas for grouped and aggregated workloads.
- Streaming in Polars enables processing datasets that don’t fit in memory, which pandas can’t handle natively.
.to_pandas()
andfrom_pandas()
let you convert between DataFrame formats, and Narwhals offers a library-agnostic API.- Built-in plotting uses Altair for Polars and Matplotlib for pandas, allowing quick visualization directly from DataFrames.
To get the most out of this tutorial, it’s recommended that you already have a basic understanding of how to work with both pandas and Polars DataFrames, as well as Polars LazyFrames.
To complete the examples in this tutorial, you’ll use various tools and the Python REPL. You’ll use the command line to run some scripts that time your code and reveal how pandas and Polars compare. You’ll also take advantage of the plotting capabilities of Jupyter Notebook.
Much of the data you’ll use will be random and self-generated. You’ll also use a cleansed and reformatted Apache Parquet version of some freely available retail data from the UC Irvine Machine Learning Repository. Parquet files are optimized to store data and analyze it efficiently. This enables you to achieve optimal performance from the pandas and Polars libraries.
Before you start, you should download the online_retail.parquet
file from the tutorial downloadables and place it into your project directory.
You’ll need to install the pandas and Polars libraries, as well as PyArrow, Matplotlib, Vega-Altair, and Narwhals, to make sure your code has everything it needs to run. You’ll also use NumPy, which is currently installed automatically when you install pandas.
You may also want to consider creating your own virtual environment within your project folder to install the necessary libraries. This will prevent them from interfering with your current setup.
You can install the required libraries using these commands at your command prompt:
$ python -m pip install polars \
pandas \
pyarrow \
narwhals \
altair \
jupyterlab \
matplotlib
All the code examples are provided in the downloadable materials for this tutorial, which you can download by clicking the link below:
Get Your Code: Click here to download the free sample code you’ll use to learn the differences between Polars and pandas.
Now that you’re set up, it’s time to get started and learn about the main differences between Polars and pandas.
Take the Quiz: Test your knowledge with our interactive “Polars vs pandas: What's the Difference?” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Polars vs pandas: What's the Difference?Take this quiz to test your knowledge of the Polars vs pandas tutorial and review the key differences between these open-source Python libraries.
Do Polars and pandas Use the Same Syntax?
There are similarities between Polars and pandas. For example, they both support Series and DataFrames and can perform many of the same data analysis computations. However, there are some differences in their syntax.
To explore this, you’ll use the order details in your online_retail.parquet
file to analyze both pandas and Polars DataFrames. This file contains the following data:
Column Name | Description |
---|---|
InvoiceNo | Invoice number |
StockCode | Stock code of item |
Description | Item description |
Quantity | Quantity purchased |
InvoiceDate | Date invoiced |
UnitPrice | Item price |
CustomerID | Customer identifier |
Country | Country of purchase made |
Next, you’ll analyze some of this data with pandas and then with Polars.
Using Index-Based Syntax in pandas
Suppose you want a DataFrame with a new Total
column that contains the total cost of each purchase. You also want to apply filtering so you can concentrate on specific data.
To achieve this, you might write the following pandas code in your REPL:
pandas_polars_demo.py
>>> import pandas as pd
>>> orders_pandas = pd.read_parquet("online_retail.parquet")
>>> orders_pandas["Total"] = (
... orders_pandas["Quantity"] * orders_pandas["UnitPrice"]
... )
>>> orders_pandas[["InvoiceNo", "Quantity", "UnitPrice", "Total"]][
... orders_pandas["Total"] > 100
... ].head(3)
InvoiceNo Quantity UnitPrice Total
46 536371 80 2.55 204.0
65 536374 32 10.95 350.4
82 536376 48 3.45 165.6
This code uses pandas index-based syntax, inspired by NumPy, on which pandas was originally built. First, you add a new Total
column to your DataFrame. The column is calculated by multiplying the values of the Quantity
and UnitPrice
columns together. This operation permanently changes your original DataFrame.