π οΈ 1. Installation
pip install polars[all] jupyterlab
π¦ 2. Basic Usage in Jupyter
import polars as pl
# Read CSV
df = pl.read_csv("data.csv")
# Show head
df.head()
# Select and filter
df.select(["col1", "col2"]).filter(pl.col("col1") > 10)
βοΈ 3. Eager vs Lazy Mode
Eager (default) |
Executes immediately (like Pandas) |
Lazy |
Builds a computation graph and optimizes it |
# Lazy version
ldf = pl.read_csv("data.csv").lazy()
result = ldf.filter(pl.col("price") > 100).select(["item", "price"]).collect()
π§ 4. Advanced Polars Features
π A. Custom Expressions
df.with_columns([
(pl.col("price") * 0.1).alias("tax"),
pl.col("name").str.to_uppercase()
])
π B. GroupBy + Aggregations
df.groupby("category").agg([
pl.col("price").mean().alias("avg_price"),
pl.count().alias("count")
])
π C. Join Operations
df1.join(df2, on="id", how="inner")
inner |
Only matching rows |
left / outer |
Keep left/all rows |
semi / anti |
Exists / doesnβt exist in right |
D. Window Functions
df.sort("timestamp").with_columns([
pl.col("price").rolling_mean(window_size=3).over("category")
])
Useful for moving averages, cumulative stats, and rankings.
π§ͺ E. Pivot and Melt
# Pivot
df.pivot(values="value", index="date", columns="category", aggregate_fn="sum")
# Melt (unpivot)
df.melt(id_vars="date", value_vars=["sales_a", "sales_b"])
π 5. Read/Write Support
CSV |
pl.read_csv , df.write_csv |
Parquet |
pl.read_parquet , df.write_parquet |
JSON |
pl.read_json , df.write_json |
IPC/Arrow |
pl.read_ipc , df.write_ipc |
π― 6. Interoperability
π Convert between Polars βοΈ Pandas
import pandas as pd
df_pl = pl.DataFrame(pd_df) # from pandas
pd_df = df_pl.to_pandas() # to pandas
π Plotting
Polars doesnβt support plotting directly β convert to Pandas:
import matplotlib.pyplot as plt
df_pl.to_pandas().plot(x="date", y="sales")
plt.show()
π¬ 8. Debugging & Inspection in Jupyter
# Schema inspection
df.schema
# Data types
df.dtypes
# Lazy query plan (before execution)
ldf.describe_plan()
π¦ 9. Streaming & Chunked Data (Advanced)
for chunk in pl.read_csv("big.csv", chunk_size=100_000):
process(chunk)
Useful for real-time or out-of-core processing.
π§© 10. Plugins & Ecosystem Integration
DuckDB |
Direct querying via pl.read_database() |
Arrow |
Native backend format; fast interop |
SQL (via DuckDB) |
duckdb.query("SELECT ...").pl() |
Machine Learning |
Use .to_pandas() or convert to numpy() |
π 11. Benchmarking Example
import time
df = pl.read_csv("big.csv")
start = time.time()
df.groupby("category").agg(pl.col("price").mean())
print("Elapsed:", time.time() - start)
π 12. Cheat Sheet Summary
Filter |
df.filter(pl.col("x") > 5) |
Select |
df.select(["a", "b"]) |
GroupBy + Mean |
df.groupby("cat").agg(pl.col("x").mean()) |
Join |
df1.join(df2, on="id") |
Convert to Pandas |
df.to_pandas() |
Lazy collect |
ldf.collect() |
Sort |
df.sort("x", descending=True) |
Window fn |
pl.col("val").cum_sum().over("group") |
Back to top