polars

📁 eyadsibai/ltk 📅 Jan 28, 2026
0
总安装量
7
周安装量
安装命令
npx skills add https://github.com/eyadsibai/ltk --skill polars

Agent 安装分布

gemini-cli 6
antigravity 5
claude-code 5
github-copilot 5
opencode 4

Skill 文档

Polars Fast DataFrame Library

Lightning-fast DataFrame library with lazy evaluation and parallel execution.

When to Use

  • Pandas is too slow for your dataset
  • Working with 1-100GB datasets that fit in RAM
  • Need lazy evaluation for query optimization
  • Building ETL pipelines
  • Want parallel execution without extra config

Lazy vs Eager Evaluation

Mode Function Executes Use Case
Eager read_csv() Immediately Small data, exploration
Lazy scan_csv() On .collect() Large data, pipelines

Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).


Core Operations

Data Selection

Operation Purpose
select() Choose columns
filter() Choose rows by condition
with_columns() Add/modify columns
drop() Remove columns
head(n) / tail(n) First/last n rows

Aggregation

Operation Purpose
group_by().agg() Group and aggregate
pivot() Reshape wide
melt() Reshape long
unique() Distinct values

Joins

Join Type Description
inner Matching rows only
left All left + matching right
outer All rows from both
cross Cartesian product
semi Left rows with match
anti Left rows without match

Expression API

Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.

Common Expressions

Expression Purpose
pl.col("name") Reference column
pl.lit(value) Literal value
pl.all() All columns
pl.exclude(...) All except

Expression Methods

Category Methods
Aggregation .sum(), .mean(), .min(), .max(), .count()
String .str.contains(), .str.replace(), .str.to_lowercase()
DateTime .dt.year(), .dt.month(), .dt.day()
Conditional .when().then().otherwise()
Window .over(), .rolling_mean(), .shift()

Pandas Migration

Pandas Polars
df['col'] df.select('col')
df[df['col'] > 5] df.filter(pl.col('col') > 5)
df['new'] = df['col'] * 2 df.with_columns((pl.col('col') * 2).alias('new'))
df.groupby('col').mean() df.group_by('col').agg(pl.all().mean())
df.apply(func) df.map_rows(func) (avoid if possible)

Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.


File I/O

Format Read Write Notes
CSV read_csv() / scan_csv() write_csv() Human readable
Parquet read_parquet() / scan_parquet() write_parquet() Fast, compressed
JSON read_json() / scan_ndjson() write_json() Newline-delimited
IPC/Arrow read_ipc() / scan_ipc() write_ipc() Zero-copy

Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.


Performance Tips

Tip Why
Use lazy mode Query optimization
Use Parquet Column-oriented, compressed
Select columns early Projection pushdown
Filter early Predicate pushdown
Avoid Python UDFs Breaks parallelism
Use expressions Vectorized operations
Set dtypes on read Avoid inference overhead

vs Alternatives

Tool Best For Limitations
Polars 1-100GB, speed critical Must fit in RAM
Pandas Small data, ecosystem Slow, memory hungry
Dask Larger than RAM More complex API
Spark Cluster computing Infrastructure overhead
DuckDB SQL interface Different API style

Resources