polars
0
总安装量
7
周安装量
安装命令
npx skills add https://github.com/eyadsibai/ltk --skill polars
Agent 安装分布
gemini-cli
6
antigravity
5
claude-code
5
github-copilot
5
opencode
4
Skill 文档
Polars Fast DataFrame Library
Lightning-fast DataFrame library with lazy evaluation and parallel execution.
When to Use
- Pandas is too slow for your dataset
- Working with 1-100GB datasets that fit in RAM
- Need lazy evaluation for query optimization
- Building ETL pipelines
- Want parallel execution without extra config
Lazy vs Eager Evaluation
| Mode | Function | Executes | Use Case |
|---|---|---|---|
| Eager | read_csv() |
Immediately | Small data, exploration |
| Lazy | scan_csv() |
On .collect() |
Large data, pipelines |
Key concept: Lazy mode builds a query plan that gets optimized before execution. The optimizer applies predicate pushdown (filter early) and projection pushdown (select columns early).
Core Operations
Data Selection
| Operation | Purpose |
|---|---|
select() |
Choose columns |
filter() |
Choose rows by condition |
with_columns() |
Add/modify columns |
drop() |
Remove columns |
head(n) / tail(n) |
First/last n rows |
Aggregation
| Operation | Purpose |
|---|---|
group_by().agg() |
Group and aggregate |
pivot() |
Reshape wide |
melt() |
Reshape long |
unique() |
Distinct values |
Joins
| Join Type | Description |
|---|---|
| inner | Matching rows only |
| left | All left + matching right |
| outer | All rows from both |
| cross | Cartesian product |
| semi | Left rows with match |
| anti | Left rows without match |
Expression API
Key concept: Polars uses expressions (pl.col()) instead of indexing. Expressions are lazily evaluated and optimized.
Common Expressions
| Expression | Purpose |
|---|---|
pl.col("name") |
Reference column |
pl.lit(value) |
Literal value |
pl.all() |
All columns |
pl.exclude(...) |
All except |
Expression Methods
| Category | Methods |
|---|---|
| Aggregation | .sum(), .mean(), .min(), .max(), .count() |
| String | .str.contains(), .str.replace(), .str.to_lowercase() |
| DateTime | .dt.year(), .dt.month(), .dt.day() |
| Conditional | .when().then().otherwise() |
| Window | .over(), .rolling_mean(), .shift() |
Pandas Migration
| Pandas | Polars |
|---|---|
df['col'] |
df.select('col') |
df[df['col'] > 5] |
df.filter(pl.col('col') > 5) |
df['new'] = df['col'] * 2 |
df.with_columns((pl.col('col') * 2).alias('new')) |
df.groupby('col').mean() |
df.group_by('col').agg(pl.all().mean()) |
df.apply(func) |
df.map_rows(func) (avoid if possible) |
Key concept: Polars prefers explicit operations over implicit indexing. Use .alias() to name computed columns.
File I/O
| Format | Read | Write | Notes |
|---|---|---|---|
| CSV | read_csv() / scan_csv() |
write_csv() |
Human readable |
| Parquet | read_parquet() / scan_parquet() |
write_parquet() |
Fast, compressed |
| JSON | read_json() / scan_ndjson() |
write_json() |
Newline-delimited |
| IPC/Arrow | read_ipc() / scan_ipc() |
write_ipc() |
Zero-copy |
Key concept: Use Parquet for performance. Use scan_* for large files to enable lazy optimization.
Performance Tips
| Tip | Why |
|---|---|
| Use lazy mode | Query optimization |
| Use Parquet | Column-oriented, compressed |
| Select columns early | Projection pushdown |
| Filter early | Predicate pushdown |
| Avoid Python UDFs | Breaks parallelism |
| Use expressions | Vectorized operations |
| Set dtypes on read | Avoid inference overhead |
vs Alternatives
| Tool | Best For | Limitations |
|---|---|---|
| Polars | 1-100GB, speed critical | Must fit in RAM |
| Pandas | Small data, ecosystem | Slow, memory hungry |
| Dask | Larger than RAM | More complex API |
| Spark | Cluster computing | Infrastructure overhead |
| DuckDB | SQL interface | Different API style |
Resources
- Docs: https://pola.rs/
- User Guide: https://docs.pola.rs/user-guide/
- Cookbook: https://docs.pola.rs/user-guide/misc/cookbook/