xarray
8
总安装量
8
周安装量
#35768
全站排名
安装命令
npx skills add https://github.com/tondevrel/scientific-agent-skills --skill xarray
Agent 安装分布
opencode
7
github-copilot
7
codex
7
gemini-cli
6
claude-code
6
amp
6
Skill 文档
Xarray – N-Dimensional Labeled Arrays
Xarray provides a pandas-like experience for multidimensional data. It is the core of the Pangeo ecosystem and is essential for working with NetCDF, GRIB, and Zarr formats.
When to Use
- Working with multi-dimensional scientific data (Time, Lat, Lon, Level, Ensemble).
- Analyzing climate, weather, or oceanographic datasets (NetCDF files).
- Handling large datasets that don’t fit in memory (via Dask integration).
- Performing complex broadcasting and alignment based on dimension names instead of axis indices.
- Storing metadata (units, descriptions) directly inside the data object.
- Remote sensing and geospatial imaging analysis.
Reference Documentation
Official docs: https://docs.xarray.dev/
Tutorials: https://tutorial.xarray.dev/
Search patterns: xr.DataArray, xr.Dataset, ds.sel, ds.groupby, ds.resample, xr.open_dataset
Core Principles
DataArray vs Dataset
| Structure | Description | Analogy |
|---|---|---|
| DataArray | A single labeled N-dimensional array. | Like a pandas.Series but N-D. |
| Dataset | A dict-like container of multiple DataArrays. | Like a pandas.DataFrame but N-D. |
Key Concepts
- Dimensions: Names of the axes (e.g., x, y, time).
- Coordinates: Values associated with dimensions (e.g., actual timestamps or latitude values).
- Attributes: Arbitrary metadata (e.g., units=’Kelvin’, standard_name=’air_temperature’).
Quick Reference
Installation
pip install xarray netCDF4 dask zarr
Standard Imports
import xarray as xr
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Basic Pattern – Creation
import xarray as xr
import numpy as np
# Create a DataArray
data = np.random.rand(4, 3)
times = pd.date_range("2023-01-01", periods=4)
lons = [-120, -110, -100]
da = xr.DataArray(
data,
coords={"time": times, "lon": lons},
dims=("time", "lon"),
name="temp",
attrs={"units": "degC"}
)
# Convert to Dataset
ds = da.to_dataset()
print(ds)
Critical Rules
â DO
- Use Named Dimensions – Always use
dim=('time', 'lat', 'lon')instead of integer axes. - Select by Labels – Use
.sel()for coordinate values and.isel()for index integers. - Lazy Loading – Use
chunks={}inopen_datasetto handle large files with Dask. - Keep Metadata – Populate
.attrsto ensure your data is self-describing. - Alignment – Let Xarray handle broadcasting; it will automatically align data based on coordinate values.
- Accessor power – Use
.dtfor datetime and.strfor string operations.
â DON’T
- Use Integer Indexing – Avoid
data[0, :, 5](unreadable and fragile). Use.isel(time=0, lon=5). - Ignore the Encoding – When saving to NetCDF, check
ds.encodingfor compression/scaling. - Manual Loops – Don’t loop over time steps; use
.groupby()or.resample(). - Forget Dask – For datasets larger than RAM, ensure Dask is installed and chunks are defined.
Anti-Patterns (NEVER)
# â BAD: Positional indexing (What is axis 1? Lat or Lon?)
mean_val = ds.variable.mean(axis=1)
# â
GOOD: Named dimension reduction (Clear and robust)
mean_val = ds.variable.mean(dim='lat')
# â BAD: Manual time slicing with list comprehensions
# subset = [ds.sel(time=t) for t in my_times if t > '2020-01-01']
# â
GOOD: Built-in slicing
subset = ds.sel(time=slice('2020-01-01', '2021-12-31'))
# â BAD: Losing metadata during numpy conversion
raw_data = ds.temp.values # Now it's just a numpy array, units are gone!
# â
GOOD: Keep in Xarray as long as possible
processed = ds.temp * 10 # Units and coords are preserved
Selection and Indexing
sel vs isel
# Select by coordinate values
subset = ds.sel(lat=45.0, lon=slice(-100, -80))
# Select by index (integer)
first_step = ds.isel(time=0)
# Nearest neighbor lookup
point = ds.sel(lat=42.1, lon=-71.2, method="nearest")
# Multi-dimensional selection
high_temp_days = ds.where(ds.temp > 30, drop=True)
Computation and Math
Broadcasting and Alignment
# Xarray aligns automatically by coordinate names
da1 = xr.DataArray([1, 2], coords=[[1, 2]], dims=['x'])
da2 = xr.DataArray([1, 2, 3], coords=[[1, 2, 3]], dims=['y'])
# result is a 2x3 matrix
result = da1 + da2
# Mathematical operations preserve coordinates
log_temp = np.log(ds.temp)
anomalies = ds.temp - ds.temp.mean(dim='time')
GroupBy and Resampling
Time Series and Spatial Aggregation
# Monthly means
monthly = ds.resample(time="1MS").mean()
# Climatology (group by month regardless of year)
climatology = ds.groupby("time.month").mean()
# Calculate anomalies relative to climatology
anomalies = ds.groupby("time.month") - climatology
# Rolling window (Moving average)
rolling_mean = ds.rolling(time=7, center=True).mean()
File I/O (NetCDF, Zarr)
Reading and Writing
# Open a single file
ds = xr.open_dataset("weather_data.nc")
# Open multiple files (MFDataset)
ds_all = xr.open_mfdataset("data/*.nc", combine="by_coords", chunks={'time': 100})
# Write to NetCDF
ds.to_netcdf("output.nc")
# Write to Zarr (Cloud optimized)
ds.to_zarr("data.zarr")
Plotting
High-level wrapping of Matplotlib
# 1D plot
ds.temp.sel(lat=0, lon=0, method='nearest').plot()
# 2D map
ds.temp.isel(time=0).plot(cmap='RdBu_r', robust=True)
# Faceting (Subplots)
ds.temp.isel(time=slice(0, 4)).plot(col="time", col_wrap=2)
Integration with pandas and NumPy
# To Pandas
df = ds.to_dataframe()
# From Pandas
new_ds = xr.Dataset.from_dataframe(df)
# To NumPy (Lose coordinates)
arr = ds.temp.values
# Interoperability
# Xarray objects work in many SciPy/NumPy functions
from scipy.signal import detrend
detrended = xr.apply_ufunc(detrend, ds.temp, input_core_dims=[['time']], output_core_dims=[['time']])
Advanced: Dask for Big Data
Out-of-memory computation
# Opening with chunks creates Dask arrays
ds = xr.open_dataset("huge_file.nc", chunks={'time': 500, 'lat': 100, 'lon': 100})
# Computation is now lazy
result = ds.temp.mean(dim='time') # Returns immediately
# Trigger computation
final_val = result.compute()
Practical Workflows
1. Global Temperature Anomaly Workflow
def calculate_temp_anomaly(filepath):
"""Calculate monthly anomalies from NetCDF data."""
ds = xr.open_dataset(filepath)
# 1. Compute climatology (mean for each month of the year)
climatology = ds.temp.groupby("time.month").mean("time")
# 2. Subtract climatology from original data
anomalies = ds.temp.groupby("time.month") - climatology
# 3. Global mean anomaly
# Weighted by cos(lat) because grid cells get smaller at poles
weights = np.cos(np.deg2rad(ds.lat))
weights.name = "weights"
anom_weighted = anomalies.weighted(weights)
return anom_weighted.mean(("lat", "lon"))
# ts_anomaly = calculate_temp_anomaly("global_temps.nc")
2. Multi-Model Ensemble Analysis
def analyze_ensemble(file_list):
"""Combine multiple model runs into a single dataset with a 'model' dimension."""
datasets = [xr.open_dataset(f) for f in file_list]
model_names = ["Model_A", "Model_B", "Model_C"]
# Concatenate along a new dimension
combined = xr.concat(datasets, dim=pd.Index(model_names, name="model"))
# Calculate ensemble mean and spread
ens_mean = combined.mean(dim="model")
ens_std = combined.std(dim="model")
return ens_mean, ens_std
3. Satellite Image Processing (NDVI)
def calculate_ndvi(ds):
"""Calculate NDVI from Red and NIR bands in an Xarray Dataset."""
# NDVI = (NIR - Red) / (NIR + Red)
red = ds.sel(band='red')
nir = ds.sel(band='nir')
ndvi = (nir - red) / (nir + red)
ndvi.attrs['long_name'] = "Normalized Difference Vegetation Index"
return ndvi
Performance Optimization
Chunking Strategies
# â Problem: Small chunks lead to massive overhead
# ds = ds.chunk({'time': 1, 'lat': 1, 'lon': 1})
# â
Solution: Aim for 10MB - 100MB per chunk
ds = ds.chunk({'time': -1, 'lat': 100, 'lon': 100})
Vectorization with apply_ufunc
# Wrap a custom numpy function to work on Xarray objects efficiently
def my_complex_stat(x):
return np.median(x) * np.std(x)
result = xr.apply_ufunc(
my_complex_stat,
ds.temp,
input_core_dims=[['time']], # The dimension to map the function over
vectorize=True,
dask="parallelized"
)
Common Pitfalls and Solutions
Coordinate Mismatch
# â Problem: DataArrays don't align due to floating point jitter in lat/lon
# â
Solution: Use .interp_like() or .reindex_like()
ds2_aligned = ds2.interp_like(ds1)
Memory Leak with values
# â Problem: Calling .values on a huge Dask-backed array crashes the machine
# â
Solution: Use .compute() or subset first
subset_val = ds.temp.isel(time=0).values # This is safer
Slicing issues (Start/End)
# â Problem: slice(10, 0) returns empty because order is wrong
# â
Solution: Check if your index is ascending or descending
# ds.sortby('lat').sel(lat=slice(-90, 90))
Xarray is the bridge between raw N-dimensional math and high-level data analysis. Its ability to handle labels and metadata makes scientific code self-documenting and significantly more reliable.