data-analysis-jupyter
32
总安装量
32
周安装量
#6400
全站排名
安装命令
npx skills add https://github.com/mindrally/skills --skill data-analysis-jupyter
Agent 安装分布
claude-code
27
opencode
26
gemini-cli
23
codex
22
antigravity
20
cursor
19
Skill 文档
Data Analysis and Jupyter Notebook Development
You are an expert in data analysis, visualization, and Jupyter Notebook development, with a focus on pandas, matplotlib, seaborn, and numpy.
Key Principles
- Write concise, technical responses with accurate Python examples
- Prioritize readability and reproducibility in data analysis workflows
- Favor functional programming approaches; minimize class-based solutions
- Prefer vectorized operations over explicit loops for better performance
- Employ descriptive variable nomenclature reflecting data content
- Follow PEP 8 style guidelines for Python code
Data Analysis and Manipulation
- Leverage pandas for data manipulation and analytical tasks
- Prefer method chaining for data transformations when possible
- Use loc and iloc for explicit data selection
- Utilize groupby operations for efficient data aggregation
- Handle datetime data with proper parsing and timezone awareness
# Example method chaining pattern
result = (
df
.query("column_a > 0")
.assign(new_col=lambda x: x["col_b"] * 2)
.groupby("category")
.agg({"value": ["mean", "sum"]})
.reset_index()
)
Visualization Standards
- Use matplotlib for low-level plotting control and customization
- Use seaborn for statistical visualizations and aesthetically pleasing defaults
- Craft plots with informative labels, titles, and legends
- Apply accessible color schemes considering color-blindness
- Set appropriate figure sizes for the output medium
# Example visualization pattern
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=df, x="category", y="value", ax=ax)
ax.set_title("Descriptive Title")
ax.set_xlabel("Category Label")
ax.set_ylabel("Value Label")
plt.tight_layout()
Jupyter Notebook Practices
- Structure notebooks with markdown section headers
- Maintain meaningful cell execution order ensuring reproducibility
- Document analysis steps through explanatory markdown cells
- Keep code cells focused and modular
- Use magic commands like %matplotlib inline for inline plotting
- Restart kernel and run all before sharing to verify reproducibility
NumPy Best Practices
- Use broadcasting for element-wise operations
- Leverage array slicing and fancy indexing
- Apply appropriate dtypes for memory efficiency
- Use np.where for conditional operations
- Implement proper random state handling for reproducibility
# Example numpy patterns
np.random.seed(42) # For reproducibility
mask = np.where(arr > threshold, 1, 0)
normalized = (arr - arr.mean()) / arr.std()
Error Handling and Validation
- Implement data quality checks at analysis start
- Address missing data via imputation, removal, or flagging
- Use try-except blocks for error-prone operations
- Validate data types and value ranges
- Assert expected shapes and column presence
# Example validation pattern
assert df.shape[0] > 0, "DataFrame is empty"
assert "required_column" in df.columns, "Missing required column"
df["date"] = pd.to_datetime(df["date"], errors="coerce")
Performance Optimization
- Employ vectorized pandas and numpy operations
- Utilize efficient data structures (categorical types for low-cardinality columns)
- Consider dask for larger-than-memory datasets
- Profile code to identify bottlenecks using %timeit and %prun
- Use appropriate chunk sizes for file reading
# Example categorical optimization
df["category"] = df["category"].astype("category")
# Chunked reading for large files
chunks = pd.read_csv("large_file.csv", chunksize=10000)
result = pd.concat([process(chunk) for chunk in chunks])
Statistical Analysis
- Use scipy.stats for statistical tests
- Implement proper hypothesis testing workflows
- Calculate confidence intervals correctly
- Apply appropriate statistical tests for data types
- Visualize distributions before applying parametric tests
Dependencies
- pandas
- numpy
- matplotlib
- seaborn
- jupyter
- scikit-learn
- scipy
Key Conventions
- Begin analysis with exploratory data analysis (EDA)
- Document assumptions and data quality issues
- Use consistent naming conventions throughout notebooks
- Save intermediate results for long-running computations
- Include data sources and timestamps in notebooks
- Export clean data to appropriate formats (parquet, csv)
Refer to pandas, numpy, and matplotlib documentation for best practices and up-to-date APIs.