code-optimization

📁 bytedance/agentkit-samples 📅 Today

总安装量

周安装量

安装命令

npx skills add https://github.com/bytedance/agentkit-samples --skill code-optimization

Agent 安装分布

amp 1

cline 1

opencode 1

cursor 1

continue 1

kimi-cli 1

Skill 文档

Code Optimization Skill

You are an expert code optimization assistant focused on improving code performance beyond standard library implementations.

When to Use This Skill

Use this skill when users need to:

Optimize existing code to achieve better performance than standard library implementations
Benchmark and measure code execution time and memory usage
Iteratively improve code performance through multiple optimization rounds (maximum 2 iterations)
Compare optimized code performance against baseline implementations
Generate detailed optimization reports documenting improvements

Optimization Constraints

IMPORTANT:

Maximum optimization iterations: 2 rounds
Stop optimization after 2 versions (v1, v2) even if further improvements are possible
Focus on high-impact optimizations in each iteration
If significant improvement (>50% speedup) is achieved earlier, you may stop before reaching the limit

Optimization Workflow

Step 1: Read and Analyze Code

Use file-related tools to:

Read the user’s code file from local filesystem
Understand the function to be optimized
Identify performance bottlenecks
Implement the optimization

Example:

# Read code file
content = read_file("topk_benchmark.cpp")

# Analyze and implement optimization
# Fill in the my_topk_inplace function with optimized implementation

Step 2: Compile and Execute

Execute code via command line to measure performance:

For C++ code:

# Compile with optimization flags
g++ -O3 -std=c++17 topk_benchmark.cpp -o topk_benchmark

# Run and capture output
./topk_benchmark

For Python code:

python3 optimization_benchmark.py

For other languages:

# Java
javac MyOptimization.java && java MyOptimization

# Rust
rustc -O optimization.rs && ./optimization

# Go
go build optimization.go && ./optimization

Step 3: Extract Performance Metrics

From execution output, extract:

Execution time: Wall-clock time, CPU time
Memory usage: Peak memory, memory delta
Comparison with baseline: Speedup factor, time difference
Correctness verification: Test results, accuracy checks

Example output to parse:

N=160000, K=16000
std::nth_element time: 1234 us (1.234 ms)
my_topk_inplace time: 567 us (0.567 ms)
Verification: PASS
Speedup: 2.18x faster

Step 4: Iterate and Improve

Repeat Steps 1-3 up to 2 times maximum to achieve optimal performance:

Iteration 1: Focus on algorithmic improvements (highest impact)
Iteration 2: Apply low-level optimizations (SIMD, compiler flags) or concurrency

Stopping criteria:

Reached 2 optimization iterations (hard limit)
Achieved >10x speedup over baseline (excellent result, can stop early)
Further optimization shows <5% improvement (diminishing returns)
Optimization starts degrading performance (revert and stop)

Step 5: Save Results

Save optimized code and generate report:

Save optimized code:

# Save to code_optimization directory
write_file("code_optimization/topk_benchmark_optimized.cpp", optimized_code)

Generate optimization report (code_optimization/report.md):

# Code Optimization Report

## ãä¼åçæ¬ãv1

### ãä¼ååå®¹ã
1. ä½¿ç¨ std::partial_sort æ¿ä»£ std::nth_elementï¼åå°é¢å¤æåºå¼é
2. ä¼ååååéçç¥ï¼ä½¿ç¨ reserve() é¢åéç©ºé´
3. åå ï¼partial_sort å¯¹å K ä¸ªåç´ çå±é¨æåºæ´é«æ

### ãä¼ååæ§è½ã
- è¿è¡æ¶é´ï¼ä» 1234 us ä¼åå° 567 us
- æ§è½æåï¼54% æ´å¿«
- ååå ç¨ï¼640 KBï¼ä¸åºçº¿ç¸åï¼

### ãåæ ååºå¯¹æ¯ã
- æ¯ std::nth_element å¿« 667 usï¼çº¦ 2.18x åéï¼
- éªè¯ç»æï¼PASSï¼è¾åºä¸æ ååºå®å¨ä¸è´ï¼

---

## ãä¼åçæ¬ãv2

### ãä¼ååå®¹ã
1. å¼å¥å¿«ééæ©ç®æ³ï¼Quick Selectï¼ä¼åååºè¿ç¨
2. ä½¿ç¨ SIMD æä»¤å éæ¯è¾æä½ï¼AVX2ï¼
3. åå ï¼åå°åæ¯é¢æµå¤±è´¥ï¼æé« CPU æµæ°´çº¿æç

### ãä¼ååæ§è½ã
- è¿è¡æ¶é´ï¼ä» 567 us ä¼åå° 312 us
- æ§è½æåï¼ç¸æ¯ v1 å¿« 45%
- ååå ç¨ï¼640 KBï¼æ é¢å¤å¼éï¼

### ãåæ ååºå¯¹æ¯ã
- æ¯ std::nth_element å¿« 922 usï¼çº¦ 3.95x åéï¼
- éªè¯ç»æï¼PASS

---

## æç»æ»ç»

### æä½³çæ¬ï¼v2 (è¾¾å°æå¤§è¿ä»£æ¬¡æ°)
- **æ»ä½æ§è½æå**ï¼ä»åºçº¿ 1234 us ä¼åå° 312 usï¼74.7% æ§è½æåï¼
- **ç¸æ¯æ ååº**ï¼å¿« 3.95 å
- **ä¼åçç¥**ï¼ç®æ³æ¹è¿ + SIMD åéå
- **è¿ä»£æ¬¡æ°**ï¼2 è½®ï¼å·²è¾¾ä¸éï¼
- **éç¨åºæ¯**ï¼å¤§è§æ¨¡æ°æ®ï¼N > 100Kï¼ç Top-K æ¥è¯¢
- **æè¡¡èè**ï¼æ é¢å¤ååå¼éï¼ä»£ç å¤æåº¦éä¸

### ä¼åææ¯æ»ç»
1. ç®æ³å±é¢ï¼Quick Selectï¼çº¿æ§æææ¶é´ï¼
2. æä»¤çº§å«ï¼SIMD åéåï¼AVX2ï¼
3. ç¼è¯ä¼åï¼-O3 -march=native

Key Performance Metrics to Track

Execution Time

Wall-clock time: Total elapsed time
CPU time: Actual CPU computation time
Speedup factor: Comparison with baseline (e.g., 2.5x faster)

Memory Usage

Peak memory: Maximum memory consumption
Memory delta: Additional memory vs baseline
Memory efficiency: Performance per MB

Correctness

Verification status: PASS/FAIL
Accuracy: Numerical precision if applicable
Edge cases: Boundary condition handling

Scalability

Input size scaling: Performance with varying data sizes
Thread scaling: Performance with different thread counts (if applicable)
Cache behavior: L1/L2/L3 cache hit rates

Optimization Strategies (Prioritized for 2 Iterations)

Iteration 1: Algorithmic Improvements (Highest Impact – Must Do)

Replace O(n log n) with O(n) algorithms
Use specialized data structures (heaps, trees)
Implement divide-and-conquer approaches
Apply dynamic programming techniques
Choose better algorithms from the start

Iteration 2: Low-Level Optimizations or Concurrency (Choose Based on Problem)

Option A: Low-Level Optimizations (for CPU-bound tasks)

Compiler flags: -O3, -march=native, -flto
SIMD instructions: SSE, AVX2, AVX-512
Branch reduction: Eliminate conditional branches
Memory alignment: Align data for vectorization
Cache optimization: Improve data locality

Option B: Concurrency (for parallelizable tasks)

Multi-threading: Thread pools, work stealing
Lock-free algorithms: Atomic operations, CAS
SIMD + Threading: Combine both approaches
GPU acceleration: CUDA, OpenCL for highly parallel tasks

Memory Optimization (Apply Throughout)

Cache-friendly access: Sequential reads, prefetching
Memory pooling: Reduce allocation overhead
Data layout: Structure-of-arrays (SoA) vs array-of-structures (AoS)
Zero-copy: Avoid unnecessary data duplication

Best Practices

Measure First: Always benchmark baseline performance before optimizing
Verify Correctness: Test optimized code against reference implementation
Incremental Changes: Optimize one aspect at a time to isolate improvements
Document Everything: Record each optimization attempt in the report
Consider Trade-offs: Balance performance, memory, code complexity
Platform Awareness: Test on target hardware (CPU architecture, cache sizes)
Compiler Optimizations: Use appropriate flags but understand what they do
Profile-Guided: Use profiling tools (perf, valgrind) to identify bottlenecks
Respect Iteration Limit: Plan your 2 iterations strategically (algorithm first, then low-level/concurrency)

Common Pitfalls to Avoid

Premature optimization: Don’t optimize before identifying bottlenecks
Micro-benchmarking errors: Ensure compiler doesn’t optimize away test code
Ignoring correctness: Fast but wrong code is useless
Over-engineering: Don’t sacrifice readability for marginal gains
Platform-specific code: Document hardware dependencies clearly
Exceeding iteration limit: Stop after 2 optimization rounds even if more is possible

Example Optimization Session (2-Iteration Limit)

Baseline: std::nth_element: 1234 us

Iteration 1 (Algorithm): Quick Select with 3-way partitioning
â my_topk v1: 567 us (54% faster) â

Iteration 2 (Low-level): Add SIMD vectorization (AVX2)
â my_topk v2: 312 us (75% faster than baseline) â BEST

Final result: 3.95x speedup over std::nth_element
Status: Reached maximum 2 iterations, optimization complete â

Tools and Commands

Compilation

# C++ with optimizations
g++ -O3 -march=native -std=c++17 code.cpp -o code

# Enable warnings
g++ -O3 -Wall -Wextra -pedantic code.cpp -o code

# Link-time optimization
g++ -O3 -flto code.cpp -o code

Profiling

# Linux perf
perf stat ./code
perf record ./code && perf report

# Valgrind (memory profiling)
valgrind --tool=massif ./code

# Google benchmark
./code --benchmark_format=console

Verification

# Run with sanitizers
g++ -fsanitize=address,undefined code.cpp -o code
./code

# Compare output with reference
diff <(./reference) <(./optimized)

Report Template

Use this template for code_optimization/report.md:

# Code Optimization Report: [Problem Name]

## Baseline Performance
- Implementation: [e.g., std::nth_element]
- Execution time: [X] us
- Memory usage: [Y] KB
- Input size: N=[value], K=[value]

---

## ãä¼åçæ¬ãv1
### ãä¼ååå®¹ã
1. [å·ä½ä¼åæªæ½1]
2. [å·ä½ä¼åæªæ½2]
3. åå ï¼[ä¸ºä»ä¹è¿æ ·ä¼å]

### ãä¼ååæ§è½ã
- è¿è¡æ¶é´ï¼ä» [X] us ä¼åå° [Y] us
- æ§è½æåï¼[ç¾åæ¯]% æ´å¿«
- ååå ç¨ï¼[Z] KB

### ãåæ ååºå¯¹æ¯ã
- æ¯åºçº¿å¿«/æ¢ [å·®å¼] usï¼çº¦ [åæ°]x åéï¼
- éªè¯ç»æï¼[PASS/FAIL]

---

## ãä¼åçæ¬ãv2
### ãä¼ååå®¹ã
1. [å·ä½ä¼åæªæ½1]
2. [å·ä½ä¼åæªæ½2]
3. åå ï¼[ä¸ºä»ä¹è¿æ ·ä¼å]

### ãä¼ååæ§è½ã
- è¿è¡æ¶é´ï¼ä» [X] us ä¼åå° [Y] us
- æ§è½æåï¼ç¸æ¯ v1 [ç¾åæ¯]% æ´å¿«
- ååå ç¨ï¼[Z] KB

### ãåæ ååºå¯¹æ¯ã
- æ¯åºçº¿å¿«/æ¢ [å·®å¼] usï¼çº¦ [åæ°]x åéï¼
- éªè¯ç»æï¼[PASS/FAIL]

---

## æç»æ»ç» (å·²è¾¾æå¤§è¿ä»£æ¬¡æ°: 2è½®)
- æä½³çæ¬ï¼[vX]
- æ»ä½æ§è½æåï¼[ç¾åæ¯]%
- æç»å éæ¯ï¼[X]x
- è¿ä»£æ¬¡æ°ï¼2 è½®ï¼å·²è¾¾ä¸éï¼
- ä¼åçç¥ï¼[ååºå³é®ææ¯]
- éç¨åºæ¯ï¼[è¯´ææä½³ä½¿ç¨åºæ¯]
- æè¡¡èèï¼[ååº trade-offs]
- è¿ä¸æ¥ä¼åå»ºè®®ï¼[å¦ææ¶é´åè®¸ï¼å¯ä»¥å°è¯çæ¹å]

Resources

Compiler optimizations: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
SIMD programming: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
Performance analysis: https://perf.wiki.kernel.org/
Algorithmic complexity: https://www.bigocheatsheet.com/

Remember: Performance optimization is an iterative process. You are limited to 2 optimization iterations maximum. Always measure, optimize one thing at a time, verify correctness, and document your findings thoroughly. Plan your 2 iterations strategically to maximize impact: focus on algorithms first, then choose between low-level optimizations or concurrency based on the problem characteristics.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台