rust-performance-best-practices
npx skills add https://github.com/mcart13/dev-skills --skill rust-performance-best-practices
Agent 安装分布
Skill 文档
Rust Performance Best Practices
Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.
When to Apply
Reference these guidelines when:
- Investigating slow Rust programs or high latency
- Optimizing build times or binary size
- Reviewing allocation-heavy code
- Debugging lock contention or thread scaling issues
- Setting up release profiles for production
- Working with async runtimes (Tokio, async-std)
When NOT to Apply
Skip these optimizations when:
- Code isn’t in a hot path (profile first!)
- Readability would suffer significantly
- You haven’t measured a performance problem
- The optimization requires unsafe code you can’t verify
- Premature optimization would delay shipping
The Optimization Workflow
CRITICAL: Most Rust code doesn’t need optimization. Profile first, optimize second.
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â OPTIMIZATION WORKFLOW â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â â
â 1. MEASURE FIRST â
â âââ Profile before changing anything â
â âââ Use cargo flamegraph, perf, or heaptrack â
â âââ Identify actual bottlenecks (don't guess!) â
â â
â 2. CHECK BUILD SETTINGS â
â âââ Release mode? (10-100x vs debug) â
â âââ LTO enabled? (5-20% improvement) â
â âââ Target CPU? (10-30% for SIMD) â
â â
â 3. FIX ALGORITHMIC ISSUES â
â âââ O(n²) â O(n log n) matters more than micro-opts â
â âââ Check data structure choices â
â âââ Avoid unnecessary work â
â â
â 4. REDUCE ALLOCATIONS â
â âââ Pre-size collections (with_capacity) â
â âââ Reuse buffers (clear + reuse) â
â âââ Avoid cloning (borrow instead) â
â â
â 5. OPTIMIZE HOT LOOPS â
â âââ Iterators over indices â
â âââ Reduce lock scope â
â âââ Batch I/O operations â
â â
â 6. MEASURE AGAIN â
â âââ Verify improvement with benchmarks â
â âââ Check for regressions elsewhere â
â âââ Document the optimization â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Quick Profiling Commands
# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report
# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out
# Benchmark
cargo bench # All benchmarks
cargo bench hot_function # Specific benchmark
# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log
# Assembly inspection
cargo asm my_crate::hot_function --rust
# syscall count
strace -c ./target/release/myapp 2>&1 | head -20
Common Scenarios â Rules
“My Rust program is slow”
Is it running in debug mode?
âââ YES â build-release-profile (10-100x speedup)
âââ NO
â
Where does flamegraph show time?
âââ malloc/free â alloc-* rules (with_capacity, reuse buffers)
âââ Mutex::lock â sync-* rules (RwLock, atomics, shorter scope)
âââ read/write syscalls â io-* rules (BufReader/BufWriter)
âââ clone/drop â alloc-avoid-clone, use references
âââ Your code â iter-* rules, algorithm improvements
“My binary is too large”
1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0
“High memory usage”
1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate
“Lock contention / thread scaling”
1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? â sync-use-rwlock
4. Simple counters? â sync-use-atomics
5. Message passing? â sync-use-channels
6. Thread-local + periodic flush for stats
“Slow file I/O”
1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate
Rule Categories
| Priority | Category | Typical Impact | Prefix |
|---|---|---|---|
| 1 | Build Profiles | 10-100x (debugârelease) | build- |
| 2 | Benchmarking | Enables measurement | bench- |
| 3 | Allocation | 2-50x for allocation-heavy code | alloc- |
| 4 | Data Structures | 2-10x for hot paths | data- |
| 5 | Iteration | 2-5x for loop-heavy code | iter- |
| 6 | Synchronization | 5-100x for contended code | sync- |
| 7 | I/O | 10-100x for I/O-bound code | io- |
| 8 | Unsafe | 5-30% in tight loops (experts only) | unsafe- |
1. Build Profiles (CRITICAL)
These apply to ALL Rust code. Check these first.
| Rule | Impact | One-liner |
|---|---|---|
build-release-profile |
10-100x | Always ship release builds |
build-opt-level |
2-5x | opt-level=3 for speed, ‘z’ for size |
build-enable-lto |
5-20% | LTO enables cross-crate optimization |
build-codegen-units |
5-15% | codegen-units=1 for max optimization |
build-panic-abort |
Binary size | panic=’abort’ removes unwinding |
build-target-cpu |
10-30% | target-cpu=native for SIMD |
build-pgo |
5-20% | Profile-guided optimization |
build-incremental-off |
5-10% | Disable for release builds |
2. Benchmarking (REQUIRED)
You can’t optimize what you don’t measure.
| Rule | Purpose |
|---|---|
bench-cargo-bench |
Use cargo bench with criterion |
bench-bench-profile |
Bench profile enables optimizations |
bench-black-box |
Prevent dead code elimination |
bench-avoid-io |
I/O variance destroys measurements |
3. Allocation
Every allocation is a syscall. Reduce them.
| Rule | Impact | Pattern |
|---|---|---|
alloc-vec-with-capacity |
2-10x | Vec::with_capacity(n) not Vec::new() |
alloc-string-with-capacity |
2-5x | String::with_capacity(n) |
alloc-hashmap-with-capacity |
2-5x | HashMap::with_capacity(n) |
alloc-reuse-buffers |
2-10x | .clear() and reuse, don’t reallocate (up to 50x in tight loops) |
alloc-use-slices-in-apis |
Flexibility | &[T] not Vec<T> in parameters |
alloc-avoid-clone |
2-10x | Borrow &T instead of clone() (benefits scale with data size) |
4. Data Structures
The right data structure beats micro-optimization.
| Rule | When |
|---|---|
data-avoid-linkedlist |
Almost always (Vec wins) |
data-choose-vecdeque-for-queue |
FIFO queues |
data-choose-map-type |
HashMap=O(1), BTreeMap=sorted |
data-use-entry-api |
Insert-or-update patterns |
data-repr-transparent |
FFI newtypes |
5. Iteration
Iterators are as fast as loops and safer.
| Rule | Impact | Pattern |
|---|---|---|
iter-avoid-collect-then-loop |
2-3x | Chain iterators, don’t collect |
iter-use-lazy-iterators |
2-3x | .filter().map() not intermediate vecs |
iter-use-any-find |
Short-circuit | .any() not .filter().count() > 0 |
iter-use-retain |
In-place | .retain() not .filter().collect() |
iter-use-binary-search |
O(log n) | .binary_search() on sorted data |
6. Synchronization
Locks are expensive. Minimize contention.
| Rule | Impact | When |
|---|---|---|
sync-share-with-arc |
Avoids copying | Share large (>64B) data across threads |
sync-use-rwlock |
2-8x for reads | >80% reads, few writes; consider parking_lot |
sync-keep-lock-scope-short |
4x | Minimize code under lock |
sync-use-channels |
3-4x | Message passing vs shared state |
sync-use-atomics |
20x | Simple counters, flags |
sync-use-parking-lot |
1.5-5x | Prefer parking_lot over std sync primitives |
7. I/O
Every syscall costs. Buffer them.
| Rule | Impact | Pattern |
|---|---|---|
io-use-bufreader |
50x | Wrap File in BufReader |
io-use-bufwriter |
18x | Wrap File in BufWriter |
io-flush-bufwriter |
CRITICAL | Must flush or lose data! |
io-read-line-with-bufread |
53x | Reuse String buffer with read_line |
8. Async/Await (HIGH)
Critical for Tokio and async-std applications.
| Rule | Impact | Pattern |
|---|---|---|
async-spawn-blocking |
Prevents hang | Use spawn_blocking for CPU-bound work |
async-cooperative |
Latency | Yield periodically in long computations |
async-mutex-choice |
Correctness | tokio::sync::Mutex across .await points |
async-avoid-blocking-io |
Throughput | Use async I/O, not std::fs in async contexts |
async-bounded-channels |
Backpressure | Prefer bounded channels for flow control |
Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.
// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
let hash = expensive_hash(data); // CPU-bound, blocks executor!
Ok(hash)
}
// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}
9. Unsafe (Expert Only)
Only after profiling proves these matter.
| Rule | Impact | Risk |
|---|---|---|
unsafe-get-unchecked |
5-30% | UB if bounds wrong |
unsafe-use-maybeuninit |
20-100x alloc | UB if read before write |
unsafe-avoid-transmute |
Correctness | Prefer safe alternatives |
unsafe-repr-transparent |
Zero-cost | Required for FFI newtypes |
Decision Trees
When to use with_capacity?
Do you know the size?
âââ YES, exact â with_capacity(exact)
âââ YES, approximate â with_capacity(estimate)
âââ NO
â
Will it grow frequently?
âââ YES â Start bigger or use reserve()
âââ NO â Vec::new() is fine
Mutex vs RwLock vs Atomics?
Is it a simple counter/flag?
âââ YES â Atomics (20x faster)
âââ NO
â
What's the read/write ratio?
âââ Mostly reads (>90%) â RwLock
âââ Mostly writes â Mutex
âââ Mixed â Mutex (simpler)
Consider: parking_lot > std for all of these
When is unsafe get_unchecked worth it?
Did you profile and find bounds checks are the bottleneck?
âââ NO â Don't use it
âââ YES
â
Did you check if LLVM already removed the bounds check?
âââ NO â Check assembly first (cargo asm)
âââ YES, still there
â
Can you use iterators instead?
âââ YES â Use iterators (same speed, safe)
âââ NO â get_unchecked with documented invariants
Reading Rules
Each rule file in rules/ contains:
- Quantified impact with real benchmark numbers
- Visual explanations of how the optimization works
- Incorrect examples showing common mistakes
- Correct examples with best practices
- When NOT to apply – trade-offs and edge cases
- Common mistakes to avoid
- Profiling commands to identify the issue
- References to official docs
Full Compiled Document
For all rules in a single file: AGENTS.md