rust-performance-best-practices

📁 mcart13/dev-skills 📅 7 days ago

总安装量

周安装量

#47956

全站排名

安装命令

npx skills add https://github.com/mcart13/dev-skills --skill rust-performance-best-practices

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

github-copilot 1

claude-code 1

Skill 文档

Rust Performance Best Practices

Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.

When to Apply

Reference these guidelines when:

Investigating slow Rust programs or high latency
Optimizing build times or binary size
Reviewing allocation-heavy code
Debugging lock contention or thread scaling issues
Setting up release profiles for production
Working with async runtimes (Tokio, async-std)

When NOT to Apply

Skip these optimizations when:

Code isn’t in a hot path (profile first!)
Readability would suffer significantly
You haven’t measured a performance problem
The optimization requires unsafe code you can’t verify
Premature optimization would delay shipping

The Optimization Workflow

CRITICAL: Most Rust code doesn’t need optimization. Profile first, optimize second.

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                   OPTIMIZATION WORKFLOW                      â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                             â
â  1. MEASURE FIRST                                           â
â     âââ Profile before changing anything                   â
â     âââ Use cargo flamegraph, perf, or heaptrack           â
â     âââ Identify actual bottlenecks (don't guess!)         â
â                                                             â
â  2. CHECK BUILD SETTINGS                                    â
â     âââ Release mode? (10-100x vs debug)                   â
â     âââ LTO enabled? (5-20% improvement)                   â
â     âââ Target CPU? (10-30% for SIMD)                      â
â                                                             â
â  3. FIX ALGORITHMIC ISSUES                                  â
â     âââ O(nÂ²) â O(n log n) matters more than micro-opts   â
â     âââ Check data structure choices                       â
â     âââ Avoid unnecessary work                             â
â                                                             â
â  4. REDUCE ALLOCATIONS                                      â
â     âââ Pre-size collections (with_capacity)               â
â     âââ Reuse buffers (clear + reuse)                      â
â     âââ Avoid cloning (borrow instead)                     â
â                                                             â
â  5. OPTIMIZE HOT LOOPS                                      â
â     âââ Iterators over indices                             â
â     âââ Reduce lock scope                                  â
â     âââ Batch I/O operations                               â
â                                                             â
â  6. MEASURE AGAIN                                           â
â     âââ Verify improvement with benchmarks                 â
â     âââ Check for regressions elsewhere                    â
â     âââ Document the optimization                          â
â                                                             â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Quick Profiling Commands

# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report

# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

# Benchmark
cargo bench                          # All benchmarks
cargo bench hot_function             # Specific benchmark

# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log

# Assembly inspection
cargo asm my_crate::hot_function --rust

# syscall count
strace -c ./target/release/myapp 2>&1 | head -20

Common Scenarios â Rules

“My Rust program is slow”

Is it running in debug mode?
âââ YES â build-release-profile (10-100x speedup)
âââ NO
    â
    Where does flamegraph show time?
    âââ malloc/free â alloc-* rules (with_capacity, reuse buffers)
    âââ Mutex::lock â sync-* rules (RwLock, atomics, shorter scope)
    âââ read/write syscalls â io-* rules (BufReader/BufWriter)
    âââ clone/drop â alloc-avoid-clone, use references
    âââ Your code â iter-* rules, algorithm improvements

“My binary is too large”

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

“High memory usage”

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

“Lock contention / thread scaling”

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? â sync-use-rwlock
4. Simple counters? â sync-use-atomics
5. Message passing? â sync-use-channels
6. Thread-local + periodic flush for stats

“Slow file I/O”

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

Priority	Category	Typical Impact	Prefix
1	Build Profiles	10-100x (debugârelease)	`build-`
2	Benchmarking	Enables measurement	`bench-`
3	Allocation	2-50x for allocation-heavy code	`alloc-`
4	Data Structures	2-10x for hot paths	`data-`
5	Iteration	2-5x for loop-heavy code	`iter-`
6	Synchronization	5-100x for contended code	`sync-`
7	I/O	10-100x for I/O-bound code	`io-`
8	Unsafe	5-30% in tight loops (experts only)	`unsafe-`

1. Build Profiles (CRITICAL)

These apply to ALL Rust code. Check these first.

Rule	Impact	One-liner
`build-release-profile`	10-100x	Always ship release builds
`build-opt-level`	2-5x	opt-level=3 for speed, ‘z’ for size
`build-enable-lto`	5-20%	LTO enables cross-crate optimization
`build-codegen-units`	5-15%	codegen-units=1 for max optimization
`build-panic-abort`	Binary size	panic=’abort’ removes unwinding
`build-target-cpu`	10-30%	target-cpu=native for SIMD
`build-pgo`	5-20%	Profile-guided optimization
`build-incremental-off`	5-10%	Disable for release builds

2. Benchmarking (REQUIRED)

You can’t optimize what you don’t measure.

Rule	Purpose
`bench-cargo-bench`	Use `cargo bench` with criterion
`bench-bench-profile`	Bench profile enables optimizations
`bench-black-box`	Prevent dead code elimination
`bench-avoid-io`	I/O variance destroys measurements

3. Allocation

Every allocation is a syscall. Reduce them.

Rule	Impact	Pattern
`alloc-vec-with-capacity`	2-10x	`Vec::with_capacity(n)` not `Vec::new()`
`alloc-string-with-capacity`	2-5x	`String::with_capacity(n)`
`alloc-hashmap-with-capacity`	2-5x	`HashMap::with_capacity(n)`
`alloc-reuse-buffers`	2-10x	`.clear()` and reuse, don’t reallocate (up to 50x in tight loops)
`alloc-use-slices-in-apis`	Flexibility	`&[T]` not `Vec<T>` in parameters
`alloc-avoid-clone`	2-10x	Borrow `&T` instead of `clone()` (benefits scale with data size)

4. Data Structures

The right data structure beats micro-optimization.

Rule	When
`data-avoid-linkedlist`	Almost always (Vec wins)
`data-choose-vecdeque-for-queue`	FIFO queues
`data-choose-map-type`	HashMap=O(1), BTreeMap=sorted
`data-use-entry-api`	Insert-or-update patterns
`data-repr-transparent`	FFI newtypes

5. Iteration

Iterators are as fast as loops and safer.

Rule	Impact	Pattern
`iter-avoid-collect-then-loop`	2-3x	Chain iterators, don’t collect
`iter-use-lazy-iterators`	2-3x	`.filter().map()` not intermediate vecs
`iter-use-any-find`	Short-circuit	`.any()` not `.filter().count() > 0`
`iter-use-retain`	In-place	`.retain()` not `.filter().collect()`
`iter-use-binary-search`	O(log n)	`.binary_search()` on sorted data

6. Synchronization

Locks are expensive. Minimize contention.

Rule	Impact	When
`sync-share-with-arc`	Avoids copying	Share large (>64B) data across threads
`sync-use-rwlock`	2-8x for reads	>80% reads, few writes; consider parking_lot
`sync-keep-lock-scope-short`	4x	Minimize code under lock
`sync-use-channels`	3-4x	Message passing vs shared state
`sync-use-atomics`	20x	Simple counters, flags
`sync-use-parking-lot`	1.5-5x	Prefer `parking_lot` over std sync primitives

7. I/O

Every syscall costs. Buffer them.

Rule	Impact	Pattern
`io-use-bufreader`	50x	Wrap `File` in `BufReader`
`io-use-bufwriter`	18x	Wrap `File` in `BufWriter`
`io-flush-bufwriter`	CRITICAL	Must flush or lose data!
`io-read-line-with-bufread`	53x	Reuse String buffer with `read_line`

8. Async/Await (HIGH)

Critical for Tokio and async-std applications.

Rule	Impact	Pattern
`async-spawn-blocking`	Prevents hang	Use `spawn_blocking` for CPU-bound work
`async-cooperative`	Latency	Yield periodically in long computations
`async-mutex-choice`	Correctness	`tokio::sync::Mutex` across `.await` points
`async-avoid-blocking-io`	Throughput	Use async I/O, not std::fs in async contexts
`async-bounded-channels`	Backpressure	Prefer bounded channels for flow control

Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.

// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

9. Unsafe (Expert Only)

Only after profiling proves these matter.

Rule	Impact	Risk
`unsafe-get-unchecked`	5-30%	UB if bounds wrong
`unsafe-use-maybeuninit`	20-100x alloc	UB if read before write
`unsafe-avoid-transmute`	Correctness	Prefer safe alternatives
`unsafe-repr-transparent`	Zero-cost	Required for FFI newtypes

Decision Trees

When to use with_capacity?

Do you know the size?
âââ YES, exact â with_capacity(exact)
âââ YES, approximate â with_capacity(estimate)
âââ NO
    â
    Will it grow frequently?
    âââ YES â Start bigger or use reserve()
    âââ NO â Vec::new() is fine

Mutex vs RwLock vs Atomics?

Is it a simple counter/flag?
âââ YES â Atomics (20x faster)
âââ NO
    â
    What's the read/write ratio?
    âââ Mostly reads (>90%) â RwLock
    âââ Mostly writes â Mutex
    âââ Mixed â Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

Did you profile and find bounds checks are the bottleneck?
âââ NO â Don't use it
âââ YES
    â
    Did you check if LLVM already removed the bounds check?
    âââ NO â Check assembly first (cargo asm)
    âââ YES, still there
        â
        Can you use iterators instead?
        âââ YES â Use iterators (same speed, safe)
        âââ NO â get_unchecked with documented invariants

Reading Rules

Each rule file in rules/ contains:

Quantified impact with real benchmark numbers
Visual explanations of how the optimization works
Incorrect examples showing common mistakes
Correct examples with best practices
When NOT to apply – trade-offs and edge cases
Common mistakes to avoid
Profiling commands to identify the issue
References to official docs

Full Compiled Document

For all rules in a single file: AGENTS.md

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台