rust-performance-best-practices

📁 mcart13/dev-skills 📅 7 days ago
1
总安装量
1
周安装量
#47956
全站排名
安装命令
npx skills add https://github.com/mcart13/dev-skills --skill rust-performance-best-practices

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
github-copilot 1
claude-code 1

Skill 文档

Rust Performance Best Practices

Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.

When to Apply

Reference these guidelines when:

  • Investigating slow Rust programs or high latency
  • Optimizing build times or binary size
  • Reviewing allocation-heavy code
  • Debugging lock contention or thread scaling issues
  • Setting up release profiles for production
  • Working with async runtimes (Tokio, async-std)

When NOT to Apply

Skip these optimizations when:

  • Code isn’t in a hot path (profile first!)
  • Readability would suffer significantly
  • You haven’t measured a performance problem
  • The optimization requires unsafe code you can’t verify
  • Premature optimization would delay shipping

The Optimization Workflow

CRITICAL: Most Rust code doesn’t need optimization. Profile first, optimize second.

┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Profiling Commands

# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report

# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

# Benchmark
cargo bench                          # All benchmarks
cargo bench hot_function             # Specific benchmark

# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log

# Assembly inspection
cargo asm my_crate::hot_function --rust

# syscall count
strace -c ./target/release/myapp 2>&1 | head -20

Common Scenarios → Rules

“My Rust program is slow”

Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    │
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

“My binary is too large”

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

“High memory usage”

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

“Lock contention / thread scaling”

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

“Slow file I/O”

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

Priority Category Typical Impact Prefix
1 Build Profiles 10-100x (debug→release) build-
2 Benchmarking Enables measurement bench-
3 Allocation 2-50x for allocation-heavy code alloc-
4 Data Structures 2-10x for hot paths data-
5 Iteration 2-5x for loop-heavy code iter-
6 Synchronization 5-100x for contended code sync-
7 I/O 10-100x for I/O-bound code io-
8 Unsafe 5-30% in tight loops (experts only) unsafe-

1. Build Profiles (CRITICAL)

These apply to ALL Rust code. Check these first.

Rule Impact One-liner
build-release-profile 10-100x Always ship release builds
build-opt-level 2-5x opt-level=3 for speed, ‘z’ for size
build-enable-lto 5-20% LTO enables cross-crate optimization
build-codegen-units 5-15% codegen-units=1 for max optimization
build-panic-abort Binary size panic=’abort’ removes unwinding
build-target-cpu 10-30% target-cpu=native for SIMD
build-pgo 5-20% Profile-guided optimization
build-incremental-off 5-10% Disable for release builds

2. Benchmarking (REQUIRED)

You can’t optimize what you don’t measure.

Rule Purpose
bench-cargo-bench Use cargo bench with criterion
bench-bench-profile Bench profile enables optimizations
bench-black-box Prevent dead code elimination
bench-avoid-io I/O variance destroys measurements

3. Allocation

Every allocation is a syscall. Reduce them.

Rule Impact Pattern
alloc-vec-with-capacity 2-10x Vec::with_capacity(n) not Vec::new()
alloc-string-with-capacity 2-5x String::with_capacity(n)
alloc-hashmap-with-capacity 2-5x HashMap::with_capacity(n)
alloc-reuse-buffers 2-10x .clear() and reuse, don’t reallocate (up to 50x in tight loops)
alloc-use-slices-in-apis Flexibility &[T] not Vec<T> in parameters
alloc-avoid-clone 2-10x Borrow &T instead of clone() (benefits scale with data size)

4. Data Structures

The right data structure beats micro-optimization.

Rule When
data-avoid-linkedlist Almost always (Vec wins)
data-choose-vecdeque-for-queue FIFO queues
data-choose-map-type HashMap=O(1), BTreeMap=sorted
data-use-entry-api Insert-or-update patterns
data-repr-transparent FFI newtypes

5. Iteration

Iterators are as fast as loops and safer.

Rule Impact Pattern
iter-avoid-collect-then-loop 2-3x Chain iterators, don’t collect
iter-use-lazy-iterators 2-3x .filter().map() not intermediate vecs
iter-use-any-find Short-circuit .any() not .filter().count() > 0
iter-use-retain In-place .retain() not .filter().collect()
iter-use-binary-search O(log n) .binary_search() on sorted data

6. Synchronization

Locks are expensive. Minimize contention.

Rule Impact When
sync-share-with-arc Avoids copying Share large (>64B) data across threads
sync-use-rwlock 2-8x for reads >80% reads, few writes; consider parking_lot
sync-keep-lock-scope-short 4x Minimize code under lock
sync-use-channels 3-4x Message passing vs shared state
sync-use-atomics 20x Simple counters, flags
sync-use-parking-lot 1.5-5x Prefer parking_lot over std sync primitives

7. I/O

Every syscall costs. Buffer them.

Rule Impact Pattern
io-use-bufreader 50x Wrap File in BufReader
io-use-bufwriter 18x Wrap File in BufWriter
io-flush-bufwriter CRITICAL Must flush or lose data!
io-read-line-with-bufread 53x Reuse String buffer with read_line

8. Async/Await (HIGH)

Critical for Tokio and async-std applications.

Rule Impact Pattern
async-spawn-blocking Prevents hang Use spawn_blocking for CPU-bound work
async-cooperative Latency Yield periodically in long computations
async-mutex-choice Correctness tokio::sync::Mutex across .await points
async-avoid-blocking-io Throughput Use async I/O, not std::fs in async contexts
async-bounded-channels Backpressure Prefer bounded channels for flow control

Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.

// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

9. Unsafe (Expert Only)

Only after profiling proves these matter.

Rule Impact Risk
unsafe-get-unchecked 5-30% UB if bounds wrong
unsafe-use-maybeuninit 20-100x alloc UB if read before write
unsafe-avoid-transmute Correctness Prefer safe alternatives
unsafe-repr-transparent Zero-cost Required for FFI newtypes

Decision Trees

When to use with_capacity?

Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    │
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Mutex vs RwLock vs Atomics?

Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    │
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    │
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        │
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Reading Rules

Each rule file in rules/ contains:

  • Quantified impact with real benchmark numbers
  • Visual explanations of how the optimization works
  • Incorrect examples showing common mistakes
  • Correct examples with best practices
  • When NOT to apply – trade-offs and edge cases
  • Common mistakes to avoid
  • Profiling commands to identify the issue
  • References to official docs

Full Compiled Document

For all rules in a single file: AGENTS.md