llama-cpp

📁 tdimino/claude-code-minoan 📅 8 days ago

总安装量

周安装量

#24475

全站排名

安装命令

npx skills add https://github.com/tdimino/claude-code-minoan --skill llama-cpp

Agent 安装分布

opencode 13

gemini-cli 13

antigravity 13

claude-code 13

github-copilot 13

amp 13

Skill 文档

llama.cpp – Secondary Inference Engine

Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool.

Prerequisites

brew install llama.cpp

Binaries: llama-cli, llama-server, llama-embedding, llama-quantize

Quick Reference

Resolve Ollama Model to GGUF Path

To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path:

~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b

Run Inference

GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b)
llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-prompt

Start API Server

To start an OpenAI-compatible server (port 8081, avoids Ollama’s 11434):

~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>

# Or with options:
PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>

Test the server:

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

Benchmark (llama.cpp vs Ollama)

~/.claude/skills/llama-cpp/scripts/llama_bench.sh qwen2.5:7b

Reports prompt processing and generation tok/s for both engines side by side.

LoRA Adapter Inference

Load a LoRA adapter dynamically on top of a base GGUF model (no merge required):

~/.claude/skills/llama-cpp/scripts/llama_lora.sh <base.gguf> <lora.gguf> "Your prompt"

This is the key advantage over Ollama: hot-swap LoRA adapters without rebuilding models.

Convert Kothar LoRA to GGUF

Convert HuggingFace LoRA adapters from the Kothar training pipeline into a merged GGUF model:

python3 ~/.claude/skills/llama-cpp/scripts/convert_lora_to_gguf.py \
  --base NousResearch/Hermes-2-Mistral-7B-DPO \
  --lora <path-or-hf-id> \
  --output kothar-q4_k_m.gguf \
  --quantize q4_k_m

When to Use llama.cpp vs Ollama

Task	Use
RLAMA queries	Ollama (native integration)
Quick model chat	Ollama (`ollama run`)
LoRA adapter testing	llama.cpp (`llama_lora.sh`)
Benchmarking tok/s	llama.cpp (`llama_bench.sh`)
Maximum inference speed	llama.cpp (10-20% faster)
Custom server config	llama.cpp (`llama_serve.sh`)
Embedding generation	Either (Ollama simpler, llama-embedding more control)
Kothar GGUF conversion	llama.cpp (`convert_lora_to_gguf.py`)

Architecture

Ollama (primary, port 11434)          llama.cpp (secondary, port 8081)
âââ RLAMA RAG queries                 âââ LoRA adapter hot-loading
âââ Model management (pull/list)      âââ Benchmarking
âââ General chat                      âââ Custom server configs
âââ Embeddings (nomic-embed-text)     âââ Kothar GGUF conversion

Both share the same GGUF model files (~/.ollama/models/blobs/)

Subprocess Best Practices (Build 7940+)

When calling llama-cli from scripts or subprocesses:

Always use --single-turn â generates one response then exits (prevents interactive chat mode hang)
Always use --simple-io â suppresses ANSI spinner that floods redirected output
Always use --no-display-prompt â suppresses prompt echo
Use --n-gpu-layers all instead of legacy -ngl 999
Use --flash-attn on (not bare --flash-attn) â now takes argument
Timing stats appear in stdout as [ Prompt: X t/s | Generation: Y t/s ] (via --show-timings, default: on)
Redirect stderr to file, not variable â spinner output can overflow bash variables

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台