nl-sql-query
1
总安装量
1
周安装量
#44851
全站排名
安装命令
npx skills add https://github.com/realmarcin/linkml-coral --skill nl-sql-query
Agent 安装分布
amp
1
opencode
1
kimi-cli
1
codex
1
github-copilot
1
claude-code
1
Skill 文档
Natural Language SQL Query for CDM Database
Query the KBase CDM DuckDB database using natural language questions. This skill translates your questions to SQL and executes them against the database.
When to Use This Skill
Use this skill when:
- User asks to query the CDM database with natural language
- User wants to explore data without writing SQL
- User asks questions like “How many samples are there?” or “Show me reads with high counts”
- User wants to invoke
/nl-sql-querycommand
Prerequisites
Before using this skill:
- CDM Database: Must exist at
cdm_store.db(or specify custom path) - API Key:
ANTHROPIC_API_KEYenvironment variable must be set - Load Data: Run
just load-cdm-store-bricks-64gbif database doesn’t exist
How It Works
- Schema Inspection: Reads database schema (tables, columns, types, counts)
- AI Translation: Sends question + schema to Claude API
- SQL Generation: Claude generates optimized DuckDB SQL
- Execution: Runs query and formats results
- Display: Returns results in clean table format or JSON
Usage
Basic Query
just cdm-nl-query "How many samples are there?"
JSON Output
just cdm-nl-query-json "List all locations with sample counts"
Verbose (See Generated SQL)
just cdm-nl-query-verbose "Find reads with read_count over 50000"
Direct Python Usage
uv run python scripts/cdm_analysis/nl_sql_query.py \
--db cdm_store.db \
"Show me the top 10 samples by depth"
Example Queries
Quick Examples:
- “How many samples are in the database?”
- “Find samples with depth greater than 100”
- “Show me the top 10 locations by sample count”
- “List reads with read_count over 50000”
For complete examples: See EXAMPLES.md for 50+ example queries covering:
- Basic statistics and counts
- Filtered queries (numeric, string, date)
- Aggregations and grouping
- Joins and relationships
- Complex queries with subqueries
- Domain-specific analysis (sequencing, taxonomy, geography)
- Data quality and validation queries
Database Schema Overview
The CDM database contains these main table types:
Static Entity Tables (sdt_*)
sdt_location: Geographic locationssdt_sample: Sample metadatasdt_reads: Sequencing reads datasdt_assembly: Genome assembliessdt_genome: Genome annotationssdt_gene: Gene predictionssdt_otu(ASV): 16S amplicon sequence variants
System Tables (sys_*)
sys_oterm: Ontology termssys_type_def: Type definitionssys_process: Processing records
Dynamic Tables (ddt_*)
ddt_brick*: Measurement array data (21 brick tables)
Common Foreign Keys
sample_id: Links to sampleslocation_id: Links to locationsreads_id: Links to readsassembly_id: Links to assembliesgenome_id: Links to genomes
Output Formats
Text Format (Default)
Natural Query: How many samples are there?
Generated SQL:
SELECT COUNT(*) as total_samples FROM sdt_sample
Results (1 rows):
total_samples
-------------
1523
JSON Format
{
"natural_query": "How many samples are there?",
"sql_query": "SELECT COUNT(*) as total_samples FROM sdt_sample",
"result_count": 1,
"results": [
{"total_samples": 1523}
]
}
Implementation Steps
When user invokes this skill:
-
Check Prerequisites
# Verify database exists ls -l cdm_store.db || ls -l cdm_store_bricks_full.db # Check API key echo $ANTHROPIC_API_KEY | grep -q "sk-" && echo "â API key set" || echo "â Set ANTHROPIC_API_KEY" -
Execute Query
just cdm-nl-query "USER_QUESTION_HERE" -
Handle Results
- Display results to user
- If query fails, show generated SQL and error
- Suggest fixes if SQL is invalid
Error Handling
Database Not Found:
# Load the database first
just load-cdm-store-bricks-64gb
API Key Missing:
export ANTHROPIC_API_KEY="your-api-key-here"
Invalid SQL Generated:
- Run with
--verboseto see generated SQL - Check if table/column names are correct
- Rephrase the question more specifically
Advanced Options
Custom Database Path
uv run python scripts/cdm_analysis/nl_sql_query.py \
--db /path/to/custom.db \
"your question"
Custom API Key
uv run python scripts/cdm_analysis/nl_sql_query.py \
--api-key "sk-ant-..." \
"your question"
Save Results to File
just cdm-nl-query-json "your question" > results.json
Tips for Better Results
- Be Specific: “Show me samples with depth > 100” instead of “Show me deep samples”
- Use Table Names: Reference actual table names when known (sdt_sample, sdt_reads)
- Specify Limits: “Top 10 locations” instead of “Show locations”
- Use Field Names: “samples with read_count > 50000” uses actual column names
- Ask for Context: “Show samples with their locations” triggers JOINs automatically
Troubleshooting
Issue: Query returns no results
- Check if data actually exists:
just cdm-store-stats - Verify table/column names with verbose mode
Issue: SQL syntax error
- Run with
--verboseto see generated SQL - Check DuckDB documentation for syntax
- Rephrase question using simpler terms
Issue: Query too slow
- Add LIMIT clauses: “Show first 100 samples…”
- Use indexed columns when possible
- Consider using brick tables with caution (large datasets)
Related Commands
# Show database statistics
just cdm-store-stats
# Pre-defined queries
just cdm-find-samples Location0000001
just cdm-search-oterm "soil"
just cdm-lineage Assembly Assembly0000001
# View documentation
cat docs/CDM_PARQUET_STORE_GUIDE.md
Technical Details
Script Location: scripts/cdm_analysis/nl_sql_query.py
Dependencies:
anthropic>=0.39.0– Claude API clientduckdb– Database enginelinkml-store>=0.2.0– Data management
Model: Claude Sonnet 4 (claude-sonnet-4-20250514)
Token Limits:
- Schema description: ~1-2K tokens
- Query translation: ~1K tokens max
- Results returned up to 100 rows by default