nl-sql-query

📁 realmarcin/linkml-coral 📅 3 days ago

总安装量

周安装量

#44851

全站排名

安装命令

npx skills add https://github.com/realmarcin/linkml-coral --skill nl-sql-query

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

github-copilot 1

claude-code 1

Skill 文档

Natural Language SQL Query for CDM Database

Query the KBase CDM DuckDB database using natural language questions. This skill translates your questions to SQL and executes them against the database.

When to Use This Skill

Use this skill when:

User asks to query the CDM database with natural language
User wants to explore data without writing SQL
User asks questions like “How many samples are there?” or “Show me reads with high counts”
User wants to invoke /nl-sql-query command

Prerequisites

Before using this skill:

CDM Database: Must exist at cdm_store.db (or specify custom path)
API Key: ANTHROPIC_API_KEY environment variable must be set
Load Data: Run just load-cdm-store-bricks-64gb if database doesn’t exist

How It Works

Schema Inspection: Reads database schema (tables, columns, types, counts)
AI Translation: Sends question + schema to Claude API
SQL Generation: Claude generates optimized DuckDB SQL
Execution: Runs query and formats results
Display: Returns results in clean table format or JSON

Usage

Basic Query

just cdm-nl-query "How many samples are there?"

JSON Output

just cdm-nl-query-json "List all locations with sample counts"

Verbose (See Generated SQL)

just cdm-nl-query-verbose "Find reads with read_count over 50000"

Direct Python Usage

uv run python scripts/cdm_analysis/nl_sql_query.py \
  --db cdm_store.db \
  "Show me the top 10 samples by depth"

Example Queries

Quick Examples:

“How many samples are in the database?”
“Find samples with depth greater than 100”
“Show me the top 10 locations by sample count”
“List reads with read_count over 50000”

For complete examples: See EXAMPLES.md for 50+ example queries covering:

Basic statistics and counts
Filtered queries (numeric, string, date)
Aggregations and grouping
Joins and relationships
Complex queries with subqueries
Domain-specific analysis (sequencing, taxonomy, geography)
Data quality and validation queries

Database Schema Overview

The CDM database contains these main table types:

Static Entity Tables (sdt_*)

sdt_location: Geographic locations
sdt_sample: Sample metadata
sdt_reads: Sequencing reads data
sdt_assembly: Genome assemblies
sdt_genome: Genome annotations
sdt_gene: Gene predictions
sdt_otu (ASV): 16S amplicon sequence variants

System Tables (sys_*)

sys_oterm: Ontology terms
sys_type_def: Type definitions
sys_process: Processing records

Dynamic Tables (ddt_*)

ddt_brick*: Measurement array data (21 brick tables)

Common Foreign Keys

sample_id: Links to samples
location_id: Links to locations
reads_id: Links to reads
assembly_id: Links to assemblies
genome_id: Links to genomes

Output Formats

Text Format (Default)

Natural Query: How many samples are there?

Generated SQL:
SELECT COUNT(*) as total_samples FROM sdt_sample

Results (1 rows):
total_samples
-------------
1523

JSON Format

{
  "natural_query": "How many samples are there?",
  "sql_query": "SELECT COUNT(*) as total_samples FROM sdt_sample",
  "result_count": 1,
  "results": [
    {"total_samples": 1523}
  ]
}

Implementation Steps

When user invokes this skill:

Check Prerequisites

# Verify database exists
ls -l cdm_store.db || ls -l cdm_store_bricks_full.db

# Check API key
echo $ANTHROPIC_API_KEY | grep -q "sk-" && echo "â API key set" || echo "â Set ANTHROPIC_API_KEY"

Execute Query
```
just cdm-nl-query "USER_QUESTION_HERE"
```
Handle Results
- Display results to user
- If query fails, show generated SQL and error
- Suggest fixes if SQL is invalid

Error Handling

Database Not Found:

# Load the database first
just load-cdm-store-bricks-64gb

API Key Missing:

export ANTHROPIC_API_KEY="your-api-key-here"

Invalid SQL Generated:

Run with --verbose to see generated SQL
Check if table/column names are correct
Rephrase the question more specifically

Advanced Options

Custom Database Path

uv run python scripts/cdm_analysis/nl_sql_query.py \
  --db /path/to/custom.db \
  "your question"

Custom API Key

uv run python scripts/cdm_analysis/nl_sql_query.py \
  --api-key "sk-ant-..." \
  "your question"

Save Results to File

just cdm-nl-query-json "your question" > results.json

Tips for Better Results

Be Specific: “Show me samples with depth > 100” instead of “Show me deep samples”
Use Table Names: Reference actual table names when known (sdt_sample, sdt_reads)
Specify Limits: “Top 10 locations” instead of “Show locations”
Use Field Names: “samples with read_count > 50000” uses actual column names
Ask for Context: “Show samples with their locations” triggers JOINs automatically

Troubleshooting

Issue: Query returns no results

Check if data actually exists: just cdm-store-stats
Verify table/column names with verbose mode

Issue: SQL syntax error

Run with --verbose to see generated SQL
Check DuckDB documentation for syntax
Rephrase question using simpler terms

Issue: Query too slow

Add LIMIT clauses: “Show first 100 samples…”
Use indexed columns when possible
Consider using brick tables with caution (large datasets)

Related Commands

# Show database statistics
just cdm-store-stats

# Pre-defined queries
just cdm-find-samples Location0000001
just cdm-search-oterm "soil"
just cdm-lineage Assembly Assembly0000001

# View documentation
cat docs/CDM_PARQUET_STORE_GUIDE.md

Technical Details

Script Location: scripts/cdm_analysis/nl_sql_query.py

Dependencies:

anthropic>=0.39.0 – Claude API client
duckdb – Database engine
linkml-store>=0.2.0 – Data management

Model: Claude Sonnet 4 (claude-sonnet-4-20250514)

Token Limits:

Schema description: ~1-2K tokens
Query translation: ~1K tokens max
Results returned up to 100 rows by default

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台