ai-parsing-data

📁 lebsral/dspy-programming-not-prompting-lms-skills 📅 5 days ago

总安装量

周安装量

#68899

全站排名

安装命令

npx skills add https://github.com/lebsral/dspy-programming-not-prompting-lms-skills --skill ai-parsing-data

Agent 安装分布

replit 1

opencode 1

cursor 1

github-copilot 1

claude-code 1

Skill 文档

Build an AI Data Parser

Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction â define what you want, and the AI extracts it.

Step 1: Define what to extract

Ask the user:

What are you parsing? (emails, invoices, resumes, articles, forms, etc.)
What fields do you need? (names, dates, amounts, entities, etc.)
What’s the output format? (flat fields, list of objects, nested structure)

Step 2: Build the parser

Simple field extraction

import dspy

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)

Structured output with Pydantic

For complex or nested output, use Pydantic models:

from pydantic import BaseModel, Field

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: int
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)

List extraction

Extract multiple items from text:

class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)

Step 3: Handle messy data

Use assertions to catch bad extractions:

class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        result = self.parse(text=text)
        dspy.Suggest(
            "@" in result.email,
            "Email should contain @"
        )
        dspy.Suggest(
            len(result.phone) >= 10,
            "Phone number should have at least 10 digits"
        )
        return result

Step 4: Test the quality

def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0

For Pydantic outputs, compare the model objects directly or field-by-field.

Step 5: Improve accuracy

optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)

Key patterns

Use Pydantic models for complex structured output â DSPy handles serialization
Use list[Model] to extract variable-length lists of items
ChainOfThought helps â reasoning through which text maps to which fields improves accuracy
Validate with assertions â dspy.Suggest and dspy.Assert catch malformed extractions
Partial credit metrics â score field-by-field rather than all-or-nothing

Additional resources

For worked examples (invoices, resumes, entities), see examples.md
Need summaries instead of structured data? Use /ai-summarizing
AI missing items on complex inputs? Use /ai-decomposing-tasks to break extraction into reliable subtasks
Next: /ai-improving-accuracy to measure and improve your parser

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台