ai-parsing-data
1
总安装量
1
周安装量
#48053
全站排名
安装命令
npx skills add https://github.com/lebsral/dspy-programming-not-prompting-lms-skills --skill ai-parsing-data
Agent 安装分布
replit
1
opencode
1
cursor
1
github-copilot
1
claude-code
1
Skill 文档
Build an AI Data Parser
Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction â define what you want, and the AI extracts it.
Step 1: Define what to extract
Ask the user:
- What are you parsing? (emails, invoices, resumes, articles, forms, etc.)
- What fields do you need? (names, dates, amounts, entities, etc.)
- What’s the output format? (flat fields, list of objects, nested structure)
Step 2: Build the parser
Simple field extraction
import dspy
class ParseContact(dspy.Signature):
"""Extract contact information from the text."""
text: str = dspy.InputField(desc="Text containing contact information")
name: str = dspy.OutputField(desc="Person's full name")
email: str = dspy.OutputField(desc="Email address")
phone: str = dspy.OutputField(desc="Phone number")
parser = dspy.ChainOfThought(ParseContact)
Structured output with Pydantic
For complex or nested output, use Pydantic models:
from pydantic import BaseModel, Field
class Address(BaseModel):
street: str
city: str
state: str
zip_code: str
class Person(BaseModel):
name: str
age: int
address: Address
skills: list[str]
class ParsePerson(dspy.Signature):
"""Extract person details from the text."""
text: str = dspy.InputField()
person: Person = dspy.OutputField()
parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person) # Person(name='John Doe', age=32, ...)
List extraction
Extract multiple items from text:
class Entity(BaseModel):
name: str
type: str = Field(description="Type: person, organization, location, or date")
class ParseEntities(dspy.Signature):
"""Extract all named entities from the text."""
text: str = dspy.InputField()
entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")
parser = dspy.ChainOfThought(ParseEntities)
Step 3: Handle messy data
Use assertions to catch bad extractions:
class ValidatedParser(dspy.Module):
def __init__(self):
self.parse = dspy.ChainOfThought(ParseContact)
def forward(self, text):
result = self.parse(text=text)
dspy.Suggest(
"@" in result.email,
"Email should contain @"
)
dspy.Suggest(
len(result.phone) >= 10,
"Phone number should have at least 10 digits"
)
return result
Step 4: Test the quality
def parsing_metric(example, prediction, trace=None):
"""Score based on field-level accuracy."""
correct = 0
total = 0
for field in ["name", "email", "phone"]:
expected = getattr(example, field, None)
predicted = getattr(prediction, field, None)
if expected is not None:
total += 1
if predicted and expected.lower().strip() == predicted.lower().strip():
correct += 1
return correct / total if total > 0 else 0.0
For Pydantic outputs, compare the model objects directly or field-by-field.
Step 5: Improve accuracy
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)
Key patterns
- Use Pydantic models for complex structured output â DSPy handles serialization
- Use
list[Model]to extract variable-length lists of items ChainOfThoughthelps â reasoning through which text maps to which fields improves accuracy- Validate with assertions â
dspy.Suggestanddspy.Assertcatch malformed extractions - Partial credit metrics â score field-by-field rather than all-or-nothing
Additional resources
- For worked examples (invoices, resumes, entities), see examples.md
- Need summaries instead of structured data? Use
/ai-summarizing - AI missing items on complex inputs? Use
/ai-decomposing-tasksto break extraction into reliable subtasks - Next:
/ai-improving-accuracyto measure and improve your parser