invoice-extractor
npx skills add https://github.com/ontos-ai/invoice-extract-skill --skill invoice-extractor
Agent 安装分布
Skill 文档
Invoice Extractor Skill
What It Does
Automatically extracts the following fields from PDF invoices/receipts and image files (JPG/PNG/WEBP/BMP) into structured JSON/CSV/Excel:
- Document: type, number, date, due date
- Vendor: name, tax ID
- Customer: name
- Financials: currency, subtotal (excl. VAT), VAT amount, total (incl. VAT)
- Line items: description, quantity, unit price, VAT rate, line total
- Status: payment status, notes
- Warnings: row-level flags for data inconsistencies (e.g. amount mismatches)
- Usage report: per-file input/output tokens and API call duration
Prerequisites
-
An API key for any OpenAI-compatible VLM provider (Qwen, DeepSeek, OpenAI, etc.)
-
Install dependencies:
pip install -r requirements.txt
Configuration
Create a .env file in the project root:
API_KEY=your-api-key-here
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL=qwen-vl-max-latest
Swap providers by changing these three values:
| Provider | MODEL | BASE_URL |
|---|---|---|
| Qwen (default) | qwen-vl-max-latest |
https://dashscope.aliyuncs.com/compatible-mode/v1 |
| DeepSeek | deepseek-chat |
https://api.deepseek.com/v1 |
| OpenAI | gpt-4o |
https://api.openai.com/v1 |
All settings can also be overridden via CLI flags (--api-key, --base-url, --model).
Usage
Basic
# Batch extract all files in a directory (PDF + images, 5 concurrent)
python cli.py -i "/path/to/invoices/"
# Single file (PDF or image)
python cli.py -i "invoice.pdf"
python cli.py -i "receipt.jpg"
Results are automatically saved to results/:
extraction_<timestamp>.csvâ Extracted invoice datausage_report_<timestamp>.jsonâ Token usage and timing report
Output Format
Choose your preferred format with --format:
python cli.py -i "folder/" --format json
python cli.py -i "folder/" --format xlsx
python cli.py -i "folder/" --format csv # default
Custom CSV Columns
Only output the columns you need. Omit --columns to get all 30 default columns.
python cli.py -i "folder/" --format csv \
--columns item_seq,source_file,vendor_name,invoice_number,total_incl_vat,warnings
Available columns:
item_seq, source_file, document_type, invoice_number, invoice_date, due_date,
currency, vendor_name, vendor_tax_id,
customer_name, subtotal_excl_vat, total_vat_amount,
total_incl_vat,
item_description, item_quantity, item_unit_price, item_vat_rate, item_line_total,
payment_status, notes, warnings
Custom Output Directory
python cli.py -i "folder/" -o my_output/
Using a Different Provider
# DeepSeek
python cli.py -i "folder/" \
--api-key sk-xxx --base-url https://api.deepseek.com/v1 --model deepseek-chat
# OpenAI
python cli.py -i "folder/" \
--api-key sk-xxx --base-url https://api.openai.com/v1 --model gpt-4o
Advanced Options
python cli.py -i "folder/" --format xlsx \
--dpi 300 \
--max-pages 10 \
-c 10 \
-v # verbose logging
-c, --concurrency N: Max concurrent VLM API calls during batch processing (default: 5)
Environment Check
python scripts/check_env.py
Verifies: Python version, required packages, API key configuration.
Output
All outputs are saved to the results/ directory (or custom --output-dir):
results/
âââ extraction_20260207_222600.csv # Extracted invoice data
âââ usage_report_20260207_222600.json # Token usage report
Usage Report Format
{
"summary": {
"total_files": 3,
"success_count": 3,
"failure_count": 0,
"total_input_tokens": 15234,
"total_output_tokens": 2456,
"total_tokens": 17690,
"total_duration_seconds": 32.5
},
"per_file": [
{
"source_file": "invoice_001.pdf",
"input_tokens": 5078,
"output_tokens": 819,
"total_tokens": 5897,
"duration_seconds": 10.8
}
]
}
Programmatic Integration
from invoice_extractor.extractor import InvoiceExtractor
from invoice_extractor.exporter import to_csv, to_excel, export_usage_report
# Initialize with any OpenAI-compatible API
extractor = InvoiceExtractor(
api_key="your-key",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
model="qwen-vl-max-latest",
)
# Extract single file
result = extractor.extract_single("invoice.pdf")
print(result.data.model_dump_json(indent=2))
if result.warnings:
print(f"Warnings: {result.warnings}")
if result.usage:
print(f"Tokens: in={result.usage.input_tokens}, out={result.usage.output_tokens}")
# Batch extract â custom CSV + usage report
batch = extractor.extract_batch("/path/to/folder/")
to_csv(batch, "results/output.csv", columns=[
"item_seq", "source_file", "vendor_name",
"invoice_number", "total_incl_vat", "warnings",
])
export_usage_report(batch, "results/usage_report.json")
# Batch extract â Excel
to_excel(batch, "results/output.xlsx")
Warning System
- Amount mismatch: Flags when
subtotal + VAT â total(tolerance: 0.05) - Missing fields: Left blank in CSV (no “null” or “None” strings)
- Extraction failure: Entire row shows the failure reason in the
warningscolumn
Supported Document Types
| Type | Languages | Examples |
|---|---|---|
| Digital invoices (PDF) | NL/DE/FR/EN | Standard electronic invoice PDFs |
| Scanned receipts (PDF/JPG/PNG) | Any | Photographed or scanned paper receipts |
| Credit notes | NL/DE/FR/EN | Creditnota / Gutschrift |
| Payment confirmations | NL/DE/FR/EN | Betalingsbevestiging |
| Image files (JPG/PNG/WEBP/BMP) | Any | Direct photo of invoice/receipt |
Notes
- Compatible with any OpenAI-compatible VLM API (default: Qwen
qwen-vl-max-latest) - Supports PDF and image files (JPG, PNG, WEBP, BMP) as input
- Async concurrent batch processing (default: 5 concurrent, configurable via
-c) - European number formats auto-handled (comma = decimal, period = thousands separator)
- Basic amount consistency validation (subtotal + tax = total)
- Config priority: CLI flags > environment variables > config.yaml
- Token usage tracked per file; usage report saved as JSON alongside results