extracting-mistral-ocr
4
总安装量
4
周安装量
#50030
全站排名
安装命令
npx skills add https://github.com/tristanmanchester/agent-skills --skill extracting-mistral-ocr
Agent 安装分布
opencode
4
gemini-cli
4
codebuddy
4
github-copilot
4
codex
4
kimi-cli
4
Skill 文档
Mistral OCR PDF extraction
Quick start (default)
Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr
Output directory layout:
combined.md(all pages concatenated)pages/page-000.md(per-page markdown)raw_response.json(full OCR response)images/(decoded embedded images, if requested)tables/(separate tables, if requested)
Workflow
-
Pick input mode
- Local PDF (most common): upload via Files API, then OCR via
file_id. - Public URL: OCR directly via
document_url.
- Local PDF (most common): upload via Files API, then OCR via
-
Choose output fidelity (defaults are safe for RAG)
- Keep
table_format=inlineunless the user explicitly wants tables split out. - Set
--include-image-base64when the user needs figures/diagrams extracted. - Use
--extract-header/--extract-footerif header/footer noise hurts downstream search.
- Keep
-
Run OCR
- Use
scripts/mistral_ocr_extract.pyto produce a deterministic on-disk artefact set.
- Use
-
(Optional) Structured extraction from the whole document
- If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
- The OCR API can return a document-level
document_annotationin addition to page markdown.
Example:
python {baseDir}/scripts/mistral_ocr_extract.py \ --input invoice.pdf \ --out out/invoice \ --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \ --annotation-format json_object
Decision rules
- If the PDF is local and not publicly accessible, upload it (the script does this automatically).
- If the PDF URL is private or requires authentication, do not pass it as
document_url; upload instead. - If output quality is critical, prefer
table_format=htmlfor downstream parsing over brittle regex.
Common failure modes
- Missing
MISTRAL_API_KEY: set it in the environment before running. - URL OCR fails: the URL likely is not publicly accessible; upload the file.
- Large files: upload supports large files, but very large PDFs may need page selection (
--pages) or batch processing.
References
- API + parameters:
references/mistral_ocr_api.md - Output mapping rules (placeholders to extracted images/tables):
references/output_mapping.md - Example annotation prompts for common document types:
references/annotation_prompts.md