data-import-parsers
4
总安装量
3
周安装量
#49901
全站排名
安装命令
npx skills add https://github.com/janjaszczak/cursor --skill data-import-parsers
Agent 安装分布
gemini-cli
3
github-copilot
3
codex
3
kimi-cli
3
cursor
3
amp
3
Skill 文档
Data Import & Validation (Streaming + Error CSV + Metrics + Idempotency)
When to use this skill
Use when working on:
- ETL / import pipelines (CSV/XLSX/JSON) into a DB
- Parsers that currently load whole files into memory
- Data validation/coercion, error isolation, auditability, and reproducibility
- Any import job that must be safe to re-run (idempotent)
Non-negotiables (contract)
- Process input files sequentially (no âload everything then insertâ) to control memory.
- Validate and coerce types explicitly (define allowed coercions; reject ambiguous cases).
- Irreparable records must be skipped and logged to an error CSV containing:
- all original columns exactly as seen in input
- extra columns:
timestamp,file,line,error
- Emit metrics at minimum:
rows_ok,rows_skipped,parse_errors. - DB writes must be idempotent: re-running the import must not duplicate or corrupt data.
Standard workflow
1) Plan-first (before coding)
- Identify input formats, volume, and ârow identityâ rules (keys/dedup strategy).
- Locate current import entrypoints + DB write layer.
- Define:
- schema mapping (source -> target columns)
- validation rules per field
- coercion rules per field (what is allowed, what is not)
- error taxonomy (what counts as âirreparableâ)
2) Streaming architecture
- Read one file at a time, row by row (or chunked) and write in bounded batches.
- Never accumulate full datasets in memory.
- Ensure progress logging is monotonic (e.g., file + row counters).
3) Validation & coercion
- Treat raw row values as immutable âsource of truthâ.
- Perform coercions in a controlled layer:
- return
(ok, parsed_record)or(error, reason)per row
- return
- Separate concerns:
- parsing (raw -> typed)
- business validation (typed -> acceptable)
- persistence (acceptable -> DB)
4) Error CSV logging
- On any irreparable row:
- write original row values (unmodified)
- add:
timestamp,file,line,error
- Avoid partial writes that drop context; every skipped row must be explainable from the CSV alone.
5) Metrics
Maintain counters:
rows_ok: successfully persisted rowsrows_skipped: irreparable rows skippedparse_errors: count of parse/validation errors (can equal rows_skipped or be a superset if you track recoverable warnings separately)
Emit metrics:
- end-of-file summary
- end-of-run summary (aggregate over files) Optionally persist metrics to a JSON or a DB table for observability.
6) Idempotent persistence
Pick and implement one clear strategy:
- Upsert by natural key / business key
- Insert with unique constraint + conflict handling
- Staging table + merge
- Per-file âingestion ledgerâ (hash/checkpoint) + skip already-ingested files
Idempotency must hold across:
- reruns after crashes
- partial completion
- duplicate input files
Review checklist (definition of done)
- Demonstrably streaming: memory does not scale with total rows.
- Every skipped record is present in error CSV with full context.
- Metrics are emitted and consistent with observed behavior.
- Re-running import on same inputs does not create duplicates (verified).
- Failures are isolated: one bad row/file does not poison the full run.