json-data-validation-test-design
3
总安装量
3
周安装量
#54684
全站排名
安装命令
npx skills add https://github.com/shimo4228/claude-code-learned-skills --skill json-data-validation-test-design
Agent 安装分布
replit
3
openclaw
3
mcpjam
2
claude-code
2
windsurf
2
zencoder
2
Skill 文档
JSON Data File Validation Test Design
Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules
Problem
JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:
- Stray characters from OCR/copy-paste (e.g.,
Ãmixed into Japanese text) - Schema violations that the app silently swallows
- Cross-reference mismatches (source data vs generated output)
- Missing or duplicate entries
- Business rule violations (e.g., correct answer not in choices)
Manual review of large files (60+ entries, 3000+ lines) is unreliable.
Solution: Layered Pytest Validation
Structure tests in layers from structural to semantic:
Layer 1: Top-level structure
class TestTopLevelStructure:
def test_required_fields(self, data): ...
def test_count_matches(self, data):
assert data["totalItems"] == len(data["items"])
Layer 2: Per-entry schema validation
class TestEntryFields:
def test_required_fields(self, entries):
for e in entries:
missing = REQUIRED - e.keys()
assert not missing, f"Entry {e['id']}: missing {missing}"
def test_enum_values(self, entries):
for e in entries:
assert e["type"] in VALID_TYPES
Layer 3: Cross-entry consistency
class TestConsistency:
def test_no_duplicates(self, entries):
ids = [e["id"] for e in entries]
assert len(ids) == len(set(ids))
def test_references_resolve(self, entries, categories):
# Every entry's category exists in categories list
Layer 4: Source cross-reference
class TestSourceCrossReference:
@pytest.fixture
def source_data(self):
# Parse original source files
...
def test_values_match_source(self, entries, source_data):
mismatches = []
for e in entries:
if e["answer"] != source_data[e["id"]]:
mismatches.append(...)
assert not mismatches, f"{len(mismatches)} mismatches"
Layer 5: Content quality heuristics
class TestContentQuality:
def test_min_text_length(self, entries):
for e in entries:
assert len(e["text"]) >= THRESHOLD
def test_no_stray_characters(self, entries):
stray = {"Ã", "â¬", "£"} # Characters unlikely in this domain
issues = []
for e in entries:
for ch in stray:
if ch in e["text"]:
issues.append(f"{e['id']}: '{ch}'")
assert not issues
Key Design Decisions
- Module-scoped fixtures for the parsed JSON (
scope="module") to avoid re-reading per test - Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
- Graceful degradation: source cross-reference tests skip with
pytest.skip()if source files are absent - Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like “éå¦ç¿”)
When to Use
- After generating/rebuilding JSON data files from external sources
- As a CI gate for data files that feed into apps
- When a data file is too large for manual review
- When data is parsed from inconsistent sources (OCR, PDF export, manual entry)