json-data-validation-test-design

📁 shimo4228/claude-code-learned-skills 📅 12 days ago

总安装量

周安装量

#54684

全站排名

安装命令

npx skills add https://github.com/shimo4228/claude-code-learned-skills --skill json-data-validation-test-design

Agent 安装分布

replit 3

openclaw 3

mcpjam 2

claude-code 2

windsurf 2

zencoder 2

Skill 文档

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

Stray characters from OCR/copy-paste (e.g., Ã mixed into Japanese text)
Schema violations that the app silently swallows
Cross-reference mismatches (source data vs generated output)
Missing or duplicate entries
Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure:
    def test_required_fields(self, data): ...
    def test_count_matches(self, data):
        assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields:
    def test_required_fields(self, entries):
        for e in entries:
            missing = REQUIRED - e.keys()
            assert not missing, f"Entry {e['id']}: missing {missing}"

    def test_enum_values(self, entries):
        for e in entries:
            assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency:
    def test_no_duplicates(self, entries):
        ids = [e["id"] for e in entries]
        assert len(ids) == len(set(ids))

    def test_references_resolve(self, entries, categories):
        # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference:
    @pytest.fixture
    def source_data(self):
        # Parse original source files
        ...

    def test_values_match_source(self, entries, source_data):
        mismatches = []
        for e in entries:
            if e["answer"] != source_data[e["id"]]:
                mismatches.append(...)
        assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality:
    def test_min_text_length(self, entries):
        for e in entries:
            assert len(e["text"]) >= THRESHOLD

    def test_no_stray_characters(self, entries):
        stray = {"Ã", "â¬", "Â£"}  # Characters unlikely in this domain
        issues = []
        for e in entries:
            for ch in stray:
                if ch in e["text"]:
                    issues.append(f"{e['id']}: '{ch}'")
        assert not issues

Key Design Decisions

Module-scoped fixtures for the parsed JSON (scope="module") to avoid re-reading per test
Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like “éå¦ç¿”)

When to Use

After generating/rebuilding JSON data files from external sources
As a CI gate for data files that feed into apps
When a data file is too large for manual review
When data is parsed from inconsistent sources (OCR, PDF export, manual entry)

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台