json-data-validation-test-design

📁 shimo4228/claude-code-learned-skills 📅 12 days ago
3
总安装量
3
周安装量
#54684
全站排名
安装命令
npx skills add https://github.com/shimo4228/claude-code-learned-skills --skill json-data-validation-test-design

Agent 安装分布

replit 3
openclaw 3
mcpjam 2
claude-code 2
windsurf 2
zencoder 2

Skill 文档

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

  • Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)
  • Schema violations that the app silently swallows
  • Cross-reference mismatches (source data vs generated output)
  • Missing or duplicate entries
  • Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure:
    def test_required_fields(self, data): ...
    def test_count_matches(self, data):
        assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields:
    def test_required_fields(self, entries):
        for e in entries:
            missing = REQUIRED - e.keys()
            assert not missing, f"Entry {e['id']}: missing {missing}"

    def test_enum_values(self, entries):
        for e in entries:
            assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency:
    def test_no_duplicates(self, entries):
        ids = [e["id"] for e in entries]
        assert len(ids) == len(set(ids))

    def test_references_resolve(self, entries, categories):
        # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference:
    @pytest.fixture
    def source_data(self):
        # Parse original source files
        ...

    def test_values_match_source(self, entries, source_data):
        mismatches = []
        for e in entries:
            if e["answer"] != source_data[e["id"]]:
                mismatches.append(...)
        assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality:
    def test_min_text_length(self, entries):
        for e in entries:
            assert len(e["text"]) >= THRESHOLD

    def test_no_stray_characters(self, entries):
        stray = {"ß", "€", "£"}  # Characters unlikely in this domain
        issues = []
        for e in entries:
            for ch in stray:
                if ch in e["text"]:
                    issues.append(f"{e['id']}: '{ch}'")
        assert not issues

Key Design Decisions

  • Module-scoped fixtures for the parsed JSON (scope="module") to avoid re-reading per test
  • Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
  • Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
  • Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like “過学習”)

When to Use

  • After generating/rebuilding JSON data files from external sources
  • As a CI gate for data files that feed into apps
  • When a data file is too large for manual review
  • When data is parsed from inconsistent sources (OCR, PDF export, manual entry)