extract-thailand-election-policies
npx skills add https://github.com/nuttea/thailand-election-skills --skill extract-thailand-election-policies
Agent 安装分布
Skill 文档
Extract Thailand Election Policies
Specialized skill for extracting Thai political party policy data from PDF documents using Google Gemini 3 Pro Preview with structured output.
Agent Instructions – CRITICAL Setup Steps
BEFORE running ANY Python script in this skill, the agent MUST execute these commands IN ORDER:
# Step 1: Navigate to skill directory
cd .claude/skills/extract-thailand-election-policies
# Step 2: Create virtual environment (skip if .venv/ exists)
python3 -m venv .venv
# Step 3: ACTIVATE virtual environment (CRITICAL - don't skip!)
source .venv/bin/activate
# Step 4: Install requirements (CRITICAL - must be in activated venv!)
pip install -r requirements.txt
# Step 5: Verify installation succeeded
python -c "import google.genai, pydantic, requests; print('â Core packages ready')"
# Step 6: Optional - Install ddtrace for Datadog LLMObs tracing
pip install ddtrace
IMPORTANT NOTES:
- The
source .venv/bin/activatecommand MUST be run before pip install - Each new terminal session needs
source .venv/bin/activateagain - Check for
(.venv)prefix in terminal prompt to confirm activation - If installation fails, try without version constraints:
pip install google-genai pydantic requests
After setup, run Python scripts normally:
python scripts/extract_policy.py --pdf-path party.pdf --output-file party.json
When done, deactivate:
deactivate
Quick Start (After Venv Setup)
# Extract single party
python scripts/extract_policy.py \
--pdf-path "à¹à¸à¸à¸£à¹ 9 à¸à¸£à¸£à¸à¹à¸à¸·à¹à¸à¹à¸à¸¢.pdf" \
--output-file "party_9_policies.json"
Features
- â Thai Language OCR – Handles Thai text and numerals
- â Structured Output – Pydantic validation with JSON Schema
- â 9-Field Extraction – Complete policy data model
- â Budget Normalization – Converts Thai units to Baht
- â Category Assignment – 15 predefined categories
- â Stream Monitoring – Timeout detection and auto-retry
- â Error Logging – Detailed debugging information
Policy Data Model
Fields Extracted
- policy_seq (int) – Policy sequence number (Thai numerals â Arabic)
- policy_category (str) – One of 15 predefined categories
- policy_name (str) – Policy title/name
- budget_baht (int) – Budget in Baht (pure integer, 0 if none)
- funding_source (str) – Funding source details
- cost_effectiveness (str) – Cost-effectiveness analysis
- benefits (str) – Benefits description
- impacts (str) – Impact analysis
- risks (str) – Risk assessment
Policy Categories (15)
- à¹à¸¨à¸£à¸©à¸à¸à¸´à¸à¹à¸¥à¸°à¸à¸²à¸£à¸à¹à¸² (Economy & Trade)
- à¹à¸à¸©à¸à¸£à¸à¸£à¸£à¸¡à¹à¸¥à¸°à¸à¸£à¸°à¸¡à¸ (Agriculture & Fisheries)
- สาà¸à¸²à¸£à¸à¸ªà¸¸à¸ (Public Health)
- à¸à¸²à¸£à¸¨à¸¶à¸à¸©à¸² (Education)
- à¹à¸à¸£à¸à¸ªà¸£à¹à¸²à¸à¸à¸·à¹à¸à¸à¸²à¸ (Infrastructure)
- สิà¹à¸à¹à¸§à¸à¸¥à¹à¸à¸¡à¹à¸¥à¸°à¸à¸¥à¸±à¸à¸à¸²à¸ (Environment & Energy)
- สวัสà¸à¸´à¸à¸²à¸£à¸ªà¸±à¸à¸à¸¡ (Social Welfare)
- à¸à¸£à¸£à¸¡à¸²à¸ ิà¸à¸²à¸¥à¹à¸¥à¸°à¸à¸²à¸£à¸à¹à¸à¸à¹à¸²à¸à¸à¸à¸£à¹à¸£à¸±à¸à¸à¸±à¸ (Governance & Anti-Corruption)
- à¸à¸¥à¸²à¹à¸«à¸¡à¹à¸¥à¸°à¸à¸§à¸²à¸¡à¸¡à¸±à¹à¸à¸à¸ (Defense & Security)
- à¸à¸²à¸£à¸à¹à¸à¸à¹à¸à¸µà¹à¸¢à¸§à¹à¸¥à¸°à¸§à¸±à¸à¸à¸à¸£à¸£à¸¡ (Tourism & Culture)
- à¸à¸µà¹à¸à¸´à¸à¹à¸¥à¸°à¸à¸µà¹à¸à¸¢à¸¹à¹à¸à¸²à¸¨à¸±à¸¢ (Land & Housing)
- à¹à¸£à¸à¸à¸²à¸à¹à¸¥à¸°à¸à¸²à¸£à¸à¹à¸²à¸à¸à¸²à¸ (Labor & Employment)
- ยุà¸à¸´à¸à¸£à¸£à¸¡ (Justice)
- à¸à¸²à¸£à¸à¹à¸²à¸à¸à¸£à¸°à¹à¸à¸¨ (Foreign Affairs)
- à¸à¸·à¹à¸à¹ (Others)
Usage
Extract Single Party
python scripts/extract_policy.py \
--pdf-path "à¹à¸à¸à¸£à¹ 27 à¸à¸£à¸£à¸à¸à¸£à¸°à¸à¸²à¸à¸´à¸à¸±à¸à¸¢à¹.pdf" \
--output-file "party_27_policies.json"
Batch Extract All Parties
bash scripts/batch_extract_all.sh
Features:
- Processes all PDFs in directory
- Skips already-extracted files
- Auto-retry on failures (up to 3 times)
- Moves processed PDFs to
processed/directory - Creates consolidated CSV at end
- 3-second delays between extractions
Convert to CSV
python scripts/json_to_csv.py \
--json-file "party_9_policies.json" \
--output-file "party_9_policies.csv"
Send to Datadog
python scripts/send_to_datadog.py \
--csv-file "consolidated_all_parties.csv"
Script Arguments
extract_policy.py
| Argument | Required | Description |
|---|---|---|
--pdf-path |
Yes | Path to PDF file |
--output-file |
Yes | Output JSON file path |
--max-retries |
No | Max retry attempts (default: 2) |
batch_extract_all.sh
| Argument | Description |
|---|---|
| (none) | Extract all PDFs, skip existing |
--force |
Re-extract all files |
json_to_csv.py
| Argument | Required | Description |
|---|---|---|
--json-file |
Yes | Input JSON file |
--output-file |
Yes | Output CSV file |
--delimiter |
No | CSV delimiter (default: |) |
--preserve-newlines |
No | Convert newlines to \n |
send_to_datadog.py
| Argument | Required | Description |
|---|---|---|
--csv-file |
Yes | CSV file to send |
--batch-size |
No | Batch size (default: 50) |
--dry-run |
No | Test without sending |
Extraction Rules
Thai Numeral Conversion
Convert to Arabic ONLY in policy_seq:
- ๠â 0, ๠â 1, ๠â 2, ๠â 3, ๠â 4
- ๠â 5, ๠â 6, ๠â 7, ๠â 8, ๠â 9
Preserve in all other fields:
- à¹) à¹) à¹) in lists
- Thai numerals in text content
Budget Normalization
Convert to pure integer in Baht:
- ลà¹à¸²à¸ = à 1,000,000
- à¸à¸±à¸à¸¥à¹à¸²à¸ = à 1,000,000,000
- à¹à¸ªà¸à¸¥à¹à¸²à¸ = à 100,000,000,000
- ลà¹à¸²à¸à¸¥à¹à¸²à¸ = à 1,000,000,000,000
- à¹à¸¡à¹à¹à¸à¹à¹à¸à¸´à¸à¸à¸à¸à¸£à¸°à¸¡à¸²à¸ = 0
Examples:
- 40,000 ลà¹à¸²à¸ â 40,000,000,000
- 3.5 à¹à¸ªà¸à¸¥à¹à¸²à¸ â 350,000,000,000
- à¹à¸¡à¹à¸£à¸°à¸à¸¸ â 0
Text Extraction
- Extract word-by-word for accuracy
- Preserve Thai formatting
- Include all policies (no TOTAL rows)
- Maintain numbered lists (à¹) à¹) à¹))
Output Format
JSON Structure
{
"policies": [
{
"policy_seq": 1,
"policy_category": "à¹à¸à¸£à¸à¸ªà¸£à¹à¸²à¸à¸à¸·à¹à¸à¸à¸²à¸",
"policy_name": "ระà¸à¸à¸£à¸²à¸à¸à¸§à¸²à¸¡à¹à¸£à¹à¸§à¸ªà¸¹à¸",
"budget_baht": 350000000000,
"funding_source": "à¹) à¸à¸à¸à¸£à¸°à¸¡à¸²à¸à¹à¸à¹à¸à¸à¸´à¸\nà¹) PPP\nà¹) à¸à¸±à¸à¸à¸à¸±à¸à¸£",
"cost_effectiveness": "ลà¸à¸à¹à¸à¸à¸¸à¸à¹à¸¥à¸à¸´à¸ªà¸à¸´à¸à¸ªà¹...",
"benefits": "à¹) à¹à¸à¸´à¹à¸¡à¸à¸²à¸£à¹à¸à¸·à¹à¸à¸¡à¸à¹à¸\nà¹) à¸à¸£à¸°à¸à¸¸à¹à¸à¹à¸¨à¸£à¸©à¸à¸à¸´à¸",
"impacts": "à¸à¸¥à¸à¸£à¸°à¸à¸à¸£à¸°à¸¢à¸°à¸¢à¸²à¸§...",
"risks": "à¸à¸§à¸²à¸¡à¹à¸ªà¸µà¹à¸¢à¸à¸à¸²à¸à¸à¸²à¸£à¹à¸à¸´à¸..."
}
]
}
CSV Format
Pipe-delimited with columns:
party_number|party_name|policy_seq|policy_category|policy_name|budget_baht|funding_source|cost_effectiveness|benefits|impacts|risks
Performance
- Average extraction time: 3-5 minutes per PDF
- Large PDFs (50MB): 5-8 minutes
- Small PDFs (<5MB): 2-3 minutes
- Batch processing: ~4 hours for 51 parties
Error Handling
Automatic Retry
- Detects incomplete responses (1-chunk with invalid JSON)
- Retries up to 3 times
- 3-second delay between retries
- Logs all errors to
.error.logfiles
Stream Monitoring
- Tracks time between chunks
- Timeout if no chunks for >3 minutes
- Auto-retry on timeout
- Shows chunk count and content preview
Error Logs
Location: output_dir/party_N_NAME.error.log
Contains:
- Python exceptions
- API errors
- Validation failures
- Timeout information
Workflow
1. Extract PDFs to JSON
cd /path/to/pdfs
bash /path/to/skills/extract-thailand-election-policies/scripts/batch_extract_all.sh
Output:
- 51 JSON files in
all_parties_output/ - PDFs moved to
processed/directory - Error logs for any failures
2. Convert to CSV
# Individual CSVs created automatically during extraction
# Create consolidated CSV
python scripts/consolidate_csv.py \
--input-dir "all_parties_output" \
--output-file "consolidated_all_parties.csv"
3. Send to Datadog
python scripts/send_to_datadog.py \
--csv-file "consolidated_all_parties.csv"
Result: All policies searchable in Datadog with tags:
source:custom-logservice:th-election-policyversion:YYYYMMDD-HHMMenv:prod
Example: Complete Workflow
# 1. Set up
cd /Users/nuttee.jirattivongvibul/Projects/nuttee-se-gemini-cli/temp_working/OTHERS/THAILAND_ELECTION_2026
# 2. Extract all parties
bash scripts/batch_extract_all.sh
# 3. Check status
./CHECK_STATUS.sh
# 4. Generate consolidated CSV
# (automatically done by batch script)
# 5. Send to Datadog
python send_to_datadog.py \
--csv-file "all_parties_output/consolidated_all_parties.csv"
# 6. Analyze in Datadog
# Go to: https://app.datadoghq.com/logs
# Query: source:custom-log service:th-election-policy
Real-World Results
Thailand 2026 Election Extraction
Completed: 2026-01-29
Results:
- â 51 parties extracted (100%)
- â 587 policies total
- â All data in Datadog
- â Analysis notebook created
Processing Time: ~6-7 hours total
Success Factors:
- Stream timeout detection
- Incomplete response retry
- Proper error logging
- 3-second delays
Troubleshooting
Issue: Incomplete JSON (1 chunk)
Symptom: Only 1 chunk received, invalid JSON
Solution: Script automatically detects and retries (up to 3 times)
Manual fix:
python scripts/extract_policy.py \
--pdf-path "problem.pdf" \
--output-file "output.json" \
--max-retries 5
Issue: Stream Stalls
Symptom: No chunks for >3 minutes
Solution: Script automatically detects timeout and retries
Check logs:
cat all_parties_output/party_N_NAME.error.log
Issue: API Rate Limits
Symptom: Multiple failures in a row
Solution:
- Increase delay in batch script (change
DELAY_BETWEEN_PDFS) - Wait 1 hour and retry
- Use different API key
Advanced Usage
Custom Extraction
For different policy document formats, modify:
- Pydantic Models (lines 28-43 in
extract_policy.py) - Instructions (in batch script or command line)
- Categories (update predefined list)
Batch Processing Options
Skip existing files:
./batch_extract_all.sh
Force re-extract all:
./batch_extract_all.sh --force
Custom delays:
Edit DELAY_BETWEEN_PDFS in script (default: 3 seconds)
Integration
With Google Sheets
- Use comma-separated CSV
- Import with UTF-8 encoding
- Find & Replace:
\nâCtrl+Enter - Format budget column with thousands separator
With Datadog
- Send logs with
send_to_datadog.py - Query:
source:custom-log service:th-election-policy - Create dashboards and monitors
- Export for further analysis
With Other Tools
Python/Pandas:
import pandas as pd
df = pd.read_csv('consolidated.csv', delimiter='|')
Excel:
- Open CSV with delimiter:
| - Convert text to columns if needed
Files in This Skill
extract-thailand-election-policies/
âââ SKILL.md # This file
âââ README.md # Quick reference
âââ WORKFLOW.md # Step-by-step guide
âââ scripts/
â âââ extract_policy.py # Single PDF extraction
â âââ batch_extract_all.sh # Batch processing
â âââ json_to_csv.py # JSON to CSV conversion
â âââ send_to_datadog.py # Datadog integration
â âââ CHECK_STATUS.sh # Progress monitoring
âââ examples/
âââ sample_output.json # Example JSON
âââ sample_output.csv # Example CSV
âââ datadog_queries.md # Query examples
Success Metrics
Thailand 2026 Project
- Extraction: 51/51 parties (100%)
- Policies: 587 total
- Data Quality: 100% valid JSON
- Datadog: 587 logs ingested
- Analysis: Notebook with 35 cells created
See Also
- Analysis Guide:
DATADOG_ANALYSIS_GUIDE.md - Project Summary:
PROJECT_COMPLETE_SUMMARY.md - Datadog Notebook: https://app.datadoghq.com/notebook/13821543
Agent Workflow
When user requests Thai election policy extraction:
- â
Use
scripts/extract_policy.pyfor single PDF - â
Use
scripts/batch_extract_all.shfor multiple PDFs - â
Convert to CSV with
scripts/json_to_csv.py - â
Send to Datadog with
scripts/send_to_datadog.py - â Analyze using Datadog notebook
This skill is production-ready and battle-tested with 51 real political party PDFs.