health-checks
4
总安装量
4
周安装量
#52862
全站排名
安装命令
npx skills add https://github.com/doubleslashse/claude-marketplace --skill health-checks
Agent 安装分布
opencode
4
gemini-cli
3
claude-code
3
codex
3
qoder
2
antigravity
2
Skill 文档
Health Checks Skill
Overview
This skill provides knowledge and procedures for monitoring infrastructure health across GitHub Actions, Railway, Supabase, and Postgres.
Health Check Philosophy
Why Regular Health Checks?
- Proactive Detection: Find issues before users do
- Trend Identification: Spot degradation early
- Capacity Planning: Know when to scale
- Compliance: Maintain system hygiene
- Documentation: Track system state over time
Health Check Frequency
| Check Type | Frequency | When |
|---|---|---|
| Quick | Every deploy | After any deployment |
| Daily | Daily | Morning/start of business |
| Weekly | Weekly | Beginning of week |
| Deep | Monthly | Beginning of month |
| Full Audit | Quarterly | Scheduled maintenance window |
Health Status Framework
Traffic Light System
GREEN - All systems healthy
- No critical issues
- Metrics within normal ranges
- Advisory count: 0
YELLOW - Warning state
- Non-critical issues present
- Metrics approaching limits
- Performance advisories present
RED - Critical state
- Service impaired or unavailable
- Critical metrics exceeded
- Security advisories present
- Immediate action required
Status Determination Rules
| Condition | Status |
|---|---|
| Security advisory exists | RED |
| Service unavailable | RED |
| Error rate > 5% | RED |
| Connection utilization > 85% | RED |
| CI success rate < 75% | RED |
| Performance advisory exists | YELLOW |
| Error rate 1-5% | YELLOW |
| Connection utilization 70-85% | YELLOW |
| CI success rate 75-90% | YELLOW |
| Long-running queries present | YELLOW |
| All metrics normal | GREEN |
Health Metrics
Key Performance Indicators
| Platform | Metric | Good | Warning | Critical |
|---|---|---|---|---|
| Database | Connection % | <70% | 70-85% | >85% |
| Database | Query Duration | <100ms | 100-500ms | >500ms |
| Database | Dead Rows % | <10% | 10-20% | >20% |
| API | Error Rate | <1% | 1-5% | >5% |
| API | Response Time P95 | <500ms | 500-2000ms | >2000ms |
| CI/CD | Success Rate | >90% | 75-90% | <75% |
| CI/CD | Build Time | <5min | 5-15min | >15min |
Platform-Specific Metrics
Supabase
- API error rate
- Auth failure rate
- Storage utilization
- Edge function cold starts
- Realtime connection count
- Advisory count (security/performance)
GitHub Actions
- Workflow success rate
- Average build time
- Queue wait time
- Cache hit rate
- Failed workflow count
Railway
- Service uptime
- Deploy success rate
- Memory utilization
- CPU utilization
- Health check pass rate
Postgres
- Connection utilization
- Query duration distribution
- Lock contention
- Dead tuple ratio
- Index usage efficiency
- Table bloat
Health Check Procedures
Quick Health Check (5 min)
Purpose: Verify basic system functionality
1. [ ] Check for active incidents (any platform)
2. [ ] Verify all services responding
3. [ ] Check for critical advisories
4. [ ] Review last hour error rate
5. [ ] Check connection pool status
Daily Health Check (15 min)
Purpose: Assess overall system health
1. [ ] Run quick health check
2. [ ] Review 24-hour error trends
3. [ ] Check CI/CD success rate
4. [ ] Review all advisories
5. [ ] Check slow query log
6. [ ] Verify backups completed
7. [ ] Review resource utilization
Weekly Health Check (30 min)
Purpose: Comprehensive review and trending
1. [ ] Run daily health check
2. [ ] Analyze weekly error patterns
3. [ ] Review index usage stats
4. [ ] Check for table bloat
5. [ ] Review connection patterns
6. [ ] Assess capacity trends
7. [ ] Review deployment frequency
8. [ ] Check certificate expirations
Monthly Deep Check (1+ hours)
Purpose: Full system audit
1. [ ] Run weekly health check
2. [ ] Full index analysis
3. [ ] Query performance review
4. [ ] Security configuration audit
5. [ ] Capacity planning review
6. [ ] Cost analysis
7. [ ] Documentation review
8. [ ] Disaster recovery test
Alert Thresholds
Immediate Alerts (Page)
- Service unavailable > 1 minute
- Error rate > 10%
- Database connections > 90%
- Security advisory created
- Deployment failure (production)
- Health check failure > 5 minutes
Warning Alerts (Slack/Email)
- Error rate > 2%
- Database connections > 75%
- Performance advisory created
- Build time increase > 50%
- Response time P95 > 1s
- Disk usage > 80%
Info Alerts (Daily Digest)
- New advisory (any type)
- Build time change
- Resource trend change
- Configuration change
Health Report Template
# Infrastructure Health Report
**Generated**: {TIMESTAMP}
**Report Type**: {Quick | Daily | Weekly | Monthly}
**Overall Status**: {GREEN | YELLOW | RED}
## Executive Summary
{2-3 sentence overview}
## Platform Status
| Platform | Status | Issues | Warnings |
|----------|--------|--------|----------|
| GitHub Actions | {STATUS} | {N} | {N} |
| Railway | {STATUS} | {N} | {N} |
| Supabase | {STATUS} | {N} | {N} |
| Postgres | {STATUS} | {N} | {N} |
## Key Metrics
### Database
- Connections: {N}/{MAX} ({PCT}%)
- Query P95: {MS}ms
- Dead Rows: {PCT}%
### API
- Error Rate: {PCT}%
- Response Time P95: {MS}ms
### CI/CD
- Success Rate: {PCT}%
- Avg Build Time: {MIN}m
## Advisories
### Security
{List or "None"}
### Performance
{List or "None"}
## Issues Requiring Attention
### Immediate
{List or "None"}
### This Week
{List or "None"}
## Trends
{Notable changes from previous period}
## Recommendations
{Specific actions to improve health}
---
*Next health check: {TIMESTAMP}*
Remediation Playbooks
High Connection Utilization
1. Check for connection leaks
2. Identify idle connections
3. Review connection pool settings
4. Consider connection pooler (PgBouncer/Supavisor)
5. Optimize application connection handling
High Error Rate
1. Identify error types
2. Check recent deployments
3. Review affected endpoints
4. Check downstream dependencies
5. Roll back if deployment-related
Slow Queries
1. Identify slow queries (pg_stat_statements)
2. Run EXPLAIN ANALYZE
3. Check for missing indexes
4. Review query patterns
5. Consider query optimization or caching
Build Failures
1. Review failure logs
2. Check for flaky tests
3. Verify dependencies available
4. Check for environment issues
5. Review recent changes
See checklists.md for detailed health check checklists.