monitor

📁 whawkinsiv/claude-code-skills 📅 7 days ago

总安装量

周安装量

#52143

全站排名

安装命令

npx skills add https://github.com/whawkinsiv/claude-code-skills --skill monitor

Agent 安装分布

cursor 3

claude-code 3

github-copilot 3

mcpjam 2

openhands 2

zencoder 2

Skill 文档

Monitor

Monitoring Checklist

Basic Monitoring:
- [ ] Uptime monitoring (is site up?)
- [ ] Error tracking (are errors happening?)
- [ ] Performance monitoring (is it slow?)
- [ ] User activity (are people using it?)
- [ ] Critical alerts configured
- [ ] Check dashboard daily

See MONITORING-SETUP.md for implementation.

Why Monitor?

Without monitoring:

Users hit errors, you don’t know
Site goes down, you find out from Twitter
Slow performance, users leave silently
Security issues, no alert

With monitoring:

Errors show in dashboard immediately
Get text when site goes down
See performance degradation
Catch issues before users complain

Goal: Know about problems before users tell you.

Three Essential Monitors

1. Is It Up?

Uptime monitoring – Pings your app every minute

Free tools:

UptimeRobot (free, 50 monitors)
Pingdom (limited free tier)
Vercel/Netlify (built-in for deployed apps)

Setup:

1. Sign up for UptimeRobot
2. Add monitor for https://yourapp.com
3. Add your email for alerts
4. Get texted if site is down

2. Are There Errors?

Error tracking – Captures JavaScript errors and API failures

Free tools:

Sentry (free tier: 5k errors/month)
LogRocket (limited free)
Vercel/Netlify logs (for deployed apps)

Setup:

Tell AI:
"Add Sentry error tracking:
- Capture all frontend errors
- Capture all API errors
- Include user context
- Send to Sentry dashboard"

3. Is It Slow?

Performance monitoring – Tracks page load times

Free tools:

Vercel Analytics (built-in)
Google PageSpeed Insights (free)
Cloudflare Analytics (free tier)

Setup:

Usually automatic with hosting platform
Check dashboard weekly

What to Monitor

Critical Metrics

Must monitor:

Site uptime (99%+)
Error rate (< 1% of requests)
API response time (< 500ms)
Page load time (< 3s)

Nice to have:

Active users
Feature usage
Conversion rates
User paths

For MVP: Focus on the “must monitor” only.

Setting Up Alerts

Configure alerts for:

Critical (text me immediately):

Site is down
Error rate spike (10x normal)
Database connection lost
Payment processing failing

Important (email within hour):

API slow (>2 seconds)
Error rate elevated (2x normal)
Disk space low (>80%)

Informational (daily digest):

New errors discovered
Performance trending down
Traffic patterns

Tell AI:

Configure monitoring alerts:
- Critical: Text to [phone]
- Important: Email to [email]
- Send summary: Daily at 9am

Daily Monitoring Routine

5-minute morning check:

Daily Check:
1. Open monitoring dashboard
2. Check uptime (should be 100% yesterday)
3. Check error count (any spikes?)
4. Check performance (slower than usual?)
5. Review any alerts from overnight

If all green: You’re done, 5 minutes.

If red: Investigate using debug skill.

Reading Monitoring Dashboards

Uptime Dashboard

Green: Site responding
Red: Site down or slow to respond

What to check:

Uptime percentage (target: 99%+)
Response time (target: <500ms)
Recent downtime incidents

Error Dashboard

Look for:

Error count spikes (sudden jump)
New error types (didn’t see before)
Affected users (how many hit this?)
Error frequency (happening a lot?)

Priority:

Affecting many users â High priority
Blocking key features â High priority
Edge case error â Lower priority

Performance Dashboard

Look for:

Load time trending up (getting slower)
Slow endpoints (which API calls)
Slow pages (which routes)
Geographic differences (slow in specific regions)

Error Investigation

When errors spike:

1. Open error tracking dashboard (Sentry)
2. Find the most frequent error
3. Read error message and stack trace
4. Note: How many users affected?
5. Note: Started when?
6. Check: Did we deploy recently?

Give to AI:

Error in production:
[Paste error message and stack trace]

Affected: [X] users in last [Y] hours
Started: [timestamp]
Recent deploys: [any?]

Please:
1. Explain what's wrong
2. Propose hotfix
3. How to test before deploying

User-Reported Issues

When user reports problem:

User Report Investigation:
1. Can you reproduce it?
2. Check monitoring for errors at that time
3. Check logs for that user
4. Check if others affected
5. Determine severity

Then use debug skill to fix.

Tell AI:

User reported: [issue description]
User: [email or ID]
Timestamp: [when it happened]

Check monitoring and logs for this user at this time.
What errors or issues do you see?

Proactive Monitoring

Catch issues before users:

Weekly checks:

Weekly Review:
- [ ] Error trends (going up or down?)
- [ ] Performance trends (slower?)
- [ ] New error types introduced
- [ ] Uptime issues resolved
- [ ] Alert noise (too many false alerts?)

Monthly checks:

Monthly Health:
- [ ] Compare to last month
- [ ] Any degradation?
- [ ] Any improvements?
- [ ] Monitoring gaps (what's not tracked?)

Free Monitoring Stack

Recommended for MVP:

Uptime:

UptimeRobot (free) – 50 monitors

Errors:

Sentry (free) – 5k errors/month

Performance:

Vercel Analytics (free on Vercel)
Cloudflare Analytics (free)

Logs:

Platform logs (Vercel, Netlify, Railway)

Cost: $0/month until you need more.

When to Upgrade Monitoring

Upgrade when:

Hitting free tier limits
Need more detailed analytics
Need faster alert response
Need advanced features (session replay, etc.)

Paid tiers (typically $20-50/mo):

Sentry Pro ($26/mo)
LogRocket ($99/mo – session replay)
DataDog ($15/host/mo)

For < 1000 users: Free tiers sufficient.

Common Monitoring Mistakes

Mistake	Fix
No monitoring set up	Set up before launch
Alert fatigue (too many alerts)	Only alert on critical issues
Checking once a month	Check daily (5 minutes)
Ignoring trends	Watch for degradation over time
No alerts configured	Set up text alerts for downtime
Monitoring but not acting	Use monitoring to find and fix issues

Interpreting Trends

Good trends:

Errors decreasing
Performance improving
Uptime stable at 99.9%+

Warning trends:

Errors slowly increasing
Performance slowly degrading
Uptime dipping below 99%

Critical trends:

Sudden error spike
Sudden performance drop
Multiple downtime incidents

Action: Address warning trends before they become critical.

Logging vs Monitoring

Logging:

Records what happened
For debugging specific issues
Detailed, verbose
Review when investigating

Monitoring:

Tracks overall health
For catching issues early
High-level metrics
Review daily

Both needed: Monitoring alerts you, logs help debug.

Setting Up Logging

Tell AI:

Add application logging:
- Log all errors with context
- Log API requests/responses
- Log slow operations (>1s)
- Log authentication events
- Don't log sensitive data

Format: JSON with timestamp, level, message, context
Send to: [Platform logs or external service]

Log levels:

ERROR: Something broke
WARN: Something concerning
INFO: Normal operations
DEBUG: Detailed debugging info

Production: Log ERROR and WARN only.

Monitoring Integrations

Third-party services:

Payments (Stripe):

Failed payments alert
Refund requests alert
Subscription cancellations (daily digest)

Email (SendGrid):

Delivery failures alert
Bounce rate elevated alert
Spam complaints alert

Database:

Connection pool exhausted
Slow queries (>1s)
Disk space low

Tell AI:

Add monitoring for [service]:
- Alert on failures
- Track success rate
- Log errors with context

Incident Response

When alerts fire:

Incident Response:
1. Acknowledge alert (mark as seen)
2. Assess severity:
   - Critical: Site down, payments failing
   - High: Errors affecting many users
   - Medium: Isolated issues
3. Immediate action:
   - Critical: Hotfix or rollback
   - High: Fix within hours
   - Medium: Fix in next deploy
4. Update users if needed
5. Post-mortem after resolved

Critical incidents:

1. Assess impact (how many affected?)
2. Quick fix or rollback
3. Deploy hotfix
4. Verify fixed
5. Monitor closely for hour
6. Update status page if you have one

Success Looks Like

â Know about issues before users report them
â Uptime >99.9%
â Errors caught and fixed quickly
â Performance trends stable or improving
â Daily monitoring routine (5 minutes)
â Alerts configured and actionable
â Issues resolved proactively

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台