infrastructure-monitoring-setup
npx skills add https://github.com/dawiddutoit/custom-claude --skill infrastructure-monitoring-setup
Agent 安装分布
Skill 文档
Works with infrastructure-monitor.sh script, systemd timer, ntfy.sh push notifications,
Infrastructure Monitoring Setup Skill
Complete setup and configuration of automated infrastructure monitoring with mobile push notifications and auto-recovery capabilities.
Quick Start
Quick setup for monitoring (5 minutes):
# 1. Create unique ntfy topic
TOPIC="infra-$(openssl rand -hex 8)"
echo "Your topic: $TOPIC"
# 2. Add to .env
echo "ALERT_ENABLED=true" >> /home/dawiddutoit/projects/network/.env
echo "NTFY_SERVER=https://ntfy.sh" >> /home/dawiddutoit/projects/network/.env
echo "NTFY_TOPIC=$TOPIC" >> /home/dawiddutoit/projects/network/.env
echo "AUTO_RECOVER=true" >> /home/dawiddutoit/projects/network/.env
# 3. Install systemd service
sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.* /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now infrastructure-monitor.timer
# 4. Test
/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh
Then install ntfy app on phone and subscribe to your topic.
Table of Contents
- When to Use This Skill
- What This Skill Does
- Instructions
- 3.1 Install ntfy Mobile App
- 3.2 Configure Monitoring in .env
- 3.3 Install Systemd Timer
- 3.4 Test Monitoring and Alerts
- 3.5 Configure Home Assistant Integration (Optional)
- 3.6 Verify Auto-Recovery
- 3.7 View Monitoring Logs
- Supporting Files
- Expected Outcomes
- Requirements
- Red Flags to Avoid
When to Use This Skill
Explicit Triggers:
- “Setup monitoring”
- “Configure mobile alerts”
- “Enable auto-recovery”
- “Setup ntfy notifications”
- “Configure Home Assistant alerts”
Implicit Triggers:
- Want to be notified of infrastructure failures
- Need automated recovery for common issues
- Infrastructure has been down without detection
- Want proactive monitoring
Debugging Triggers:
- “Why am I not getting alerts?”
- “Is monitoring working?”
- “How to test notifications?”
What This Skill Does
- Mobile Alerts – Configures ntfy.sh push notifications to phone
- Auto-Recovery – Enables automatic fixes for common failures
- HA Integration – Optional Home Assistant notification integration
- Systemd Service – Installs timer to run monitoring every 5 minutes
- Tests Setup – Verifies notifications and recovery work
- Logs Access – Shows how to view monitoring logs
- Troubleshooting – Diagnoses alert delivery issues
Instructions
3.1 Install ntfy Mobile App
Install app:
- Android: https://play.google.com/store/apps/details?id=io.heckel.ntfy
- iOS: https://apps.apple.com/us/app/ntfy/id1625396347
Subscribe to topic:
- Open ntfy app
- Tap “+” to add subscription
- Enter topic:
infra-YOUR-RANDOM-ID(you’ll generate this in step 3.2) - Server:
https://ntfy.sh - Tap “Subscribe”
Note: You need the topic ID from step 3.2 before subscribing. Come back here after generating it.
3.2 Configure Monitoring in .env
Generate unique topic ID:
TOPIC="infra-$(openssl rand -hex 8)"
echo "Your unique topic: $TOPIC"
Save this topic ID – you’ll use it in the ntfy app.
Add monitoring configuration to .env:
# Navigate to project directory
cd /home/dawiddutoit/projects/network
# Add monitoring variables
cat >> .env << EOF
# Monitoring & Alerts
ALERT_ENABLED=true
NTFY_SERVER=https://ntfy.sh
NTFY_TOPIC=$TOPIC
AUTO_RECOVER=true
EOF
Verify configuration:
grep -A4 "Monitoring & Alerts" /home/dawiddutoit/projects/network/.env
Expected:
# Monitoring & Alerts
ALERT_ENABLED=true
NTFY_SERVER=https://ntfy.sh
NTFY_TOPIC=infra-a3f7d92b4c8e1f56
AUTO_RECOVER=true
Configuration options:
| Variable | Purpose | Default |
|---|---|---|
ALERT_ENABLED |
Enable mobile push notifications | false |
NTFY_SERVER |
ntfy.sh server URL | https://ntfy.sh |
NTFY_TOPIC |
Unique topic for your alerts | None (required) |
AUTO_RECOVER |
Enable automatic recovery | true |
To disable auto-recovery but keep alerts:
# Edit .env
nano /home/dawiddutoit/projects/network/.env
# Change: AUTO_RECOVER=false
3.3 Install Systemd Timer
Install systemd service and timer to run monitoring every 5 minutes:
# Copy service files
sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.service /etc/systemd/system/
sudo cp /home/dawiddutoit/projects/network/systemd/infrastructure-monitor.timer /etc/systemd/system/
# Reload systemd
sudo systemctl daemon-reload
# Enable and start timer
sudo systemctl enable infrastructure-monitor.timer
sudo systemctl start infrastructure-monitor.timer
Verify timer is active:
# Check timer status
systemctl list-timers infrastructure-monitor.timer
# Check service status
sudo systemctl status infrastructure-monitor.timer
Expected:
â infrastructure-monitor.timer - Run infrastructure monitoring every 5 minutes
Loaded: loaded (/etc/systemd/system/infrastructure-monitor.timer; enabled)
Active: active (waiting) since...
Timer configuration:
- Runs every 5 minutes
- Starts 1 minute after boot
- Persistent (survives reboots)
3.4 Test Monitoring and Alerts
Test monitoring script:
# Run monitoring manually
/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh
Expected output shows:
- Docker containers checked
- Tunnel connectivity tested
- Service health verified
- Network interface status
- Alert sent to ntfy topic
Test alert delivery:
Within 30 seconds, you should receive push notification on phone with infrastructure status.
If no notification received:
Check ntfy topic subscription:
# Test sending to topic directly
curl -d "Test from infrastructure monitoring" https://ntfy.sh/$TOPIC
If direct curl works but monitoring doesn’t:
- Check ALERT_ENABLED=true in .env
- Verify NTFY_TOPIC matches app subscription
- Check script has network access
3.5 Configure Home Assistant Integration (Optional)
Why use Home Assistant integration:
- Centralized home automation alerts
- Can trigger automations based on infrastructure status
- Redundancy with ntfy.sh
- Integration with existing HA notifications
Prerequisites:
- Home Assistant running and accessible
- HA mobile app installed (for notify.mobile_app_* service)
Step 1: Create Long-Lived Access Token
- Go to Home Assistant: http://192.168.68.123:8123
- Click your profile (bottom left)
- Scroll to “Long-Lived Access Tokens”
- Click “Create Token”
- Name: “Infrastructure Monitoring”
- Copy token (shown only once)
Step 2: Find Notification Service Name
- In Home Assistant: Developer Tools â Services
- Filter by “notify”
- Find your mobile app service:
notify.mobile_app_your_phone
Step 3: Add to .env
# Edit .env
nano /home/dawiddutoit/projects/network/.env
# Add HA configuration
HA_NOTIFICATIONS_ENABLED=true
HA_BASE_URL=http://192.168.68.123:8123
HA_ACCESS_TOKEN=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
HA_NOTIFY_SERVICE=notify.mobile_app_your_phone
Step 4: Test HA Notifications
# Run monitoring (should send to both ntfy and HA)
/home/dawiddutoit/projects/network/scripts/infrastructure-monitor.sh
Check you receive notification in Home Assistant companion app.
Troubleshooting HA notifications:
# Test HA API access
curl -H "Authorization: Bearer YOUR_TOKEN" \
http://192.168.68.123:8123/api/
# Test notification service
curl -X POST \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"message": "Test from infrastructure monitoring"}' \
http://192.168.68.123:8123/api/services/notify/mobile_app_your_phone
3.6 Verify Auto-Recovery
Monitor logs to see auto-recovery in action:
# View live monitoring logs
sudo journalctl -u infrastructure-monitor.service -f
# Or check persistent log
tail -f /var/log/infrastructure-monitor.log
Auto-recovery capabilities:
| Issue | Detection | Recovery Action |
|---|---|---|
| Stuck cloudflared | No registrations in 10 min | Restart cloudflared container |
| Docker network isolation | Ping fails between containers | Recreate bridge network |
| Inactive Ethernet | WiFi used instead of eth0 | Activate Ethernet connection |
| Service failures | HTTP health checks fail | Restart affected containers |
Test auto-recovery:
# Simulate stuck tunnel
docker stop cloudflared
# Wait 5 minutes (next monitoring run)
# Check logs - should show tunnel restarted
# Verify tunnel recovered
docker ps | grep cloudflared
docker logs cloudflared | grep "Registered tunnel"
3.7 View Monitoring Logs
View systemd service logs:
# Live monitoring logs
sudo journalctl -u infrastructure-monitor.service -f
# Last 50 lines
sudo journalctl -u infrastructure-monitor.service -n 50
# Logs from today
sudo journalctl -u infrastructure-monitor.service --since today
# Logs with timestamps
sudo journalctl -u infrastructure-monitor.service -o short-iso
View persistent log file:
# Live tail
tail -f /var/log/infrastructure-monitor.log
# Last 100 lines
tail -100 /var/log/infrastructure-monitor.log
# Search for errors
grep -i error /var/log/infrastructure-monitor.log
# Search for recoveries
grep -i "recovered" /var/log/infrastructure-monitor.log
Check timer schedule:
# Show next run time
systemctl list-timers infrastructure-monitor.timer
# Show timer configuration
systemctl cat infrastructure-monitor.timer
Monitoring controls:
# Stop monitoring temporarily
sudo systemctl stop infrastructure-monitor.timer
# Restart monitoring
sudo systemctl start infrastructure-monitor.timer
# Disable monitoring (survives reboot)
sudo systemctl disable infrastructure-monitor.timer
# Re-enable monitoring
sudo systemctl enable infrastructure-monitor.timer
Supporting Files
| File | Purpose |
|---|---|
references/reference.md |
Monitoring architecture, recovery strategies, ntfy.sh details |
examples/examples.md |
Example configurations, alert formats, log outputs |
scripts/test-notifications.sh |
Test script for alert delivery |
Expected Outcomes
Success:
- ntfy app receives push notifications
- Monitoring runs every 5 minutes
- Auto-recovery fixes common failures within 5 minutes
- Logs show monitoring activity
- Home Assistant notifications working (if configured)
Partial Success:
- Monitoring runs but alerts not received (check topic subscription)
- Alerts received but auto-recovery disabled (set AUTO_RECOVER=true)
Failure Indicators:
- No notifications received after 10 minutes
- Timer not running (check systemctl status)
- Script fails with errors (check logs)
- HA notifications not working (check token/service name)
Requirements
- Infrastructure server running Linux with systemd
- Mobile device with ntfy app installed
- Internet connectivity for ntfy.sh
- .env file with monitoring configuration
- Home Assistant (optional, for HA integration)
Red Flags to Avoid
- Do not use public/guessable ntfy topic (security risk)
- Do not share ntfy topic publicly (anyone can subscribe)
- Do not disable monitoring without alternative alerting
- Do not ignore persistent alerts (investigate root cause)
- Do not run monitoring script too frequently (causes noise)
- Do not commit .env with ntfy topic to git (privacy)
- Do not use AUTO_RECOVER=false without manual monitoring
Notes
- Monitoring checks run every 5 minutes via systemd timer
- ntfy.sh is free and doesn’t require account
- Topic ID should be random and private (security by obscurity)
- Auto-recovery attempts fixes before alerting as critical
- Alert levels: ð´ Critical (manual intervention), â ï¸ Warning (recovery in progress)
- HA integration is optional and works alongside ntfy.sh
- Logs persist across reboots at /var/log/infrastructure-monitor.log
- Maximum detection time: 5 minutes (timer interval)
- Monitoring survives server reboots (systemd timer enabled)
- Use infrastructure-health-check skill for manual on-demand checks