rancher-cluster-inspection
npx skills add https://github.com/futuretea/rancher-assistant --skill rancher-cluster-inspection
Agent 安装分布
Skill 文档
Rancher é群巡æ£ï¼å¤ Agent å¹¶è¡çï¼
对 Kubernetes é群åå¥åº·å·¡æ£ï¼éç¨å¤ Agent å¹¶è¡æ¶æï¼6 ä¸ªç»´åº¦åæ¶æ§è¡ï¼æ¾èæåå·¡æ£é度åè¦ç深度ã
æ¶ææ¦è§
ç¨æ·è¯·æ± â Skill è¯å«å·¡æ£ç±»å â å¹¶è¡å¯å¨ç»´åº¦ Agent â æ±æ»æ¥å
ââ cluster-info-inspector (é群信æ¯)
ââ node-health-inspector (èç¹å¥åº·)
ââ capacity-inspector (èµæºå®¹é)
ââ workload-inspector (å·¥ä½è´è½½)
ââ event-inspector (å¼å¸¸äºä»¶)
ââ system-inspector (ç³»ç»ç»ä»¶)
å¯ç¨ Sub-Agentï¼6 个维度 Agentï¼
1. rancher-cluster-info-inspector
维度: é群åºç¡ä¿¡æ¯ æ£æ¥é¡¹: éç¾¤ç¶æãK8s çæ¬ãé¡¹ç®æ°éãå½åç©ºé´æ°éãProvider ä¿¡æ¯
2. rancher-node-health-inspector
维度: èç¹å¥åº· æ£æ¥é¡¹: Ready ç¶æãMemoryPressureãDiskPressureãPIDPressureãTaints/Cordonedãkubelet çæ¬ä¸è´æ§
3. rancher-capacity-inspector
维度: èµæºå®¹é æ£æ¥é¡¹: CPU/å å请æ±/éå¶/使ç¨çãPod æ°éãè¿åº¦åé æ£æµ
4. rancher-workload-inspector
维度: å·¥ä½è´è½½å¥åº· æ£æ¥é¡¹: Deployment/StatefulSet/DaemonSet å¯ç¨æ§ãå¼å¸¸ Podãé«éå¯ Pod
5. rancher-event-inspector
维度: å¼å¸¸äºä»¶ æ£æ¥é¡¹: Warning äºä»¶ãOOMKillingãFailedSchedulingãEvictedãé«é¢éå¤äºä»¶
6. rancher-system-inspector
维度: ç³»ç»ç»ä»¶ æ£æ¥é¡¹: CoreDNSãkube-proxyãmetrics-serverãcattle-agentãfleet-agentãIngress Controller
å³çæ
ç¨æ·è¯·æ±ï¼
ââ "é群巡æ£" / "cluster inspection" / "å¥åº·æ£æ¥" / "éç¾¤ä½æ£"
â ââ å¹¶è¡å¯å¨ 6 个维度 Agentï¼å®æ´å·¡æ£ï¼
â
ââ "å¿«éæ£æ¥" / "quick check" / "ç®åççéç¾¤ç¶æ"
â ââ å¹¶è¡å¯å¨ 3 个维度 Agentï¼cluster-info + node-health + eventï¼
â
ââ "èç¹å·¡æ£" / "æ£æ¥ææèç¹" / "node inspection"
â ââ å¹¶è¡å¯å¨ 2 个维度 Agentï¼node-health + capacityï¼
â
ââ "å·¥ä½è´è½½å·¡æ£" / "åºç¨å¥åº·æ£æ¥" / "workload inspection"
â ââ å¹¶è¡å¯å¨ 2 个维度 Agentï¼workload + eventï¼
â
ââ "äºä»¶å·¡æ£" / "æ£æ¥å¼å¸¸äºä»¶" / "event inspection"
â ââ å¯å¨ 1 个维度 Agentï¼eventï¼
â
ââ "ç³»ç»ç»ä»¶å·¡æ£" / "æ£æ¥ç³»ç»ç»ä»¶" / "system inspection"
â ââ å¯å¨ 1 个维度 Agentï¼systemï¼
â
ââ "å·¡æ£ææé群" / "å
¨é¨éç¾¤ä½æ£" / "inspect all clusters"
â ââ è·åé群å表 â 为æ¯ä¸ªé群并è¡å¯å¨ 6 个维度 Agent
â
ââ "忴忣æ¥" / "pre-change check"
â ââ å¹¶è¡å¯å¨ 6 个维度 Agentï¼å®æ´å·¡æ£ï¼è®°å½åºçº¿ï¼
â
ââ "忴忣æ¥" / "post-change check"
ââ å¹¶è¡å¯å¨ 6 个维度 Agentï¼å®æ´å·¡æ£ï¼ä¸åºçº¿å¯¹æ¯ï¼
å¹¶è¡æ§è¡æ¨¡å¼
æ¨¡å¼ 1: åéç¾¤å®æ´å·¡æ£ï¼6 Agent å¹¶è¡ï¼
ç¨æ·: "对 production é群å䏿¬¡å®æ´å·¡æ£"
â æ¥éª¤ 1: ç¡®å®é群 IDï¼å¦éè¦ï¼ä½¿ç¨ cluster_list æç´¢ï¼
â æ¥éª¤ 2: åæ¶å¯å¨ 6 个维度 Agent
Agent 1: rancher-cluster-info-inspectorï¼é群 c-abc123ï¼
Agent 2: rancher-node-health-inspectorï¼é群 c-abc123ï¼
Agent 3: rancher-capacity-inspectorï¼é群 c-abc123ï¼
Agent 4: rancher-workload-inspectorï¼é群 c-abc123ï¼
Agent 5: rancher-event-inspectorï¼é群 c-abc123ï¼
Agent 6: rancher-system-inspectorï¼é群 c-abc123ï¼
â æ¥éª¤ 3: æ±æ» 6 个维度æ¥åï¼è®¡ç®æ´ä½è¯åï¼çæå®æ´å·¡æ£æ¥å
æ¨¡å¼ 2: å¤é群并è¡å·¡æ£ï¼N é群 à 6 Agentï¼
ç¨æ·: "å·¡æ£ææé群"
â æ¥éª¤ 1: è°ç¨ cluster_list è·åææé群
â æ¥éª¤ 2: 为æ¯ä¸ªéç¾¤åæ¶å¯å¨ 6 个维度 Agent
é群 production (c-abc123): 6 个维度 Agent
é群 staging (c-def456): 6 个维度 Agent
é群 dev (c-ghi789): 6 个维度 Agent
ï¼å
± 18 个 Agent å¹¶è¡è¿è¡ï¼
â æ¥éª¤ 3: å嫿±æ»æ¯ä¸ªé群çå·¡æ£æ¥å
â æ¥éª¤ 4: çæå¤éç¾¤å·¡æ£æ»è§
æ¨¡å¼ 3: å¿«éå·¡æ£ï¼3 Agent å¹¶è¡ï¼
ç¨æ·: "å¿«éæ£æ¥ä¸ä¸ production é群"
â å¹¶è¡å¯å¨ 3 个维度 Agentï¼
Agent 1: rancher-cluster-info-inspector
Agent 2: rancher-node-health-inspector
Agent 3: rancher-event-inspector
â æ±æ»æ¥åï¼ä»
å« 3 个维度ï¼
æ¨¡å¼ 4: æå®å½å空é´å·¡æ£
ç¨æ·: "å·¡æ£ production é群ç app å monitoring å½å空é´"
â å¹¶è¡å¯å¨ç»´åº¦ Agentï¼ä¼ å
¥ namespaces åæ°ï¼ï¼
Agent 1: rancher-workload-inspectorï¼namespaces: ["app", "monitoring"]ï¼
Agent 2: rancher-event-inspectorï¼namespaces: ["app", "monitoring"]ï¼
â èç¦æå®å½å空é´çæ£æ¥ç»æ
æ¨¡å¼ 5: åæ´åå对æ¯å·¡æ£
ç¨æ·: "å䏿¬¡åæ´åå·¡æ£"
â å¹¶è¡å¯å¨ 6 个维度 Agentï¼å®æ´å·¡æ£ï¼
â ä¿åæ¥åä½ä¸ºåºçº¿
ç¨æ·ï¼åæ´åï¼: "å忴忣æ¥"
â å¹¶è¡å¯å¨ 6 个维度 Agentï¼å®æ´å·¡æ£ï¼
â ä¸ä¹åçåºçº¿å¯¹æ¯ï¼é«äº®åå项
工使µ
æ¥éª¤ 1: è¯å«å·¡æ£ç±»å
- 宿´å·¡æ£ vs å¿«éå·¡æ£ vs ä¸é¡¹å·¡æ£ï¼
- åé群 vs å¤é群ï¼
- æ¯å¦æå®å½å空é´ï¼
æ¥éª¤ 2: è·åé群信æ¯
å¦æç¨æ·æä¾é群åç§°èé IDï¼
â ä½¿ç¨ cluster_listï¼name: "å
³é®è¯"ï¼æç´¢
â è·åå¹é
çé群 ID
å¦æç¨æ·è¦æ±å·¡æ£”ææé群”ï¼
â ä½¿ç¨ cluster_list è·å宿´å表
æ¥éª¤ 3: å¹¶è¡å¯å¨ç»´åº¦ Agent
宿´å·¡æ£ï¼6 Agent å¹¶è¡ï¼ï¼
// åæ¶å¯å¨ 6 个维度 Agent
const tasks = [
Task({
subagent_type: "general-purpose",
description: "å·¡æ£é群åºç¡ä¿¡æ¯",
prompt: `ä½ æ¯ rancher-cluster-info-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡é群åºç¡ä¿¡æ¯å·¡æ£ãæ£æ¥éç¾¤ç¶æãK8s çæ¬ã项ç®åå½åç©ºé´æ¦åµãè¿åæ åå维度æ¥åï¼å« dimensionãscoreãstatusãitemsãissuesãrecommendationsï¼ã`
}),
Task({
subagent_type: "general-purpose",
description: "å·¡æ£èç¹å¥åº·",
prompt: `ä½ æ¯ rancher-node-health-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡èç¹å¥åº·å·¡æ£ãæ£æ¥ Ready ç¶æãConditionsãTaintsãkubelet çæ¬ä¸è´æ§ãè¿åæ åå维度æ¥åã`
}),
Task({
subagent_type: "general-purpose",
description: "å·¡æ£èµæºå®¹é",
prompt: `ä½ æ¯ rancher-capacity-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡èµæºå®¹éå·¡æ£ãæ£æ¥ CPU/å
å请æ±/éå¶/使ç¨çãPod æ°éãè¿åº¦åé
ãè¿åæ åå维度æ¥åã`
}),
Task({
subagent_type: "general-purpose",
description: "å·¡æ£å·¥ä½è´è½½",
prompt: `ä½ æ¯ rancher-workload-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡å·¥ä½è´è½½å¥åº·å·¡æ£ãæ£æ¥ Deployment/StatefulSet/DaemonSet å¯ç¨æ§ãå¼å¸¸ Podãè¿åæ åå维度æ¥åã`
}),
Task({
subagent_type: "general-purpose",
description: "å·¡æ£å¼å¸¸äºä»¶",
prompt: `ä½ æ¯ rancher-event-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡å¼å¸¸äºä»¶å·¡æ£ãæ£æ¥ Warning äºä»¶ãOOMKillingãFailedScheduling çå
³é®äºä»¶ãè¿åæ åå维度æ¥åã`
}),
Task({
subagent_type: "general-purpose",
description: "å·¡æ£ç³»ç»ç»ä»¶",
prompt: `ä½ æ¯ rancher-system-inspectorã对é群 ${cluster}ï¼${name}ï¼æ§è¡ç³»ç»ç»ä»¶å·¡æ£ãæ£æ¥ kube-systemãcattle-system æ ¸å¿ç»ä»¶ç¶æãè¿åæ åå维度æ¥åã`
})
];
const results = await Promise.all(tasks);
å¿«éå·¡æ£ï¼3 Agent å¹¶è¡ï¼ï¼
const tasks = [
Task({ ... description: "å·¡æ£é群åºç¡ä¿¡æ¯", prompt: "rancher-cluster-info-inspector ..." }),
Task({ ... description: "å·¡æ£èç¹å¥åº·", prompt: "rancher-node-health-inspector ..." }),
Task({ ... description: "å·¡æ£å¼å¸¸äºä»¶", prompt: "rancher-event-inspector ..." })
];
å¤é群巡æ£ï¼N à 6 Agent å¹¶è¡ï¼ï¼
const clusters = await cluster_list();
const tasks = clusters.flatMap(c => [
Task({ ... prompt: `rancher-cluster-info-inspector for ${c.id}` }),
Task({ ... prompt: `rancher-node-health-inspector for ${c.id}` }),
Task({ ... prompt: `rancher-capacity-inspector for ${c.id}` }),
Task({ ... prompt: `rancher-workload-inspector for ${c.id}` }),
Task({ ... prompt: `rancher-event-inspector for ${c.id}` }),
Task({ ... prompt: `rancher-system-inspector for ${c.id}` })
]);
const results = await Promise.all(tasks);
// æé群åç»æ±æ»
æ¥éª¤ 4: æ±æ»å·¡æ£æ¥å
- æ¶éææç»´åº¦ Agent çè¿åç»æ
- æ±æ»è¯åæ¦è§è¡¨æ ¼
- åå¹¶å维度çè¯¦ç»æ£æ¥ç»æ
- åå¹¶é®é¢æ¸ åï¼æä¸¥éç¨åº¦æåºï¼
- åå¹¶æ¹è¿å»ºè®®ï¼æä¼å 级æåºï¼
- è®¡ç®æ´ä½è¯åï¼åå维度æä½åï¼
ååºæ ¼å¼
åéç¾¤å·¡æ£æ¥å
## éç¾¤å·¡æ£æ¥å: production (c-abc123)
### å·¡æ£æ¦è§
- å·¡æ£æ¶é´: 2025-01-15 10:30
- å·¡æ£èå´: 宿´å·¡æ£ï¼6 维度并è¡ï¼
- **æ´ä½è¯å: Bï¼è¯å¥½ï¼**
### è¯åæ¦è§
| 维度 | Agent | è¯å | ç¶æ |
|------|-------|------|------|
| é群åºç¡ä¿¡æ¯ | cluster-info-inspector | A | â
æ£å¸¸ |
| èç¹å¥åº· | node-health-inspector | B | â ï¸ æ³¨æ |
| èµæºå®¹é | capacity-inspector | A | â
æ£å¸¸ |
| å·¥ä½è´è½½å¥åº· | workload-inspector | B | â ï¸ æ³¨æ |
| å¼å¸¸äºä»¶ | event-inspector | A | â
æ£å¸¸ |
| ç³»ç»ç»ä»¶ | system-inspector | A | â
æ£å¸¸ |
### é®é¢æ¸
å
| 严éç¨åº¦ | 维度 | é®é¢ | 建议 |
|----------|------|------|------|
| â ï¸ | èç¹å¥åº· | node-5 NotReady | æ£æ¥ kubelet |
| â ï¸ | å·¥ä½è´è½½ | 2 个 Pod CrashLoopBackOff | æ¥çæ¥å¿ |
### æ¹è¿å»ºè®®
1. **[ç´§æ¥]** ä¿®å¤ node-5
2. **[建议]** ææ¥å´©æº Pod
å¤éç¾¤å·¡æ£æ»è§
## å¤éç¾¤å·¡æ£æ»è§
| é群 | è¯å | éç¾¤ä¿¡æ¯ | èç¹ | 容é | å·¥ä½è´è½½ | äºä»¶ | ç³»ç» | å
³é®é®é¢ |
|------|------|----------|------|------|----------|------|------|----------|
| production | B | â
| â ï¸ | â
| â ï¸ | â
| â
| 1 èç¹ NotReady |
| staging | A | â
| â
| â
| â
| â
| â
| æ |
| dev | C | â
| â ï¸ | â ï¸ | â ï¸ | â ï¸ | â
| 容éä¸è¶³ |
### é群详æ
[åé群ç¬ç«å·¡æ£æ¥å...]
å·¡æ£èå´ â Agent æ å°éæ¥è¡¨
| èå´ | Agent æ°é | 维度 Agent |
|---|---|---|
| full | 6 | cluster-info + node-health + capacity + workload + event + system |
| quick | 3 | cluster-info + node-health + event |
| nodes | 2 | node-health + capacity |
| workloads | 2 | workload + event |
| events | 1 | event |
| system | 1 | system |
å·¡æ£æä½³å®è·µ
- æ¥å¸¸å·¡æ£ï¼æ¯å¤©æ§è¡ quick å·¡æ£ï¼3 Agentï¼ï¼å ³æ³¨èç¹åäºä»¶
- å¨å·¡æ£ï¼æ¯å¨æ§è¡ full å·¡æ£ï¼6 Agentï¼ï¼è¦çææç»´åº¦
- åæ´å·¡æ£ï¼é大忴åååå䏿¬¡ full å·¡æ£ï¼å¯¹æ¯å·®å¼
- äºä»¶é©±å¨ï¼æ¶å°åè¦åæ§è¡å¯¹åºä¸é¡¹å·¡æ£
- å¤é群ï¼å®æå¯¹ææé群å full å·¡æ£ï¼çæå¥åº·è¶å¿
é误å¤ç
- 维度 Agent 失败: 卿¥å䏿 注该维度为”å·¡æ£å¤±è´¥”ï¼ä¸å½±åå ¶ä»ç»´åº¦è¯å
- metrics-server æªå®è£ : capacity-inspector è·³è¿å®é 使ç¨çï¼æ¥å䏿³¨æ
- é群ä¸å¯è¾¾: æ 记为巡æ£å¤±è´¥ï¼æ¥åéç¾¤è¿æ¥é®é¢
- æéä¸è¶³: å Agent å°½å¯è½å·¡æ£å¯è®¿é®èµæºï¼æ³¨ææééå¶
- æ°æ®ä¸å®æ´: åºäºå¯ç¨æ°æ®çææ¥åï¼æ 注缺失项
ä¸å ¶ä»æè½çå ³ç³»
| å·¡æ£åç°é®é¢ | åç»è¡å¨ | ä½¿ç¨æè½ |
|---|---|---|
| èç¹ NotReady | æ·±å ¥åæèç¹ | capacity-analysis |
| Pod CrashLoopBackOff | è¯æ Pod | resource-troubleshooting |
| Deployment ä¸å¯ç¨ | æ¥çé¨ç½²åæ´ | deployment-management |
| èµæºä¸è¶³ | 容éè§å | capacity-analysis |
| å¯çäºä»¶ | è¿½æº¯èµæºåæ´ | resource-discovery |