azure-infra-engineer
npx skills add https://github.com/404kidwiz/claude-supercode-skills --skill azure-infra-engineer
Agent 安装分布
Skill 文档
Azure Infrastructure Engineer
Purpose
Provides Microsoft Azure cloud expertise specializing in Bicep/ARM templates, Enterprise Landing Zones, and Cloud Adoption Framework (CAF) implementations. Designs and deploys enterprise-grade Azure environments with governance, networking, and infrastructure as code.
When to Use
- Deploying Azure resources using Bicep or ARM templates
- Designing Hub-and-Spoke network topologies (Virtual WAN, ExpressRoute)
- Implementing Azure Policy and Management Groups (Governance)
- Migrating workloads to Azure (ASR, Azure Migrate)
- Automating Azure DevOps pipelines for infrastructure
- Configuring Azure Active Directory (Entra ID) RBAC and PIM
2. Decision Framework
IaC Tool Selection (Azure Context)
| Tool | Status | Recommendation |
|---|---|---|
| Bicep | Recommended | Native, first-class support, concise syntax. |
| Terraform | Alternative | Best for multi-cloud strategies. |
| ARM Templates | Legacy | Verbose JSON. Avoid for new projects (compile Bicep instead). |
| PowerShell/CLI | Scripting | Use for ad-hoc tasks or pipeline glue, not state management. |
Networking Architecture
What is the connectivity need?
â
ââ **Hub-and-Spoke** (Standard)
â ââ Central Hub: Firewall, VPN Gateway, Bastion
â ââ Spokes: Workload VNets (Peered to Hub)
â
ââ **Virtual WAN** (Global Scale)
â ââ Multi-region connectivity? â **Yes**
â ââ Branch-to-Branch (SD-WAN)? â **Yes**
â
ââ **Private Access**
ââ PaaS Services? â **Private Link / Private Endpoints**
ââ Service Endpoints? â Legacy (Use Private Link where possible)
Governance Strategy (CAF)
- Management Groups: Hierarchy for policy inheritance (Root > Geo > Landing Zones).
- Azure Policy: “Deny” non-compliant resources (e.g., only East US region).
- RBAC: Least privilege access via Entra ID Groups.
- Blueprints: Rapid deployment of compliant environments (being replaced by Template Specs + Stacks).
Red Flags â Escalate to security-engineer:
- Public access enabled on Storage Accounts or SQL Databases
- Management Ports (RDP/SSH) open to internet
- Subscription Owner permissions granted to individual users (Use Contributors/PIM)
- No cost controls/budgets configured
4. Core Workflows
Workflow 1: Bicep Resource Deployment
Goal: Deploy a secure Storage Account with Private Endpoint.
Steps:
-
Define Bicep Module (
storage.bicep)param location string = resourceGroup().location param name string resource stg 'Microsoft.Storage/storageAccounts@2023-01-01' = { name: name location: location sku: { name: 'Standard_LRS' } kind: 'StorageV2' properties: { minimumTlsVersion: 'TLS1_2' supportsHttpsTrafficOnly: true publicNetworkAccess: 'Disabled' // Secure by default } } output id string = stg.id -
Main Deployment (
main.bicep)module storage './modules/storage.bicep' = { name: 'deployStorage' params: { name: 'stappprod001' } } -
Deploy via CLI
az deployment group create --resource-group rg-prod --template-file main.bicep
Workflow 3: Landing Zone Setup (CAF)
Goal: Establish the foundational hierarchy.
Steps:
-
Create Management Groups
MG-RootMG-Platform(Identity, Connectivity, Management)MG-LandingZones(Online, Corp)MG-Sandbox(Playground)
-
Assign Policies
- Assign “Allowed Locations” to
MG-Root. - Assign “Enable Azure Monitor” to
MG-LandingZones.
- Assign “Allowed Locations” to
-
Deploy Hub Network
- Deploy VNet in connectivity subscription.
- Deploy Azure Firewall and VPN Gateway.
5. Anti-Patterns & Gotchas
â Anti-Pattern 1: “ClickOps”
What it looks like:
- Creating resources manually in the Azure Portal.
Why it fails:
- Unrepeatable.
- Configuration drift.
- Disaster recovery is impossible (no code to redeploy).
Correct approach:
- Everything as Code: Even if prototyping, export the ARM template or write basic Bicep.
â Anti-Pattern 2: One Giant Resource Group
What it looks like:
rg-productioncontains VNets, VMs, Databases, and Web Apps for 5 different projects.
Why it fails:
- IAM nightmare (cannot grant access to Project A without Project B).
- Tagging and cost analysis becomes difficult.
- Risk of accidental deletion.
Correct approach:
- Lifecycle Grouping: Group resources that share a lifecycle (e.g.,
rg-network,rg-app1-prod,rg-app1-dev).
â Anti-Pattern 3: Ignoring Naming Conventions
What it looks like:
myvm1,test-storage,sql-server.
Why it fails:
- Cannot identify resource type, environment, or region from name.
- Name collisions (Storage accounts must be globally unique).
Correct approach:
- CAF Naming Standard:
[Resource Type]-[Workload]-[Environment]-[Region]-[Instance] - Example:
st-myapp-prod-eus-001(Storage Account, MyApp, Prod, East US, 001).
7. Quality Checklist
Governance:
- Naming: Resources follow CAF naming conventions.
- Tagging: Resources tagged with
CostCenter,Environment,Owner. - Policies: Azure Policy enforces compliance (e.g., allowed SKUs).
Security:
- Network: No public IPs on backend resources (VMs, DBs).
- Identity: Managed Identities used instead of Service Principals/Keys where possible.
- Encryption: CMK (Customer Managed Keys) enabled for sensitive data.
Reliability:
- Availability Zones: Critical resources deployed zone-redundant (ZRS).
- Backup: Azure Backup enabled for VMs and SQL.
- Locks: Resource Locks (
CanNotDelete) on critical production resources.
Cost:
- Sizing: Resources right-sized based on metrics.
- Reservations: Reserved Instances purchased for steady workloads.
- Cleanup: Unused resources (orphaned disks/NICs) deleted.
Examples
Example 1: Multi-Subscription Landing Zone Setup
Scenario: A healthcare company needs to deploy a compliant landing zone for HIPAA-regulated workloads across three environments (dev, staging, prod).
Architecture:
- Management Group Hierarchy: Root > Organization > Environments > Workloads
- Network Design: Hub-and-spoke with Azure Firewall, separate VNets per environment
- Policy Enforcement: Azure Policy to enforce HIPAA compliance (encryption, backup, private endpoints)
- CI/CD Pipeline: Azure DevOps pipeline with approval gates for prod deployments
Key Components:
- Azure Firewall Manager for centralized policy
- Private DNS Zones for app-internal resolution
- Azure Backup with immutable vaults for compliance
- Cost Management tags for departmental chargebacks
Example 2: Zero-Trust Network Architecture
Scenario: A financial services firm needs to replace their VPN-based access with a Zero Trust architecture using Azure Private Link and Conditional Access.
Implementation:
- Private Endpoints: All PaaS services accessed via Private Endpoints (SQL, Storage, Key Vault)
- Identity-Based Access: Conditional Access policies requiring compliant device and MFA
- Micro-segmentation: NSG rules denying all traffic by default, allowing only required flows
- Monitoring: Azure Sentinel for security analytics and anomaly detection
Security Controls:
- Azure AD Conditional Access with device compliance
- Just-In-Time VM access for administration
- Azure Defender for Cloud threat protection
- Comprehensive audit logging to Log Analytics
Example 3: Cost-Optimized Dev/Test Environment
Scenario: A software company wants to reduce their Azure dev/test environment costs by 60% while maintaining developer productivity.
Optimization Strategy:
- Auto-Shutdown: Dev VMs auto-shutdown evenings and weekends via Automation Runbooks
- Reserved Capacity: Prod-like dev environments use Reserved Instances
- Dev-Optimized SKUs: Development uses Dev/Test SKUs where available
- Tagging and Governance: Required tags for cost allocation, orphaned resource cleanup
Cost Savings Results:
- 65% reduction in dev/test compute costs
- Automated cleanup of unused resources saving $2K/month
- Reserved Instance savings for stable environments
- Developer productivity maintained with auto-start capabilities
Best Practices
Infrastructure as Code
- Everything as Code: Every resource defined in Bicep, never manual portal changes
- Module Library: Create reusable Bicep modules for common patterns
- Parameter Files: Separate parameter files per environment (dev, staging, prod)
- GitOps Workflow: Infrastructure changes via PR and approval process
- State Management: Use AzDO stateful pipelines or Terraform backend
Networking Excellence
- Hub-and-Spoke Default: Standard architecture for most workloads
- Private by Default: All PaaS access via Private Endpoints
- DNS Planning: Private DNS Zones with VNet links, avoid host file modifications
- Firewall Integration: Centralized threat protection with Azure Firewall
- Hybrid Connectivity: ExpressRoute for production, VPN for secondary
Security Hardening
- Least Privilege: RBAC with specific roles, avoid Subscription Owner
- Managed Identities: Prefer over Service Principals with secrets
- Secrets Management: Key Vault for all secrets, never environment variables
- Encryption Everywhere: CMK for sensitive data, TLS 1.2+ everywhere
- Network Isolation: NSG rules denying by default, allow-listing required traffic
Cost Management
- Right-Sizing: Regular review of actual utilization vs allocated size
- Reservation Planning: Identify stable workloads for Reserved Instances
- Auto-Shutdown: Dev/test resources off during off-hours
- Tagging Strategy: Required tags for cost center, environment, owner
- Budget Alerts: Budget thresholds with alerts at 50%, 75%, 90%
Governance and Compliance
- Policy as Guardrails: Azure Policy for prevention, not just detection
- Management Groups: Hierarchy reflecting organizational structure
- Blueprint Usage: Azure Blueprints for standard compliant environments
- Monitoring Strategy: Centralized logging to Log Analytics workspace
- Automation: Runbooks for routine operational tasks