oci best practices and architecture
npx skills add https://github.com/acedergren/oci-agent-skills --skill OCI Best Practices and Architecture
Skill 文档
OCI Best Practices and Architecture Skill
You are an expert in Oracle Cloud Infrastructure best practices and well-architected framework principles. This skill provides official Oracle guidance for security, cost, performance, operations, and reliability based on https://www.oracle.com/cloud/oci-best-practices-guide/ and https://docs.oracle.com/solutions/.
Core Best Practice Categories
- Security – Zero trust, IAM, encryption, Cloud Guard
- Cost Optimization – Right-sizing, reserved capacity, monitoring
- Performance – Shape selection, caching, network optimization
- Operational Excellence – IaC, monitoring, CI/CD, automation
- Reliability – HA, DR, backups, multi-AD architecture
Security Best Practices
Zero Trust Architecture
- Never trust, always verify principle
- Implement identity domains with MFA
- Use microsegmentation with NSGs
- Encrypt all data at rest and in transit
- Enable Cloud Guard for threat detection
CIS OCI Foundations Benchmark
# Deploy CIS-compliant landing zone
oci resource-manager stack create \
--compartment-id <compartment-ocid> \
--config-source <cis-template-zip> \
--display-name "CIS-Landing-Zone"
# Key controls:
â Least privilege IAM policies
â MFA for all users
â Security zones enabled
â Encryption by default
â Audit logging active
â Network security groups
Preventive Controls
- OCI Network Firewall: Perimeter protection with layer 7 inspection
- Security Zones: Enforce policies at infrastructure level
- WAF: Protect internet-facing applications from OWASP Top 10
- DDoS Prevention: Built-in layer 3/4/7 protection
Detective Controls
- Oracle Cloud Guard: Continuous monitoring and automated response
- Vulnerability Scanning: Regular OS and app scanning
- Data Safe: Database security assessments and auditing
- Logging Analytics: Centralized log aggregation and analysis
Cost Optimization
Right-Sizing Strategy
Workload | Start With | Scale Based On
Development | VM.Standard.E4.Flex (1 OCPU) | CPU >70%
Production | VM.Standard.E4.Flex (2 OCPU) | Sustained metrics
Cost-optimized | VM.Standard.A1.Flex (Arm) | 40% savings
HPC/GPU | BM.GPU4.8 | Workload needs
Free Tier Resources
Always Free (no time limit):
â 2 Autonomous Databases (1 OCPU each)
â 2 AMD Compute VMs (1/8 OCPU, 1GB)
â 4 Arm Ampere A1 cores (24GB total)
â 200GB Block Storage
â 10GB Object Storage + 10GB Archive
â Load Balancer (10 Mbps)
Strategy: Use for dev/test/POC
Storage Tiering
# Lifecycle policy for automatic tiering
oci os object-lifecycle-policy put \
--bucket-name <bucket> \
--items '[{
"name":"ArchiveOldLogs",
"action":"ARCHIVE",
"timeAmount":90,
"timeUnit":"DAYS",
"objectNameFilter":{"inclusionPrefixes":["logs/"]}
}]'
# Cost comparison:
Standard Storage: $0.0255/GB/month
Infrequent Access: $0.0125/GB/month (51% savings)
Archive Storage: $0.0024/GB/month (91% savings)
Budget Monitoring
# Set budget with alerts
oci budgets budget create \
--compartment-id <compartment-ocid> \
--amount 1000 \
--reset-period MONTHLY \
--targets '["<compartment-ocid>"]' \
--alert-rule-recipients '{"emailRecipients":["team@example.com"]}'
# Track costs by tags
Tags: Environment, CostCenter, Owner, Application
Performance Excellence
Compute Performance
Shape Selection Guide:
â VM.Standard.E4.Flex: General purpose, flexible
â VM.DenseIO.E4: High IOPS workloads
â BM.Standard.E4: Bare metal for maximum performance
â VM.GPU.A10: AI/ML inference
â BM.GPU.A100: Training large models
Start small, scale based on metrics (not guesses)
Block Volume Performance
# Ultra High Performance volume
oci bv volume create \
--compartment-id <compartment-ocid> \
--availability-domain <ad> \
--size-in-gbs 1000 \
--vpus-per-gb 20 # 255 IOPS/GB, 300K IOPS max
Performance Tiers:
Balanced: 60 IOPS/GB
Higher Performance: 75 IOPS/GB
Ultra High: 90-255 IOPS/GB
Network Optimization
Best Practices:
â Regional subnets for flexibility
â Adequate CIDR space (/16 VCN, /24 subnets)
â FastConnect for predictable latency
â Place resources in same AD for <1ms latency
â Use flexible load balancers
Latency Guide:
Same AD: <1ms
Different AD: 1-2ms
Cross-region: Geography-dependent
FastConnect: 5-20ms (vs 30-50ms internet)
Caching Strategies
Implementation:
â Redis/Memcached for session data
â Database query result caching
â CDN for static content
â Browser caching headers
â OCI Object Storage with CDN
Benefits: 10-100x faster, reduced DB load, lower costs
Operational Excellence
Infrastructure as Code
# Terraform state in OCI Object Storage
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-phoenix-1"
endpoint = "https://namespace.compat.objectstorage.us-phoenix-1.oraclecloud.com"
skip_region_validation = true
}
}
Best Practices:
â Use modules for reusability
â Separate state files per environment
â Implement proper variable management
â Use Resource Manager for team collaboration
â Tag all resources consistently
Tagging Strategy
# Comprehensive tagging
oci compute instance update \
--instance-id <instance-ocid> \
--defined-tags '{
"Operations":{
"Environment":"Production",
"CostCenter":"ENG-001",
"Owner":"platform-team",
"Application":"web-app",
"BackupPolicy":"daily",
"Compliance":"PCI-DSS"
}
}'
Use tags for:
- Cost allocation and reporting
- Automated backup policies
- Resource grouping
- Compliance tracking
Monitoring and Alarms
# Critical alarm (page immediately)
oci monitoring alarm create \
--compartment-id <compartment-ocid> \
--display-name "Database Critical CPU" \
--namespace "oci_autonomous_database" \
--query-text 'CpuUtilization[1m].mean() > 90' \
--severity "CRITICAL" \
--destinations '["<pager-topic-ocid>"]' \
--pending-duration "PT5M" \
--repeat-notification-duration "PT5M"
Monitoring Stack:
- Service logs â OCI Logging
- Application logs â Logging Analytics
- Metrics â OCI Monitoring
- APM â Application Performance Monitoring
CI/CD Pipeline
DevSecOps Stages:
1. Code commit â Git webhook
2. Build â Compile and package
3. Security scan â SAST, dependency check
4. Unit tests â Code coverage
5. Container build â Docker image
6. Push to OCIR â Registry
7. Deploy to dev â Automated
8. Integration tests â Automated
9. Deploy to prod â Manual approval
Security Checkpoints:
â Static analysis
â Dependency scanning
â Container image scanning
â IAM policy validation
â Secrets detection
Reliability and High Availability
Multi-AD Architecture
Highly Available Pattern:
Region: us-phoenix-1
AD-1: AD-2: AD-3:
- Load Balancer - Load Balancer - Load Balancer
- Web tier (2) - Web tier (2) - Web tier (2)
- App tier (2) - App tier (2) - App tier (2)
- Database (primary) - Database (standby) - Database (standby)
Benefits:
â 99.99% availability SLA
â Automatic failover
â No single point of failure
â Performance optimization
Database High Availability
# Autonomous Database (built-in HA)
â Automatic 3-way replication across ADs
â RPO ~0, RTO <2 minutes
â Automated backups with 60-day retention
â Point-in-time recovery
# Cross-region DR with Data Guard
oci db autonomous-database create-autonomous-database-dataguard-association \
--autonomous-database-id <primary-adb-ocid> \
--creation-type NEW \
--peer-region <dr-region>
# DB System with RAC (2-node cluster)
oci db system launch \
--shape "VM.Standard2.2" \
--node-count 2 \
--cluster-name "proddb" \
--cpu-core-count 4
Backup Strategy (3-2-1 Rule)
3 copies of data
2 different storage types
1 offsite (different region)
OCI Implementation:
â Primary: Production data (region A)
â Secondary: Boot volume backups (region A)
â Tertiary: Object storage replication (region B)
Retention Policy:
Daily: 7 days
Weekly: 4 weeks
Monthly: 12 months
Yearly: 7 years (compliance)
Disaster Recovery
RTO/RPO by Business Tier:
Tier | RTO | RPO | Strategy
1 | <1 hour | <15 min | Active-active + Data Guard
2 | <4 hours | <1 hour | Hot standby
3 | <24 hrs | <4 hours | Warm standby
4 | <72 hrs | <24 hrs | Backup/restore
Implementation:
Tier 1: Multi-region, automatic failover
Tier 2: Standby region, manual failover
Tier 3: Regular backups, restore to new region
Tier 4: Archive storage, restore on demand
Service-Specific Best Practices
Generative AI on OCI
Model Selection:
â Cohere Command R+: Chat applications
â Cohere Command: Content generation
â CodeLlama: Code generation
â Cohere Embed v3: Semantic search
RAG (Retrieval Augmented Generation):
1. Generate embeddings from user query
2. Vector search in knowledge base
3. Retrieve relevant context
4. Augment prompt with context
5. Generate response with LLM
Benefits: Up-to-date info, reduced hallucinations
Cost Optimization:
â On-demand for development
â Dedicated endpoints for production
â Cache responses where appropriate
â Implement rate limiting
Containers on OCI (OKE)
# Production cluster
oci ce cluster create \
--compartment-id <compartment-ocid> \
--name "prod-cluster" \
--kubernetes-version "v1.28.2" \
--service-lb-subnet-ids '["<lb-subnet-ocid>"]' \
--is-kubernetes-dashboard-enabled false
Best Practices:
â Private subnets for worker nodes
â Public subnet for load balancers only
â Enable pod security policies
â Use Kubernetes secrets + OCI Vault
â Implement network policies
â Enable cluster autoscaling
â Separate node pools by workload type
E-Business Suite on OCI
Migration Approach:
â Use EBS Cloud Manager for automation
â Lift-and-shift initially, optimize later
â Private subnets for app and database tiers
â Enable SSO with OCI IAM Identity Domains
â Implement DR with OCI native services
HA Architecture:
- Application: Multiple instances across ADs
- Database: RAC or Data Guard
- Load balancer: Traffic distribution
- File storage: FSS with replication
- Backups: Automated OCI services
Well-Architected Framework Checklist
Security
- IAM policies follow least privilege
- MFA enabled for all privileged users
- Zero trust principles implemented
- Cloud Guard enabled and monitored
- All data encrypted at rest
- TLS 1.2+ for data in transit
- Regular vulnerability scanning
- Security zones for critical workloads
Cost Optimization
- Resources right-sized based on metrics
- Budget alerts configured
- Cost tracking tags applied
- Unused resources identified and removed
- Storage tiering policies in place
- Reserved capacity for predictable workloads
- Free tier utilized for dev/test
Performance
- Appropriate shapes selected
- Caching implemented where beneficial
- Database auto-scaling enabled
- Network optimized (FastConnect if needed)
- Performance testing completed
- Bottlenecks identified and resolved
Operational Excellence
- Infrastructure as code (Terraform)
- CI/CD pipeline implemented
- Comprehensive monitoring and alarms
- Logging and log analytics enabled
- Automated backups configured
- Runbooks documented
- Disaster recovery tested
Reliability
- Multi-AD deployment for critical workloads
- Database HA (RAC or Data Guard)
- Load balancers with health checks
- Backup and restore tested
- DR plan documented and tested
- Chaos engineering practiced
- SLAs defined and monitored
Reference Architectures
Based on https://docs.oracle.com/solutions/:
Three-Tier Web Application
Internet â WAF â Load Balancer (public subnet)
â
Web Tier (private subnet, multi-AD)
â
App Tier (private subnet, multi-AD)
â
Database (private subnet, Data Guard)
Hybrid Cloud with FastConnect
On-Premises Data Center
â (FastConnect)
DRG (Dynamic Routing Gateway)
â
OCI VCN (hub-spoke topology)
- Shared services VCN
- Production VCN
- Development VCN
AI/ML Platform
Data Sources â Data Integration
â
Object Storage (raw data)
â
Data Science notebooks
â
Model training (GPU instances)
â
Model catalog â Inference endpoints
When to Use This Skill
Activate when user asks about:
- Architecture design and recommendations
- Best practices for any OCI service
- Security hardening or compliance
- Cost optimization strategies
- Performance tuning
- High availability and disaster recovery
- Operational excellence
- Well-architected framework
- CIS benchmarks
- Service-specific guidance (GenAI, OKE, EBS)
Example Interactions
User: “What are OCI security best practices?” Response: Cover zero trust, IAM, Cloud Guard, encryption, CIS compliance with specific implementation examples.
User: “How do I optimize my OCI costs?” Response: Explain right-sizing, reserved capacity, storage tiering, free tier, with CLI commands and cost comparisons.
User: “Design a highly available application” Response: Multi-AD architecture with load balancer, compute distribution, database HA, and DR strategy.
User: “Best practices for GenAI on OCI?” Response: Model selection, RAG implementation, cost optimization, dedicated vs on-demand endpoints.
Official Resources
- OCI Best Practices Guide: https://www.oracle.com/cloud/oci-best-practices-guide/
- Oracle Solutions: https://docs.oracle.com/solutions/
- Architecture Center: https://www.oracle.com/cloud/architecture-center/
- CIS OCI Benchmark: https://www.cisecurity.org/benchmark/oracle_cloud