Installation
gh skills-hub install azure-kubernetes Don't have the extension? Run gh extension install samueltauil/skills-hub first.
Download and extract to your repository:
.github/skills/azure-kubernetes/ Extract the ZIP to .github/skills/ in your repo. The folder name must match azure-kubernetes for Copilot to auto-discover it.
Skill Files (11)
SKILL.md 10.2 KB
---
name: azure-kubernetes
license: MIT
metadata:
author: Microsoft
version: "1.1.2"
description: "Plan, create, and configure production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 checklist, SKU selection (Automatic vs Standard), networking options (private API server, Azure CNI Overlay, egress configuration), security, and operations (autoscaling, upgrade strategy, cost analysis). WHEN: create AKS environment, provision AKS environment, enable AKS observability, design AKS networking, choose AKS SKU, secure AKS, optimize AKS, rightsize AKS pod, AKS spot nodes, AKS cluster-autoscaler."
---
# Azure Kubernetes Service
> **AUTHORITATIVE GUIDANCE β MANDATORY COMPLIANCE**
>
> This skill produces a **recommended AKS cluster configuration** based on user requirements, distinguishing **Day-0 decisions** (networking, API server β hard to change later) from **Day-1 features** (can enable post-creation). See [CLI reference](./references/cli-reference.md) for commands.
## Quick Reference
| Property | Value |
|----------|-------|
| Best for | AKS cluster planning and Day-0 decisions |
| MCP Tools | `mcp_azure_mcp_aks` |
| CLI | `az aks create`, `az aks show`, `kubectl get`, `kubectl describe` |
| Related skills | azure-diagnostics (troubleshooting AKS), azure-validate (readiness checks), azure-kubernetes-automatic-readiness (migrate existing cluster to AKS Automatic) |
## When to Use This Skill
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (Container Insights, Managed Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration
## Rules
1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
2. Use the `azure` MCP server and select `mcp_azure_mcp_aks` first to discover the exact AKS-specific MCP tools surfaced by the client. Choose the smallest discovered AKS tool that fits the task, and fall back to Azure CLI (`az aks`) only when the needed functionality is not exposed through the AKS MCP surface.
3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).
## Required Inputs (Ask only whatβs needed)
If the user is unsure, use safe defaults.
- AKS environment type: dev/test or production
- Region(s), availability zones, preferred node VM sizes
- Expected scale (node/cluster count, workload size)
- Networking requirements (API server access, pod IP model, ingress/egress control)
- Security and identity requirements, including image registry
- Upgrade and observability preferences
- Cost constraints
## Workflow
### 1. Cluster Type
- **AKS Automatic** (default): Best for most production workloads, provides a curated experience with pre-configured best practices for security, reliability, and performance. Use unless you have specific custom requirements for networking, autoscaling, or node pool configurations not supported by Node Auto-Provisioning (NAP).
- **AKS Standard**: Use if you need full control over environment configuration, which requires additional overhead to set up and manage.
### 2. Networking (Pod IP, Egress, Ingress, Dataplane)
**Pod IP Model** (Key Day-0 decision):
- **Azure CNI Overlay** (recommended): pod IPs from private overlay range, not VNet-routable, scales to large environments and good for most workloads
- **Azure CNI (VNet-routable)**: pod IPs directly from VNet (pod subnet or node subnet), use when pods must be directly addressable from VNet or on-prem
- Docs: https://learn.microsoft.com/azure/aks/azure-cni-overlay
**Dataplane & Network Policy**:
- **Azure CNI powered by Cilium** (recommended): eBPF-based for high-performance packet processing, network policies, and observability
**Egress**:
- **Static Egress Gateway** for stable, predictable outbound IPs
- For restricted egress: UDR + Azure Firewall or NVA
**Ingress**:
- **App Routing addon with Gateway API** β recommended default for HTTP/HTTPS workloads
- **Istio service mesh with Gateway API** - for advanced traffic management, mTLS, canary releases
- **Application Gateway for Containers** β for L7 load balancing with WAF integration
**DNS**:
- Enable **LocalDNS** on all node pools for reliable, performant DNS resolution
### 3. Security
- Use **Microsoft Entra ID** everywhere (control plane, Workload Identity for pods, node access). Avoid static credentials.
- Azure Key Vault via **Secrets Store CSI Driver** for secrets
- Enable **Azure Policy** + **Deployment Safeguards**
- Enable **Encryption at rest** for etcd/API server; **in-transit** for node-to-node
- Allow only signed, policy-approved images (Azure Policy + Ratify), prefer **Azure Container Registry**
- **Isolation**: Use namespaces, network policies, scoped logging
### 4. Observability
- Use Managed Prometheus and Container Insights with Grafana for AKS observability (logs + metrics).
- Enable Diagnostic Settings to collect control plane logs and audit logs in a Log Analytics workspace for security monitoring and troubleshooting.
- For other monitoring and troubleshooting tools, use features like the Agentic CLI for AKS, Application Insights, Resource Health Center, AppLens detectors, and Azure Advisors.
### 5. Upgrades & Patching
- Configure **Maintenance Windows** for controlled upgrade timing
- Enable **auto-upgrades** for control plane and node OS to stay up-to-date with security patches and Kubernetes versions
- Consider **LTS versions** for enterprise stability (2-year support) by upgrading your AKS environment to the Premium tier
- **Fleet upgrades**: Use **AKS Fleet Manager** for staged rollout across test to production environments
### 6. Performance
- Use **Ephemeral OS disks** (`--node-osdisk-type Ephemeral`) for faster node startup
- Select **Azure Linux** as node OS (smaller footprint, faster boot)
- Enable **KEDA** for event-driven autoscaling beyond HPA
### 7. Node Pools & Compute
- **Dedicated system node pool**: At least 2 nodes, tainted for system workloads only (`CriticalAddonsOnly`)
- Enable **Node Auto Provisioning (NAP)** on all pools for cost savings and responsive scaling
- Use **latest generation SKUs (v5/v6)** for host-level optimizations
- **Avoid B-series VMs** β burstable SKUs cause performance/reliability issues
- Use SKUs with **at least 4 vCPUs** for production workloads
- Set **topology spread constraints** to distribute pods across hosts/zones per SLO
### 8. Reliability
- Deploy across **3 Availability Zones** (`--zones 1 2 3`)
- Use **Standard tier** for zone-redundant control plane + 99.95% SLA for API server availability
- Enable **Microsoft Defender for Containers** for runtime protection
- Configure **PodDisruptionBudgets** for all production workloads
- Use **topology spread constraints** to ensure pod distribution across failure domains
### 9. Cost Controls
- Use **Spot node pools** for batch/interruptible workloads (up to 90% savings)
- **Stop/Start** dev/test clusters: `az aks stop/start`
- Consider **Reserved Instances** or **Savings Plans** for steady-state workloads
**Deep-dive scenarios** β load only the relevant reference file:
| Scenario | Trigger Keywords | Reference |
|----------|-----------------|-----------|
| Pod Rightsizing | over-provisioned pods, CPU requests, memory requests, rightsize workloads | [azure-aks-rightsizing.md](./references/azure-aks-rightsizing.md) |
| VPA Setup | vertical pod autoscaler, VPA recommendations, VPA enable | [azure-aks-vpa.md](./references/azure-aks-vpa.md) |
| Cluster Autoscaler | idle nodes, CAS off, enable autoscaler, scale-down profile, node utilization | [azure-aks-autoscaler.md](./references/azure-aks-autoscaler.md) |
| Spot Node Pools | Spot VMs, Spot nodes, batch workloads, cheaper nodes | [azure-aks-spot.md](./references/azure-aks-spot.md) |
> **Disambiguation:** If a prompt matches multiple rows (e.g., "cheaper nodes" could suggest both Spot and autoscaler), prefer the most specific match. If ambiguous, ask the user to clarify their intent before loading a reference file.
## Guardrails / Safety
- Do not request or output secrets (tokens, keys).
- Do not ask the user to paste subscription IDs. Discover subscription and resource scope via MCP tools (e.g., list subscriptions, list resource groups) or `az account show` / `az account list` so the agent can resolve context without exposing identifiers.
- If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2β3 safe options with tradeoffs and choose a conservative default.
- Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance.
## MCP Tools
| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `mcp_azure_mcp_aks` | AKS MCP entry point used to discover the exact AKS-specific tools exposed by the client | Discover the callable AKS tool first, then use that tool's parameters |
## Error Handling
| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| MCP tool call fails or times out | Invalid credentials, subscription, or AKS context | Verify `az login`, confirm the active subscription context with `az account show`, and check the target resource group without echoing subscription identifiers back to the user |
| Quota exceeded | Regional vCPU or resource limits | Request quota increase or select different region/VM SKU |
| Networking conflict (IP exhaustion) | Pod subnet too small for overlay/CNI | Re-plan IP ranges; may require cluster recreation (Day-0) |
| Workload Identity not working | Missing OIDC issuer or federated credential | Enable `--enable-oidc-issuer --enable-workload-identity`, configure federated identity |
SKILL.md 14.1 KB
---
name: azure-kubernetes-automatic-readiness
license: MIT
metadata:
author: Microsoft
version: "1.0.0"
description: "Assess Kubernetes workloads and cluster configuration for AKS Automatic compatibility. Identifies incompatibilities, generates fixes, and guides migration from AKS Standard to AKS Automatic. WHEN: migrate to AKS Automatic, check AKS Automatic readiness, validate manifests for Automatic, assess cluster for Automatic compatibility, fix deployment for Automatic compatibility, identify AKS Automatic migration blockers, is my cluster ready for AKS Automatic."
---
# AKS Automatic Readiness Assessment
> **AUTHORITATIVE GUIDANCE β MANDATORY COMPLIANCE**
>
> This skill assesses existing AKS clusters or local manifests for AKS Automatic compatibility.
> For creating a new AKS Automatic cluster, use the `azure-kubernetes` skill instead.
> See [constraint spec](./references/constraint-spec-v1.yaml) for all safeguard rules, [common fixes](./references/common-fixes.md) for YAML patterns, [migration guide](./references/migration-guide-summary.md) for end-to-end steps, and [MCP integration](./references/mcp-integration.md) for tool details and fallback handling.
You are an AKS Automatic compatibility assessment agent. Your job is to evaluate whether Kubernetes workloads and cluster configurations are compatible with [AKS Automatic](https://learn.microsoft.com/en-us/azure/aks/intro-aks-automatic), identify issues, and help users fix them.
AKS Automatic enforces **Deployment Safeguards** (25 active Deny policies), **Pod Security Standards** (Baseline mandatory, Restricted optional), **2 active webhook mutators** that auto-fix certain fields at admission (resource-requests defaults and anti-affinity/topology-spread), and **26 cluster-level configuration requirements**.
## Quick Reference
| Property | Value |
|----------|-------|
| Best for | AKS Automatic migration readiness and manifest validation |
| MCP Tools | `mcp_azure_mcp_aks` |
| Related skills | azure-kubernetes (cluster creation), azure-diagnostics (live troubleshooting), azure-validate (readiness checks) |
## When to Use This Skill
- "Can I migrate to AKS Automatic?"
- "Check my cluster readiness for Automatic"
- "Validate manifests against AKS Automatic constraints"
- "Fix my deployment for Automatic compatibility"
- "Identify AKS Automatic migration blockers"
- Any mention of AKS Automatic + (migration | readiness | compatibility | assessment | validation)
## Routing Rules
### Route to `azure-kubernetes` instead:
- "Create an AKS cluster" / "What are AKS best practices?" / "How do I deploy to AKS?"
- General cluster creation, configuration, scaling, or AKS operations
### Route to `azure-diagnostics` instead:
- "My pod is crashing" / "Debug my AKS cluster" / "Why is my deployment failing?"
- Live troubleshooting, debugging, error diagnosis on a running cluster
## Guardrails β READ FIRST
1. **Read-only**: NEVER modify cluster state. Assessment is read-only. Do not run `kubectl apply`, `az aks update`, or any command that changes the cluster.
2. **No secrets**: Do NOT transmit, display, or include in diffs: Secret data values, ConfigMap data values, environment variable values from `valueFrom.secretKeyRef`, service account tokens, or connection strings.
3. **User approval for file changes**: Present every fix as a diff. The user must explicitly accept before you write to any file.
4. **Scope boundaries**: Route cluster creation/deletion questions β `azure-kubernetes` skill. Route live troubleshooting β `azure-diagnostics` skill.
## MCP Tools
| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `mcp_azure_mcp_aks` | AKS MCP entry point β call `discover` first, then use the assessment action name returned in the response | `subscriptionId`, `resourceGroupName`, `resourceName`, `scope` |
## Workflow
### Step 1: Determine Scope
Ask the user what they want to assess:
**Option A β Cluster-connected assessment (via AKS MCP)**
Use when the user has a connected cluster context (subscription + resource group + cluster name).
**Option B β Offline manifest validation**
Use when the user has local Kubernetes manifests, Helm charts, or Kustomize overlays in their workspace. Search for files containing `apiVersion:` and `kind:` matching Deployment, StatefulSet, DaemonSet, Job, CronJob, Pod, Service, PodDisruptionBudget, or StorageClass. For Helm charts, look for `Chart.yaml` and rendered templates under `templates/`.
**Option C β Single manifest check**
If the user pastes or points to a single YAML manifest, validate it directly without asking for scope.
### Step 2: Run Assessment
#### Cluster-Connected Mode
Call the AKS MCP tool β this is the preferred path. Always call `discover` first to get the available actions, then use the assessment action name returned in the response:
```javascript
// Step 1: Discover available actions
mcp_azure_mcp_aks({ action: "discover" })
// Step 2: Use the assessment action name from the discover response
mcp_azure_mcp_aks({
action: "<action-from-discover>",
subscriptionId: "<subscription-id>",
resourceGroupName: "<resource-group>",
resourceName: "<cluster-name>",
scope: {
excludeNamespaces: ["kube-system", "gatekeeper-system"],
workloadTypes: ["Deployment", "StatefulSet", "DaemonSet", "CronJob", "Job"]
}
})
```
**Required permissions:**
- `Microsoft.ContainerService/managedClusters/read`
- `Microsoft.ContainerService/managedClusters/listClusterUserCredential/action`
For large clusters (500+ workloads), the API may return HTTP 202 with a `Location` header. Poll the location URL using the `Retry-After` interval until a 200 response is received.
**Parsing the MCP response:**
1. **`summary`** β aggregate counts: `compatible`, `requiresChanges`, `incompatible`, `autoFixed`, `totalWorkloads`, `clusterConfigIssues`
2. **`clusterConfiguration`** β cluster-level issues with `constraintId`, `severity`, `remediation` (az CLI commands), and `documentationUrl`
3. **`workloads[]`** β per-workload array, each with `name`, `namespace`, `kind`, `overallStatus`, and `issues[]`
Each issue in `workloads[].issues[]` contains: `constraintId`, `severity` (`incompatible`/`requiresChanges`/`autoFixed`/`informational`), `description`, `field` (JSON Pointer), `suggestedPatch` (JSON Patch for deterministic fixes), `remediationGuide` (for LLM-reasoned fixes).
#### Fallback Chain
```
1. MCP tool (mcp_azure_mcp_aks) β preferred, live cluster data
β fails (tool not found β Azure MCP server not configured)
2. Offline validation β works on local manifests without any cluster
```
If `mcp_azure_mcp_aks` is not available, inform the user:
> "The Azure MCP server is not configured in your editor. To enable live cluster assessment, follow the setup guide at [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup). For now, I can validate your local manifests offline."
Then proceed to offline mode.
#### Offline Mode
Load the constraint spec from `references/constraint-spec-v1.yaml` and evaluate each manifest. Key checks:
**Per container** (containers, initContainers, ephemeralContainers):
- Resource requests/limits β `safeguard-container-resource-requests`
- Readiness and liveness probes β `safeguard-probes-configured` *(warning-only β not blocked at admission; treat as informational)*
- Image tag not `:latest` β `safeguard-images-no-latest`
- `securityContext.privileged` not true β `safeguard-no-privileged-containers`
- `allowPrivilegeEscalation` not true β `safeguard-no-privilege-escalation`
- `capabilities.add` empty β `safeguard-container-capabilities`
- `seccompProfile` is RuntimeDefault/Localhost β `safeguard-allowed-seccomp-profiles`
**Per pod spec:**
- `hostPID`/`hostIPC` not true β `safeguard-block-host-namespaces` (incompatible)
- `hostNetwork`/`hostPort` not true β `safeguard-host-network-ports` (incompatible)
- No `hostPath` volumes β `safeguard-no-host-path-volumes` (incompatible)
- Volume types are standard β `safeguard-allowed-volume-types`
**Per workload type:**
- Deployments/StatefulSets with replicas > 1: podAntiAffinity or topologySpreadConstraints β `safeguard-pod-enforce-antiaffinity`
- StorageClass: CSI provisioner (not in-tree) β `safeguard-csi-driver-storage-class`
### Severity Classification
| Severity | Meaning | Action |
|----------|---------|--------|
| `incompatible` | Fundamental architecture issue; cannot run on Automatic without redesign | Must fix before migration β flag prominently |
| `requiresChanges` | Manifest changes needed; will be denied at admission | Generate fix diffs |
| `autoFixed` | AKS Automatic will mutate this at admission; no user action needed | Informational β show what will change |
| `informational` | No enforcement | Mention briefly |
### Step 3: Present Findings
Always start with the summary:
```
## AKS Automatic Readiness Assessment
| Status | Count |
|--------|-------|
| β
Compatible | X workloads |
| β οΈ Requires changes | Y workloads |
| β Incompatible | Z workloads |
| π§ Auto-fixed by Automatic | W workloads |
| ποΈ Cluster config issues | N issues |
```
Grouping: β€ 10 issues β list individually; > 10 β group by constraint ID. Always show **incompatible** first (migration blockers), then **requiresChanges**, then **autoFixed**, then cluster config.
Per-issue format:
```
### β [constraint-id] β Short description
**Severity:** incompatible | requiresChanges
**Affected:** namespace/resource-name (Kind)
**Current:** <what the manifest has>
**Required:** <what AKS Automatic requires>
**Fix:** <remediation summary>
**Docs:** <documentation URL>
```
### Step 4: Offer Fixes
**Deterministic fixes** (have `suggestedPatch` β generate YAML diff directly):
- `safeguard-container-resource-requests` β add `resources.requests`
- `safeguard-no-privilege-escalation` β set `allowPrivilegeEscalation: false`
- `safeguard-container-capabilities` β remove `capabilities.add`
- `safeguard-allowed-seccomp-profiles` β add `seccompProfile: RuntimeDefault`
- `safeguard-enforce-apparmor` β add AppArmor annotation
- `safeguard-csi-driver-storage-class` β replace in-tree provisioner
Use patterns in `references/common-fixes.md` and generate a before/after diff. Starting resource values use safe defaults β VPA (enabled on Automatic) will auto-tune after deployment.
**LLM-reasoned fixes** (require app context; use `remediationGuide`):
- `safeguard-images-no-latest` β correct tag is user- and release-specific; ask the user: _"What specific version tag or SHA digest should I pin this image to?"_ Do not guess
- `safeguard-pod-enforce-antiaffinity` β needs app labels for selector
- `safeguard-no-host-path-volumes` β replacement depends on what hostPath is used for
- `safeguard-block-host-namespaces` β may require architecture redesign
- `safeguard-host-network-ports` β needs alternative networking approach
For incompatible findings (e.g., hostPath volumes), explain the issue and propose alternatives. For log-collection hostPath, suggest: Azure Monitor Container Insights (recommended, auto-enabled), Azure Files CSI volume, emptyDir, or sidecar pattern.
**Fix application flow:**
1. Generate the fix as a YAML diff
2. Show the diff with explanation
3. Wait for explicit approval: "apply", "edit", or "skip"
4. On approval, apply the change to the file
5. Move to the next finding
If the user says "fix all" or "apply all deterministic fixes", first generate a single combined diff containing all eligible `suggestedPatch`-based fixes, show that combined diff with an explanation, and wait for one explicit approval before applying any writes. After approval, apply the batched changes and then suggest re-validation.
### Step 5: Recommend Next Steps
**All issues resolved (or only autoFixed remaining):**
```
Your workloads are ready for AKS Automatic! Next steps:
1. Review auto-fixed items β AKS Automatic will mutate N fields at admission.
2. Apply cluster configuration changes (see cluster config issues above).
3. Perform the SKU switch β follow the migration guide.
4. Verify β after migration, check all workloads are running and healthy.
```
See `references/migration-guide-summary.md` for the full migration checklist.
**Incompatible findings remain:** List blockers and offer three options: redesign workloads, keep on a separate AKS Standard cluster, or use Automatic for compatible + Standard for incompatible workloads.
**Cluster config issues remain (Day-0 decisions):** API Server VNet Integration, node pool OS SKU (requires recreating system node pools), and ephemeral OS disks require a new cluster β redirect to `azure-kubernetes` skill for cluster creation help.
## Error Handling
| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| MCP tool call fails or times out | Invalid credentials or subscription context | Verify `az login`, confirm active subscription with `az account show`; if MCP remains unavailable, continue with offline validation using local or exported manifests and the bundled constraint spec |
| HTTP 403 on assessment action | Missing permission | Ensure caller has sufficient RBAC access to read and assess the cluster via AKS APIs |
| API returns HTTP 202 | Large cluster (500+ workloads) β async operation | Poll the `Location` header URL using `Retry-After` interval |
| Helm chart uses Go templating β cannot evaluate | Template values not resolved | Ask user for rendered output (`helm template`) or values files |
| Constraint spec version mismatch | Skill bundles spec v1.1.1 (2026-03-15) | Note version in output; recommend re-running after spec update |
## Reference Files
| File | When to load |
|------|--------------|
| `references/constraint-spec-v1.yaml` | Always load for offline validation β all constraint IDs, severities, and fix patterns |
| `references/common-fixes.md` | When generating deterministic fixes β before/after YAML patterns |
| `references/migration-guide-summary.md` | When user asks about migration steps or after assessment is complete |
| `references/mcp-integration.md` | When troubleshooting MCP tool calls or debugging the fallback chain |
> β οΈ **Warning:** This skill bundles **constraint spec v1.1.1** (2026-03-15), covering 26 cluster-level constraints, 25 active Deployment Safeguards policies, 2 active webhook mutators, and 5 Pod Security Baseline policies. Always note the spec version in assessment output.
common-fixes.md 6.0 KB
# Common Fix Patterns for AKS Automatic Compatibility
Loaded on demand when generating YAML fixes during assessment.
Maps to constraint IDs in `constraint-spec-v1.yaml`.
---
## `safeguard-container-resource-requests` β Add resource requests/limits
**Before:**
```yaml
containers:
- name: web
image: myapp:v1.0.0
```
**After:**
```yaml
containers:
- name: web
image: myapp:v1.0.0
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
```
> π‘ **Tip:** Use safe minimums as starting values. VPA (auto-enabled on AKS Automatic) will tune these after deployment based on actual usage.
---
## `safeguard-no-privilege-escalation` β Disable privilege escalation
**Before:**
```yaml
securityContext: {}
```
**After:**
```yaml
securityContext:
allowPrivilegeEscalation: false
```
---
## `safeguard-container-capabilities` β Drop all capabilities
**Before:**
```yaml
securityContext:
capabilities:
add: ["NET_ADMIN"]
```
**After:**
```yaml
securityContext:
capabilities:
drop: ["ALL"]
allowPrivilegeEscalation: false
```
> β οΈ **Warning:** If the app genuinely requires `NET_ADMIN` or similar, it is **incompatible** with AKS Automatic. Do not silently drop β explain the incompatibility and suggest redesign.
---
## `safeguard-allowed-seccomp-profiles` β Add seccomp profile
**Before:**
```yaml
spec:
containers:
- name: web
```
**After:**
```yaml
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: web
```
---
## `safeguard-enforce-apparmor` β Add AppArmor annotation
**Before:**
```yaml
metadata:
name: my-deployment
```
**After:**
```yaml
metadata:
name: my-deployment
annotations:
container.apparmor.security.beta.kubernetes.io/web: runtime/default
```
> π‘ **Tip:** Replace `web` with the actual container name. Add one annotation per container.
---
## `safeguard-images-no-latest` β Pin image tag *(LLM-reasoned β ask user)*
**Before:**
```yaml
image: myapp:latest
```
**After:**
```yaml
image: myapp:v1.2.3 # β version confirmed with user
```
> β οΈ **Warning:** Do not guess the version. Ask the user: _"What specific version tag or SHA digest should I pin this image to?"_ If from a public registry, suggest checking Docker Hub or the registry for the latest stable tag.
---
## `safeguard-probes-configured` β Add probes *(best-practice recommendation β warning-only, not blocked at admission)*
**HTTP app (most common):**
```yaml
readinessProbe:
httpGet:
path: /healthz # β ask user for their health endpoint
port: 8080 # β ask user for port
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
```
**TCP-only app (databases, Redis, etc.):**
```yaml
readinessProbe:
tcpSocket:
port: 6379 # β service port
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 6379
initialDelaySeconds: 15
periodSeconds: 20
```
**gRPC app:**
```yaml
readinessProbe:
grpc:
port: 50051
initialDelaySeconds: 5
periodSeconds: 10
```
---
## `safeguard-pod-enforce-antiaffinity` β Add topology spread *(LLM-reasoned β ask user for label)*
Ask user: _"What label key/value identifies your workload's pods?"_
```yaml
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: <app-label> # β from user
containers:
- name: web
```
---
## `safeguard-csi-driver-storage-class` β Migrate in-tree to CSI
**Before (Azure Disk in-tree):**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-storage
provisioner: kubernetes.io/azure-disk
parameters:
skuName: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate
```
**After (Azure Disk CSI):**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-storage
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer # β preferred for zonal disks
```
| In-tree provisioner | CSI replacement |
|---|---|
| `kubernetes.io/azure-disk` | `disk.csi.azure.com` |
| `kubernetes.io/azure-file` | `file.csi.azure.com` |
---
## PodDisruptionBudget β Add missing PDB
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: <app-name>-pdb
namespace: <namespace>
spec:
maxUnavailable: 1
selector:
matchLabels:
app: <app-label>
```
## PodDisruptionBudget β Fix blocking `maxUnavailable: 0`
**Before:**
```yaml
spec:
maxUnavailable: 0
```
**After:**
```yaml
spec:
maxUnavailable: 1
```
> β οΈ **Warning:** `maxUnavailable: 0` completely blocks node drain during AKS Automatic upgrades. At least 1 pod must be allowed unavailable for upgrades to proceed.
---
## `safeguard-no-host-path-volumes` β Replace hostPath *(incompatible β suggest alternatives)*
| hostPath use case | Recommended replacement |
|---|---|
| Log collection (`/var/log`) | Azure Monitor Container Insights (auto-enabled on AKS Automatic) |
| Container runtime socket (`/var/run/docker.sock`) | Use the AKS Automatic node observability features β direct socket access not supported |
| Shared config files | `configMap` volume |
| Secrets / credentials | Kubernetes `secret` volume or Azure Key Vault CSI Driver |
| Ephemeral scratch space | `emptyDir` volume |
| Persistent app data | Azure Disk CSI via PVC (`disk.csi.azure.com`) |
| Shared file storage across pods | Azure Files CSI via PVC (`file.csi.azure.com`) |
**emptyDir example:**
```yaml
volumes:
- name: scratch
emptyDir: {}
```
**Azure Files CSI PVC example:**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: logs-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: azurefile-csi
resources:
requests:
storage: 10Gi
```
constraint-spec-v1.yaml 14.3 KB
# AKS Automatic Compatibility Constraint Spec β Condensed Reference
# Version: 1.1.1 | AKS: 2026-03-15
# This condensed version is optimized for LLM context.
apiVersion: aks-automatic.azure.com/v1
kind: ConstraintSpecReference
metadata:
version: "1.1.1"
aksVersion: "2026-03-15"
policyInitiatives:
deploymentSafeguards: c047ea8e-9c78-49b2-958b-37e56d291a44
podSecurityBaseline: a8640138-9b0a-4a28-b8cb-1666c838647d
podSecurityRestricted: 42b8ef37-b724-4e24-bbc8-7a7708edfe00
# =============================================================================
# CLUSTER CONSTRAINTS (26 total, 3 are internal/HOBO)
# =============================================================================
clusterConstraints:
# -- Addons --
- id: cluster-azure-policy-addon
severity: requiresChanges
field: addonProfiles.azurepolicy.enabled
required: true
fix: "az aks addon enable --addon azure-policy"
- id: cluster-keyvault-secrets-provider
severity: requiresChanges
field: addonProfiles.azureKeyvaultSecretsProvider.enabled
required:
enabled: true
enableSecretRotation: true
fix: "az aks addon enable --addon azure-keyvault-secrets-provider --enable-secret-rotation"
# -- Networking --
- id: cluster-api-server-vnet-integration
severity: requiresChanges
field: privateConnectProfile.enabled
required: true
fix: "az aks update --enable-apiserver-vnet-integration --apiserver-subnet-id <subnet-id>"
- id: cluster-azure-cni-overlay-cilium
severity: requiresChanges
field: networkPlugin/networkPluginMode/networkPolicy/ebpfDataplane
required: azure/overlay/cilium/cilium
fix: |
Step 1: az aks update --network-plugin-mode overlay --pod-cidr 192.168.0.0/16
Step 2: az aks update --network-dataplane cilium
Note: Irreversible. Disable NAP before Cilium update.
- id: cluster-standard-load-balancer
severity: requiresChanges
field: loadBalancerSku
required: standard
fix: "az aks update --load-balancer-sku standard (in-place upgrade from Basic supported)"
- id: cluster-nat-gateway-managed-vnet
severity: requiresChanges
condition: AKS-managed VNet only
field: outboundType
required: managedNATGateway
fix: "az aks update --outbound-type managedNATGateway"
# -- Upgrades --
- id: cluster-auto-upgrade
severity: requiresChanges
field: autoUpgradeProfile
required: upgradeChannel=stable, nodeOSUpgradeChannel=NodeImage
fix: "az aks update --auto-upgrade-channel stable --node-os-upgrade-channel NodeImage"
# -- Ingress --
- id: cluster-web-app-routing
severity: requiresChanges
field: ingressProfile.webAppRouting.enabled
required: true
fix: "az aks addon enable --addon web_application_routing"
# -- Identity --
- id: cluster-workload-identity-oidc
severity: requiresChanges
field: workloadIdentity.enabled + oidcProfile.enabled
required: true
fix: "az aks update --enable-oidc-issuer --enable-workload-identity"
- id: cluster-azure-rbac
severity: requiresChanges
field: aadProfile (managed + enableAzureRBAC)
required: true
fix: "az aks update --enable-aad --enable-azure-rbac"
- id: cluster-disable-local-accounts
severity: requiresChanges
field: disableLocalAccounts
required: true
fix: "az aks update --disable-local-accounts"
- id: cluster-system-assigned-managed-identity
severity: requiresChanges
condition: AKS-managed VNet only
field: identity.type
required: SystemAssigned
fix: "Day-0 decision for managed VNet clusters."
# -- Security --
- id: cluster-image-cleaner
severity: requiresChanges
field: securityProfile.imageCleaner.enabled
required: true
fix: "az aks update --enable-image-cleaner"
# -- Autoscaling --
- id: cluster-vpa
severity: requiresChanges
field: verticalPodAutoscaler
required: enabled=true, updateMode=Off
fix: "az aks update --enable-vpa"
- id: cluster-keda
severity: requiresChanges
field: keda.enabled
required: true
fix: "az aks update --enable-keda"
- id: cluster-node-auto-provisioning
severity: requiresChanges
field: nodeProvisioningProfile.mode
required: Auto
fix: "az aks update --node-provisioning-mode Auto"
# -- Governance --
- id: cluster-node-rg-readonly
severity: requiresChanges
field: nodeResourceGroupProfile.restrictionLevel
required: ReadOnly
fix: "Day-0 setting. May require new cluster."
# -- Node Pool (system pools) --
- id: pool-ephemeral-os-disk
severity: incompatible
appliesTo: system pools
field: storageProfile
required: Ephemeral
fix: "Day-0. Recreate system node pool."
- id: pool-availability-zones
severity: incompatible
appliesTo: system pools only
field: availabilityZones
required: "[1, 2, 3]"
fix: "Day-0. Recreate system pool in 3-AZ region. User pools not affected."
- id: pool-critical-addons-taint
severity: requiresChanges
appliesTo: system pools
field: taints
required: CriticalAddonsOnly=true:NoSchedule
fix: "az aks nodepool update --node-taints CriticalAddonsOnly=true:NoSchedule"
- id: pool-vmss-type
severity: incompatible
appliesTo: system pools
field: type
required: VirtualMachineScaleSets
fix: "Day-0. Recreate as VMSS."
- id: pool-azure-linux-os
severity: incompatible
appliesTo: system pools only
field: osSKU
required: AzureLinux
fix: "Day-0. Recreate system pool with --os-sku AzureLinux. User pools can use any OS."
- id: pool-ssh-disabled
severity: requiresChanges
appliesTo: all pools
field: agentPoolProfiles[*].securityProfile.sshAccess
required: Disabled
fix: "az aks nodepool update --cluster-name CLUSTER --name POOL_NAME --ssh-access disabled"
# -- HOBO (internal, skip in assessment) --
- id: hobo-critical-addons-noschedule
severity: internal
note: AKS-managed, not customer-facing
- id: hobo-critical-addons-noexecute
severity: internal
note: AKS-managed, not customer-facing
- id: hobo-hosted-vm-taint
severity: internal
note: AKS-managed, not customer-facing
# =============================================================================
# WORKLOAD CONSTRAINTS β Deployment Safeguards (25 active policies)
# Initiative: c047ea8e | Effect: Deny on Automatic
# =============================================================================
safeguards:
# -- AKS Best Practices (10 policies) --
- id: safeguard-restricted-node-edits
policyId: 53a4a537
severity: requiresChanges
category: nodeProtection
check: Blocks unauthorized Node object mutations
fix: Manage node pools through the AKS API (az aks nodepool) instead of direct Node object edits
- id: safeguard-container-resource-requests
policyId: 03a4ecdb
severity: autoFixed
category: resources
check: Every container must have cpu + memory requests and limits
effect: "ResourceRequestsWorkloadMutator sets defaults cpu=500m, memory=2Gi for requests+limits; enforces minimums cpu=100m, memory=100Mi; fixes QoS if requests > limits"
- id: safeguard-pod-enforce-antiaffinity
policyId: 34c88cd4
severity: autoFixed
category: availability
check: Replicated workloads need podAntiAffinity or topologySpreadConstraints
effect: "AntiAffinityTopologySpreadWorkloadMutator adds preferred anti-affinity (weight=100, hostname) + topology spread (maxSkew=1, hostname, ScheduleAnyway) if neither exists"
- id: safeguard-restricted-labels
policyId: a22123bd
severity: requiresChanges
category: labeling
check: AKS-reserved label prefixes blocked
fix: Remove/rename labels with kubernetes.azure.com/ prefix
- id: safeguard-restricted-taints
policyId: 48940d92
severity: requiresChanges
category: nodeProtection
check: AKS-reserved taint keys blocked for users
fix: Remove reserved taints, use custom taint keys
- id: safeguard-probes-configured
policyId: b1a9997f
severity: informational
enforcement: warn # Warning-only β deployments are admitted with a kubectl warning, not denied
category: reliability
check: Every container should have readinessProbe + livenessProbe (recommended best practice)
fix: Add probes (app-specific β HTTP, TCP, or exec) β recommended, not required for migration
- id: safeguard-csi-driver-storage-class
policyId: 4f3823b6
severity: requiresChanges
category: storage
check: StorageClass must use CSI provisioner (not in-tree)
fix: "Replace kubernetes.io/azure-disk β disk.csi.azure.com"
- id: safeguard-unique-service-selectors
policyId: b0fdedee
severity: requiresChanges
category: networking
check: Services must have unique selectors per namespace
fix: Deduplicate Service selectors
- id: safeguard-images-no-latest
policyId: 021f8078
severity: requiresChanges
category: imagePolicy
check: Image tag must not be :latest or untagged (no colon)
patch: "replace image tag with specific version or sha256 digest"
# -- Baseline PSS policies in Safeguards (15 policies) --
- id: safeguard-block-host-namespaces
policyId: 47a1ee2f
severity: incompatible
category: podSecurity
check: hostPID and hostIPC must be false
fix: Remove hostPID/hostIPC; redesign if required
- id: safeguard-host-network-ports
policyId: 82985f06
severity: incompatible
category: podSecurity
check: hostNetwork must be false, no hostPort allowed
fix: Use ClusterIP Services or Ingress instead
- id: safeguard-allowed-sysctls
policyId: 5e5a0673
severity: requiresChanges
category: podSecurity
check: Only safe sysctls allowed (10 specific ones)
fix: Remove disallowed sysctls
- id: safeguard-allowed-users-groups
policyId: f06ddb64
severity: requiresChanges
category: podSecurity
check: RunAsUser must be non-root (MustRunAsNonRoot)
patch: "add securityContext.runAsNonRoot: true, runAsUser: 1000"
- id: safeguard-windows-block-container-admin
policyId: 5485eac0
severity: requiresChanges
category: podSecurity
check: Windows containers must not run as ContainerAdministrator
- id: safeguard-no-privilege-escalation
policyId: 1c6e92c9
severity: requiresChanges
category: podSecurity
check: allowPrivilegeEscalation must not be true
patch: "add securityContext.allowPrivilegeEscalation: false"
- id: safeguard-no-host-path-volumes
policyId: 098fc59e
severity: incompatible
category: podSecurity
check: hostPath volumes blocked (empty allowed list)
fix: Replace with PVC, ConfigMap, CSI, or sidecar logging
- id: safeguard-enforce-apparmor
policyId: 511f5417
severity: requiresChanges
category: podSecurity
check: Must use runtime/default or RuntimeDefault AppArmor profile
patch: "add securityContext.appArmorProfile.type: RuntimeDefault (K8s 1.30+) or annotation container.apparmor.security.beta.kubernetes.io/{name}: runtime/default"
- id: safeguard-enforce-selinux
policyId: e1e6c427
severity: informational
category: podSecurity
check: Restricted PSS only β SELinux type must be container_t, container_init_t, container_kvm_t, or container_engine_t. Not enforced by AKS Automatic baseline.
fix: "Optional hardening: remove custom seLinuxOptions or use allowed types"
- id: safeguard-windows-block-host-process
policyId: 077f0ce1
severity: incompatible
category: podSecurity
check: Windows HostProcess pods blocked
fix: Remove hostProcess; incompatible if required
- id: safeguard-no-privileged-containers
policyId: 95edb821
severity: incompatible
category: podSecurity
check: privileged=true blocked
fix: Remove privileged mode; use specific capabilities instead
- id: safeguard-no-custom-proc-mount
policyId: f85eb0dd
severity: requiresChanges
category: podSecurity
check: Only Default procMount allowed
patch: "remove securityContext.procMount"
- id: safeguard-container-capabilities
policyId: c26596ff
severity: requiresChanges
category: podSecurity
check: No capabilities may be added (allowedCapabilities=[])
patch: "remove securityContext.capabilities.add"
- id: safeguard-allowed-seccomp-profiles
policyId: 975ce327
severity: requiresChanges
category: podSecurity
check: Only RuntimeDefault and Localhost seccomp profiles
patch: "add seccompProfile.type: RuntimeDefault"
- id: safeguard-allowed-volume-types
policyId: 16697877
severity: informational
category: podSecurity
check: Restricted PSS recommendation (not enforced by AKS Automatic baseline)
fix: "Optional hardening: replace non-standard volumes with supported alternatives"
# =============================================================================
# WEBHOOK MUTATIONS (2 active mutators) β auto-applied at admission
# =============================================================================
mutations:
- id: mutation-anti-affinity-topology-spread
policyId: implicit
target: [Deployment, StatefulSet, ReplicaSet]
effect: "Adds preferred pod anti-affinity (weight=100, kubernetes.io/hostname) + topology spread (maxSkew=1, kubernetes.io/hostname, ScheduleAnyway). Skips if any existing anti-affinity or topology spread. Label priority: app > app.kubernetes.io/name > default-antiaffinity-applabel."
- id: mutation-resource-requests-default
policyId: implicit
target: containers
effect: "Sets resources.requests+limits defaults cpu=500m, memory=2Gi. Minimums cpu=100m, memory=100Mi. If only limits set, requests=limits. If requests > limits, requests capped at limits (QoS fix)."
# =============================================================================
# POD SECURITY STANDARDS
# =============================================================================
podSecurityBaseline:
initiative: a8640138-9b0a-4a28-b8cb-1666c838647d
enforcement: mandatory (Deny)
note: All 5 policies overlap with Deployment Safeguards
policies: [NoPrivilegedContainers, BlockUsingHostNetwork, BlockUsingHostProcessIDAndIPC, ContainerCapabilities, NoHostPathVolume]
podSecurityRestricted:
initiative: 42b8ef37-b724-4e24-bbc8-7a7708edfe00
enforcement: optional (Audit only)
note: Not enforced by default. Opt-in for stricter security.
additionalPolicies: [NoPrivilegeEscalation, AllowedVolumeTypes, AllowedUsersGroups, AllowedSeccompProfiles]
mcp-integration.md 6.5 KB
# MCP Integration Reference
Loaded when troubleshooting MCP tool calls, debugging the fallback chain, or understanding the API response format.
---
## Tool Discovery
Always call `mcp_azure_mcp_aks` first to discover the current available tool surface. Do not assume a fixed action name β the available actions depend on the MCP server version deployed to the client.
```javascript
mcp_azure_mcp_aks({ action: "discover" })
```
The response lists available actions and their parameter schemas. Use the returned schema β do not hardcode parameter names.
---
## Assessment Call
After calling `discover`, use the assessment action name returned in the response. Pass parameters according to the discovered schema β do not hardcode action names or API versions.
Typical parameters include:
- `subscriptionId` β Azure subscription ID
- `resourceGroupName` β resource group containing the cluster
- `resourceName` β AKS cluster name
- `scope` (optional) β filter by namespaces or workload types
Example shape (use actual action name and schema from discover output):
```javascript
mcp_azure_mcp_aks({
action: "<action-from-discover>",
subscriptionId: "<subscription-id>",
resourceGroupName: "<resource-group>",
resourceName: "<cluster-name>",
scope: {
excludeNamespaces: ["kube-system", "gatekeeper-system", "azure-arc"],
workloadTypes: ["Deployment", "StatefulSet", "DaemonSet", "CronJob", "Job"]
}
})
```
All `scope` parameters are optional. If omitted, the API assesses all workloads excluding `kube-system` and `gatekeeper-system`.
---
## Required Permissions
```bash
# Check current role assignments
az role assignment list \
--assignee $(az ad signed-in-user show --query id -o tsv) \
--scope /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>
# Minimum permissions required:
# - Microsoft.ContainerService/managedClusters/read
# - Microsoft.ContainerService/managedClusters/listClusterUserCredential/action
# Assign if missing (requires Owner or User Access Administrator)
az role assignment create \
--assignee <principal-id> \
--role "Azure Kubernetes Service Cluster User Role" \
--scope /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>
```
---
## Response Schema
The API returns three top-level sections:
### `summary`
```json
{
"summary": {
"totalWorkloads": 42,
"compatible": 27,
"requiresChanges": 12,
"incompatible": 3,
"autoFixed": 8,
"clusterConfigIssues": 4
}
}
```
### `clusterConfiguration`
```json
{
"clusterConfiguration": [
{
"constraintId": "cluster-oidc-issuer",
"severity": "requiresChanges",
"description": "OIDC issuer not enabled",
"remediation": "az aks update --enable-oidc-issuer --resource-group <rg> --name <cluster>",
"documentationUrl": "https://learn.microsoft.com/azure/aks/..."
}
]
}
```
### `workloads[]`
```json
{
"workloads": [
{
"name": "sample-app",
"namespace": "default",
"kind": "Deployment",
"overallStatus": "requiresChanges",
"issues": [
{
"constraintId": "safeguard-images-no-latest",
"severity": "requiresChanges",
"description": "Container 'web' uses :latest image tag",
"field": "/spec/containers/0/image",
"suggestedPatch": null,
"remediationGuide": "Pin the image to a specific version or SHA digest"
}
]
}
]
}
```
---
## Async Response Handling (HTTP 202 β Large Clusters)
For clusters with 500+ workloads, the API returns HTTP 202 Accepted with a `Location` header. Poll until complete:
```javascript
// Initial call returns: { status: 202, headers: { Location: "...", "Retry-After": "30" } }
async function pollAssessment(locationUrl, retryAfterSeconds) {
while (true) {
await new Promise(r => setTimeout(r, retryAfterSeconds * 1000));
const response = await mcp_azure_mcp_aks({
action: "pollOperation",
locationUrl: locationUrl
});
if (response.status === "Succeeded") return response.result;
if (response.status === "Failed") throw new Error(response.error.message);
retryAfterSeconds = response.retryAfter ?? retryAfterSeconds;
}
}
```
---
## Fallback Chain
Attempt each step in order. Do not ask the user which is available β just try:
```
1. mcp_azure_mcp_aks β discover, then call the assessment action returned
β fails (tool not found β Azure MCP server not configured)
2. Inform user to install Azure MCP, then fall back to offline validation
kubectl get deployment,statefulset,daemonset,job,cronjob -A -o yaml > /tmp/workloads.yaml
kubectl get pdb,storageclass -A -o yaml > /tmp/policies.yaml
```
If `mcp_azure_mcp_aks` is not available, say:
> "The Azure MCP server is not configured. To enable live cluster assessment, install it following [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup). For now, I can validate your local manifests offline β export them with `kubectl get ... -o yaml` or share your manifest files."
Then proceed to offline manifest validation against `constraint-spec-v1.yaml`.
---
## Prerequisites Verification
Run these before attempting MCP or CLI assessment:
```bash
# 1. Verify Azure login
az account show --query "{name:name, id:id, state:state}" -o table
# 2. Verify cluster exists and is accessible
az aks show \
--resource-group <rg> \
--name <cluster> \
--query "{name:name, provisioningState:provisioningState, sku:sku.name}" \
-o table
# 3. Verify kubectl context
kubectl config current-context
kubectl cluster-info
```
```javascript
// 4. Verify MCP server is reachable (Azure MCP)
// If this returns available actions, MCP is configured
mcp_azure_mcp_aks({ action: "discover" })
```
---
## Common MCP Errors
| Error | Cause | Fix |
|---|---|---|
| `tool not found: mcp_azure_mcp_aks` | Azure MCP server not configured | Guide user to install: [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup), then fall back to offline |
| `HTTP 401 Unauthorized` | Not logged in | `az login` |
| `HTTP 403 Forbidden` | Insufficient RBAC permissions | Ensure caller has read access to the cluster via AKS APIs |
| `HTTP 404 Not Found` | Wrong subscription, RG, or cluster name | Verify with `az aks list -o table` |
| `HTTP 202` with no Location header | API version mismatch | Ensure the MCP server version supports async polling; retry with the latest server |
| Timeout after 30s | Cluster too large (500+ workloads) | Implement async polling β see section above |
migration-guide-summary.md 4.4 KB
# AKS Automatic Migration Guide
Loaded when user asks about migration steps or after assessment is complete.
---
## Migration Checklist
### Phase 1 β Assessment (this skill)
- [ ] Run the AKS Automatic compatibility assessment (via `mcp_azure_mcp_aks({ action: "discover" })` then the assessment action returned, or the offline manifest scan)
- [ ] Resolve all `incompatible` findings β these are hard blockers
- [ ] Apply all `requiresChanges` fixes β these will be denied at admission
- [ ] Review `autoFixed` items β understand what AKS Automatic will mutate at runtime
- [ ] Address cluster-level Day-0 config issues (see below)
### Phase 2 β Create AKS Automatic Cluster (use `azure-kubernetes` skill)
```bash
az aks create \
--resource-group <resource-group> \
--name <new-cluster-name> \
--sku automatic \
--location <location> \
--generate-ssh-keys
```
> π‘ **Tip:** AKS Automatic auto-enables: OIDC issuer, workload identity, Azure CNI Overlay, NAP, VPA, Azure Monitor Container Insights, Deployment Safeguards, and Pod Security Standards (Baseline). No manual configuration needed for these.
### Phase 3 β Validate on New Cluster
```bash
# Get credentials
az aks get-credentials \
--resource-group <resource-group> \
--name <new-cluster-name>
# Dry-run server-side apply β catches admission policy rejections
kubectl apply --dry-run=server -f <manifests-directory>/
# Deploy to a staging namespace first
kubectl create namespace staging
kubectl apply -f <manifests-directory>/ -n staging
# Watch pod startup
kubectl get pods -n staging -w
# Check events for admission rejections
kubectl get events -n staging --sort-by=.lastTimestamp | grep -i "denied\|error\|failed"
```
> β οΈ **Keep the old cluster running** for a rollback window (recommended: 48 hours minimum) while you validate workloads on the new AKS Automatic cluster.
### Phase 4 β Decommission Old Cluster
```bash
# Only after confirming workloads are stable on AKS Automatic
az aks delete \
--resource-group <resource-group> \
--name <old-cluster-name> \
--yes --no-wait
```
---
## Day-0 Decisions β Cluster-Level Configuration Requirements
Some settings require creating a **new** cluster; others can be enabled on existing clusters. Route to `azure-kubernetes` skill for cluster creation.
| Requirement | AKS Automatic default | What to do |
|---|---|---|
| API Server VNet Integration | Required, auto-enabled | Requires a new cluster |
| Network plugin | Azure CNI Overlay | Requires a new cluster if currently on kubenet |
| System node pool OS | Azure Linux | Recreate system node pool (user pools unaffected) |
| OIDC Issuer | Auto-enabled | Can be enabled on existing: `az aks update --enable-oidc-issuer` |
| Workload Identity | Auto-enabled | Can be enabled on existing: `az aks update --enable-workload-identity` |
---
## What AKS Automatic Auto-Enables
No manual setup needed for these β show this list when user asks "what do I get for free":
| Feature | Benefit |
|---|---|
| Node Auto Provisioning (NAP) | Replaces cluster autoscaler; right-sizes node pools automatically |
| Vertical Pod Autoscaler (VPA) | Auto-tunes resource requests after deployment |
| Azure Monitor Container Insights | Logs, metrics, and dashboards out of the box |
| Deployment Safeguards | 25 active deny policies + 2 webhook mutators at admission (resource-requests defaults + anti-affinity/topology-spread) |
| Pod Security Standards (Baseline) | Enforced cluster-wide; Restricted available opt-in |
| Managed OIDC Issuer | Required for workload identity |
| Azure Key Vault CSI Driver | Secret injection without static credentials |
| Ephemeral OS disks | Faster node provisioning by default |
| Azure Linux node OS | Smaller footprint, faster boot times |
---
## Post-Migration Verification Commands
```bash
# Verify all pods running
kubectl get pods -A | grep -v Running | grep -v Completed
# Check for pods stuck in Pending (may indicate resource quota or node issues)
kubectl get pods -A --field-selector status.phase=Pending
# Check Deployment Safeguards are active
kubectl get constrainttemplate -A
# Verify VPA is running
kubectl get vpa -A
# Check NAP node pools
az aks nodepool list \
--resource-group <resource-group> \
--cluster-name <cluster-name> \
--query "[].{name:name, mode:mode, osType:osType, count:count}" \
-o table
# View Container Insights metrics
az aks show \
--resource-group <resource-group> \
--name <cluster-name> \
--query addonProfiles.omsagent.enabled
```
azure-aks-autoscaler.md 3.0 KB
# AKS Cluster Autoscaler (CAS)
Enable and tune the Cluster Autoscaler to automatically scale down idle nodes.
## Check CAS Status
```bash
az aks show \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--query "agentPoolProfiles[].{name:name, casEnabled:enableAutoScaling, min:minCount, max:maxCount, count:count}" \
-o table
az aks show \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--query "autoScalerProfile" -o json
```
## Check Node Utilization (7 days)
Follow the metrics discovery steps in [azure-aks-rightsizing.md](./azure-aks-rightsizing.md#historical-metrics-azure-monitor--use-when-prometheus-or-container-insights-is-enabled) to list available metric definitions and query node CPU utilization. Use metric names such as `node_cpu_usage_percentage` or `cpuUsagePercentage` depending on what's available on the cluster.
## Enable CAS
```bash
# Cluster-level
az aks update \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--enable-cluster-autoscaler \
--min-count <MIN_NODES> --max-count <MAX_NODES>
# Specific node pool
az aks nodepool update \
--cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--name "<NODEPOOL_NAME>" \
--enable-cluster-autoscaler \
--min-count <MIN_NODES> --max-count <MAX_NODES>
```
## Recommended min/max Defaults
| Scenario | min-count | max-count |
|----------|-----------|-----------|
| Dev/test | 1 | current_count |
| Production (web/API) | 2 | current_count * 3 |
| Production (batch) | 0 | current_count * 5 |
> Risk: Low. CAS only scales down when pods can be safely rescheduled. Set min-count >= 2 for production HA.
## Tune CAS Profile
Apply when CAS is already on but idle nodes persist:
> β οΈ **Warning:** Setting `skip-nodes-with-system-pods=false` allows CAS to evict system pods. Ensure all system pods in `kube-system` have PodDisruptionBudgets before enabling this.
```bash
az aks update \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--cluster-autoscaler-profile \
scale-down-delay-after-add=10m \
scale-down-unneeded-time=10m \
scale-down-utilization-threshold=0.5 \
max-graceful-termination-sec=600 \
skip-nodes-with-system-pods=false
```
To roll back to CAS defaults:
```bash
az aks update \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--cluster-autoscaler-profile ""
```
## Profile Comparison
| Profile | scale-down-delay-after-add | scale-down-unneeded-time | utilization-threshold | Best For |
|---------|----------------------------|--------------------------|----------------------|----------|
| Default | 10m | 10m | 0.5 | General workloads |
| Cost-Optimized | 5m | 5m | 0.5 | Cost-sensitive, non-critical |
| Conservative | 30m | 30m | 0.7 | Stateful / production |
| Aggressive | 2m | 2m | 0.4 | Dev/test, batch |
> Risk: High for aggressive tuning. Ensure PodDisruptionBudgets (PDBs) are set on critical workloads before tuning. Always confirm with user before applying.
>
> Check existing PDBs before tuning:
> ```bash
> kubectl get pdb --all-namespaces
> ```
azure-aks-rightsizing.md 4.3 KB
# AKS Pod Rightsizing
Identify pods requesting far more CPU/memory than they use and recommend reduced resource requests.
## Prerequisites β Check Monitoring State First
Before collecting usage data, determine what monitoring is available on the cluster:
```bash
# 1. Check if Azure Managed Prometheus is enabled
az aks show \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--query "azureMonitorProfile.metrics.enabled" -o tsv
# 2. Check if Container Insights (Log Analytics) is enabled
az aks show \
--name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--query "addonProfiles.omsagent.enabled" -o tsv
# 3. Check if Metrics Server is running (pre-installed on AKS, but may be unhealthy)
kubectl get deployment metrics-server -n kube-system
```
Based on the result, follow the appropriate path:
| State | Rightsizing Possible? | Data Source | Accuracy |
|-------|-----------------------|-------------|----------|
| Azure Managed Prometheus enabled | Yes | Prometheus metrics via Azure Monitor | Best β full P95/7-day history |
| Container Insights (Log Analytics) enabled | Yes | KQL queries on `Perf` / `KubePodInventory` | Good β 7-day trends |
| Only Metrics Server (no Azure Monitor) | Limited | `kubectl top pods` β live data only | Low β no historical trends |
> If nothing is enabled, Metrics Server is pre-installed on AKS β confirm it is healthy and use it for live rightsizing data:
> ```bash
> kubectl get deployment metrics-server -n kube-system
> kubectl top pods --all-namespaces --sort-by=cpu
> ```
> For historical P95 trends (more accurate rightsizing), recommend enabling Azure Managed Prometheus. Warn user this incurs cost and wait for confirmation before proceeding.
---
## Detection
```bash
# Authenticate to cluster
az aks get-credentials --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>"
# List requests/limits for ALL containers per pod (includes sidecars)
# Using [*] ensures multi-container pods are not misrepresented
kubectl get pods --all-namespaces \
-o custom-columns="NAMESPACE:.metadata.namespace,POD:.metadata.name,CONTAINERS:.spec.containers[*].name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory"
# Live per-container usage (shows each container individually, including sidecars)
kubectl top pods --all-namespaces --containers --sort-by=cpu
```
## Historical Metrics (Azure Monitor β use when Prometheus or Container Insights is enabled)
First discover available metric names, then query:
```bash
az monitor metrics list-definitions \
--resource "<AKS_RESOURCE_ID>" \
--query "[].name.value" -o tsv
```
```bash
az monitor metrics list \
--resource "<AKS_RESOURCE_ID>" \
--metric "<METRIC_NAME_FROM_ABOVE>" \
--interval PT1H --aggregation Average \
--start-time "<YYYY-MM-DDTHH:mm:ssZ>" \
--end-time "<YYYY-MM-DDTHH:mm:ssZ>"
```
## Optimization Rules
| Condition | Recommendation | Risk |
|-----------|----------------|------|
| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | Medium |
| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | Medium |
| CPU request >2x P95 actual | Recommend rightsizing with 20% buffer | Low |
| No resource limits set | Add limits to prevent noisy-neighbor waste | Low |
| No VPA/HPA configured | Suggest enabling Vertical Pod Autoscaler | Low |
> For VPA setup and configuration, see [azure-aks-vpa.md](./azure-aks-vpa.md).
## YAML Patch Format
```yaml
# Rightsizing patch for <NAMESPACE>/<DEPLOYMENT_NAME>
# Current: CPU request=<CURRENT>, P95 actual=<ACTUAL>
# Recommended: CPU request=<NEW> (P95 * 1.2 buffer)
apiVersion: apps/v1
kind: Deployment
metadata:
name: <DEPLOYMENT_NAME>
namespace: <NAMESPACE>
spec:
template:
spec:
containers:
- name: <CONTAINER_NAME>
resources:
requests:
cpu: "<NEW_CPU>"
memory: "<NEW_MEM>"
limits:
cpu: "<NEW_CPU_LIMIT>" # e.g. CPU limit = 1.5x CPU request, or preserve existing limit-to-request ratio
memory: "<NEW_MEM_LIMIT>" # e.g. memory limit = 1.25x memory request, or preserve existing limit-to-request ratio
```
> Risk: Medium-High. Always review patches before applying. Test in non-production first. Get explicit user confirmation before applying to production.
azure-aks-spot.md 3.7 KB
# AKS Spot Node Pools
Recommend and create Spot VM node pools for batch, dev/test, or fault-tolerant workloads (60-90% cost reduction vs regular nodes).
## Check Existing Node Pools
```bash
az aks nodepool list \
--cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--query "[].{name:name, vmSize:vmSize, priority:scaleSetPriority, count:count, mode:mode}" \
-o table
```
## Identify Spot-Suitable Workloads
Before creating a Spot pool, identify which workloads can tolerate interruptions:
```bash
# List deployments without PodDisruptionBudgets (single-replica or no PDB = higher eviction risk)
kubectl get deployments --all-namespaces -o json | \
jq -r '.items[] | select(.spec.replicas == 1) | "\(.metadata.namespace)/\(.metadata.name)"'
# Check which pods already have spot tolerations
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.tolerations[]?.key == "kubernetes.azure.com/scalesetpriority") | "\(.metadata.namespace)/\(.metadata.name)"'
```
Use the suitability table below to decide which workloads to migrate.
## Mixed Node Pool Pattern (Spot + Regular)
For workloads that need resilience but want cost savings, use a mixed approach:
```bash
# Keep existing regular node pool as fallback (min 1-2 nodes)
az aks nodepool update \
--cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--name "<REGULAR_POOL>" \
--enable-cluster-autoscaler --min-count 1 --max-count 3
# Add Spot pool for the majority of workload capacity
# -1 means pay up to on-demand price (no cap); set e.g. 0.05 to cap hourly spend
az aks nodepool add \
--cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
--name "<SPOT_POOL_NAME>" \
--priority Spot --eviction-policy Delete --spot-max-price -1 \
--node-vm-size "<VM_SIZE>" \
--node-count 3 --min-count 0 --max-count 10 \
--enable-cluster-autoscaler \
--node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \
--labels "kubernetes.azure.com/scalesetpriority=spot"
```
Pods that tolerate Spot but don't require it (no `nodeSelector` or required node affinity pinning them to the Spot pool) will be rescheduled onto the regular pool after eviction. Pods pinned to Spot via `nodeSelector` cannot reschedule and will remain pending until a Spot node is available again.
## Workload Toleration (add to Deployment YAML)
```yaml
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
nodeSelector:
kubernetes.azure.com/scalesetpriority: spot
```
## Suitability
| Workload | Spot-Suitable? |
|----------|----------------|
| Batch / data processing | Yes |
| Dev / test environments | Yes |
| Stateless web/API (replicas >= 2) | Yes (with care) |
| Jobs with checkpointing | Yes |
| Stateful workloads (databases) | No |
| Single-replica critical services | No |
> Risk: Low for batch/dev. High for production stateful workloads. Spot VMs evict with 30-second notice. Eviction policy Delete is recommended for AKS.
## Handling Eviction Gracefully
Configure workloads to handle the 30-second eviction notice:
```yaml
# Add to Deployment spec β terminationGracePeriodSeconds should be < 30s for Spot
spec:
template:
spec:
terminationGracePeriodSeconds: 25
containers:
- name: <CONTAINER_NAME>
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # Drain in-flight requests
```
Set a PodDisruptionBudget to limit simultaneous evictions:
```bash
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: <APP_NAME>-pdb
namespace: <NAMESPACE>
spec:
minAvailable: 1
selector:
matchLabels:
app: <APP_NAME>
EOF
```
azure-aks-vpa.md 1.4 KB
# AKS Vertical Pod Autoscaler (VPA)
Use VPA to get data-driven resource recommendations for rightsizing pods. Always start in recommendation-only mode before considering auto-apply.
## Enable VPA (Recommendation Mode)
```bash
# Enable VPA addon on AKS cluster (if not already enabled)
az aks update --enable-vpa --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>
# Create a VPA object in recommendation mode for a deployment
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: <DEPLOYMENT_NAME>-vpa
namespace: <NAMESPACE>
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: <DEPLOYMENT_NAME>
updatePolicy:
updateMode: "Off" # Recommendation only β does not modify pods
EOF
# Read recommendations after 24+ hours of data collection
kubectl describe vpa <DEPLOYMENT_NAME>-vpa -n <NAMESPACE>
```
> Risk: Low in "Off" mode. **Do not use `updateMode: Auto` in production** without thorough testing and explicit user confirmation.
## Read VPA Recommendations
```bash
kubectl get vpa <DEPLOYMENT_NAME>-vpa -n <NAMESPACE> -o jsonpath='{.status.recommendation}'
```
The output shows `lowerBound`, `target`, and `upperBound` for CPU and memory. Use the `target` values as rightsized requests.
## Apply Recommendations Manually
After reviewing VPA output, patch the deployment β see [azure-aks-rightsizing.md](./azure-aks-rightsizing.md#yaml-patch-format) for the patch format.
cli-reference.md 1.2 KB
# CLI Reference for AKS
```bash
# List AKS clusters
az aks list --output table
# Show cluster details
az aks show --name <cluster-name> --resource-group <resource-group>
# Get available Kubernetes versions
az aks get-versions --location <location> --output table
# Create AKS Automatic cluster
az aks create --name <cluster-name> --resource-group <resource-group> --sku automatic \
--network-plugin azure --network-plugin-mode overlay \
--enable-oidc-issuer --enable-workload-identity
# Create AKS Standard cluster
az aks create --name <cluster-name> --resource-group <resource-group> \
--node-count 3 --zones 1 2 3 \
--network-plugin azure --network-plugin-mode overlay \
--enable-cluster-autoscaler --min-count 1 --max-count 10 \
--enable-oidc-issuer --enable-workload-identity
# Get credentials
az aks get-credentials --name <cluster-name> --resource-group <resource-group>
# List node pools
az aks nodepool list --cluster-name <cluster-name> --resource-group <resource-group> --output table
# Enable monitoring
az aks enable-addons --name <cluster-name> --resource-group <resource-group> \
--addons monitoring --workspace-resource-id <workspace-resource-id>
```
License (MIT)
View full license text
MIT License Copyright 2025 (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.