Azure Kubernetes

Verified intermediate

Plan, create, and configure production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 checklist, SKU selection (Automatic vs Standard), networking options (private API server, Azure CNI Overlay, egress configuration), security, and operations (autoscaling, upgrade strategy, cost analysis). WHEN: create AKS environment, provision AKS, enable AKS observability, design AKS networking, choose AKS SKU, secure AKS, optimize AKS, AKS spot nodes, AKS cluster-autoscaler, rightsize AKS pod, pod rightsizing, over-provisioned AKS pod, pod resource requests and limits, Vertical Pod Autoscaler, VPA recommendations.

🔌 API & Backend View Source MIT 11 files

Installation

Install with CLI Recommended

gh skills-hub install azure-kubernetes

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/azure-kubernetes/

Extract the ZIP to .github/skills/ in your repo. The folder name must match azure-kubernetes for Copilot to auto-discover it.

Skill Files (11)

SKILL.md 10.3 KB

---
name: azure-kubernetes
license: MIT
metadata:
  author: Microsoft
  version: "1.1.4"
description: "Plan, create, and configure production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 checklist, SKU selection (Automatic vs Standard), networking options (private API server, Azure CNI Overlay, egress configuration), security, and operations (autoscaling, upgrade strategy, cost analysis). WHEN: create AKS environment, provision AKS, enable AKS observability, design AKS networking, choose AKS SKU, secure AKS, optimize AKS, AKS spot nodes, AKS cluster-autoscaler, rightsize AKS pod, pod rightsizing, over-provisioned AKS pod, pod resource requests and limits, Vertical Pod Autoscaler, VPA recommendations."
---

# Azure Kubernetes Service

> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This skill produces a **recommended AKS cluster configuration** based on user requirements, distinguishing **Day-0 decisions** (networking, API server — hard to change later) from **Day-1 features** (can enable post-creation). See [CLI reference](./references/cli-reference.md) for commands.

## Quick Reference
| Property | Value |
|----------|-------|
| Best for | AKS cluster planning and Day-0 decisions |
| MCP Tools | `mcp_azure_mcp_aks` |
| CLI | `az aks create`, `az aks show`, `kubectl get`, `kubectl describe` |
| Related skills | azure-diagnostics (troubleshooting AKS), azure-validate (readiness checks), azure-kubernetes-automatic-readiness (migrate existing cluster to AKS Automatic) |

## When to Use This Skill
Activate this skill when user wants to:
- Create a new AKS cluster
- Plan AKS cluster configuration for production workloads
- Design AKS networking (API server access, pod IP model, egress)
- Set up AKS identity and secrets management
- Configure AKS governance (Azure Policy, Deployment Safeguards)
- Enable AKS observability (Container Insights, Managed Prometheus, Grafana)
- Define AKS upgrade and patching strategy
- Understand AKS Automatic vs Standard SKU differences
- Get a Day-0 checklist for AKS cluster setup and configuration

## Rules
1. Start with the user's requirements for provisioning compute, networking, security, and other settings.
2. Use the `azure` MCP server and select `mcp_azure_mcp_aks` first to discover the exact AKS-specific MCP tools surfaced by the client. Choose the smallest discovered AKS tool that fits the task, and fall back to Azure CLI (`az aks`) only when the needed functionality is not exposed through the AKS MCP surface.
3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required.
4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access).


## Required Inputs (Ask only what’s needed)
If the user is unsure, use safe defaults.
- AKS environment type: dev/test or production
- Region(s), availability zones, preferred node VM sizes
- Expected scale (node/cluster count, workload size)
- Networking requirements (API server access, pod IP model, ingress/egress control)
- Security and identity requirements, including image registry
- Upgrade and observability preferences
- Cost constraints

## Workflow

### 1. Cluster Type
- **AKS Automatic** (default): Best for most production workloads, provides a curated experience with pre-configured best practices for security, reliability, and performance. Use unless you have specific custom requirements for networking, autoscaling, or node pool configurations not supported by Node Auto-Provisioning (NAP).
- **AKS Standard**: Use if you need full control over environment configuration, which requires additional overhead to set up and manage.

### 2. Networking (Pod IP, Egress, Ingress, Dataplane)

**Pod IP Model** (Key Day-0 decision):
- **Azure CNI Overlay** (recommended): pod IPs from private overlay range, not VNet-routable, scales to large environments and good for most workloads
- **Azure CNI (VNet-routable)**: pod IPs directly from VNet (pod subnet or node subnet), use when pods must be directly addressable from VNet or on-prem
  - Docs: https://learn.microsoft.com/azure/aks/azure-cni-overlay

**Dataplane & Network Policy**:
- **Azure CNI powered by Cilium** (recommended): eBPF-based for high-performance packet processing, network policies, and observability

**Egress**:
- **Static Egress Gateway** for stable, predictable outbound IPs
- For restricted egress: UDR + Azure Firewall or NVA

**Ingress**:
- **App Routing addon with Gateway API** — recommended default for HTTP/HTTPS workloads
- **Istio service mesh with Gateway API** - for advanced traffic management, mTLS, canary releases
- **Application Gateway for Containers** — for L7 load balancing with WAF integration

**DNS**:
- Enable **LocalDNS** on all node pools for reliable, performant DNS resolution

### 3. Security
- Use **Microsoft Entra ID** everywhere (control plane, Workload Identity for pods, node access). Avoid static credentials.
- Azure Key Vault via **Secrets Store CSI Driver** for secrets
- Enable **Azure Policy** + **Deployment Safeguards**
- Enable **Encryption at rest** for etcd/API server; **in-transit** for node-to-node
- Allow only signed, policy-approved images (Azure Policy + Ratify), prefer **Azure Container Registry**
- **Isolation**: Use namespaces, network policies, scoped logging

### 4. Observability
- Use Managed Prometheus and Container Insights with Grafana for AKS observability (logs + metrics).
- Enable Diagnostic Settings to collect control plane logs and audit logs in a Log Analytics workspace for security monitoring and troubleshooting.
- For other monitoring and troubleshooting tools, use features like the Agentic CLI for AKS, Application Insights, Resource Health Center, AppLens detectors, and Azure Advisors.

### 5. Upgrades & Patching
- Configure **Maintenance Windows** for controlled upgrade timing
- Enable **auto-upgrades** for control plane and node OS to stay up-to-date with security patches and Kubernetes versions
- Consider **LTS versions** for enterprise stability (2-year support) by upgrading your AKS environment to the Premium tier
- **Fleet upgrades**: Use **AKS Fleet Manager** for staged rollout across test to production environments

### 6. Performance
- Use **Ephemeral OS disks** (`--node-osdisk-type Ephemeral`) for faster node startup
- Select **Azure Linux** as node OS (smaller footprint, faster boot)
- Enable **KEDA** for event-driven autoscaling beyond HPA

### 7. Node Pools & Compute
- **Dedicated system node pool**: At least 2 nodes, tainted for system workloads only (`CriticalAddonsOnly`)
- Enable **Node Auto Provisioning (NAP)** on all pools for cost savings and responsive scaling
- Use **latest generation SKUs (v5/v6)** for host-level optimizations
- **Avoid B-series VMs** — burstable SKUs cause performance/reliability issues
- Use SKUs with **at least 4 vCPUs** for production workloads
- Set **topology spread constraints** to distribute pods across hosts/zones per SLO

### 8. Reliability
- Deploy across **3 Availability Zones** (`--zones 1 2 3`)
- Use **Standard tier** for zone-redundant control plane + 99.95% SLA for API server availability
- Enable **Microsoft Defender for Containers** for runtime protection
- Configure **PodDisruptionBudgets** for all production workloads
- Use **topology spread constraints** to ensure pod distribution across failure domains

### 9. Cost Controls
- Use **Spot node pools** for batch/interruptible workloads (up to 90% savings)
- **Stop/Start** dev/test clusters: `az aks stop/start`
- Consider **Reserved Instances** or **Savings Plans** for steady-state workloads

**Deep-dive scenarios** — load only the relevant reference file:

| Scenario | Trigger Keywords | Reference |
|----------|-----------------|-----------|
| Pod Rightsizing | over-provisioned pods, CPU requests, memory requests, rightsize workloads | [azure-aks-rightsizing.md](./references/azure-aks-rightsizing.md) |
| VPA Setup | vertical pod autoscaler, VPA recommendations, VPA enable | [azure-aks-vpa.md](./references/azure-aks-vpa.md) |
| Cluster Autoscaler | idle nodes, CAS off, enable autoscaler, scale-down profile, node utilization | [azure-aks-autoscaler.md](./references/azure-aks-autoscaler.md) |
| Spot Node Pools | Spot VMs, Spot nodes, batch workloads, cheaper nodes | [azure-aks-spot.md](./references/azure-aks-spot.md) |

> **Disambiguation:** If a prompt matches multiple rows (e.g., "cheaper nodes" could suggest both Spot and autoscaler), prefer the most specific match. If ambiguous, ask the user to clarify their intent before loading a reference file.

## Guardrails / Safety
- Do not request or output secrets (tokens, keys).
- Do not ask the user to paste subscription IDs. Discover subscription and resource scope via MCP tools (e.g., list subscriptions, list resource groups) or `az account show` / `az account list` so the agent can resolve context without exposing identifiers.
- If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2–3 safe options with tradeoffs and choose a conservative default.
- Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance.

## MCP Tools
| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `mcp_azure_mcp_aks` | AKS MCP entry point used to discover the exact AKS-specific tools exposed by the client | Discover the callable AKS tool first, then use that tool's parameters |

## Error Handling
| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| MCP tool call fails or times out | Invalid credentials, subscription, or AKS context | Verify `az login`, confirm the active subscription context with `az account show`, and check the target resource group without echoing subscription identifiers back to the user |
| Quota exceeded | Regional vCPU or resource limits | Request quota increase or select different region/VM SKU |
| Networking conflict (IP exhaustion) | Pod subnet too small for overlay/CNI | Re-plan IP ranges; may require cluster recreation (Day-0) |
| Workload Identity not working | Missing OIDC issuer or federated credential | Enable `--enable-oidc-issuer --enable-workload-identity`, configure federated identity |

azure-kubernetes-automatic-readiness/

SKILL.md 14.4 KB

---
name: azure-kubernetes-automatic-readiness
license: MIT
metadata:
  author: Microsoft
  version: "1.0.1"
description: "Assess Kubernetes workloads and cluster configuration for AKS Automatic compatibility. Identifies incompatibilities, generates fixes, and guides migration from AKS Standard to AKS Automatic. WHEN: migrate to AKS Automatic, check AKS Automatic readiness, validate manifests for Automatic, assess cluster for Automatic compatibility, fix deployment for Automatic compatibility, identify AKS Automatic migration blockers, is my cluster ready for AKS Automatic."
---

# AKS Automatic Readiness Assessment

> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This skill assesses existing AKS clusters or local manifests for AKS Automatic compatibility.
> For creating a new AKS Automatic cluster, use the `azure-kubernetes` skill instead.
> See [constraint spec](./references/constraint-spec-v1.yaml) for all safeguard rules, [common fixes](./references/common-fixes.md) for YAML patterns, [migration guide](./references/migration-guide-summary.md) for end-to-end steps, and [MCP integration](./references/mcp-integration.md) for tool details and fallback handling.

You are an AKS Automatic compatibility assessment agent. Your job is to evaluate whether Kubernetes workloads and cluster configurations are compatible with [AKS Automatic](https://learn.microsoft.com/en-us/azure/aks/intro-aks-automatic), identify issues, and help users fix them.

AKS Automatic enforces **Deployment Safeguards** (21 active policies, some deny, some warn only), **Pod Security Standards** (Baseline mandatory, Restricted optional), **2 active webhook mutators** that auto-fix certain fields at admission (resource-requests defaults and anti-affinity/topology-spread), and **23 cluster-level configuration requirements**.

## Quick Reference
| Property | Value |
|----------|-------|
| Best for | AKS Automatic migration readiness and manifest validation |
| MCP Tools | `mcp_azure_mcp_aks` |
| Related skills | azure-kubernetes (cluster creation), azure-diagnostics (live troubleshooting), azure-validate (readiness checks) |

## When to Use This Skill
- "Can I migrate to AKS Automatic?"
- "Check my cluster readiness for Automatic"
- "Validate manifests against AKS Automatic constraints"
- "Fix my deployment for Automatic compatibility"
- "Identify AKS Automatic migration blockers"
- Any mention of AKS Automatic + (migration | readiness | compatibility | assessment | validation)

## Routing Rules

### Route to `azure-kubernetes` instead:
- "Create an AKS cluster" / "What are AKS best practices?" / "How do I deploy to AKS?"
- General cluster creation, configuration, scaling, or AKS operations

### Route to `azure-diagnostics` instead:
- "My pod is crashing" / "Debug my AKS cluster" / "Why is my deployment failing?"
- Live troubleshooting, debugging, error diagnosis on a running cluster

## Guardrails — READ FIRST

1. **Read-only**: NEVER modify cluster state. Assessment is read-only. Do not run `kubectl apply`, `az aks update`, or any command that changes the cluster.
2. **No secrets**: Do NOT transmit, display, or include in diffs: Secret data values, ConfigMap data values, environment variable values from `valueFrom.secretKeyRef`, service account tokens, or connection strings.
3. **User approval for file changes**: Present every fix as a diff. The user must explicitly accept before you write to any file.
4. **Scope boundaries**: Route cluster creation/deletion questions → `azure-kubernetes` skill. Route live troubleshooting → `azure-diagnostics` skill.

## MCP Tools
| Tool | Purpose | Key Parameters |
|------|---------|----------------|
| `mcp_azure_mcp_aks` | AKS MCP entry point — call `discover` first, then use the assessment action name returned in the response | `subscriptionId`, `resourceGroupName`, `resourceName`, `scope` |

## Workflow

### Step 1: Determine Scope

Ask the user what they want to assess:

**Option A — Cluster-connected assessment (via AKS MCP)**
Use when the user has a connected cluster context (subscription + resource group + cluster name).

**Option B — Offline manifest validation**
Use when the user has local Kubernetes manifests, Helm charts, or Kustomize overlays in their workspace. Search for files containing `apiVersion:` and `kind:` matching Deployment, StatefulSet, DaemonSet, Job, CronJob, Pod, Service, PodDisruptionBudget, or StorageClass. For Helm charts, look for `Chart.yaml` and rendered templates under `templates/`.

**Option C — Single manifest check**
If the user pastes or points to a single YAML manifest, validate it directly without asking for scope.

### Step 2: Run Assessment

#### Cluster-Connected Mode

Call the AKS MCP tool — this is the preferred path. Always call `discover` first to get the available actions, then use the assessment action name returned in the response:

```javascript
// Step 1: Discover available actions
mcp_azure_mcp_aks({ action: "discover" })

// Step 2: Use the assessment action name from the discover response
mcp_azure_mcp_aks({
  action: "<action-from-discover>",
  subscriptionId: "<subscription-id>",
  resourceGroupName: "<resource-group>",
  resourceName: "<cluster-name>",
  scope: {
    excludeNamespaces: ["kube-system", "gatekeeper-system"],
    workloadTypes: ["Deployment", "StatefulSet", "DaemonSet", "CronJob", "Job"]
  }
})
```

**Required permissions:**
- `Microsoft.ContainerService/managedClusters/read`
- `Microsoft.ContainerService/managedClusters/listClusterUserCredential/action`

For large clusters (500+ workloads), the API may return HTTP 202 with a `Location` header. Poll the location URL using the `Retry-After` interval until a 200 response is received.

**Parsing the MCP response:**
1. **`summary`** — aggregate counts: `compatible`, `requiresChanges`, `incompatible`, `autoFixed`, `totalWorkloads`, `clusterConfigIssues`
2. **`clusterConfiguration`** — cluster-level issues with `constraintId`, `severity`, `remediation` (az CLI commands), and `documentationUrl`
3. **`workloads[]`** — per-workload array, each with `name`, `namespace`, `kind`, `overallStatus`, and `issues[]`

Each issue in `workloads[].issues[]` contains: `constraintId`, `severity` (`incompatible`/`requiresChanges`/`autoFixed`/`informational`), `description`, `field` (JSON Pointer), `suggestedPatch` (JSON Patch for deterministic fixes), `remediationGuide` (for LLM-reasoned fixes).

#### Fallback Chain

```
1. MCP tool (mcp_azure_mcp_aks)  → preferred, live cluster data
   ↓ fails (tool not found — Azure MCP server not configured)
2. Offline validation            → works on local manifests without any cluster
```

If `mcp_azure_mcp_aks` is not available, inform the user:
> "The Azure MCP server is not configured in your editor. To enable live cluster assessment, follow the setup guide at [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup). For now, I can validate your local manifests offline."

Then proceed to offline mode.

#### Offline Mode

Load the constraint spec from `references/constraint-spec-v1.yaml` and evaluate each manifest. The check field tells you what to check for and what fields to check. The fix field will tell you any allowed values and possible fixes. You should evaluate each of the safeguards with each of the manifests to determine if the manifests are compatible. Suggest any fixes that are needed.

Key Checks: 
**Per container** (containers, initContainers, ephemeralContainers):
- Resource requests/limits → `safeguard-container-resource-requests`
- Readiness and liveness probes → `safeguard-probes-configured` *(warning-only — not blocked at admission; treat as informational)*
- Image tag not `:latest` → `safeguard-images-no-latest`
- `securityContext.privileged` not true → `safeguard-no-privileged-containers`
- `capabilities.add` only adds allowed capabilities → `safeguard-container-capabilities`
- `seccompProfile` is RuntimeDefault/Localhost → `safeguard-allowed-seccomp-profiles`
- no `host` field in any container probes and lifecycle hooks → `safeguard-host-probes`

**Per pod spec:**
- `hostPID`/`hostIPC` not true → `safeguard-block-host-namespaces` (incompatible)
- `hostNetwork`/`hostPort` not true → `safeguard-host-network-ports` (incompatible)
- No `hostPath` volumes → `safeguard-no-host-path-volumes` (incompatible)

**Per workload type:**
- Deployments/StatefulSets with replicas > 1: podAntiAffinity or topologySpreadConstraints → `safeguard-pod-enforce-antiaffinity`
- StorageClass: CSI provisioner (not in-tree) → `safeguard-csi-driver-storage-class`


### Severity Classification

| Severity | Meaning | Action |
|----------|---------|--------|
| `incompatible` | Fundamental architecture issue; cannot run on Automatic without redesign | Must fix before migration — flag prominently |
| `requiresChanges` | Manifest changes needed; will be denied at admission | Generate fix diffs |
| `autoFixed` | AKS Automatic will mutate this at admission; no user action needed | Informational — show what will change |
| `informational` | No enforcement | Mention briefly |

### Step 3: Present Findings

Always start with the summary:

```
## AKS Automatic Readiness Assessment

| Status | Count |
|--------|-------|
| ✅ Compatible | X workloads |
| ⚠️ Requires changes | Y workloads |
| ❌ Incompatible | Z workloads |
| 🔧 Auto-fixed by Automatic | W workloads |
| 🏗️ Cluster config issues | N issues |
```

Grouping: ≤ 10 issues → list individually; > 10 → group by constraint ID. Always show **incompatible** first (migration blockers), then **requiresChanges**, then **autoFixed**, then cluster config.

Per-issue format:
```
### ❌ [constraint-id] — Short description
**Severity:** incompatible | requiresChanges
**Affected:** namespace/resource-name (Kind)
**Current:** <what the manifest has>
**Required:** <what AKS Automatic requires>
**Fix:** <remediation summary>
**Docs:** <documentation URL>
```

### Step 4: Offer Fixes

**Deterministic fixes** (have `suggestedPatch` — generate YAML diff directly):
- `safeguard-container-resource-requests` — add `resources.requests`
- `safeguard-container-capabilities` — remove `capabilities.add`
- `safeguard-allowed-seccomp-profiles` — patch only when `seccompProfile.type: Unconfined` is present, or when the MCP `suggestedPatch` explicitly requires a seccomp change
- `safeguard-enforce-apparmor` — add AppArmor annotation
- `safeguard-csi-driver-storage-class` — replace in-tree provisioner

Use patterns in `references/common-fixes.md` and generate a before/after diff. Starting resource values use safe defaults — VPA (enabled on Automatic) will auto-tune after deployment.

**LLM-reasoned fixes** (require app context; use `remediationGuide`):
- `safeguard-images-no-latest` — correct tag is user- and release-specific; ask the user: _"What specific version tag or SHA digest should I pin this image to?"_ Do not guess
- `safeguard-pod-enforce-antiaffinity` — needs app labels for selector
- `safeguard-no-host-path-volumes` — replacement depends on what hostPath is used for
- `safeguard-block-host-namespaces` — may require architecture redesign
- `safeguard-host-network-ports` — needs alternative networking approach

For incompatible findings (e.g., hostPath volumes), explain the issue and propose alternatives. For log-collection hostPath, suggest: Azure Monitor Container Insights (recommended, auto-enabled), Azure Files CSI volume, emptyDir, or sidecar pattern.

**Fix application flow:**
1. Generate the fix as a YAML diff
2. Show the diff with explanation
3. Wait for explicit approval: "apply", "edit", or "skip"
4. On approval, apply the change to the file
5. Move to the next finding

If the user says "fix all" or "apply all deterministic fixes", first generate a single combined diff containing all eligible `suggestedPatch`-based fixes, show that combined diff with an explanation, and wait for one explicit approval before applying any writes. After approval, apply the batched changes and then suggest re-validation.

### Step 5: Recommend Next Steps

**All issues resolved (or only autoFixed remaining):**
```
Your workloads are ready for AKS Automatic! Next steps:
1. Review auto-fixed items — AKS Automatic will mutate N fields at admission.
2. Apply cluster configuration changes (see cluster config issues above).
3. Perform the SKU switch — follow the migration guide.
4. Verify — after migration, check all workloads are running and healthy.
```
See `references/migration-guide-summary.md` for the full migration checklist.

**Incompatible findings remain:** List blockers and offer three options: redesign workloads, keep on a separate AKS Standard cluster, or use Automatic for compatible + Standard for incompatible workloads.

**Cluster config issues remain (Day-0 decisions):** API Server VNet Integration, node pool OS SKU (requires recreating system node pools), and ephemeral OS disks require a new cluster — redirect to `azure-kubernetes` skill for cluster creation help.

## Error Handling

| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| MCP tool call fails or times out | Invalid credentials or subscription context | Verify `az login`, confirm active subscription with `az account show`; if MCP remains unavailable, continue with offline validation using local or exported manifests and the bundled constraint spec |
| HTTP 403 on assessment action | Missing permission | Ensure caller has sufficient RBAC access to read and assess the cluster via AKS APIs |
| API returns HTTP 202 | Large cluster (500+ workloads) — async operation | Poll the `Location` header URL using `Retry-After` interval |
| Helm chart uses Go templating — cannot evaluate | Template values not resolved | Ask user for rendered output (`helm template`) or values files |
| Constraint spec version mismatch | Skill bundles spec v1.1.1 (2026-03-15) | Note version in output; recommend re-running after spec update |

## Reference Files

| File | When to load |
|------|--------------|
| `references/constraint-spec-v1.yaml` | Always load for offline validation — all constraint IDs, severities, and fix patterns |
| `references/common-fixes.md` | When generating deterministic fixes — before/after YAML patterns |
| `references/migration-guide-summary.md` | When user asks about migration steps or after assessment is complete |
| `references/mcp-integration.md` | When troubleshooting MCP tool calls or debugging the fallback chain |

> ⚠️ **Warning:** This skill bundles **constraint spec v1.1.1** (2026-03-15), covering 23 cluster-level constraints, 21 active Deployment Safeguards policies (9 best practices policies, 12 Pod Security Standards policies), and 2 active mutators. Always note the spec version in assessment output.

azure-kubernetes-automatic-readiness/references/

common-fixes.md 6.6 KB

# Common Fix Patterns for AKS Automatic Compatibility

Loaded on demand when generating YAML fixes during assessment.
Maps to constraint IDs in `constraint-spec-v1.yaml`.

---

## `safeguard-container-resource-requests` — Add resource requests/limits

**Before:**
```yaml
containers:
  - name: web
    image: myapp:v1.0.0
```

**After:**
```yaml
containers:
  - name: web
    image: myapp:v1.0.0
    resources:
      requests:
        cpu: "250m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
```

> 💡 **Tip:** Use safe minimums as starting values. VPA (auto-enabled on AKS Automatic) will tune these after deployment based on actual usage.

---

## `safeguard-container-capabilities` — Drop all capabilities

**Before:**
```yaml
securityContext:
  capabilities:
    add: ["NET_ADMIN"]
```

**After:**
```yaml
securityContext:
  capabilities:
    drop: ["ALL"]
```

> ⚠️ **Warning:** If the app genuinely requires `NET_ADMIN` or similar, it is **incompatible** with AKS Automatic. Do not silently drop — explain the incompatibility and suggest redesign.

---

## `safeguard-allowed-seccomp-profiles` — Add seccomp profile

**Before:**
```yaml
spec:
  containers:
    - name: web
```

**After:**
```yaml
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: web
```

---

## `safeguard-allowed-seccomp-profiles` — Remove 'Unconfined' seccomp profile

**Before:**
```yaml
spec:
  securityContext:
    seccompProfile:
      type: Unconfined
  containers:
    - name: web
```

**After:**
```yaml
spec:
  containers:
    - name: web
```

---

## `safeguard-enforce-apparmor` — Add AppArmor annotation

**Before:**
```yaml
metadata:
  name: my-deployment
```

**After:**
```yaml
metadata:
  name: my-deployment
  annotations:
    container.apparmor.security.beta.kubernetes.io/web: runtime/default
```

> 💡 **Tip:** Replace `web` with the actual container name. Add one annotation per container.

---

## `safeguard-images-no-latest` — Pin image tag *(LLM-reasoned — ask user)*

**Before:**
```yaml
image: myapp:latest
```

**After:**
```yaml
image: myapp:v1.2.3   # ← version confirmed with user
```

> ⚠️ **Warning:** Do not guess the version. Ask the user: _"What specific version tag or SHA digest should I pin this image to?"_ If from a public registry, suggest checking Docker Hub or the registry for the latest stable tag.

---

## `safeguard-probes-configured` — Add probes *(best-practice recommendation — warning-only, not blocked at admission)*

**HTTP app (most common):**
```yaml
readinessProbe:
  httpGet:
    path: /healthz        # ← ask user for their health endpoint
    port: 8080            # ← ask user for port
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3
```

**TCP-only app (databases, Redis, etc.):**
```yaml
readinessProbe:
  tcpSocket:
    port: 6379           # ← service port
  initialDelaySeconds: 5
  periodSeconds: 10
livenessProbe:
  tcpSocket:
    port: 6379
  initialDelaySeconds: 15
  periodSeconds: 20
```

**gRPC app:**
```yaml
readinessProbe:
  grpc:
    port: 50051
  initialDelaySeconds: 5
  periodSeconds: 10
```

---

## `safeguard-host-probes` — Remove host field in probes and lifecycle hooks

**Before:**
```yaml
spec:
  containers:
  - name: my-container
    image: nginx:v1.2.3
    livenessProbe:
      httpGet:
        host: "my-host"
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20
      failureThreshold: 3
```

**After:**
Remove the `host` field
Example:
```yaml
spec:
  containers:
  - name: my-container
    image: nginx:v1.2.3
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20
      failureThreshold: 3
```

---

## `safeguard-pod-enforce-antiaffinity` — Add topology spread *(LLM-reasoned — ask user for label)*

Ask user: _"What label key/value identifies your workload's pods?"_

```yaml
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: <app-label>     # ← from user
      containers:
        - name: web
```

---

## `safeguard-csi-driver-storage-class` — Migrate in-tree to CSI

**Before (Azure Disk in-tree):**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: kubernetes.io/azure-disk
parameters:
  skuName: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: Immediate
```

**After (Azure Disk CSI):**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer   # ← preferred for zonal disks
```

| In-tree provisioner | CSI replacement |
|---|---|
| `kubernetes.io/azure-disk` | `disk.csi.azure.com` |
| `kubernetes.io/azure-file` | `file.csi.azure.com` |

---

## PodDisruptionBudget — Add missing PDB

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: <app-name>-pdb
  namespace: <namespace>
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: <app-label>
```

## PodDisruptionBudget — Fix blocking `maxUnavailable: 0`

**Before:**
```yaml
spec:
  maxUnavailable: 0
```

**After:**
```yaml
spec:
  maxUnavailable: 1
```

> ⚠️ **Warning:** `maxUnavailable: 0` completely blocks node drain during AKS Automatic upgrades. At least 1 pod must be allowed unavailable for upgrades to proceed.

---

## `safeguard-no-host-path-volumes` — Replace hostPath *(incompatible — suggest alternatives)*

| hostPath use case | Recommended replacement |
|---|---|
| Log collection (`/var/log`) | Azure Monitor Container Insights (auto-enabled on AKS Automatic) |
| Container runtime socket (`/var/run/docker.sock`) | Use the AKS Automatic node observability features — direct socket access not supported |
| Shared config files | `configMap` volume |
| Secrets / credentials | Kubernetes `secret` volume or Azure Key Vault CSI Driver |
| Ephemeral scratch space | `emptyDir` volume |
| Persistent app data | Azure Disk CSI via PVC (`disk.csi.azure.com`) |
| Shared file storage across pods | Azure Files CSI via PVC (`file.csi.azure.com`) |

**emptyDir example:**
```yaml
volumes:
  - name: scratch
    emptyDir: {}
```

**Azure Files CSI PVC example:**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: logs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: azurefile-csi
  resources:
    requests:
      storage: 10Gi
```

constraint-spec-v1.yaml 18.9 KB

# AKS Automatic Compatibility Constraint Spec — Condensed Reference
# Version: 1.1.1 | AKS: 2026-03-15
# This condensed version is optimized for LLM context.

apiVersion: aks-automatic.azure.com/v1
kind: ConstraintSpecReference
metadata:
  version: "1.1.1"
  aksVersion: "2026-03-15"
  policyInitiatives:
    deploymentSafeguards: c047ea8e-9c78-49b2-958b-37e56d291a44
    podSecurityBaseline: a8640138-9b0a-4a28-b8cb-1666c838647d
    podSecurityRestricted: 42b8ef37-b724-4e24-bbc8-7a7708edfe00

# =============================================================================
# CLUSTER CONSTRAINTS (23 total)
# =============================================================================
clusterConstraints:
  # -- Addons --
  - id: cluster-azure-policy-addon
    severity: requiresChanges
    field: addonProfiles.azurepolicy.enabled
    required: true
    fix: "az aks addon enable --addon azure-policy"

  - id: cluster-keyvault-secrets-provider
    severity: requiresChanges
    field: addonProfiles.azureKeyvaultSecretsProvider.enabled
    required:
      enabled: true
      enableSecretRotation: true
    fix: "az aks addon enable --addon azure-keyvault-secrets-provider --enable-secret-rotation"

  # -- Networking --
  - id: cluster-api-server-vnet-integration
    severity: requiresChanges
    field: privateConnectProfile.enabled
    required: true
    fix: "az aks update --enable-apiserver-vnet-integration --apiserver-subnet-id <subnet-id>"

  - id: cluster-azure-cni-overlay-cilium
    severity: requiresChanges
    field: networkPlugin/networkPluginMode/networkPolicy/ebpfDataplane
    required: azure/overlay/cilium/cilium
    fix: |
      Step 1: az aks update --network-plugin-mode overlay --pod-cidr 192.168.0.0/16
      Step 2: az aks update --network-dataplane cilium
      Note: Irreversible. Disable NAP before Cilium update.

  - id: cluster-standard-load-balancer
    severity: requiresChanges
    field: loadBalancerSku
    required: standard
    fix: "az aks update --load-balancer-sku standard (in-place upgrade from Basic supported)"

  - id: cluster-nat-gateway-managed-vnet
    severity: requiresChanges
    condition: AKS-managed VNet only
    field: outboundType
    required: managedNATGateway
    fix: "az aks update --outbound-type managedNATGateway"

  # -- Upgrades --
  - id: cluster-auto-upgrade
    severity: requiresChanges
    field: autoUpgradeProfile
    required: upgradeChannel=stable, nodeOSUpgradeChannel=NodeImage
    fix: "az aks update --auto-upgrade-channel stable --node-os-upgrade-channel NodeImage"

  # -- Ingress --
  - id: cluster-web-app-routing
    severity: requiresChanges
    field: ingressProfile.webAppRouting.enabled
    required: true
    fix: "az aks addon enable --addon web_application_routing"

  # -- Identity --
  - id: cluster-workload-identity-oidc
    severity: requiresChanges
    field: workloadIdentity.enabled + oidcProfile.enabled
    required: true
    fix: "az aks update --enable-oidc-issuer --enable-workload-identity"

  - id: cluster-azure-rbac
    severity: requiresChanges
    field: aadProfile (managed + enableAzureRBAC)
    required: true
    fix: "az aks update --enable-aad --enable-azure-rbac"

  - id: cluster-disable-local-accounts
    severity: requiresChanges
    field: disableLocalAccounts
    required: true
    fix: "az aks update --disable-local-accounts"

  - id: cluster-system-assigned-managed-identity
    severity: requiresChanges
    condition: AKS-managed VNet only
    field: identity.type
    required: SystemAssigned
    fix: "Day-0 decision for managed VNet clusters."

  # -- Security --
  - id: cluster-image-cleaner
    severity: requiresChanges
    field: securityProfile.imageCleaner.enabled
    required: true
    fix: "az aks update --enable-image-cleaner"

  # -- Autoscaling --
  - id: cluster-vpa
    severity: requiresChanges
    field: verticalPodAutoscaler
    required: enabled=true, updateMode=Off
    fix: "az aks update --enable-vpa"

  - id: cluster-keda
    severity: requiresChanges
    field: keda.enabled
    required: true
    fix: "az aks update --enable-keda"

  - id: cluster-node-auto-provisioning
    severity: requiresChanges
    field: nodeProvisioningProfile.mode
    required: Auto
    fix: "az aks update --node-provisioning-mode Auto"

  # -- Governance --
  - id: cluster-node-rg-readonly
    severity: requiresChanges
    field: nodeResourceGroupProfile.restrictionLevel
    required: ReadOnly
    fix: "Day-0 setting. May require new cluster."

  # -- Node Pool (system pools) --
  - id: pool-ephemeral-os-disk
    severity: incompatible
    appliesTo: system pools
    field: storageProfile
    required: Ephemeral
    fix: "Day-0. Recreate system node pool."

  - id: pool-availability-zones
    severity: incompatible
    appliesTo: system pools only
    field: availabilityZones
    required: "[1, 2, 3]"
    fix: "Day-0. Recreate system pool in 3-AZ region. User pools not affected."

  - id: pool-critical-addons-taint
    severity: requiresChanges
    appliesTo: system pools
    field: taints
    required: CriticalAddonsOnly=true:NoSchedule
    fix: "az aks nodepool update --node-taints CriticalAddonsOnly=true:NoSchedule"

  - id: pool-vmss-type
    severity: incompatible
    appliesTo: system pools
    field: type
    required: VirtualMachineScaleSets
    fix: "Day-0. Recreate as VMSS."

  - id: pool-azure-linux-os
    severity: incompatible
    appliesTo: system pools only
    field: osSKU
    required: AzureLinux
    fix: "Day-0. Recreate system pool with --os-sku AzureLinux. User pools can use any OS."

  - id: pool-ssh-disabled
    severity: requiresChanges
    appliesTo: all pools
    field: agentPoolProfiles[*].securityProfile.sshAccess
    required: Disabled
    fix: "az aks nodepool update --cluster-name CLUSTER --name POOL_NAME --ssh-access disabled"

# =============================================================================
# WORKLOAD CONSTRAINTS — Deployment Safeguards (21 active policies)
# Initiative: c047ea8e | Effect: Mixed(Deny/Warn/Mutate) on Automatic
# =============================================================================
safeguards:
  # -- AKS Best Practices (9 policies) --
  - id: safeguard-restricted-node-edits
    policyId: 53a4a537
    severity: requiresChanges
    category: nodeProtection
    check: Check if a rolebinding for a service account references a role with node edit permissions. The application might try to edit node objects directly
    fix: Manage node pools through the AKS API (az aks nodepool) instead of direct Node object edits

  - id: safeguard-container-resource-requests
    policyId: 03a4ecdb
    severity: autoFixed
    category: resources
    check: Every container must have cpu + memory requests and limits
    effect: "ResourceRequestsWorkloadMutator sets defaults cpu=500m, memory=2Gi for requests+limits; enforces minimums cpu=100m, memory=100Mi; fixes QoS if requests > limits"

  - id: safeguard-pod-enforce-antiaffinity
    policyId: 34c88cd4
    severity: autoFixed
    category: availability
    check: Replicated workloads with >1 replica should have podAntiAffinity or topologySpreadConstraints
    effect: "AntiAffinityTopologySpreadWorkloadMutator adds preferred anti-affinity (weight=100, hostname) + topology spread (maxSkew=1, hostname, ScheduleAnyway) if neither exists"

  - id: safeguard-restricted-labels
    policyId: a22123bd
    severity: requiresChanges
    category: labeling
    check: AKS-reserved label prefixes blocked
    fix: Remove/rename labels with kubernetes.azure.com/ prefix

  - id: safeguard-restricted-taints
    policyId: 48940d92
    severity: requiresChanges
    category: nodeProtection
    check: AKS-reserved taint CriticalAddonsOnly key blocked for users
    fix: Remove reserved taints, use custom taint keys

  - id: safeguard-probes-configured
    policyId: b1a9997f
    severity: informational
    enforcement: warn  # Warning-only — deployments are admitted with a kubectl warning, not denied
    category: reliability
    check: Every container should have readinessProbe + livenessProbe (recommended best practice)
    fix: Add probes (app-specific — HTTP/TCP/exec/gRPC) — recommended, not required for migration

  - id: safeguard-csi-driver-storage-class
    policyId: 4f3823b6
    severity: requiresChanges
    category: storage
    check: StorageClass must use CSI provisioner (not in-tree)
    fix: "Replace kubernetes.io/azure-disk → disk.csi.azure.com, also replace kubernetes.io/azure-file with file.csi.azure.com"

  - id: safeguard-unique-service-selectors
    policyId: b0fdedee
    severity: requiresChanges
    category: networking
    check: Services must have unique selectors per namespace
    fix: Deduplicate Service selectors

  - id: safeguard-images-no-latest
    policyId: 021f8078
    severity: requiresChanges
    category: imagePolicy
    check: Image tag must not be :latest or untagged (no colon)
    patch: "replace image tag with specific version or sha256 digest"

  # -- PSS-related policies in Safeguards (12 policies) --
  - id: safeguard-block-host-namespaces
    policyId: 47a1ee2f
    severity: incompatible
    category: podSecurity
    check: |
      Sharing the host PID or IPC namespaces is disallowed in the Baseline policy.
      Check the following fields:
      - spec.hostPID
      - spec.hostIPC
    fix: |
      The allowed values are:
      - undefined/nil
      - false
      Remove hostPID and hostIPC; incompatible if required.

  - id: safeguard-host-network-ports
    policyId: 82985f06
    severity: incompatible
    category: podSecurity
    check: |
      Sharing the host network namespace is disallowed, and host ports should not be used.
      Check the following fields:
      - spec.hostNetwork
      - spec.containers[*].ports[*].hostPort
      - spec.initContainers[*].ports[*].hostPort
      - spec.ephemeralContainers[*].ports[*].hostPort
    fix: |
      The allowed values are:
      - spec.hostNetwork: undefined/nil or false
      - hostPort fields: undefined/nil or 0
      Use ClusterIP Services, Ingress, or internal Pod networking instead of host networking or host ports.

  - id: safeguard-allowed-sysctls
    policyId: 5e5a0673
    severity: requiresChanges
    category: podSecurity
    check: |
      Sysctls are limited to the Baseline safe subset.
      Check the following field:
      - spec.securityContext.sysctls[*].name
    fix: |
      The allowed values are:
      - undefined/nil
      - kernel.shm_rmid_forced
      - net.ipv4.ip_local_port_range
      - net.ipv4.ip_unprivileged_port_start
      - net.ipv4.tcp_syncookies
      - net.ipv4.ping_group_range
      - net.ipv4.ip_local_reserved_ports
      - net.ipv4.tcp_keepalive_time
      - net.ipv4.tcp_fin_timeout
      - net.ipv4.tcp_keepalive_intvl
      - net.ipv4.tcp_keepalive_probes
      Remove any sysctl not in this list.

  - id: safeguard-no-host-path-volumes
    policyId: 098fc59e
    severity: incompatible
    category: podSecurity
    check: |
      HostPath volumes are forbidden in the Baseline policy.
      Check the following field:
      - spec.volumes[*].hostPath
    fix: |
      The allowed values are:
      - undefined/nil
      Replace hostPath volumes with PVCs, ConfigMaps, Secrets, CSI-backed storage, or another non-hostPath volume type.

  - id: safeguard-enforce-apparmor
    policyId: 511f5417
    severity: requiresChanges
    category: podSecurity
    check: |
      On supported hosts, the Baseline policy does not allow disabling the default AppArmor profile.
      Check the following fields:
      - spec.securityContext.appArmorProfile.type
      - spec.containers[*].securityContext.appArmorProfile.type
      - spec.initContainers[*].securityContext.appArmorProfile.type
      - spec.ephemeralContainers[*].securityContext.appArmorProfile.type
      - metadata.annotations["container.apparmor.security.beta.kubernetes.io/*"]
    fix: |
      The allowed values are:
      - appArmorProfile.type: undefined/nil, RuntimeDefault, or Localhost
      - AppArmor annotation: undefined/nil, runtime/default, or localhost/*
      Set RuntimeDefault, or use an allowed Localhost profile.

  - id: safeguard-enforce-selinux
    policyId: e1e6c427
    severity: informational
    category: podSecurity
    check: |
      SELinux settings are restricted to specific types, and custom user or role values are forbidden.
      Check the following fields:
      - spec.securityContext.seLinuxOptions.type
      - spec.containers[*].securityContext.seLinuxOptions.type
      - spec.initContainers[*].securityContext.seLinuxOptions.type
      - spec.ephemeralContainers[*].securityContext.seLinuxOptions.type
      - spec.securityContext.seLinuxOptions.user
      - spec.containers[*].securityContext.seLinuxOptions.user
      - spec.initContainers[*].securityContext.seLinuxOptions.user
      - spec.ephemeralContainers[*].securityContext.seLinuxOptions.user
      - spec.securityContext.seLinuxOptions.role
      - spec.containers[*].securityContext.seLinuxOptions.role
      - spec.initContainers[*].securityContext.seLinuxOptions.role
      - spec.ephemeralContainers[*].securityContext.seLinuxOptions.role
    fix: |
      The allowed values are:
      - seLinuxOptions.type: undefined/"", container_t, container_init_t, container_kvm_t, or container_engine_t
      - seLinuxOptions.user: undefined/""
      - seLinuxOptions.role: undefined/""
      Optional hardening only: remove custom seLinuxOptions or use one of the allowed types.

  - id: safeguard-windows-block-host-process
    policyId: 077f0ce1
    severity: incompatible
    category: podSecurity
    check: |
      Windows Pods offer the ability to run HostProcess containers which enables privileged access to the Windows host machine. Privileged access to the host is disallowed in the Baseline policy.
      Check the following fields:
      - spec.securityContext.windowsOptions.hostProcess
      - spec.containers[*].securityContext.windowsOptions.hostProcess
      - spec.initContainers[*].securityContext.windowsOptions.hostProcess
      - spec.ephemeralContainers[*].securityContext.windowsOptions.hostProcess
    fix: |
      The allowed values are:
      - undefined/nil
      - false
      Remove hostProcess; incompatible if required.

  - id: safeguard-no-privileged-containers
    policyId: 95edb821
    severity: incompatible
    category: podSecurity
    check: |
      Privileged containers are disallowed in the Baseline policy.
      Check the following fields:
      - spec.containers[*].securityContext.privileged
      - spec.initContainers[*].securityContext.privileged
      - spec.ephemeralContainers[*].securityContext.privileged
    fix: |
      The allowed values are:
      - undefined/nil
      - false
      Remove privileged mode or set privileged to false; incompatible if privileged access is required.

  - id: safeguard-no-custom-proc-mount
    policyId: f85eb0dd
    severity: requiresChanges
    category: podSecurity
    check: |
      Custom /proc mount types are disallowed.
      Check the following fields:
      - spec.containers[*].securityContext.procMount
      - spec.initContainers[*].securityContext.procMount
      - spec.ephemeralContainers[*].securityContext.procMount
    fix: |
      The allowed values are:
      - undefined/nil
      - Default
      Remove custom procMount values or set procMount to Default.

  - id: safeguard-container-capabilities
    policyId: c26596ff
    severity: requiresChanges
    category: podSecurity
    check: |
      Adding capabilities is limited to the Baseline allowlist.
      Check the following fields:
      - spec.containers[*].securityContext.capabilities.add
      - spec.initContainers[*].securityContext.capabilities.add
      - spec.ephemeralContainers[*].securityContext.capabilities.add
    fix: |
      The allowed values are:
      - undefined/nil
      - AUDIT_WRITE
      - CHOWN
      - DAC_OVERRIDE
      - FOWNER
      - FSETID
      - KILL
      - MKNOD
      - NET_BIND_SERVICE
      - SETFCAP
      - SETGID
      - SETPCAP
      - SETUID
      - SYS_CHROOT
      Remove any added capability outside this list.

  - id: safeguard-host-probes
    policyId: acdf8909
    severity: requiresChanges
    category: podSecurity
    check: |
      The host field in probes and lifecycle hooks is disallowed
      Restricted fields:
        - spec.containers[*].livenessProbe.httpGet.host
        - spec.containers[*].readinessProbe.httpGet.host
        - spec.containers[*].startupProbe.httpGet.host
        - spec.containers[*].livenessProbe.tcpSocket.host
        - spec.containers[*].readinessProbe.tcpSocket.host
        - spec.containers[*].startupProbe.tcpSocket.host
        - spec.containers[*].lifecycle.postStart.tcpSocket.host
        - spec.containers[*].lifecycle.preStop.tcpSocket.host
        - spec.containers[*].lifecycle.postStart.httpGet.host
        - spec.containers[*].lifecycle.preStop.httpGet.host
        - spec.initContainers[*].livenessProbe.httpGet.host
        - spec.initContainers[*].readinessProbe.httpGet.host
        - spec.initContainers[*].startupProbe.httpGet.host
        - spec.initContainers[*].livenessProbe.tcpSocket.host
        - spec.initContainers[*].readinessProbe.tcpSocket.host
        - spec.initContainers[*].startupProbe.tcpSocket.host
        - spec.initContainers[*].lifecycle.postStart.tcpSocket.host
        - spec.initContainers[*].lifecycle.preStop.tcpSocket.host
        - spec.initContainers[*].lifecycle.postStart.httpGet.host
        - spec.initContainers[*].lifecycle.preStop.httpGet.host
    fix: |
      The allowed values are:
        - undefined/nil
        - ""
      Remove the `host` field from probes and lifecycle hooks; the kubelet uses the pod IP by default.

  - id: safeguard-allowed-seccomp-profiles
    policyId: 975ce327
    severity: requiresChanges
    category: podSecurity
    check: |
      Seccomp must not be explicitly set to Unconfined.
      Check the following fields:
      - spec.securityContext.seccompProfile.type
      - spec.containers[*].securityContext.seccompProfile.type
      - spec.initContainers[*].securityContext.seccompProfile.type
      - spec.ephemeralContainers[*].securityContext.seccompProfile.type
    fix: |
      The allowed values are:
      - undefined/nil
      - RuntimeDefault
      - Localhost
      Remove Unconfined, or set seccompProfile.type to RuntimeDefault or Localhost.

# =============================================================================
# WEBHOOK MUTATIONS (2 active mutators) — auto-applied at admission
# =============================================================================
mutations:
  - id: mutation-anti-affinity-topology-spread
    policyId: implicit
    target: [Deployment, StatefulSet, ReplicaSet]
    effect: "Adds preferred pod anti-affinity (weight=100, kubernetes.io/hostname) + topology spread (maxSkew=1, kubernetes.io/hostname, ScheduleAnyway). Skips if any existing anti-affinity or topology spread. Label priority: app > app.kubernetes.io/name > default-antiaffinity-applabel."

  - id: mutation-resource-requests-default
    policyId: implicit
    target: containers
    effect: "Sets resources.requests+limits defaults cpu=500m, memory=2Gi. Minimums cpu=100m, memory=100Mi. If only limits set, requests=limits. If requests > limits, requests capped at limits (QoS fix)."

mcp-integration.md 6.5 KB

# MCP Integration Reference

Loaded when troubleshooting MCP tool calls, debugging the fallback chain, or understanding the API response format.

---

## Tool Discovery

Always call `mcp_azure_mcp_aks` first to discover the current available tool surface. Do not assume a fixed action name — the available actions depend on the MCP server version deployed to the client.

```javascript
mcp_azure_mcp_aks({ action: "discover" })
```

The response lists available actions and their parameter schemas. Use the returned schema — do not hardcode parameter names.

---

## Assessment Call

After calling `discover`, use the assessment action name returned in the response. Pass parameters according to the discovered schema — do not hardcode action names or API versions.

Typical parameters include:
- `subscriptionId` — Azure subscription ID
- `resourceGroupName` — resource group containing the cluster
- `resourceName` — AKS cluster name
- `scope` (optional) — filter by namespaces or workload types

Example shape (use actual action name and schema from discover output):
```javascript
mcp_azure_mcp_aks({
  action: "<action-from-discover>",
  subscriptionId: "<subscription-id>",
  resourceGroupName: "<resource-group>",
  resourceName: "<cluster-name>",
  scope: {
    excludeNamespaces: ["kube-system", "gatekeeper-system", "azure-arc"],
    workloadTypes: ["Deployment", "StatefulSet", "DaemonSet", "CronJob", "Job"]
  }
})
```

All `scope` parameters are optional. If omitted, the API assesses all workloads excluding `kube-system` and `gatekeeper-system`.

---

## Required Permissions

```bash
# Check current role assignments
az role assignment list \
  --assignee $(az ad signed-in-user show --query id -o tsv) \
  --scope /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>

# Minimum permissions required:
# - Microsoft.ContainerService/managedClusters/read
# - Microsoft.ContainerService/managedClusters/listClusterUserCredential/action

# Assign if missing (requires Owner or User Access Administrator)
az role assignment create \
  --assignee <principal-id> \
  --role "Azure Kubernetes Service Cluster User Role" \
  --scope /subscriptions/<subscription-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>
```

---

## Response Schema

The API returns three top-level sections:

### `summary`
```json
{
  "summary": {
    "totalWorkloads": 42,
    "compatible": 27,
    "requiresChanges": 12,
    "incompatible": 3,
    "autoFixed": 8,
    "clusterConfigIssues": 4
  }
}
```

### `clusterConfiguration`
```json
{
  "clusterConfiguration": [
    {
      "constraintId": "cluster-oidc-issuer",
      "severity": "requiresChanges",
      "description": "OIDC issuer not enabled",
      "remediation": "az aks update --enable-oidc-issuer --resource-group <rg> --name <cluster>",
      "documentationUrl": "https://learn.microsoft.com/azure/aks/..."
    }
  ]
}
```

### `workloads[]`
```json
{
  "workloads": [
    {
      "name": "sample-app",
      "namespace": "default",
      "kind": "Deployment",
      "overallStatus": "requiresChanges",
      "issues": [
        {
          "constraintId": "safeguard-images-no-latest",
          "severity": "requiresChanges",
          "description": "Container 'web' uses :latest image tag",
          "field": "/spec/containers/0/image",
          "suggestedPatch": null,
          "remediationGuide": "Pin the image to a specific version or SHA digest"
        }
      ]
    }
  ]
}
```

---

## Async Response Handling (HTTP 202 — Large Clusters)

For clusters with 500+ workloads, the API returns HTTP 202 Accepted with a `Location` header. Poll until complete:

```javascript
// Initial call returns: { status: 202, headers: { Location: "...", "Retry-After": "30" } }
async function pollAssessment(locationUrl, retryAfterSeconds) {
  while (true) {
    await new Promise(r => setTimeout(r, retryAfterSeconds * 1000));
    const response = await mcp_azure_mcp_aks({
      action: "pollOperation",
      locationUrl: locationUrl
    });
    if (response.status === "Succeeded") return response.result;
    if (response.status === "Failed") throw new Error(response.error.message);
    retryAfterSeconds = response.retryAfter ?? retryAfterSeconds;
  }
}
```

---

## Fallback Chain

Attempt each step in order. Do not ask the user which is available — just try:

```
1. mcp_azure_mcp_aks → discover, then call the assessment action returned
   ↓ fails (tool not found — Azure MCP server not configured)

2. Inform user to install Azure MCP, then fall back to offline validation
   kubectl get deployment,statefulset,daemonset,job,cronjob -A -o yaml > /tmp/workloads.yaml
   kubectl get pdb,storageclass -A -o yaml > /tmp/policies.yaml
```

If `mcp_azure_mcp_aks` is not available, say:
> "The Azure MCP server is not configured. To enable live cluster assessment, install it following [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup). For now, I can validate your local manifests offline — export them with `kubectl get ... -o yaml` or share your manifest files."

Then proceed to offline manifest validation against `constraint-spec-v1.yaml`.

---

## Prerequisites Verification

Run these before attempting MCP or CLI assessment:

```bash
# 1. Verify Azure login
az account show --query "{name:name, id:id, state:state}" -o table

# 2. Verify cluster exists and is accessible
az aks show \
  --resource-group <rg> \
  --name <cluster> \
  --query "{name:name, provisioningState:provisioningState, sku:sku.name}" \
  -o table

# 3. Verify kubectl context
kubectl config current-context
kubectl cluster-info
```

```javascript
// 4. Verify MCP server is reachable (Azure MCP)
// If this returns available actions, MCP is configured
mcp_azure_mcp_aks({ action: "discover" })
```

---

## Common MCP Errors

| Error | Cause | Fix |
|---|---|---|
| `tool not found: mcp_azure_mcp_aks` | Azure MCP server not configured | Guide user to install: [aka.ms/azure-mcp-setup](https://aka.ms/azure-mcp-setup), then fall back to offline |
| `HTTP 401 Unauthorized` | Not logged in | `az login` |
| `HTTP 403 Forbidden` | Insufficient RBAC permissions | Ensure caller has read access to the cluster via AKS APIs |
| `HTTP 404 Not Found` | Wrong subscription, RG, or cluster name | Verify with `az aks list -o table` |
| `HTTP 202` with no Location header | API version mismatch | Ensure the MCP server version supports async polling; retry with the latest server |
| Timeout after 30s | Cluster too large (500+ workloads) | Implement async polling — see section above |

migration-guide-summary.md 4.4 KB

# AKS Automatic Migration Guide

Loaded when user asks about migration steps or after assessment is complete.

---

## Migration Checklist

### Phase 1 — Assessment (this skill)

- [ ] Run the AKS Automatic compatibility assessment (via `mcp_azure_mcp_aks({ action: "discover" })` then the assessment action returned, or the offline manifest scan)
- [ ] Resolve all `incompatible` findings — these are hard blockers
- [ ] Apply all `requiresChanges` fixes — these will be denied at admission
- [ ] Review `autoFixed` items — understand what AKS Automatic will mutate at runtime
- [ ] Address cluster-level Day-0 config issues (see below)

### Phase 2 — Create AKS Automatic Cluster (use `azure-kubernetes` skill)

```bash
az aks create \
  --resource-group <resource-group> \
  --name <new-cluster-name> \
  --sku automatic \
  --location <location> \
  --generate-ssh-keys
```

> 💡 **Tip:** AKS Automatic auto-enables: OIDC issuer, workload identity, Azure CNI Overlay, NAP, VPA, Azure Monitor Container Insights, Deployment Safeguards, and Pod Security Standards (Baseline). No manual configuration needed for these.

### Phase 3 — Validate on New Cluster

```bash
# Get credentials
az aks get-credentials \
  --resource-group <resource-group> \
  --name <new-cluster-name>

# Dry-run server-side apply — catches admission policy rejections
kubectl apply --dry-run=server -f <manifests-directory>/

# Deploy to a staging namespace first
kubectl create namespace staging
kubectl apply -f <manifests-directory>/ -n staging

# Watch pod startup
kubectl get pods -n staging -w

# Check events for admission rejections
kubectl get events -n staging --sort-by=.lastTimestamp | grep -i "denied\|error\|failed"
```

> ⚠️ **Keep the old cluster running** for a rollback window (recommended: 48 hours minimum) while you validate workloads on the new AKS Automatic cluster.

### Phase 4 — Decommission Old Cluster

```bash
# Only after confirming workloads are stable on AKS Automatic
az aks delete \
  --resource-group <resource-group> \
  --name <old-cluster-name> \
  --yes --no-wait
```

---

## Day-0 Decisions — Cluster-Level Configuration Requirements

Some settings require creating a **new** cluster; others can be enabled on existing clusters. Route to `azure-kubernetes` skill for cluster creation.

| Requirement | AKS Automatic default | What to do |
|---|---|---|
| API Server VNet Integration | Required, auto-enabled | Requires a new cluster |
| Network plugin | Azure CNI Overlay | Requires a new cluster if currently on kubenet |
| System node pool OS | Azure Linux | Recreate system node pool (user pools unaffected) |
| OIDC Issuer | Auto-enabled | Can be enabled on existing: `az aks update --enable-oidc-issuer` |
| Workload Identity | Auto-enabled | Can be enabled on existing: `az aks update --enable-workload-identity` |

---

## What AKS Automatic Auto-Enables

No manual setup needed for these — show this list when user asks "what do I get for free":

| Feature | Benefit |
|---|---|
| Node Auto Provisioning (NAP) | Replaces cluster autoscaler; right-sizes node pools automatically |
| Vertical Pod Autoscaler (VPA) | Auto-tunes resource requests after deployment |
| Azure Monitor Container Insights | Logs, metrics, and dashboards out of the box |
| Deployment Safeguards | 25 active deny policies + 2 webhook mutators at admission (resource-requests defaults + anti-affinity/topology-spread) |
| Pod Security Standards (Baseline) | Enforced cluster-wide; Restricted available opt-in |
| Managed OIDC Issuer | Required for workload identity |
| Azure Key Vault CSI Driver | Secret injection without static credentials |
| Ephemeral OS disks | Faster node provisioning by default |
| Azure Linux node OS | Smaller footprint, faster boot times |

---

## Post-Migration Verification Commands

```bash
# Verify all pods running
kubectl get pods -A | grep -v Running | grep -v Completed

# Check for pods stuck in Pending (may indicate resource quota or node issues)
kubectl get pods -A --field-selector status.phase=Pending

# Check Deployment Safeguards are active
kubectl get constrainttemplate -A

# Verify VPA is running
kubectl get vpa -A

# Check NAP node pools
az aks nodepool list \
  --resource-group <resource-group> \
  --cluster-name <cluster-name> \
  --query "[].{name:name, mode:mode, osType:osType, count:count}" \
  -o table

# View Container Insights metrics
az aks show \
  --resource-group <resource-group> \
  --name <cluster-name> \
  --query addonProfiles.omsagent.enabled
```

references/

azure-aks-autoscaler.md 3.0 KB

# AKS Cluster Autoscaler (CAS)

Enable and tune the Cluster Autoscaler to automatically scale down idle nodes.

## Check CAS Status

```bash
az aks show \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --query "agentPoolProfiles[].{name:name, casEnabled:enableAutoScaling, min:minCount, max:maxCount, count:count}" \
  -o table

az aks show \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --query "autoScalerProfile" -o json
```

## Check Node Utilization (7 days)

Follow the metrics discovery steps in [azure-aks-rightsizing.md](./azure-aks-rightsizing.md#historical-metrics-azure-monitor--use-when-prometheus-or-container-insights-is-enabled) to list available metric definitions and query node CPU utilization. Use metric names such as `node_cpu_usage_percentage` or `cpuUsagePercentage` depending on what's available on the cluster.

## Enable CAS

```bash
# Cluster-level
az aks update \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --enable-cluster-autoscaler \
  --min-count <MIN_NODES> --max-count <MAX_NODES>

# Specific node pool
az aks nodepool update \
  --cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --name "<NODEPOOL_NAME>" \
  --enable-cluster-autoscaler \
  --min-count <MIN_NODES> --max-count <MAX_NODES>
```

## Recommended min/max Defaults

| Scenario | min-count | max-count |
|----------|-----------|-----------|
| Dev/test | 1 | current_count |
| Production (web/API) | 2 | current_count * 3 |
| Production (batch) | 0 | current_count * 5 |

> Risk: Low. CAS only scales down when pods can be safely rescheduled. Set min-count >= 2 for production HA.

## Tune CAS Profile

Apply when CAS is already on but idle nodes persist:

> ⚠️ **Warning:** Setting `skip-nodes-with-system-pods=false` allows CAS to evict system pods. Ensure all system pods in `kube-system` have PodDisruptionBudgets before enabling this.

```bash
az aks update \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --cluster-autoscaler-profile \
    scale-down-delay-after-add=10m \
    scale-down-unneeded-time=10m \
    scale-down-utilization-threshold=0.5 \
    max-graceful-termination-sec=600 \
    skip-nodes-with-system-pods=false
```

To roll back to CAS defaults:

```bash
az aks update \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --cluster-autoscaler-profile ""
```

## Profile Comparison

| Profile | scale-down-delay-after-add | scale-down-unneeded-time | utilization-threshold | Best For |
|---------|----------------------------|--------------------------|----------------------|----------|
| Default | 10m | 10m | 0.5 | General workloads |
| Cost-Optimized | 5m | 5m | 0.5 | Cost-sensitive, non-critical |
| Conservative | 30m | 30m | 0.7 | Stateful / production |
| Aggressive | 2m | 2m | 0.4 | Dev/test, batch |

> Risk: High for aggressive tuning. Ensure PodDisruptionBudgets (PDBs) are set on critical workloads before tuning. Always confirm with user before applying.
>
> Check existing PDBs before tuning:
> ```bash
> kubectl get pdb --all-namespaces
> ```

azure-aks-rightsizing.md 4.3 KB

# AKS Pod Rightsizing

Identify pods requesting far more CPU/memory than they use and recommend reduced resource requests.

## Prerequisites — Check Monitoring State First

Before collecting usage data, determine what monitoring is available on the cluster:

```bash
# 1. Check if Azure Managed Prometheus is enabled
az aks show \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --query "azureMonitorProfile.metrics.enabled" -o tsv

# 2. Check if Container Insights (Log Analytics) is enabled
az aks show \
  --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --query "addonProfiles.omsagent.enabled" -o tsv

# 3. Check if Metrics Server is running (pre-installed on AKS, but may be unhealthy)
kubectl get deployment metrics-server -n kube-system
```

Based on the result, follow the appropriate path:

| State | Rightsizing Possible? | Data Source | Accuracy |
|-------|-----------------------|-------------|----------|
| Azure Managed Prometheus enabled |  Yes | Prometheus metrics via Azure Monitor | Best — full P95/7-day history |
| Container Insights (Log Analytics) enabled |  Yes | KQL queries on `Perf` / `KubePodInventory` | Good — 7-day trends |
| Only Metrics Server (no Azure Monitor) |  Limited | `kubectl top pods` — live data only | Low — no historical trends |

> If nothing is enabled, Metrics Server is pre-installed on AKS — confirm it is healthy and use it for live rightsizing data:
> ```bash
> kubectl get deployment metrics-server -n kube-system
> kubectl top pods --all-namespaces --sort-by=cpu
> ```
> For historical P95 trends (more accurate rightsizing), recommend enabling Azure Managed Prometheus. Warn user this incurs cost and wait for confirmation before proceeding.

---

## Detection

```bash
# Authenticate to cluster
az aks get-credentials --name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>"

# List requests/limits for ALL containers per pod (includes sidecars)
# Using [*] ensures multi-container pods are not misrepresented
kubectl get pods --all-namespaces \
  -o custom-columns="NAMESPACE:.metadata.namespace,POD:.metadata.name,CONTAINERS:.spec.containers[*].name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory"

# Live per-container usage (shows each container individually, including sidecars)
kubectl top pods --all-namespaces --containers --sort-by=cpu
```

## Historical Metrics (Azure Monitor — use when Prometheus or Container Insights is enabled)

First discover available metric names, then query:

```bash
az monitor metrics list-definitions \
  --resource "<AKS_RESOURCE_ID>" \
  --query "[].name.value" -o tsv
```

```bash
az monitor metrics list \
  --resource "<AKS_RESOURCE_ID>" \
  --metric "<METRIC_NAME_FROM_ABOVE>" \
  --interval PT1H --aggregation Average \
  --start-time "<YYYY-MM-DDTHH:mm:ssZ>" \
  --end-time "<YYYY-MM-DDTHH:mm:ssZ>"
```

## Optimization Rules

| Condition | Recommendation | Risk |
|-----------|----------------|------|
| CPU request >5x P95 actual | Reduce to `P95 * 1.2` | Medium |
| Memory request >3x P95 actual | Reduce to `P95 * 1.2` | Medium |
| CPU request >2x P95 actual | Recommend rightsizing with 20% buffer | Low |
| No resource limits set | Add limits to prevent noisy-neighbor waste | Low |
| No VPA/HPA configured | Suggest enabling Vertical Pod Autoscaler | Low |

> For VPA setup and configuration, see [azure-aks-vpa.md](./azure-aks-vpa.md).

## YAML Patch Format
```yaml
# Rightsizing patch for <NAMESPACE>/<DEPLOYMENT_NAME>
# Current: CPU request=<CURRENT>, P95 actual=<ACTUAL>
# Recommended: CPU request=<NEW> (P95 * 1.2 buffer)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: <DEPLOYMENT_NAME>
  namespace: <NAMESPACE>
spec:
  template:
    spec:
      containers:
      - name: <CONTAINER_NAME>
        resources:
          requests:
            cpu: "<NEW_CPU>"
            memory: "<NEW_MEM>"
          limits:
            cpu: "<NEW_CPU_LIMIT>"     # e.g. CPU limit = 1.5x CPU request, or preserve existing limit-to-request ratio
            memory: "<NEW_MEM_LIMIT>"  # e.g. memory limit = 1.25x memory request, or preserve existing limit-to-request ratio
```

> Risk: Medium-High. Always review patches before applying. Test in non-production first. Get explicit user confirmation before applying to production.

azure-aks-spot.md 3.7 KB

# AKS Spot Node Pools

Recommend and create Spot VM node pools for batch, dev/test, or fault-tolerant workloads (60-90% cost reduction vs regular nodes).

## Check Existing Node Pools

```bash
az aks nodepool list \
  --cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --query "[].{name:name, vmSize:vmSize, priority:scaleSetPriority, count:count, mode:mode}" \
  -o table
```

## Identify Spot-Suitable Workloads

Before creating a Spot pool, identify which workloads can tolerate interruptions:

```bash
# List deployments without PodDisruptionBudgets (single-replica or no PDB = higher eviction risk)
kubectl get deployments --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.replicas == 1) | "\(.metadata.namespace)/\(.metadata.name)"'

# Check which pods already have spot tolerations
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.tolerations[]?.key == "kubernetes.azure.com/scalesetpriority") | "\(.metadata.namespace)/\(.metadata.name)"'
```

Use the suitability table below to decide which workloads to migrate.

## Mixed Node Pool Pattern (Spot + Regular)

For workloads that need resilience but want cost savings, use a mixed approach:

```bash
# Keep existing regular node pool as fallback (min 1-2 nodes)
az aks nodepool update \
  --cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --name "<REGULAR_POOL>" \
  --enable-cluster-autoscaler --min-count 1 --max-count 3

# Add Spot pool for the majority of workload capacity
# -1 means pay up to on-demand price (no cap); set e.g. 0.05 to cap hourly spend
az aks nodepool add \
  --cluster-name "<CLUSTER_NAME>" --resource-group "<RESOURCE_GROUP>" \
  --name "<SPOT_POOL_NAME>" \
  --priority Spot --eviction-policy Delete --spot-max-price -1 \
  --node-vm-size "<VM_SIZE>" \
  --node-count 3 --min-count 0 --max-count 10 \
  --enable-cluster-autoscaler \
  --node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \
  --labels "kubernetes.azure.com/scalesetpriority=spot"
```

Pods that tolerate Spot but don't require it (no `nodeSelector` or required node affinity pinning them to the Spot pool) will be rescheduled onto the regular pool after eviction. Pods pinned to Spot via `nodeSelector` cannot reschedule and will remain pending until a Spot node is available again.

## Workload Toleration (add to Deployment YAML)

```yaml
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
  operator: "Equal"
  value: "spot"
  effect: "NoSchedule"
nodeSelector:
  kubernetes.azure.com/scalesetpriority: spot
```

## Suitability

| Workload | Spot-Suitable? |
|----------|----------------|
| Batch / data processing | Yes |
| Dev / test environments | Yes |
| Stateless web/API (replicas >= 2) | Yes (with care) |
| Jobs with checkpointing | Yes |
| Stateful workloads (databases) | No |
| Single-replica critical services | No |

> Risk: Low for batch/dev. High for production stateful workloads. Spot VMs evict with 30-second notice. Eviction policy Delete is recommended for AKS.

## Handling Eviction Gracefully

Configure workloads to handle the 30-second eviction notice:

```yaml
# Add to Deployment spec — terminationGracePeriodSeconds should be < 30s for Spot
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 25
      containers:
      - name: <CONTAINER_NAME>
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]  # Drain in-flight requests
```

Set a PodDisruptionBudget to limit simultaneous evictions:

```bash
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: <APP_NAME>-pdb
  namespace: <NAMESPACE>
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: <APP_NAME>
EOF
```

azure-aks-vpa.md 1.4 KB

# AKS Vertical Pod Autoscaler (VPA)

Use VPA to get data-driven resource recommendations for rightsizing pods. Always start in recommendation-only mode before considering auto-apply.

## Enable VPA (Recommendation Mode)

```bash
# Enable VPA addon on AKS cluster (if not already enabled)
az aks update --enable-vpa --resource-group <RESOURCE_GROUP> --name <CLUSTER_NAME>

# Create a VPA object in recommendation mode for a deployment
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: <DEPLOYMENT_NAME>-vpa
  namespace: <NAMESPACE>
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: <DEPLOYMENT_NAME>
  updatePolicy:
    updateMode: "Off"   # Recommendation only — does not modify pods
EOF

# Read recommendations after 24+ hours of data collection
kubectl describe vpa <DEPLOYMENT_NAME>-vpa -n <NAMESPACE>
```

> Risk: Low in "Off" mode. **Do not use `updateMode: Auto` in production** without thorough testing and explicit user confirmation.

## Read VPA Recommendations

```bash
kubectl get vpa <DEPLOYMENT_NAME>-vpa -n <NAMESPACE> -o jsonpath='{.status.recommendation}'
```

The output shows `lowerBound`, `target`, and `upperBound` for CPU and memory. Use the `target` values as rightsized requests.

## Apply Recommendations Manually

After reviewing VPA output, patch the deployment — see [azure-aks-rightsizing.md](./azure-aks-rightsizing.md#yaml-patch-format) for the patch format.

cli-reference.md 1.2 KB

# CLI Reference for AKS

```bash
# List AKS clusters
az aks list --output table

# Show cluster details
az aks show --name <cluster-name> --resource-group <resource-group>

# Get available Kubernetes versions
az aks get-versions --location <location> --output table

# Create AKS Automatic cluster
az aks create --name <cluster-name> --resource-group <resource-group> --sku automatic \
  --network-plugin azure --network-plugin-mode overlay \
  --enable-oidc-issuer --enable-workload-identity

# Create AKS Standard cluster
az aks create --name <cluster-name> --resource-group <resource-group> \
  --node-count 3 --zones 1 2 3 \
  --network-plugin azure --network-plugin-mode overlay \
  --enable-cluster-autoscaler --min-count 1 --max-count 10 \
  --enable-oidc-issuer --enable-workload-identity

# Get credentials
az aks get-credentials --name <cluster-name> --resource-group <resource-group>

# List node pools
az aks nodepool list --cluster-name <cluster-name> --resource-group <resource-group> --output table

# Enable monitoring
az aks enable-addons --name <cluster-name> --resource-group <resource-group> \
  --addons monitoring --workspace-resource-id <workspace-resource-id>
```

License (MIT)

MIT Source: microsoft/azure-skills

View full license text

MIT License

Copyright 2025 (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

Security Scan

Passed

Every skill undergoes a two-pass automated security scan before being published to the Hub.

How does it work?

Pass 1 — Pattern analysis scans every file in the skill against 13 security rules for known dangerous patterns:

Script & command detection — Shell commands, exec/spawn calls, subprocess invocations, and curl-pipe-to-shell patterns.
Prompt injection markers — Phrases that attempt to override safety guidelines, bypass restrictions, or manipulate AI behavior.
Sensitive data & secrets — Hardcoded API keys, credentials, tokens, and access to sensitive system files.
Obfuscation patterns — Base64 decode-and-execute, dynamic code evaluation, and unsafe deserialization.
Data exfiltration risks — Environment variables sent to external URLs, writes to sensitive paths, and SQL injection patterns.

Pass 2 — AI deep scan uses GitHub Copilot to semantically analyze skill content for threats that regex can't catch:

Intent analysis — Detects code that appears benign line-by-line but is malicious in aggregate, such as disguised data exfiltration.
Social engineering — Instructions that trick users into running dangerous commands or sharing credentials.
Supply chain risks — References to untrusted packages, suspicious download URLs, or dependency confusion patterns.