Installation
Install with CLI
Recommended
gh skills-hub install azure-diagnostics Don't have the extension? Run gh extension install samueltauil/skills-hub first.
Download and extract to your repository:
.github/skills/azure-diagnostics/ Extract the ZIP to .github/skills/ in your repo. The folder name must match azure-diagnostics for Copilot to auto-discover it.
Skill Files (13)
SKILL.md 4.4 KB
---
name: azure-diagnostics
description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors."
license: MIT
metadata:
author: Microsoft
version: "1.0.4"
---
# Azure Diagnostics
> **AUTHORITATIVE GUIDANCE โ MANDATORY COMPLIANCE**
>
> This document is the **official source** for debugging and troubleshooting Azure production issues. Follow these instructions to diagnose and resolve common Azure service problems systematically.
## Triggers
Activate this skill when user wants to:
- Debug or troubleshoot production issues
- Diagnose errors in Azure services
- Analyze application logs or metrics
- Fix image pull, cold start, or health probe issues
- Investigate why Azure resources are failing
- Find root cause of application errors
- Troubleshoot Azure Function Apps (invocation failures, timeouts, binding errors)
- Find the App Insights or Log Analytics workspace linked to a Function App
- Troubleshoot AKS clusters, nodes, pods, ingress, or Kubernetes networking issues
## Rules
1. Start with systematic diagnosis flow
2. Use AppLens (MCP) for AI-powered diagnostics when available
3. Check resource health before deep-diving into logs
4. Select appropriate troubleshooting guide based on service type
5. Document findings and attempted remediation steps
6. Route AKS incidents to the dedicated AKS troubleshooting document
---
## Quick Diagnosis Flow
1. **Identify symptoms** - What's failing?
2. **Check resource health** - Is Azure healthy?
3. **Review logs** - What do logs show?
4. **Analyze metrics** - Performance patterns?
5. **Investigate recent changes** - What changed?
---
## Troubleshooting Guides by Service
| Service | Common Issues | Reference |
|---------|---------------|-----------|
| **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) |
| **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) |
| **AKS** | Cluster access, nodes, `kube-system`, scheduling, crash loops, ingress, DNS, upgrades | [AKS Troubleshooting](aks-troubleshooting/aks-troubleshooting.md) |
---
## Routing
- Keep Container Apps and Function Apps diagnostics in this parent skill.
- Route active AKS incidents, AKS-specific intake, evidence gathering, and remediation guidance to [AKS Troubleshooting](aks-troubleshooting/aks-troubleshooting.md).
---
## Quick Reference
### Common Diagnostic Commands
```bash
# Check resource health
az resource show --ids RESOURCE_ID
# View activity log
az monitor activity-log list -g RG --max-events 20
# Container Apps logs
az containerapp logs show --name APP -g RG --follow
# Function App logs (query App Insights traces)
az monitor app-insights query --apps APP-INSIGHTS -g RG \
--analytics-query "traces | where timestamp > ago(1h) | order by timestamp desc | take 50"
```
### AppLens (MCP Tools)
For AI-powered diagnostics, use:
```
mcp_azure_mcp_applens
intent: "diagnose issues with <resource-name>"
command: "diagnose"
parameters:
resourceId: "<resource-id>"
Provides:
- Automated issue detection
- Root cause analysis
- Remediation recommendations
```
### Azure Monitor (MCP Tools)
For querying logs and metrics:
```
mcp_azure_mcp_monitor
intent: "query logs for <resource-name>"
command: "logs_query"
parameters:
workspaceId: "<workspace-id>"
query: "<KQL-query>"
```
See [kql-queries.md](references/kql-queries.md) for common diagnostic queries.
---
## Check Azure Resource Health
### Using MCP
```
mcp_azure_mcp_resourcehealth
intent: "check health status of <resource-name>"
command: "get"
parameters:
resourceId: "<resource-id>"
```
### Using CLI
```bash
# Check specific resource health
az resource show --ids RESOURCE_ID
# Check recent activity
az monitor activity-log list -g RG --max-events 20
```
---
## References
- [KQL Query Library](references/kql-queries.md)
- [Azure Resource Graph Queries](references/azure-resource-graph.md)
- [Function Apps Troubleshooting](references/functions/README.md)
aks-troubleshooting/
aks-troubleshooting.md 4.4 KB
# AKS Troubleshooting Guide
Primary AKS troubleshooting guide for incidents routed from [../SKILL.md](../SKILL.md).
## When to Use This Guide
- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues
- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy
## Scenario Playbooks
| Scenario | Reference |
| ------------------------------------------------------------- | ------------------------------------------------ |
| broad cluster investigation | [general-diagnostics.md](general-diagnostics.md) |
| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md) |
| node health, scaling, pressure, upgrade, or zone issues | [node-issues.md](node-issues.md) |
| service, ingress, DNS, or network policy issues | [networking.md](networking.md) |
## Tool Selection For Diagnostics
When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
## Required Inputs
- subscription or active Azure context
- resource group and cluster name
- symptom summary
- first observed time or recent change window
- impacted namespace, workload, service, or ingress when known
If cluster identity is missing, stop and ask for it.
## Scope Buckets
- Lifecycle: create, update, start, stop, upgrade, or provisioning failures
- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift
- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues
- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints
## Evidence Order
1. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
2. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
3. Use detector, warning-event, or metrics modes when the incoming data already matches them.
## Workflow
1. Get cluster context.
2. Classify the problem by scope bucket.
3. Prefer Azure-side evidence before Kubernetes-side evidence.
4. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
5. Return evidence, failure domain, confidence, next checks, remediation, and escalation.
## Error Patterns
- No cluster context: ask for subscription, resource group, and cluster name.
- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.
- `kubectl` blocked: separate auth problems from network reachability.
- Logs or metrics missing: use events, node state, and resource descriptions.
- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.
## Safe Fallback Checks
```bash
az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
Keep these read-only unless the user explicitly asks for remediation.
## Guardrails
- default to read-only diagnostics
- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
- do not conclude root cause without quoting the evidence that supports it
## Output Checklist
Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.
general-diagnostics.md 1.7 KB
# General AKS Investigation & Diagnostics
## "What happened in my cluster?"
When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this systematic flow:
```bash
# 1. Cluster health
az aks show -g <rg> -n <cluster> --query "provisioningState"
# 2. Recent events
kubectl get events -A --sort-by='.lastTimestamp' | head -40
# 3. Node status
kubectl get nodes -o wide
# 4. Unhealthy pods
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# 5. All pods overview
kubectl get pods -A -o wide
# 6. System pods health
kubectl get pods -n kube-system -o wide
# 7. Activity log
az monitor activity-log list -g <rg> --max-events 20 -o table
```
---
## AKS CLI Tools
```bash
# Get cluster credentials (required before kubectl commands)
az aks get-credentials -g <rg> -n <cluster>
# View node pools
az aks nodepool list -g <rg> --cluster-name <cluster> -o table
```
### AppLens (MCP) for AKS
For AI-powered diagnostics:
```text
mcp_azure_mcp_applens
intent: "diagnose AKS cluster issues"
command: "diagnose"
parameters:
resourceId: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>"
```
> ๐ก **Tip:** AppLens automatically detects common issues and provides remediation recommendations using the cluster resource ID.
---
## Best Practices
1. **Start with kubectl get/describe** - Always check basic status first
2. **Check events** - `kubectl get events -A` reveals recent issues
3. **Use systematic isolation** - Pod -> Node -> Cluster -> Network
4. **Document changes** - Note what you tried and what worked
5. **Escalate when needed** - For control plane issues, contact Azure support
networking.md 8.4 KB
# Networking Troubleshooting
For CNI-specific issues, check CNI pod health and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network).
## Service Unreachable / Connection Refused
**Diagnostics - always start here:**
```bash
# 1. Verify service exists and has endpoints (read-only)
kubectl get svc <service-name> -n <ns>
kubectl get endpoints <service-name> -n <ns>
# 2. Optional connectivity test from inside the namespace
# This creates a temporary pod. Prefer read-only checks first.
# Only use it after the user explicitly approves a mutating test.
kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz
```
**Decision tree:**
| Observation | Cause | Fix |
| --------------------------------------- | ---------------------------------- | ----------------------------------------------- |
| Endpoints shows `<none>` | Label selector mismatch | Align selector with pod labels; check for typos |
| Endpoints has IPs but unreachable | Port mismatch or app not listening | Confirm `targetPort` = actual container port |
| Works from some pods, fails from others | Network policy blocking | See Network Policy section |
| Works inside cluster, fails externally | Load balancer issue | See Load Balancer section |
| `ECONNREFUSED` immediately | App not listening on that port | Check listening ports in the pod |
Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
---
## DNS Resolution Failures
**Diagnostics:**
```bash
# Confirm CoreDNS is running and healthy (read-only)
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl top pod -n kube-system -l k8s-app=kube-dns
# Optional live DNS test from the same namespace as the failing pod
# This creates a temporary pod. Prefer get/describe/logs or exec into an existing pod first.
# Only use it after the user explicitly approves creating the test pod.
kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \
nslookup <service-name>.<ns>.svc.cluster.local
# CoreDNS logs - errors show here first
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
```
**DNS failure patterns:**
| Symptom | Cause | Fix |
| ------------------------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `NXDOMAIN` for `svc.cluster.local` | CoreDNS down or pod network broken | After confirming the diagnostics above, coordinate with the cluster operator to restart or redeploy CoreDNS and verify CNI |
| Internal resolves, external NXDOMAIN | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder |
| Intermittent SERVFAIL under load | CoreDNS CPU throttled | Remove CPU limits or add replicas |
| Private cluster - external names fail | Custom DNS missing privatelink forwarder | Add conditional forwarder to Azure DNS |
| `i/o timeout` not `NXDOMAIN` | Port 53 blocked by NetworkPolicy or NSG | Allow UDP/TCP 53 from pods to kube-dns ClusterIP |
> โ ๏ธ **Warning:** The fixes in this table can change cluster state. Use them only after performing the read-only diagnostics above, and only with explicit confirmation from the cluster owner or operator.
```bash
kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
```
Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
---
## Load Balancer Stuck in Pending
**Diagnostics:**
```bash
kubectl describe svc <svc> -n <ns>
# Events section reveals the actual Azure error
kubectl logs -n kube-system -l component=cloud-controller-manager --tail=100
```
**Error decision table:**
| Error in Events / CCM Logs | Cause | Fix |
| ------------------------------------------------------ | -------------------------------------- | ---------------------------------------------------------------------------- |
| `InsufficientFreeAddresses` | Subnet has no free IPs | Expand subnet CIDR; use Azure CNI Overlay; use NAT gateway instead |
| `ensure(default/svc): failed... PublicIPAddress quota` | Public IP quota exhausted | Request quota increase for Public IP Addresses in the region |
| `cannot find NSG` | NSG name changed or detached | Re-associate NSG to the AKS subnet; check `az aks show` for NSG name |
| `reconciling NSG rules: failed` | NSG is locked or has conflicting rules | Remove resource lock; check for deny-all rules above AKS-managed rules |
| `subnet not found` | Wrong subnet name in annotation | Verify subnet name: `az network vnet subnet list -g <rg> --vnet-name <vnet>` |
| No events, stuck Pending | CCM can't authenticate to Azure | Check cluster managed identity access on the VNet resource group |
---
## Ingress Not Routing Traffic
**Diagnostics:**
```bash
# Confirm controller is running
kubectl get pods -n <ingress-ns> -l 'app.kubernetes.io/name in (ingress-nginx,nginx-ingress)'
kubectl logs -n <ingress-ns> -l app.kubernetes.io/name=ingress-nginx --tail=100
# Check the ingress resource state
kubectl describe ingress <name> -n <ns>
kubectl get ingress <name> -n <ns>
# Check backend
kubectl get endpoints <backend-svc> -n <ns>
```
**Ingress failure patterns:**
| Symptom | Cause | Fix |
| -------------------------------- | ---------------------------------------------- | ------------------------------------------------------------ |
| ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` |
| 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` |
| 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` - add both |
| 502 Bad Gateway | Backend pods unhealthy or wrong port | Verify Endpoints has IPs; confirm `targetPort` and readiness |
| 503 Service Unavailable | All backend pods down | Check pod restarts and readiness probe |
| TLS handshake fail | cert-manager not issuing | Check certificate status and ACME challenge |
| Works for host-a, 404 for host-b | DNS not pointing to ingress IP | Verify `nslookup <host>` resolves to the ingress address |
---
## Network Policy Blocking Traffic
```bash
# List all policies in the namespace - check both ingress and egress
kubectl get networkpolicy -n <ns> -o yaml
# Check for a default-deny policy (blocks everything unless explicitly allowed)
kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}'
```
**AKS network policy engine check:** Azure NPM (Azure CNI): `kubectl get pods -n kube-system -l k8s-app=azure-npm`. Calico: `kubectl get pods -n calico-system`.
Policy audit: source labels, destination labels, destination ingress rules, and source egress rules must all line up. With default-deny, explicitly allow UDP/TCP 53 to kube-dns.
node-issues.md 7.8 KB
# Node & Cluster Troubleshooting
## Node NotReady
**Diagnostics:**
```bash
kubectl get nodes -o wide
kubectl describe node <node-name>
# Look for: Conditions, Taints, Events, resource usage, kubelet status
```
**Condition decision tree:**
| Condition | Value | Meaning | Fix Path |
| -------------------- | ------- | --------------------------------- | ------------------------------------------------------------- |
| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |
| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes |
| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |
\*Only after explicit user request for remediation and confirmation of workload impact.
**AKS-specific - SSH to a node:**
```bash
# Create a privileged debug pod on the node
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
# Check kubelet status inside the node
chroot /host systemctl status kubelet
chroot /host journalctl -u kubelet -n 50
```
**Optional remediation if kubelet can't recover (after confirmation):** cordon -> drain -> delete. AKS auto-replaces via node pool VMSS.
> โ ๏ธ **Warning:** These commands are disruptive. By default, stay in read-only diagnostic mode. Only suggest or run them if the user has explicitly requested remediation and confirmed they understand the workload and PodDisruptionBudget impact.
```bash
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
```
---
## Node Pool Not Scaling
### Cluster Autoscaler Not Triggering
**Diagnostics:**
```bash
# Autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
# Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Verify autoscaler is enabled on the node pool
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}"
```
**Autoscaler won't scale up - common reasons:**
- Node pool already at `maxCount`
- VM quota exhausted: `az vm list-usage -l <region> -o table | grep -i "DSv3\|quota"`
- Pod `nodeAffinity` is unsatisfiable on any new node template
- 10-minute cooldown period still active after last scale event
**Autoscaler won't scale down - common reasons:**
- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe)
- Standalone pods with no controller (not in a ReplicaSet)
- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod
### Manual Scaling
```bash
az aks nodepool scale -g <rg> --cluster-name <cluster> -n <nodepool> --node-count <n>
```
---
## Resource Pressure & Capacity Planning
**Check actual vs allocatable:**
```bash
kubectl describe node <node> | grep -A6 "Allocated resources:"
```
See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for allocatable math.
**Ephemeral storage pressure:**
```bash
# Check what's consuming ephemeral storage on a node
kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
```
Common culprit: high-volume container logs accumulating in `/var/log/containers`.
---
## Node Image / OS Upgrade Issues
```bash
# Check current node image versions
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "{nodeImageVersion:nodeImageVersion, osType:osType}"
# Check available upgrades
az aks nodepool get-upgrades -g <rg> --cluster-name <cluster> --nodepool-name <nodepool>
# Upgrade node image (non-disruptive with surge)
az aks nodepool upgrade -g <rg> --cluster-name <cluster> -n <nodepool> --node-image-only
```
---
## Kubernetes Version Upgrade Failures
**Pre-upgrade check:**
```bash
# Check for deprecated API usage before upgrading
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Verify available upgrade paths (can only skip one minor version)
az aks get-upgrades -g <rg> -n <cluster> -o table
```
**Upgrade stuck or failed:**
```bash
# Check control plane provisioning state
az aks show -g <rg> -n <cluster> --query "provisioningState"
# If stuck: check AKS diagnostics blade in portal
# Azure Portal -> AKS cluster -> Diagnose and solve problems -> Upgrade
```
Common causes: PDB blocking drain (`kubectl get pdb -A`), deprecated APIs in use, custom admission webhooks failing (`kubectl get validatingwebhookconfiguration`).
---
## Spot Node Pool Evictions
AKS spot nodes use Azure Spot VMs - they can be evicted with 30 seconds notice when Azure needs capacity.
**Diagnose spot eviction:**
```bash
# Spot nodes carry this taint automatically
kubectl describe node <node> | grep "Taint"
# kubernetes.azure.com/scalesetpriority=spot:NoSchedule
# Check eviction events
kubectl get events -A --field-selector reason=SpotEviction
kubectl get events -A | grep -i "evict\|spot\|preempt"
```
**Spot workload pattern:** pods must tolerate the spot taint. Prefer PDBs and avoid stateful PVC workloads on spot.
```yaml
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: Equal
value: spot
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values: ["spot"]
```
---
## Multi-AZ Node Pool & Zone-Related Failures
**Check zone distribution:**
```bash
kubectl get nodes -L topology.kubernetes.io/zone
```
**Zone-related failure patterns:**
| Symptom | Cause | Fix |
| ------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------------ |
| Pods stack on one zone after node failures | Scheduling imbalance after zone failure | `kubectl rollout restart deployment/<n>` to rebalance |
| PVC pending with `volume node affinity conflict` | Azure Disk is zonal; pod scheduled in different zone | Use ZRS storage class or ensure PVC and pod are in same zone |
| Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys |
| Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings |
Use `Premium_ZRS` or `StandardSSD_ZRS` in custom StorageClasses to reduce zonal PVC conflicts. See [AKS storage best practices](https://learn.microsoft.com/azure/aks/operator-best-practices-storage).
---
## Zero-Downtime Node Pool Upgrades
`maxSurge` controls how many extra nodes are provisioned during upgrade.
```bash
# Check current maxSurge
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "upgradeSettings.maxSurge"
az aks nodepool update -g <rg> --cluster-name <cluster> -n <nodepool> \
--max-surge 33%
```
**Upgrade stuck / nodes not draining:**
```bash
kubectl get pdb -A
kubectl describe pdb <pdb-name> -n <ns>
```
If `DisruptionsAllowed: 0`, scale up the workload or temporarily relax `minAvailable`.
pod-failures.md 6.9 KB
# Pod Failures & Application Issues
## Common Pod Diagnostic Commands
```bash
# List unhealthy pods across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# All pods wide view
kubectl get pods -A -o wide
# Detailed pod status - events section is critical
kubectl describe pod <pod-name> -n <namespace>
# Pod logs (current and previous crash)
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
---
## CrashLoopBackOff
Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Check: Exit Code, Reason, Last State, Events
kubectl logs <pod-name> -n <namespace> --previous
# Shows stdout/stderr from the last crashed container
```
**Decision tree:**
| Exit Code | Meaning | Fix Path |
| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |
| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |
| `1` | Application error | Read logs - unhandled exception, missing config, bad startup |
| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks |
| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug |
| `143` | SIGTERM - graceful shutdown | Pod was terminated; check if liveness probe killed it |
**OOMKilled specifically:**
```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
# Reason: OOMKilled -> container exceeded memory limit
```
Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
---
## ImagePullBackOff
Pod can't pull the container image.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Events section shows the exact pull error
```
| Error Message | Cause | Fix |
| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |
| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry |
| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |
| `manifest unknown` | Tag doesn't exist | Check available tags in the registry |
| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS -> ACR integration |
**ACR integration check:**
```bash
# Verify AKS is attached to ACR
az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io
```
---
## Pending Pods
Pod stays in `Pending` - scheduler can't place it.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Events section shows why scheduling failed
```
| Event Message | Cause | Fix |
| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |
| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit |
| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool |
| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules |
| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists |
| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone |
---
## Readiness & Liveness Probe Failures
**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Look for: "Readiness probe failed" or "Liveness probe failed" in Events
# Check the pod's READY column - must show n/n
kubectl get pod <pod-name> -n <namespace>
```
| Symptom | Cause | Fix |
| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |
| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint |
| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly |
| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance |
> ๐ก **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.
---
## Resource Constraints (CPU/Memory)
**Check actual usage vs limits:**
```bash
kubectl top pod <pod-name> -n <namespace>
kubectl top pod -n <namespace> --sort-by=memory
# Compare with requests/limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
```
| Symptom | Cause | Fix |
| -------------------------------- | --------------------------------------- | --------------------------------------------------- |
| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak |
| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits |
| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |
> โ ๏ธ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.
aks-troubleshooting/references/
aks-mcp.md 1.5 KB
# AKS MCP Reference
Use this reference when AKS-aware MCP tools are available in the client.
## Preference Order
1. `mcp_azure_mcp_aks`
2. The AKS-MCP tools that surface after discovery in the client
3. Supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, and `mcp_azure_mcp_resourcehealth`
4. Raw `az aks` and `kubectl` only when required functionality is missing from MCP
## Happy Path
After selecting `mcp_azure_mcp_aks`, let the client enumerate the exact AKS-MCP tools it exposes and choose the smallest tool that fits the task.
Favor the obvious read paths first:
- cluster and Azure-side inspection
- detector or diagnostic workflows
- monitoring, metrics, or control-plane-log checks
- kubectl-style read operations
## Authentication And Access
AKS-MCP is Azure CLI-backed. Expect service principal, workload identity, managed identity, or existing `az login` auth, usually keyed by `AZURE_CLIENT_ID`. If `AZURE_SUBSCRIPTION_ID` is set, expect the server to select that subscription after login.
Default to `readonly`. Only suggest `readwrite` or `admin` when the current diagnostic step truly requires it.
## Detector Notes
For detector-style workflows, use the cluster resource ID, keep the time window within the last 30 days, cap each run to 24 hours, and stay within the supported AKS detector categories.
## Fallback Rule
If the client does not expose the AKS-MCP surface needed for a check, then fall back to:
- `az aks` for Azure-side AKS operations
- raw `kubectl` for Kubernetes-side inspection
command-flows.md 2.5 KB
# AKS Command Flows
## Cluster Baseline Flow
```text
Resolve subscription -> resolve resource group -> resolve cluster -> inspect cluster state -> inspect node pools -> inspect resource health -> inspect recent operations
```
CLI fallback when AKS-MCP cannot perform the cluster baseline read:
```bash
az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
az monitor activity-log list -g <resource-group> --max-events 20
```
## Kubernetes Baseline Flow
```text
Check API reachability -> inspect nodes -> inspect kube-system -> inspect events -> inspect affected namespace -> inspect pod details and logs
```
CLI fallback when AKS-MCP cannot perform the Kubernetes baseline read:
```bash
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
## Connectivity Flow
```text
pod -> service -> endpoints -> ingress or load balancer -> DNS -> network controls
```
CLI fallback when AKS-MCP cannot perform the connectivity read:
```bash
kubectl get pods -n <namespace> -o wide
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>
```
## Detector Flow
```text
resolve cluster resource ID -> list detectors or choose category -> select a focused time window -> run the detector or category -> rank critical findings above warnings -> ignore emerging issues when choosing the primary root cause
```
## Monitoring Flow
```text
check resource health -> inspect metrics -> verify diagnostics settings -> inspect control plane logs if available -> correlate with Application Insights or namespace symptoms
```
## Scheduling Flow
```text
pod events -> node capacity -> taints and tolerations -> affinity rules -> PVC state -> quotas
```
CLI fallback when AKS-MCP cannot perform the scheduling read:
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe node <node-name>
kubectl get pvc -n <namespace>
kubectl describe quota -n <namespace>
```
## Safety Boundary
Treat the following as change operations and avoid them unless the user explicitly asks for remediation:
- deleting or restarting pods
- cordon and drain operations
- scaling workloads or node pools
- cluster upgrade operations
- DNS, route, NSG, or firewall changes
structured-input-modes.md 1.5 KB
# AKS Structured Input Modes
Use this reference when the troubleshooting request already contains structured inputs.
## Detector-backed Mode
Use when AKS-aware detectors or AppLens-style insights are available.
Decision rules:
- Ignore findings where the detector is `emergingIssues`.
- Prefer critical findings over warnings.
- Prefer findings with more concrete remediation detail when choosing the likely root problem.
- Preserve per-insight output: problem summary, root-problem flag, affected resources, suggested commands.
## Warning Events Mode
Use when the request includes Kubernetes warning events.
Expected output:
- summary of the events and their impact
- likely cause or causes
- next kubectl checks
- monitoring follow-up
## Metrics Scan Mode
Use when the request includes CPU or memory time-series data.
Expected output:
- healthy or unhealthy status
- anomaly timestamps and explanations
- suggestion tied to the observed metric pressure
## Generic Symptoms Mode
Use when the request includes resource symptoms but not detector results, warning events, or time-series metrics.
Expected output:
- symptom summary by resource
- likely failure domain
- next evidence-collection steps
## Learn Grounding Fallback
If the first troubleshooting pass is incomplete, search Microsoft Learn using:
- the user prompt
- the parsed problem names
- the AKS troubleshooting context
Use Learn grounding to refine or validate the root-cause hypothesis, not to replace observed evidence.
references/
azure-resource-graph.md 2.9 KB
# Azure Resource Graph Queries for Diagnostics
Azure Resource Graph (ARG) enables fast, cross-subscription resource querying using KQL via `az graph query`. Use it to check resource health, find degraded resources, and correlate incidents.
## How to Query
Use the `extension_cli_generate` MCP tool to generate `az graph query` commands:
```yaml
mcp_azure_mcp_extension_cli_generate
intent: "query Azure Resource Graph to <describe what you want to diagnose>"
cli-type: "az"
```
Or construct directly:
```bash
az graph query -q "<KQL>" --query "data[].{name:name, type:type}" -o table
```
> โ ๏ธ **Prerequisite:** `az extension add --name resource-graph`
## Key Tables
| Table | Contains |
|-------|----------|
| `Resources` | All ARM resources (name, type, location, properties, tags) |
| `HealthResources` | Resource health availability status |
| `ServiceHealthResources` | Azure service health events and incidents |
| `ResourceContainers` | Subscriptions, resource groups, management groups |
## Diagnostics Query Patterns
**Check resource health status across resources:**
```kql
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| project name, availabilityState=properties.availabilityState, reasonType=properties.reasonType
```
**Find resources in unhealthy or degraded state:**
```kql
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| where properties.availabilityState != 'Available'
| project name, state=properties.availabilityState, reason=properties.reasonType, summary=properties.summary
```
**Query active service health incidents:**
```kql
ServiceHealthResources
| where type =~ 'microsoft.resourcehealth/events'
| where properties.Status == 'Active'
| project name, title=properties.Title, impact=properties.Impact, status=properties.Status
```
**Find resources by provisioning state (failed/stuck deployments):**
```kql
Resources
| where properties.provisioningState != 'Succeeded'
| project name, type, resourceGroup, provisioningState=properties.provisioningState
```
**Find App Services in stopped or error state:**
```kql
Resources
| where type =~ 'microsoft.web/sites'
| where properties.state != 'Running'
| project name, state=properties.state, resourceGroup, location
```
**Find Container Apps with provisioning issues:**
```kql
Resources
| where type =~ 'microsoft.app/containerapps'
| where properties.provisioningState != 'Succeeded'
| project name, provisioningState=properties.provisioningState, resourceGroup
```
## Tips
- Use `=~` for case-insensitive type matching (resource types are lowercase)
- Navigate properties with `properties.fieldName`
- Use `--first N` to limit result count
- Use `--subscriptions` to scope to specific subscriptions
- Combine ARG health data with Azure Monitor metrics for full picture
- Check `HealthResources` before deep-diving into application logs
kql-queries.md 1.3 KB
# KQL Query Reference
Essential Kusto Query Language (KQL) queries for diagnosing Azure application issues.
## Prerequisites
- Application Insights or Log Analytics workspace configured
- Diagnostic settings enabled on Azure resources
---
## Recent Errors
```kql
// Recent errors
AppExceptions
| where TimeGenerated > ago(1h)
| project TimeGenerated, Message, StackTrace
| order by TimeGenerated desc
```
## Failed Requests
```kql
// Failed requests
AppRequests
| where Success == false
| where TimeGenerated > ago(1h)
| summarize count() by Name, ResultCode
| order by count_ desc
```
## Slow Requests
```kql
// Slow requests
AppRequests
| where TimeGenerated > ago(1h)
| where DurationMs > 5000
| project TimeGenerated, Name, DurationMs
| order by DurationMs desc
```
## Dependency Failures
```kql
// Dependency failures
AppDependencies
| where Success == false
| where TimeGenerated > ago(1h)
| summarize count() by Name, ResultCode, Target
```
---
## Tips
- Always include time filter: `TimeGenerated > ago(Xh)`
- Limit results with `take 50` for large datasets
- Use `summarize` to aggregate data before analyzing
## More Resources
- [KQL Quick Reference](https://learn.microsoft.com/azure/data-explorer/kql-quick-reference)
- [Application Insights Queries](https://learn.microsoft.com/azure/azure-monitor/logs/queries)
references/container-apps/
README.md 2.8 KB
# Container Apps Troubleshooting
### Common Issues Matrix
| Symptom | Likely Cause | Quick Fix |
|---------|--------------|-----------|
| Image pull failure | ACR credentials missing | `az containerapp registry set --identity system` |
| ACR build fails | ACR Tasks disabled (free sub) | Build locally with Docker |
| Cold start timeout | min-replicas=0 | `az containerapp update --min-replicas 1` |
| Port mismatch | Wrong target port | Check Dockerfile EXPOSE matches ingress |
| App keeps restarting | Health probe failing | Verify `/health` endpoint |
### Image Pull Failures
**Diagnose:**
```bash
# Check registry configuration
az containerapp show --name APP -g RG --query "properties.configuration.registries"
# Check revision status
az containerapp revision list --name APP -g RG --output table
```
**Fix:**
```bash
az containerapp registry set \
--name APP -g RG \
--server ACR.azurecr.io \
--identity system
```
### ACR Tasks Disabled (Free Subscriptions)
**Symptom:** `az acr build` fails with "ACR Tasks is not supported"
**Fix: Build locally instead:**
```bash
docker build -t ACR.azurecr.io/myapp:v1 .
az acr login --name ACR
docker push ACR.azurecr.io/myapp:v1
```
### Cold Start Issues
**Symptom:** First request very slow or times out
**Fix:**
```bash
az containerapp update --name APP -g RG --min-replicas 1
```
### Health Probe Failures
**Symptom:** Container keeps restarting
**Check:**
```bash
# View health probe config
az containerapp show --name APP -g RG --query "properties.configuration.ingress"
# Check if /health endpoint responds
curl https://APP.REGION.azurecontainerapps.io/health
```
**Fix:** Ensure app has health endpoint returning 200:
```javascript
app.get('/health', (req, res) => res.sendStatus(200));
```
### Port Mismatch
**Symptom:** App starts but returns 502/503
**Check:**
```bash
az containerapp show --name APP -g RG --query "properties.configuration.ingress.targetPort"
```
**Verify:** App must listen on this exact port. Check:
- Dockerfile `EXPOSE` statement
- `process.env.PORT` or hardcoded port in app
### View Logs
```bash
# Stream logs (wait for replicas if scale-to-zero)
az containerapp logs show --name APP -g RG --follow
# Recent logs
az containerapp logs show --name APP -g RG --tail 100
# System logs (startup issues)
az containerapp logs show --name APP -g RG --type system
```
### Get All Diagnostic Info
```bash
# Combined diagnostic command
echo "=== Container App Diagnostics ===" && \
echo "Revisions:" && az containerapp revision list --name APP -g RG -o table && \
echo "Registry Config:" && az containerapp show --name APP -g RG --query "properties.configuration.registries" && \
echo "Ingress Config:" && az containerapp show --name APP -g RG --query "properties.configuration.ingress" && \
echo "Recent Logs:" && az containerapp logs show --name APP -g RG --tail 20
```
references/functions/
README.md 3.6 KB
# Function Apps Troubleshooting
## Find Linked App Insights / Log Analytics
### Preferred: Use Azure Resource Graph
A single ARG query returns the App Insights name, instrumentation key, connection string, and Log Analytics workspace for a given function app:
```bash
az graph query -q "
resources
| where type =~ 'microsoft.web/sites' and name == '<func-app-name>'
| project funcName=name, rg=resourceGroup
| join kind=inner (
resources
| where type =~ 'microsoft.insights/components'
| project appiName=name, rg=resourceGroup,
instrumentationKey=properties.InstrumentationKey,
connectionString=properties.ConnectionString,
workspaceId=properties.WorkspaceResourceId
) on rg
| project funcName, appiName, instrumentationKey, connectionString, workspaceId
" -o json
```
> ๐ก **Tip:** This join matches by resource group. If App Insights is in a different resource group, use the CLI fallback below.
### Fallback: CLI Commands
#### Step 1: Get the App Insights connection string from app settings
```bash
az functionapp config appsettings list \
--name <func-app-name> -g <rg-name> \
--query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING' || name=='APPINSIGHTS_INSTRUMENTATIONKEY']"
```
#### Step 2: Find the App Insights resource by instrumentation key
```bash
az monitor app-insights component show \
--query "[?instrumentationKey=='<key>'] | [0].{name:name, rg:resourceGroup, workspaceId:workspaceResourceId}"
```
#### Step 3: Find the Log Analytics workspace
```bash
az monitor app-insights component show --app <appinsights-name> -g <rg-name> \
--query "workspaceResourceId" -o tsv
```
### Confirm logs are flowing
Query App Insights `traces` table to verify the function app is sending telemetry:
```bash
az monitor app-insights query --apps <appinsights-name> -g <rg-name> \
--analytics-query "traces | where operation_Name != '' | take 1 | project timestamp, operation_Name, message"
```
For `FunctionAppLogs` (available in Log Analytics only, not App Insights), query the workspace directly:
```bash
az monitor log-analytics query -w <workspace-guid> \
--analytics-query "FunctionAppLogs | where _ResourceId contains '<func-app-name>' | take 5 | project TimeGenerated, FunctionName, Message, Level"
```
> โ ๏ธ **Classic App Insights:** Some function apps use classic App Insights without a linked Log Analytics workspace (`workspaceId` is null). In this case, `FunctionAppLogs` is **not available** โ use the `traces`, `requests`, and `exceptions` tables via `az monitor app-insights query` instead. As a last resort, `az webapp log tail --name <func-app-name> -g <rg-name>` can stream live logs directly.
If results are returned, logs are flowing. If empty, verify the `APPLICATIONINSIGHTS_CONNECTION_STRING` app setting matches this App Insights instance.
> โ ๏ธ **Always prefer querying App Insights or Log Analytics** for function app logs. `az webapp log tail` can stream live logs directly but App Insights provides richer data, historical queries, and correlation across requests.
> ๐ก **Tip:** App Insights logs can be delayed by a few minutes. If you don't see recent data, wait 3-5 minutes and query again.
---
## Check Recent Deployments
Correlate issues with recent deployments by listing deployment history:
```bash
az rest --method get \
--uri "/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Web/sites/<func-app-name>/deployments?api-version=2023-12-01"
```
Compare deployment timestamps against when errors started appearing in App Insights to identify if a deployment caused the issue.
License (MIT)
View full license text
MIT License Copyright 2025 (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.