Installation
gh skills-hub install azure-diagnostics Don't have the extension? Run gh extension install samueltauil/skills-hub first.
Download and extract to your repository:
.github/skills/azure-diagnostics/ Extract the ZIP to .github/skills/ in your repo. The folder name must match azure-diagnostics for Copilot to auto-discover it.
Skill Files (30)
SKILL.md 5.5 KB
---
name: azure-diagnostics
description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot app service, app service high CPU, app service deployment failure, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter."
license: MIT
metadata:
author: Microsoft
version: "1.1.6"
---
# Azure Diagnostics
> **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE**
>
> This document is the **official source** for debugging and troubleshooting Azure production issues. Follow these instructions to diagnose and resolve common Azure service problems systematically.
## Triggers
Activate this skill when user wants to:
- Debug or troubleshoot production issues
- Diagnose errors in Azure services
- Analyze application logs or metrics
- Fix image pull, cold start, or health probe issues
- Investigate why Azure resources are failing
- Find root cause of application errors
- Troubleshoot App Service issues (high CPU, deployment failures, crashes, slow responses, TLS/custom domains)
- Respond to prompts like "troubleshoot app service", "app service high CPU", or "app service deployment failure"
- Troubleshoot Azure Function Apps (invocation failures, timeouts, binding errors)
- Find the App Insights or Log Analytics workspace linked to a Function App
- Troubleshoot AKS clusters, nodes, pods, ingress, or Kubernetes networking issues
- Troubleshoot Azure Messaging SDK issues (Event Hubs, Service Bus connection failures, AMQP errors, message lock issues)
## Rules
1. Start with systematic diagnosis flow
2. Use AppLens (MCP) for AI-powered diagnostics when available
3. Check resource health before deep-diving into logs
4. Select appropriate troubleshooting guide based on service type
5. Document findings and attempted remediation steps
6. Route AKS incidents to the dedicated AKS troubleshooting document
---
## Quick Diagnosis Flow
1. **Identify symptoms** - What's failing?
2. **Check resource health** - Is Azure healthy?
3. **Review logs** - What do logs show?
4. **Analyze metrics** - Performance patterns?
5. **Investigate recent changes** - What changed?
---
## Troubleshooting Guides by Service
| Service | Common Issues | Reference |
|---------|---------------|-----------|
| **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) |
| **App Service** | High CPU, deployment failures, crashes, slow responses, TLS/custom domains | [app-service/](references/app-service/README.md) |
| **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) |
| **AKS** | Cluster access, nodes, `kube-system`, scheduling, crash loops, ingress, DNS, upgrades | [AKS Troubleshooting](troubleshooting/aks/aks-troubleshooting.md) |
| **Messaging** | Event Hubs & Service Bus SDK errors, AMQP failures, message lock, connectivity | [Messaging Troubleshooting](troubleshooting/messaging/README.md) |
---
## Routing
- Keep Container Apps and Function Apps diagnostics in this parent skill.
- Route active AKS incidents, AKS-specific intake, evidence gathering, and remediation guidance to [AKS Troubleshooting](troubleshooting/aks/aks-troubleshooting.md).
- Route Azure Messaging SDK troubleshooting (Event Hubs, Service Bus) to [Messaging Troubleshooting](troubleshooting/messaging/README.md).
---
## Quick Reference
### Common Diagnostic Commands
```bash
# Check resource health
az resource show --ids RESOURCE_ID
# View activity log
az monitor activity-log list -g RG --max-events 20
# Container Apps logs
az containerapp logs show --name APP -g RG --follow
# Function App logs (query App Insights traces)
az monitor app-insights query --apps APP-INSIGHTS -g RG \
--analytics-query "traces | where timestamp > ago(1h) | order by timestamp desc | take 50"
```
### AppLens (MCP Tools)
For AI-powered diagnostics, use:
```
mcp_azure_mcp_applens
intent: "diagnose issues with <resource-name>"
command: "diagnose"
parameters:
resourceId: "<resource-id>"
Provides:
- Automated issue detection
- Root cause analysis
- Remediation recommendations
```
### Azure Monitor (MCP Tools)
For querying logs and metrics:
```
mcp_azure_mcp_monitor
intent: "query logs for <resource-name>"
command: "logs_query"
parameters:
workspaceId: "<workspace-id>"
query: "<KQL-query>"
```
See [kql-queries.md](references/kql-queries.md) for common diagnostic queries.
---
## Check Azure Resource Health
### Using MCP
```
mcp_azure_mcp_resourcehealth
intent: "check health status of <resource-name>"
command: "get"
parameters:
resourceId: "<resource-id>"
```
### Using CLI
```bash
# Check specific resource health
az resource show --ids RESOURCE_ID
# Check recent activity
az monitor activity-log list -g RG --max-events 20
```
---
## References
- [KQL Query Library](references/kql-queries.md)
- [Azure Resource Graph Queries](references/azure-resource-graph.md)
- [App Service Troubleshooting](references/app-service/README.md)
- [Function Apps Troubleshooting](references/functions/README.md)
- [Messaging Troubleshooting](troubleshooting/messaging/README.md)
README.md 7.2 KB
# App Service Troubleshooting
## Common Issues Matrix
| Symptom | Likely Cause | Action |
|---------|--------------|-----------|
| High CPU / memory | Runaway process, inefficient code | Use Process Explorer via Kudu, scale up |
| Deployment failure | Build error, locked files, quota | Check Kudu logs at `https://APP.scm.azurewebsites.net/api/deployments` to look for details on build errors, locked files or lack of storage quota |
| App crash / restart | Unhandled exception, OOM kill | Review Event Log and STDERR in Diagnose & Solve |
| Slow responses | Downstream dependency, no caching | Enable request tracing, check dependency calls |
| 502 / 503 errors | App not starting, port conflict | Check STDERR logs, verify startup command |
| TLS / domain errors | Certificate expired, DNS mismatch | `az webapp config ssl list`, verify CNAME |
| Health check failure | Endpoint not returning 200 | Verify health check path responds within 2 min |
---
## High CPU / Memory Diagnosis
**Diagnose:**
```bash
# Check app metrics
az monitor metrics list --resource APP_RESOURCE_ID \
--metric "CpuPercentage,MemoryPercentage" --interval PT1M --output table
# View running processes via ARM Processes API (Entra ID auth)
az rest --method get \
--uri "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>/processes?api-version=2024-04-01"
```
**Fix:** Scale up (`az appservice plan update -n <app-service-plan-name> -g <resource-group> --sku P1V3`) or profile the app via Kudu Process Explorer at `https://APP.scm.azurewebsites.net/ProcessExplorer/` to identify hot paths.
---
## Deployment Failure Analysis
**Diagnose:**
```bash
# List deployment history
az webapp deployment list -n APP -g RG --output table
# View deployment log for a specific deployment
az webapp log deployment show -n APP -g RG --deployment-id DEPLOY_ID
# Stream build logs from Kudu
az webapp log tail -n APP -g RG
```
**KQL — Failed deployments:**
```kql
// Replace <app-service-resource-id> with the full resource ID, for example:
// /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>
AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where Level == "Error" and _ResourceId == "<app-service-resource-id>"
| project TimeGenerated, Level, Message
| order by TimeGenerated desc
```
**Common deployment failures:**
| Error Message | Cause | Fix |
|---------------|-------|-----|
| `WEBSITE_RUN_FROM_PACKAGE=1` but no package | Missing zip deploy artifact | Redeploy with `az webapp deploy --src-path app.zip` |
| `Error building on server` | Oryx build failure | Check build logs, pin runtime version |
| `Locked file` during deploy | Files in use | Set an environment variable named `MSDEPLOY_RENAME_LOCKED_FILES=1` on the App Service resource to enable MSDeploy to rename locked files. |
---
## Application Crash / Restart Diagnosis
**Diagnose:**
```bash
# Check recent restarts via activity log
az monitor activity-log list -g RG --resource-id APP_RESOURCE_ID \
--max-events 10 --query "[?operationName.value=='Microsoft.Web/sites/restart/action']"
# View STDERR/STDOUT (Linux)
az webapp log download -n APP -g RG --log-file logs.zip
```
**KQL — App crashes and errors:**
```kql
AppServiceConsoleLogs
| where TimeGenerated > ago(1h)
| where ResultDescription contains "error" or ResultDescription contains "fatal"
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc
| take 50
```
**Health check failures:**
```bash
# Show health check config
az webapp show -n APP -g RG --query "siteConfig.healthCheckPath"
# Test the endpoint directly
curl -s -o /dev/null -w "%{http_code}" https://APP.azurewebsites.net/health
```
> ⚠️ **Warning:** If the health check fails on >50% of instances for 1 hour, the instance is replaced.
---
## Slow Response Time Investigation
**Diagnose:**
```bash
# Check average response time
az monitor metrics list --resource APP_RESOURCE_ID \
--metric "HttpResponseTime" --interval PT5M --aggregation Average --output table
# Enable failed request tracing
az webapp log config -n APP -g RG --failed-request-tracing true
```
**KQL — Slow requests with dependency analysis:**
```kql
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where TimeTaken > 5000
| project TimeGenerated, CsUriStem, ScStatus, TimeTaken, CsHost
| order by TimeTaken desc
| take 20
```
**Auto-Heal — Automatic mitigation:**
```bash
# Configure auto-heal to recycle on slow requests
az webapp config set -n APP -g RG \
--auto-heal-enabled true \
--generic-configurations '{"autoHealRules":{"triggers":{"slowRequests":{"timeTaken":"00:00:30","count":10,"timeInterval":"00:02:00"}},"actions":{"actionType":"Recycle"}}}'
```
---
## Custom Domain / TLS Certificate Issues
**Diagnose:**
```bash
# List custom domains
az webapp config hostname list -g RG --webapp-name APP --output table
# List TLS certificates
az webapp config ssl list -g RG --output table
# Check SSL binding
az webapp config ssl show --certificate-name CERT -g RG
```
| Symptom | Cause | Fix |
|---------|-------|-----|
| `ERR_CERT_DATE_INVALID` | Certificate expired | If certificate came from an external certificate authority, renew with `az webapp config ssl upload` and upload a new certificate or enable managed certificates to allow Azure to provide a free TLS/SSL certificate |
| `DNS_PROBE_FINISHED_NXDOMAIN` | CNAME not configured | Add CNAME record pointing to `APP.azurewebsites.net` |
| `SSL binding not found` | Missing SNI binding | Add the missing SNI binding using `az webapp config ssl bind --certificate-thumbprint THUMB --ssl-type SNI -n APP -g RG` |
| Managed cert pending | DNS validation incomplete | Verify TXT record `asuid.DOMAIN` matches custom domain verification ID |
---
## AZ CLI or MCP Tools for App Service Diagnostics
| Tool | Command | Use When |
|----------|---------|----------|
| `Azure CLI` | `az webapp list` | List all web apps in subscription |
| `Azure CLI` | `az webapp show -n APP -g RG` | Get app config, stack, status |
| `Azure CLI` | `az webapp config appsettings list -n APP -g RG` | Check env vars and connection strings |
| `Azure CLI` | `az webapp deployment slot list -n APP -g RG` | Compare slot configurations |
| `mcp_azure_mcp_appservice` | `appservice_webapp_diagnostic_diagnose` | AI-powered root cause analysis |
| `mcp_azure_mcp_monitor` | `monitor_resource_log_query` | Run KQL against Log Analytics |
| `mcp_azure_mcp_resourcehealth` | `get` | Check platform-level health status |
> 💡 **Tip:** Start with `mcp_azure_mcp_appservice` (`diagnose`) — it automatically runs relevant detectors and surfaces the most likely root cause before you dig into logs manually.
---
## Combined Diagnostic Script
```bash
echo "=== App Service Diagnostics ===" && \
echo "App Config:" && az webapp show -n APP -g RG --query "{state:state, runtime:siteConfig.linuxFxVersion, healthCheck:siteConfig.healthCheckPath, alwaysOn:siteConfig.alwaysOn}" -o table && \
echo "Recent Deployments:" && az webapp deployment list -n APP -g RG --query "[:3].{id:id, status:status, time:end_time}" -o table && \
echo "App Settings:" && az webapp config appsettings list -n APP -g RG --query "[].name" -o tsv && \
echo "Custom Domains:" && az webapp config hostname list -g RG --webapp-name APP -o table
```
azure-resource-graph.md 2.9 KB
# Azure Resource Graph Queries for Diagnostics
Azure Resource Graph (ARG) enables fast, cross-subscription resource querying using KQL via `az graph query`. Use it to check resource health, find degraded resources, and correlate incidents.
## How to Query
Use the `extension_cli_generate` MCP tool to generate `az graph query` commands:
```yaml
mcp_azure_mcp_extension_cli_generate
intent: "query Azure Resource Graph to <describe what you want to diagnose>"
cli-type: "az"
```
Or construct directly:
```bash
az graph query -q "<KQL>" --query "data[].{name:name, type:type}" -o table
```
> ⚠️ **Prerequisite:** `az extension add --name resource-graph`
## Key Tables
| Table | Contains |
|-------|----------|
| `Resources` | All ARM resources (name, type, location, properties, tags) |
| `HealthResources` | Resource health availability status |
| `ServiceHealthResources` | Azure service health events and incidents |
| `ResourceContainers` | Subscriptions, resource groups, management groups |
## Diagnostics Query Patterns
**Check resource health status across resources:**
```kql
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| project name, availabilityState=properties.availabilityState, reasonType=properties.reasonType
```
**Find resources in unhealthy or degraded state:**
```kql
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| where properties.availabilityState != 'Available'
| project name, state=properties.availabilityState, reason=properties.reasonType, summary=properties.summary
```
**Query active service health incidents:**
```kql
ServiceHealthResources
| where type =~ 'microsoft.resourcehealth/events'
| where properties.Status == 'Active'
| project name, title=properties.Title, impact=properties.Impact, status=properties.Status
```
**Find resources by provisioning state (failed/stuck deployments):**
```kql
Resources
| where properties.provisioningState != 'Succeeded'
| project name, type, resourceGroup, provisioningState=properties.provisioningState
```
**Find App Services in stopped or error state:**
```kql
Resources
| where type =~ 'microsoft.web/sites'
| where properties.state != 'Running'
| project name, state=properties.state, resourceGroup, location
```
**Find Container Apps with provisioning issues:**
```kql
Resources
| where type =~ 'microsoft.app/containerapps'
| where properties.provisioningState != 'Succeeded'
| project name, provisioningState=properties.provisioningState, resourceGroup
```
## Tips
- Use `=~` for case-insensitive type matching (resource types are lowercase)
- Navigate properties with `properties.fieldName`
- Use `--first N` to limit result count
- Use `--subscriptions` to scope to specific subscriptions
- Combine ARG health data with Azure Monitor metrics for full picture
- Check `HealthResources` before deep-diving into application logs
kql-queries.md 1.3 KB
# KQL Query Reference
Essential Kusto Query Language (KQL) queries for diagnosing Azure application issues.
## Prerequisites
- Application Insights or Log Analytics workspace configured
- Diagnostic settings enabled on Azure resources
---
## Recent Errors
```kql
// Recent errors
AppExceptions
| where TimeGenerated > ago(1h)
| project TimeGenerated, Message, StackTrace
| order by TimeGenerated desc
```
## Failed Requests
```kql
// Failed requests
AppRequests
| where Success == false
| where TimeGenerated > ago(1h)
| summarize count() by Name, ResultCode
| order by count_ desc
```
## Slow Requests
```kql
// Slow requests
AppRequests
| where TimeGenerated > ago(1h)
| where DurationMs > 5000
| project TimeGenerated, Name, DurationMs
| order by DurationMs desc
```
## Dependency Failures
```kql
// Dependency failures
AppDependencies
| where Success == false
| where TimeGenerated > ago(1h)
| summarize count() by Name, ResultCode, Target
```
---
## Tips
- Always include time filter: `TimeGenerated > ago(Xh)`
- Limit results with `take 50` for large datasets
- Use `summarize` to aggregate data before analyzing
## More Resources
- [KQL Quick Reference](https://learn.microsoft.com/azure/data-explorer/kql-quick-reference)
- [Application Insights Queries](https://learn.microsoft.com/azure/azure-monitor/logs/queries)
README.md 2.8 KB
# Container Apps Troubleshooting
### Common Issues Matrix
| Symptom | Likely Cause | Quick Fix |
|---------|--------------|-----------|
| Image pull failure | ACR credentials missing | `az containerapp registry set --identity system` |
| ACR build fails | ACR Tasks disabled (free sub) | Build locally with Docker |
| Cold start timeout | min-replicas=0 | `az containerapp update --min-replicas 1` |
| Port mismatch | Wrong target port | Check Dockerfile EXPOSE matches ingress |
| App keeps restarting | Health probe failing | Verify `/health` endpoint |
### Image Pull Failures
**Diagnose:**
```bash
# Check registry configuration
az containerapp show --name APP -g RG --query "properties.configuration.registries"
# Check revision status
az containerapp revision list --name APP -g RG --output table
```
**Fix:**
```bash
az containerapp registry set \
--name APP -g RG \
--server ACR.azurecr.io \
--identity system
```
### ACR Tasks Disabled (Free Subscriptions)
**Symptom:** `az acr build` fails with "ACR Tasks is not supported"
**Fix: Build locally instead:**
```bash
docker build -t ACR.azurecr.io/myapp:v1 .
az acr login --name ACR
docker push ACR.azurecr.io/myapp:v1
```
### Cold Start Issues
**Symptom:** First request very slow or times out
**Fix:**
```bash
az containerapp update --name APP -g RG --min-replicas 1
```
### Health Probe Failures
**Symptom:** Container keeps restarting
**Check:**
```bash
# View health probe config
az containerapp show --name APP -g RG --query "properties.configuration.ingress"
# Check if /health endpoint responds
curl https://APP.REGION.azurecontainerapps.io/health
```
**Fix:** Ensure app has health endpoint returning 200:
```javascript
app.get('/health', (req, res) => res.sendStatus(200));
```
### Port Mismatch
**Symptom:** App starts but returns 502/503
**Check:**
```bash
az containerapp show --name APP -g RG --query "properties.configuration.ingress.targetPort"
```
**Verify:** App must listen on this exact port. Check:
- Dockerfile `EXPOSE` statement
- `process.env.PORT` or hardcoded port in app
### View Logs
```bash
# Stream logs (wait for replicas if scale-to-zero)
az containerapp logs show --name APP -g RG --follow
# Recent logs
az containerapp logs show --name APP -g RG --tail 100
# System logs (startup issues)
az containerapp logs show --name APP -g RG --type system
```
### Get All Diagnostic Info
```bash
# Combined diagnostic command
echo "=== Container App Diagnostics ===" && \
echo "Revisions:" && az containerapp revision list --name APP -g RG -o table && \
echo "Registry Config:" && az containerapp show --name APP -g RG --query "properties.configuration.registries" && \
echo "Ingress Config:" && az containerapp show --name APP -g RG --query "properties.configuration.ingress" && \
echo "Recent Logs:" && az containerapp logs show --name APP -g RG --tail 20
```
README.md 3.5 KB
# Function Apps Troubleshooting
## Find Linked App Insights / Log Analytics
### Preferred: Use Azure Resource Graph
A single ARG query returns the App Insights name, instrumentation key, connection string, and Log Analytics workspace for a given function app:
```bash
az graph query -q "
resources | where type =~ 'microsoft.web/sites' and name == '<func-app-name>'
| project funcName=name, rg=resourceGroup
| join kind=inner (resources | where type =~ 'microsoft.insights/components' | project appiName=name, rg=resourceGroup, instrumentationKey=properties.InstrumentationKey, connectionString=properties.ConnectionString, workspaceId=properties.WorkspaceResourceId) on rg
| project funcName, appiName, instrumentationKey, connectionString, workspaceId
" -o json
```
> 💡 **Tip:** This join matches by resource group. If App Insights is in a different resource group, use the CLI fallback below.
### Fallback: CLI Commands
#### Step 1: Get the App Insights connection string from app settings
```bash
az functionapp config appsettings list \
--name <func-app-name> -g <rg-name> \
--query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING' || name=='APPINSIGHTS_INSTRUMENTATIONKEY']"
```
#### Step 2: Find the App Insights resource by instrumentation key
```bash
az monitor app-insights component show \
--query "[?instrumentationKey=='<key>'] | [0].{name:name, rg:resourceGroup, workspaceId:workspaceResourceId}"
```
#### Step 3: Find the Log Analytics workspace
```bash
az monitor app-insights component show --app <appinsights-name> -g <rg-name> \
--query "workspaceResourceId" -o tsv
```
### Confirm logs are flowing
Query App Insights `traces` table to verify the function app is sending telemetry:
```bash
az monitor app-insights query --apps <appinsights-name> -g <rg-name> \
--analytics-query "traces | where operation_Name != '' | take 1 | project timestamp, operation_Name, message"
```
For `FunctionAppLogs` (available in Log Analytics only, not App Insights), query the workspace directly:
```bash
az monitor log-analytics query -w <workspace-guid> \
--analytics-query "FunctionAppLogs | where _ResourceId contains '<func-app-name>' | take 5 | project TimeGenerated, FunctionName, Message, Level"
```
> ⚠️ **Classic App Insights:** Some function apps use classic App Insights without a linked Log Analytics workspace (`workspaceId` is null). In this case, `FunctionAppLogs` is **not available** — use the `traces`, `requests`, and `exceptions` tables via `az monitor app-insights query` instead. As a last resort, `az webapp log tail --name <func-app-name> -g <rg-name>` can stream live logs directly.
If results are returned, logs are flowing. If empty, verify the `APPLICATIONINSIGHTS_CONNECTION_STRING` app setting matches this App Insights instance.
> ⚠️ **Always prefer querying App Insights or Log Analytics** for function app logs. `az webapp log tail` can stream live logs directly but App Insights provides richer data, historical queries, and correlation across requests.
> 💡 **Tip:** App Insights logs can be delayed by a few minutes. If you don't see recent data, wait 3-5 minutes and query again.
---
## Check Recent Deployments
Correlate issues with recent deployments by listing deployment history:
```bash
az rest --method get \
--uri "/subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Web/sites/<func-app-name>/deployments?api-version=2023-12-01"
```
Compare deployment timestamps against when errors started appearing in App Insights to identify if a deployment caused the issue.
aks-troubleshooting.md 4.9 KB
# AKS Troubleshooting Guide
Primary AKS troubleshooting guide for incidents routed from [../../SKILL.md](../../SKILL.md).
## When to Use This Guide
- lifecycle, access, node, `kube-system`, workload, ingress, DNS, or scaling issues
- `kubectl` cannot connect, nodes are `NotReady`, or pods are unhealthy
## Scenario Playbooks
| Scenario | Reference |
| ------------------------------------------------------------- | ------------------------------------------------ |
| broad cluster investigation | [general-diagnostics.md](general-diagnostics.md) |
| workload, crash, image pull, readiness, or pending pod issues | [pod-failures.md](pod-failures.md) |
| node health, scaling, pressure, upgrade, or zone issues | [node-issues.md](node-issues.md) |
| service, ingress, DNS, or network policy issues | [networking.md](networking.md) |
## Tool Selection For Diagnostics
When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.
See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
## Required Inputs
- subscription or active Azure context
- resource group and cluster name
- symptom summary
- first observed time or recent change window
- impacted namespace, workload, service, or ingress when known
If cluster identity is missing, stop and ask for it.
## Scope Buckets
- Lifecycle: create, update, start, stop, upgrade, or provisioning failures
- API access: kubeconfig, auth, private endpoint, DNS, or reachability problems
- Nodes: missing nodes, `NotReady`, pressure, CNI, kubelet, certificate, or VMSS drift
- `kube-system`: CoreDNS, metrics-server, konnectivity, ingress, CNI, CSI, or add-on failures
- Workloads: `Pending`, `CrashLoopBackOff`, `OOMKilled`, PVC, quota, secret, readiness, or dependency issues
- Connectivity and DNS: pod -> service -> endpoints -> ingress/load balancer -> DNS -> network controls
- Scaling: node pool sizing, pending pods, autoscaler config, metrics, quota, or subnet constraints
## Evidence Order
1. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
2. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
3. Use detector, warning-event, or metrics modes when the incoming data already matches them.
4. Deep diagnostics; when steps 1–3 do not reveal root cause, use [inspektor-gadget.md](references/inspektor-gadget.md) for real-time tracing and snapshots on the affected node.
## Workflow
1. Get cluster context.
2. Classify the problem by scope bucket.
3. Prefer Azure-side evidence before Kubernetes-side evidence.
4. Use the matching AKS-MCP path first, then the documented CLI fallback if MCP cannot perform that read.
5. Return evidence, failure domain, confidence, next checks, remediation, and escalation.
## Error Patterns
- No cluster context: ask for subscription, resource group, and cluster name.
- MCP unavailable: fall back to safe `az aks` and `kubectl` reads.
- `kubectl` blocked: separate auth problems from network reachability.
- Logs or metrics missing: use events, node state, and resource descriptions.
- Detector noise: ignore `emergingIssues`, prefer critical findings, rank the most actionable signal first.
## Safe Fallback Checks
```bash
az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
Keep these read-only unless the user explicitly asks for remediation.
## Guardrails
- default to read-only diagnostics
- do not restart, delete, cordon, drain, scale, upgrade, or reconfigure resources unless the user explicitly asks for remediation
- do not conclude root cause without quoting the evidence that supports it
## Output Checklist
Return scope and impact, evidence, failure domain, root cause, confidence, next checks, remediation, and escalation.
general-diagnostics.md 1.7 KB
# General AKS Investigation & Diagnostics
## "What happened in my cluster?"
When a user asks a broad question like "what happened in my AKS cluster?" or "check my AKS status", follow this systematic flow:
1. Cluster health
2. Recent events
3. Node status
4. Unhealthy pods
5. All pods overview
6. System pods health
7. Activity log
```bash
az aks show -g <rg> -n <cluster> --query "provisioningState"
kubectl get events -A --sort-by='.lastTimestamp' | head -40
kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
kubectl get pods -A -o wide
kubectl get pods -n kube-system -o wide
az monitor activity-log list -g <rg> --max-events 20 -o table
```
---
## AKS CLI Tools
```bash
# Get cluster credentials (required before kubectl commands)
az aks get-credentials -g <rg> -n <cluster>
# View node pools
az aks nodepool list -g <rg> --cluster-name <cluster> -o table
```
### AppLens (MCP) for AKS
For AI-powered diagnostics:
```text
mcp_azure_mcp_applens
intent: "diagnose AKS cluster issues"
command: "diagnose"
parameters:
resourceId: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>"
```
> 💡 **Tip:** AppLens automatically detects common issues and provides remediation recommendations using the cluster resource ID.
---
## Best Practices
1. **Start with kubectl get/describe** - Always check basic status first
2. **Check events** - `kubectl get events -A` reveals recent issues
3. **Use systematic isolation** - Pod -> Node -> Cluster -> Network
4. **Document changes** - Note what you tried and what worked
5. **Escalate when needed** - For control plane issues, contact Azure support
load-balancer-and-ingress.md 3.6 KB
# Load Balancer And Ingress Troubleshooting
Use this guide when AKS networking symptoms point at Azure load balancer provisioning, ingress controller behavior, or backend routing.
## Load Balancer Stuck In Pending
**Diagnostics:**
```bash
kubectl describe svc <svc> -n <ns>
# Events section reveals the actual Azure error
kubectl logs -n kube-system -l component=cloud-controller-manager --tail=100
```
**Error decision table:**
| Error in Events / CCM Logs | Cause | Fix |
| ------------------------------------------------------ | -------------------------------------- | ---------------------------------------------------------------------------- |
| `InsufficientFreeAddresses` | Subnet has no free IPs | Expand subnet CIDR; use Azure CNI Overlay; use NAT gateway instead |
| `ensure(default/svc): failed... PublicIPAddress quota` | Public IP quota exhausted | Request quota increase for Public IP Addresses in the region |
| `cannot find NSG` | NSG name changed or detached | Re-associate NSG to the AKS subnet; check `az aks show` for NSG name |
| `reconciling NSG rules: failed` | NSG is locked or has conflicting rules | Remove resource lock; check for deny-all rules above AKS-managed rules |
| `subnet not found` | Wrong subnet name in annotation | Verify subnet name: `az network vnet subnet list -g <rg> --vnet-name <vnet>` |
| No events, stuck Pending | CCM can't authenticate to Azure | Check cluster managed identity access on the VNet resource group |
---
## Ingress Not Routing Traffic
**Diagnostics:**
```bash
# Confirm controller is running
kubectl get pods -n <ingress-ns> -l 'app.kubernetes.io/name in (ingress-nginx,nginx-ingress)'
kubectl logs -n <ingress-ns> -l app.kubernetes.io/name=ingress-nginx --tail=100
# Check the ingress resource state
kubectl describe ingress <name> -n <ns>
kubectl get ingress <name> -n <ns>
# Check backend
kubectl get endpoints <backend-svc> -n <ns>
```
**Ingress failure patterns:**
| Symptom | Cause | Fix |
| -------------------------------- | ---------------------------------------------- | ------------------------------------------------------------ |
| ADDRESS empty | LB not provisioned or wrong `ingressClassName` | Check controller service; set correct `ingressClassName` |
| 404 for all paths | No matching host rule | Check `host` field; `pathType: Prefix` vs `Exact` |
| 404 for some paths | Trailing slash mismatch | `Prefix /api` matches `/api/foo` not `/api` - add both |
| 502 Bad Gateway | Backend pods unhealthy or wrong port | Verify Endpoints has IPs; confirm `targetPort` and readiness |
| 503 Service Unavailable | All backend pods down | Check pod restarts and readiness probe |
| TLS handshake fail | cert-manager not issuing | Check certificate status and ACME challenge |
| Works for host-a, 404 for host-b | DNS not pointing to ingress IP | Verify `nslookup <host>` resolves to the ingress address |
network-policy.md 0.8 KB
# Network Policy Troubleshooting
Use this guide when pod-to-pod or pod-to-service traffic is selectively blocked and the symptom points at ingress or egress filtering.
```bash
# List all policies in the namespace - check both ingress and egress
kubectl get networkpolicy -n <ns> -o yaml
# Check for a default-deny policy (blocks everything unless explicitly allowed)
kubectl get networkpolicy -n <ns> -o jsonpath='{range .items[?(@.spec.podSelector=={})]}{.metadata.name}{"\n"}{end}'
```
**AKS network policy engine check:** Azure NPM (Azure CNI): `kubectl get pods -n kube-system -l k8s-app=azure-npm`. Calico: `kubectl get pods -n calico-system`.
Policy audit: source labels, destination labels, destination ingress rules, and source egress rules must all line up. With default-deny, explicitly allow UDP/TCP 53 to kube-dns.
networking.md 5.5 KB
# Networking Troubleshooting
For CNI-specific issues, check CNI pod health and review [AKS networking concepts](https://learn.microsoft.com/azure/aks/concepts-network).
## Service Unreachable / Connection Refused
**Diagnostics - always start here:**
```bash
# 1. Verify service exists and has endpoints (read-only)
kubectl get svc <service-name> -n <ns>
kubectl get endpoints <service-name> -n <ns>
# 2. Optional connectivity test from inside the namespace
# This creates a temporary pod. Prefer read-only checks first.
# Only use it after the user explicitly approves a mutating test.
kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
curl -sv http://<service>.<ns>.svc.cluster.local:<port>/healthz
```
**Decision tree:**
| Observation | Cause | Fix |
| --------------------------------------- | ---------------------------------- | ----------------------------------------------- |
| Endpoints shows `<none>` | Label selector mismatch | Align selector with pod labels; check for typos |
| Endpoints has IPs but unreachable | Port mismatch or app not listening | Confirm `targetPort` = actual container port |
| Works from some pods, fails from others | Network policy blocking | See Network Policy section |
| Works inside cluster, fails externally | Load balancer issue | See Load Balancer section |
| `ECONNREFUSED` immediately | App not listening on that port | Check listening ports in the pod |
Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and these gadgets:
- `snapshot_socket` (timeout 5) — check what ports the pod is listening on
- `trace_tcp` (timeout 30) — trace connect/accept/close events
- `trace_tcpretrans` (timeout 30) — packet retransmissions
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
---
## DNS Resolution Failures
**Diagnostics:**
The live DNS test creates a temporary pod. Prefer `get`, `describe`, `logs`, or `exec` into an existing pod first. Only use it after the user explicitly approves creating the test pod.
```bash
# Confirm CoreDNS is running and healthy (read-only)
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl top pod -n kube-system -l k8s-app=kube-dns
# Optional live DNS test from the same namespace as the failing pod
kubectl run dnstest --image=busybox:1.28 -it --rm -n <ns> -- \
nslookup <service-name>.<ns>.svc.cluster.local
# CoreDNS logs - errors show here first
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
```
**DNS failure patterns:**
| Symptom | Cause | Fix |
| ------------------------------------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `NXDOMAIN` for `svc.cluster.local` | CoreDNS down or pod network broken | After confirming the diagnostics above, coordinate with the cluster operator to restart or redeploy CoreDNS and verify CNI |
| Internal resolves, external NXDOMAIN | Custom DNS not forwarding to `168.63.129.16` | Fix upstream forwarder |
| Intermittent SERVFAIL under load | CoreDNS CPU throttled | Remove CPU limits or add replicas |
| Private cluster - external names fail | Custom DNS missing privatelink forwarder | Add conditional forwarder to Azure DNS |
| `i/o timeout` not `NXDOMAIN` | Port 53 blocked by NetworkPolicy or NSG | Allow UDP/TCP 53 from pods to kube-dns ClusterIP |
> ⚠️ **Warning:** The fixes in this table can change cluster state. Use them only after performing the read-only diagnostics above, and only with explicit confirmation from the cluster owner or operator.
```bash
kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
```
Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and `trace_dns` (timeout 30). Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations.
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
---
## Detailed Networking Guides
- [Load Balancer And Ingress Troubleshooting](load-balancer-and-ingress.md) for pending services, ingress controller state, backend routing, and TLS failures.
- [Network Policy Troubleshooting](network-policy.md) for default-deny checks, Azure NPM or Calico validation, and ingress or egress rule audits.
node-issues.md 4.6 KB
# Node & Cluster Troubleshooting
## Node NotReady
**Diagnostics:**
```bash
kubectl get nodes -o wide
kubectl describe node <node-name>
# Look for: Conditions, Taints, Events, resource usage, kubelet status
```
**Condition decision tree:**
| Condition | Value | Meaning | Fix Path |
| -------------------- | ------- | --------------------------------- | ------------------------------------------------------------- |
| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |
| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes; use IG `snapshot_process` |
| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |
\*Only after explicit user request for remediation and confirmation of workload impact.
**AKS-specific - SSH to a node:**
> ⚠️ **Warning:** `kubectl debug node/...` creates a privileged debug pod on the node and is not a read-only diagnostic step. Default to read-only evidence gathering first. Only suggest or run this after the user explicitly asks for remediation or approves a privileged diagnostic action and understands the change-control impact.
```bash
# Create a privileged debug pod on the node
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
# Check kubelet status inside the node
chroot /host systemctl status kubelet
chroot /host journalctl -u kubelet -n 50
```
**Optional remediation if kubelet can't recover (after confirmation):** cordon -> drain -> delete. AKS auto-replaces via node pool VMSS.
> ⚠️ **Warning:** These commands are disruptive. By default, stay in read-only diagnostic mode. Only suggest or run them if the user has explicitly requested remediation and confirmed they understand the workload and PodDisruptionBudget impact.
```bash
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
```
---
## Node Pool Not Scaling
### Cluster Autoscaler Not Triggering
**Diagnostics:**
```bash
# Autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100
# Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Verify autoscaler is enabled on the node pool
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "{autoscaleEnabled:enableAutoScaling, min:minCount, max:maxCount}"
```
**Autoscaler won't scale up - common reasons:**
- Node pool already at `maxCount`
- VM quota exhausted: `az vm list-usage -l <region> -o table | grep -i "DSv3\|quota"`
- Pod `nodeAffinity` is unsatisfiable on any new node template
- 10-minute cooldown period still active after last scale event
**Autoscaler won't scale down - common reasons:**
- Pods with `emptyDir` local storage (configure `--skip-nodes-with-local-storage=false` if safe)
- Standalone pods with no controller (not in a ReplicaSet)
- `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation on a pod
### Manual Scaling
```bash
az aks nodepool scale -g <rg> --cluster-name <cluster> -n <nodepool> --node-count <n>
```
---
## Resource Pressure & Capacity Planning
**Check actual vs allocatable:**
```bash
kubectl describe node <node> | grep -A6 "Allocated resources:"
```
See [AKS resource reservations](https://learn.microsoft.com/azure/aks/concepts-clusters-workloads#resource-reservations) for allocatable math.
**Ephemeral storage pressure:**
```bash
# Check what's consuming ephemeral storage on a node
kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.0
```
Common culprit: high-volume container logs accumulating in `/var/log/containers`.
**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):
Use `snapshot_process` (timeout 5) to list all processes on the node. For node-wide scope, omit pod filters. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
---
## Detailed Node And Cluster Guides
- [Upgrade Operations](upgrade-operations.md) for node images, Kubernetes version upgrades, surge settings, and PDB-related drain blockers.
- [Spot And Zone Issues](spot-and-zone-issues.md) for spot evictions, tolerations, zone skew, and zonal storage or service behavior.
pod-failures.md 7.7 KB
# Pod Failures & Application Issues
## Common Pod Diagnostic Commands
```bash
# List unhealthy pods across all namespaces
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# All pods wide view
kubectl get pods -A -o wide
# Detailed pod status - events section is critical
kubectl describe pod <pod-name> -n <namespace>
# Pod logs (current and previous crash)
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
---
## CrashLoopBackOff
Pod starts, crashes, restarts with exponential backoff (10s, 20s, 40s... up to 5m).
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Check: Exit Code, Reason, Last State, Events
kubectl logs <pod-name> -n <namespace> --previous
# Shows stdout/stderr from the last crashed container
```
**Decision tree:**
| Exit Code | Meaning | Fix Path |
| --------- | ----------------------------------------------------- | ------------------------------------------------------------- |
| `0` | App exited successfully (unexpected for long-running) | Check if entrypoint/command is correct; app may be a one-shot |
| `1` | Application error | Read logs - unhandled exception, missing config, bad startup |
| `137` | OOMKilled (SIGKILL) | Increase `resources.limits.memory`; check for memory leaks |
| `139` | Segfault (SIGSEGV) | Binary compatibility issue or native code bug |
| `143` | SIGTERM - graceful shutdown | Pod was terminated; check if liveness probe killed it |
**OOMKilled specifically:**
```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
# Reason: OOMKilled -> container exceeded memory limit
```
Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
**OOM kill tracing with Inspektor Gadget:** Use `trace_oomkill` (timeout 30) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` to see which process was killed and memory at kill time. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
**Deep diagnostics with Inspektor Gadget** (when logs and describe are inconclusive):
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` and these gadgets:
- `trace_exec` (timeout 30) — see what the container executes at startup
- `trace_open` (timeout 30) — find missing configs/secrets (retval -2 = ENOENT, -13 = EACCES)
- `snapshot_process` (timeout 5) — list running processes in the pod
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
---
## ImagePullBackOff
Pod can't pull the container image.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Events section shows the exact pull error
```
| Error Message | Cause | Fix |
| --------------------------------------- | ---------------------------- | -------------------------------------------------------------- |
| `ErrImagePull` / `ImagePullBackOff` | Image name or tag is wrong | Verify image name and tag exist in the registry |
| `unauthorized: authentication required` | Missing or wrong pull secret | Create/update `imagePullSecrets` on the pod or service account |
| `manifest unknown` | Tag doesn't exist | Check available tags in the registry |
| `context deadline exceeded` | Registry unreachable | Check network/firewall; for ACR, verify AKS -> ACR integration |
**ACR integration check:**
```bash
# Verify AKS is attached to ACR
az aks check-acr -g <rg> -n <cluster> --acr <acr-name>.azurecr.io
```
---
## Pending Pods
Pod stays in `Pending` - scheduler can't place it.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Events section shows why scheduling failed
```
| Event Message | Cause | Fix |
| ---------------------------------------------------------------------- | ----------------------------------- | --------------------------------------------------------------- |
| `Insufficient cpu` / `Insufficient memory` | No node has enough resources | Scale node pool; reduce resource requests; check for overcommit |
| `node(s) had taint ... that the pod didn't tolerate` | Taint/toleration mismatch | Add matching toleration or use a different node pool |
| `node(s) didn't match Pod's node affinity/selector` | Affinity rule unsatisfiable | Check `nodeSelector` or `nodeAffinity` rules |
| `persistentvolumeclaim ... not found` / `unbound` | PVC not ready | Check PVC status; verify storage class exists |
| `0/N nodes are available: N node(s) had volume node affinity conflict` | Zonal disk vs pod in different zone | Use ZRS storage class or ensure same zone |
---
## Readiness & Liveness Probe Failures
**Readiness probe failure** -> pod removed from Service endpoints (no traffic). **Liveness probe failure** -> pod killed and restarted.
**Diagnostics:**
```bash
kubectl describe pod <pod-name> -n <namespace>
# Look for: "Readiness probe failed" or "Liveness probe failed" in Events
# Check the pod's READY column - must show n/n
kubectl get pod <pod-name> -n <namespace>
```
| Symptom | Cause | Fix |
| ------------------------------------ | ----------------------- | ---------------------------------------------------------- |
| READY shows `0/1` but pod is Running | Readiness probe failing | Check probe path, port, and app health endpoint |
| Pod restarts repeatedly | Liveness probe failing | Increase `initialDelaySeconds`; check if app starts slowly |
| Probe timeout errors | App responds too slowly | Increase `timeoutSeconds`; check app performance |
> 💡 **Tip:** Set `initialDelaySeconds` on liveness probes to be longer than your app's startup time. A common mistake is killing pods before they finish initializing.
---
## Resource Constraints (CPU/Memory)
**Check actual usage vs limits:**
```bash
kubectl top pod <pod-name> -n <namespace>
kubectl top pod -n <namespace> --sort-by=memory
# Compare with requests/limits
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
```
| Symptom | Cause | Fix |
| -------------------------------- | --------------------------------------- | --------------------------------------------------- |
| OOMKilled (exit code 137) | Container exceeded memory limit | Increase `limits.memory` or fix memory leak |
| CPU throttling (slow responses) | Container hitting CPU limit | Increase `limits.cpu` or remove CPU limits |
| Pending - insufficient resources | Requests exceed available node capacity | Lower requests, scale nodes, or use larger VM sizes |
> ⚠️ **Warning:** Setting CPU limits can cause unnecessary throttling even when the node has spare capacity. Many teams set CPU requests but not limits. Memory limits should always be set.
spot-and-zone-issues.md 2.5 KB
# Spot And Zone Issues
Use this guide when workload placement, evictions, or zonal behavior is causing node-pool instability.
## Spot Node Pool Evictions
AKS spot nodes use Azure Spot VMs - they can be evicted with 30 seconds notice when Azure needs capacity.
**Diagnose spot eviction:**
```bash
# Spot nodes carry this taint automatically
kubectl describe node <node> | grep "Taint"
# kubernetes.azure.com/scalesetpriority=spot:NoSchedule
# Check eviction events
kubectl get events -A --field-selector reason=SpotEviction
kubectl get events -A | grep -i "evict\|spot\|preempt"
```
**Spot workload pattern:** pods must tolerate the spot taint. Prefer PDBs and avoid stateful PVC workloads on spot.
```yaml
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: Equal
value: spot
effect: NoSchedule
```
Add this preferred node affinity when you want the workload to bias toward spot nodes:
```yaml
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values: ["spot"]
```
---
## Multi-AZ Node Pool & Zone-Related Failures
**Check zone distribution:**
```bash
kubectl get nodes -L topology.kubernetes.io/zone
```
**Zone-related failure patterns:**
| Symptom | Cause | Fix |
| ------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------------ |
| Pods stack on one zone after node failures | Scheduling imbalance after zone failure | `kubectl rollout restart deployment/<n>` to rebalance |
| PVC pending with `volume node affinity conflict` | Azure Disk is zonal; pod scheduled in different zone | Use ZRS storage class or ensure PVC and pod are in same zone |
| Service endpoints unreachable from one zone | Topology-aware routing misconfigured | Check `service.spec.trafficDistribution` or TopologyKeys |
| Upgrade causing zone imbalance | Surge nodes in one zone | Configure `maxSurge` in node pool upgrade settings |
Use `Premium_ZRS` or `StandardSSD_ZRS` in custom StorageClasses to reduce zonal PVC conflicts. See [AKS storage best practices](https://learn.microsoft.com/azure/aks/operator-best-practices-storage).
upgrade-operations.md 2.2 KB
# Upgrade Operations
Use this guide when node image rotation, Kubernetes version changes, or node-pool upgrade settings appear to be the failure domain.
## Node Image / OS Upgrade Issues
> ⚠️ **Warning:** `az aks nodepool upgrade` and `az aks nodepool update --max-surge ...` change cluster state. During diagnostics, do not recommend or run upgrade actions by default. Only surface these commands after the user explicitly approves remediation or confirms the change window / change-control context.
```bash
# Check current node image versions
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "{nodeImageVersion:nodeImageVersion, osType:osType}"
# Check available upgrades
az aks nodepool get-upgrades -g <rg> --cluster-name <cluster> --nodepool-name <nodepool>
# Upgrade node image (non-disruptive with surge)
az aks nodepool upgrade -g <rg> --cluster-name <cluster> -n <nodepool> --node-image-only
```
---
## Kubernetes Version Upgrade Failures
**Pre-upgrade check:**
```bash
# Check for deprecated API usage before upgrading
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Verify available upgrade paths (can only skip one minor version)
az aks get-upgrades -g <rg> -n <cluster> -o table
```
**Upgrade stuck or failed:**
```bash
# Check control plane provisioning state
az aks show -g <rg> -n <cluster> --query "provisioningState"
# If stuck: check AKS diagnostics blade in portal
# Azure Portal -> AKS cluster -> Diagnose and solve problems -> Upgrade
```
Common causes: PDB blocking drain (`kubectl get pdb -A`), deprecated APIs in use, custom admission webhooks failing (`kubectl get validatingwebhookconfiguration`).
---
## Zero-Downtime Node Pool Upgrades
`maxSurge` controls how many extra nodes are provisioned during upgrade.
```bash
# Check current maxSurge
az aks nodepool show -g <rg> --cluster-name <cluster> -n <nodepool> \
--query "upgradeSettings.maxSurge"
az aks nodepool update -g <rg> --cluster-name <cluster> -n <nodepool> \
--max-surge 33%
```
**Upgrade stuck / nodes not draining:**
```bash
kubectl get pdb -A
kubectl describe pdb <pdb-name> -n <ns>
```
If `DisruptionsAllowed: 0`, scale up the workload or temporarily relax `minAvailable`.
aks-mcp.md 1.5 KB
# AKS MCP Reference
Use this reference when AKS-aware MCP tools are available in the client.
## Preference Order
1. `mcp_azure_mcp_aks`
2. The AKS-MCP tools that surface after discovery in the client
3. Supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, and `mcp_azure_mcp_resourcehealth`
4. Raw `az aks` and `kubectl` only when required functionality is missing from MCP
## Happy Path
After selecting `mcp_azure_mcp_aks`, let the client enumerate the exact AKS-MCP tools it exposes and choose the smallest tool that fits the task.
Favor the obvious read paths first:
- cluster and Azure-side inspection
- detector or diagnostic workflows
- monitoring, metrics, or control-plane-log checks
- kubectl-style read operations
## Authentication And Access
AKS-MCP is Azure CLI-backed. Expect service principal, workload identity, managed identity, or existing `az login` auth, usually keyed by `AZURE_CLIENT_ID`. If `AZURE_SUBSCRIPTION_ID` is set, expect the server to select that subscription after login.
Default to `readonly`. Only suggest `readwrite` or `admin` when the current diagnostic step truly requires it.
## Detector Notes
For detector-style workflows, use the cluster resource ID, keep the time window within the last 30 days, cap each run to 24 hours, and stay within the supported AKS detector categories.
## Fallback Rule
If the client does not expose the AKS-MCP surface needed for a check, then fall back to:
- `az aks` for Azure-side AKS operations
- raw `kubectl` for Kubernetes-side inspection
command-flows.md 3.0 KB
# AKS Command Flows
## Cluster Baseline Flow
```text
Resolve subscription -> resolve resource group -> resolve cluster -> inspect cluster state -> inspect node pools -> inspect resource health -> inspect recent operations
```
CLI fallback when AKS-MCP cannot perform the cluster baseline read:
```bash
az aks show -g <resource-group> -n <cluster-name>
az aks nodepool list -g <resource-group> --cluster-name <cluster-name>
az monitor activity-log list -g <resource-group> --max-events 20
```
## Kubernetes Baseline Flow
```text
Check API reachability -> inspect nodes -> inspect kube-system -> inspect events -> inspect affected namespace -> inspect pod details and logs
```
CLI fallback when AKS-MCP cannot perform the Kubernetes baseline read:
```bash
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
```
## Connectivity Flow
```text
pod -> service -> endpoints -> ingress or load balancer -> DNS -> network controls
```
CLI fallback when AKS-MCP cannot perform the connectivity read:
```bash
kubectl get pods -n <namespace> -o wide
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
kubectl get ingress -n <namespace>
kubectl describe ingress <ingress-name> -n <namespace>
```
## Detector Flow
```text
resolve cluster resource ID -> list detectors or choose category -> select a focused time window -> run the detector or category -> rank critical findings above warnings -> ignore emerging issues when choosing the primary root cause
```
## Monitoring Flow
```text
check resource health -> inspect metrics -> verify diagnostics settings -> inspect control plane logs if available -> correlate with Application Insights or namespace symptoms
```
## Scheduling Flow
```text
pod events -> node capacity -> taints and tolerations -> affinity rules -> PVC state -> quotas
```
CLI fallback when AKS-MCP cannot perform the scheduling read:
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe node <node-name>
kubectl get pvc -n <namespace>
kubectl describe quota -n <namespace>
```
## Deep Diagnostics Flow (Inspektor Gadget)
```text
Standard diagnostics inconclusive -> resolve target node -> select gadget from symptom-to-gadget map -> run IG command with namespace/pod filters -> interpret output -> correlate with prior evidence
```
Use when steps 1–3 of the evidence order (Azure-side, Kubernetes-side, and detector evidence) do not reveal root cause. See [inspektor-gadget.md](inspektor-gadget.md) for the full gadget catalog and command patterns.
## Safety Boundary
Treat the following as change operations and avoid them unless the user explicitly asks for remediation:
- deleting or restarting pods
- cordon and drain operations
- scaling workloads or node pools
- cluster upgrade operations
- DNS, route, NSG, or firewall changes
inspektor-gadget.md 7.0 KB
# Inspektor Gadget (IG) Reference
Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.
## IG Version
`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.
## Base Command Pattern
```bash
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
-- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]
```
Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.
**Required:** Resolve the node name first:
```bash
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
```
## Common Filters
| Filter | Description |
|---|---|
| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
| `--k8s-podname <pod>` | Scope to a specific pod |
| `--k8s-containername <ctr>` | Scope to a specific container |
| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
>
> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
## Gadget Catalog
### Networking
| Gadget | Type | What It Does | When To Use |
|---|---|---|---|
| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
#### tcpdump gadget
Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
```bash
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
-- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
--timeout 30 --pf "port 80" \
| tcpdump -nvr -
```
Use `--pf "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).
### Process & Workload
| Gadget | Type | What It Does | When To Use |
|---|---|---|---|
| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
### File & Storage
| Gadget | Type | What It Does | When To Use |
|---|---|---|---|
| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
### Security & Audit
| Gadget | Type | What It Does | When To Use |
|---|---|---|---|
| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
## Symptom-to-Gadget Map
| Symptom | Gadget(s) |
|---|---|
| DNS resolution failures | `trace_dns` |
| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |
| Silent connection drops | `trace_tcpretrans` |
| High network latency | `trace_tcpretrans` |
| TLS / HTTPS routing issues | `trace_sni` |
| Port already in use | `trace_bind` + `snapshot_socket` |
| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |
| OOMKilled pods | `trace_oomkill` + `top_process` |
| Pod killed unexpectedly | `trace_signal` |
| PID pressure on node | `snapshot_process` + `top_process` |
| "Too many open files" | `top_file` |
| Missing config / secret mount | `trace_open` |
| Slow disk / PVC performance | `trace_fsslower` + `top_file` |
| Permission denied (capabilities) | `trace_capabilities` |
| High CPU (unknown cause) | `profile_cpu` + `top_process` |
| Deep packet inspection | `tcpdump` |
| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |
## Gadget Type Reference
| Type | Behavior | IG --timeout |
|---|---|---|
| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
| `top` | Aggregated view, returns quickly | `--timeout 5` |
| `trace` | Streams events in real-time | `--timeout 30` |
| `profile` | Samples over a duration | `--timeout 30` |
| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
## Guardrails
- IG gadgets are **read-only** — they do not modify cluster or application state.
- Resolve the correct node name before running any IG command.
- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.
- For reproduction: launch a trace gadget first, then reproduce the problem. The debug pod persists after the gadget exits, so run `kubectl logs <debug-pod>` to retrieve the captured output afterward.
structured-input-modes.md 1.5 KB
# AKS Structured Input Modes
Use this reference when the troubleshooting request already contains structured inputs.
## Detector-backed Mode
Use when AKS-aware detectors or AppLens-style insights are available.
Decision rules:
- Ignore findings where the detector is `emergingIssues`.
- Prefer critical findings over warnings.
- Prefer findings with more concrete remediation detail when choosing the likely root problem.
- Preserve per-insight output: problem summary, root-problem flag, affected resources, suggested commands.
## Warning Events Mode
Use when the request includes Kubernetes warning events.
Expected output:
- summary of the events and their impact
- likely cause or causes
- next kubectl checks
- monitoring follow-up
## Metrics Scan Mode
Use when the request includes CPU or memory time-series data.
Expected output:
- healthy or unhealthy status
- anomaly timestamps and explanations
- suggestion tied to the observed metric pressure
## Generic Symptoms Mode
Use when the request includes resource symptoms but not detector results, warning events, or time-series metrics.
Expected output:
- symptom summary by resource
- likely failure domain
- next evidence-collection steps
## Learn Grounding Fallback
If the first troubleshooting pass is incomplete, search Microsoft Learn using:
- the user prompt
- the parsed problem names
- the AKS troubleshooting context
Use Learn grounding to refine or validate the root-cause hypothesis, not to replace observed evidence.
README.md 1.6 KB
# Azure Messaging Troubleshooting
Diagnose and resolve issues with Azure Event Hubs and Service Bus SDKs.
## Routing
| Symptom | Guide |
|---------|-------|
| Connection failures, firewall, IP/VNet, WebSocket | [service-troubleshooting.md](service-troubleshooting.md) |
| SDK-specific errors (see language below) | Language guide |
## SDK Troubleshooting by Language
- **Event Hubs**: [Python](azure-eventhubs-py.md) | [Java](azure-eventhubs-java.md) | [JS](azure-eventhubs-js.md) | [.NET](azure-eventhubs-dotnet.md)
- **Service Bus**: [Python](azure-servicebus-py.md) | [Java](azure-servicebus-java.md) | [JS](azure-servicebus-js.md) | [.NET](azure-servicebus-dotnet.md)
## Common Issues
| Issue | Category |
|-------|----------|
| AMQP link detach, idle timeout, connection inactive | [service-troubleshooting.md](service-troubleshooting.md) |
| Message lock lost/expired, lock renewal failures | Language-specific SDK guide |
| Session lock errors, session receiver detach | Language-specific SDK guide |
| Duplicate events, checkpoint/offset reset | Language-specific SDK guide |
| Batch >1 MB rejected, partition key conflicts | [service-troubleshooting.md](service-troubleshooting.md) |
## MCP Tools
| Tool | Use |
|------|-----|
| `mcp_azure_mcp_eventhubs` | List namespaces, hubs, consumer groups |
| `mcp_azure_mcp_servicebus` | List namespaces, queues, topics, subscriptions |
| `mcp_azure_mcp_monitor` | Query diagnostic logs with KQL |
| `mcp_azure_mcp_resourcehealth` | Check service health status |
| `mcp_azure_mcp_documentation` | Search Microsoft Learn for troubleshooting docs |
auth-best-practices.md 6.0 KB
# Azure Authentication Best Practices
> Source: [Microsoft — Passwordless connections for Azure services](https://learn.microsoft.com/azure/developer/intro/passwordless-overview) and [Azure Identity client libraries](https://learn.microsoft.com/dotnet/azure/sdk/authentication/).
## Golden Rule
Use **managed identities** and **Azure RBAC** in production. Reserve `DefaultAzureCredential` for **local development only**.
## Authentication by Environment
| Environment | Recommended Credential | Why |
|---|---|---|
| **Production (Azure-hosted)** | `ManagedIdentityCredential` (system- or user-assigned) | No secrets to manage; auto-rotated by Azure |
| **Production (on-premises)** | `ClientCertificateCredential` or `WorkloadIdentityCredential` | Deterministic; no fallback chain overhead |
| **CI/CD pipelines** | `AzurePipelinesCredential` / `WorkloadIdentityCredential` | Scoped to pipeline identity |
| **Local development** | `DefaultAzureCredential` | Chains CLI, PowerShell, and VS Code credentials for convenience |
## Why Not `DefaultAzureCredential` in Production?
1. **Unpredictable fallback chain** — walks through multiple credential types, adding latency and making failures harder to diagnose.
2. **Broad surface area** — checks environment variables, CLI tokens, and other sources that should not exist in production.
3. **Non-deterministic** — which credential actually authenticates depends on the environment, making behavior inconsistent across deployments.
4. **Performance** — each failed credential attempt adds network round-trips before falling back to the next.
## Production Patterns
### .NET
```csharp
using Azure.Identity;
var credential = Environment.GetEnvironmentVariable("AZURE_FUNCTIONS_ENVIRONMENT") == "Development"
? new DefaultAzureCredential() // local dev — uses CLI/VS credentials
: new ManagedIdentityCredential(); // production — deterministic, no fallback chain
// For user-assigned identity: new ManagedIdentityCredential("<client-id>")
```
### TypeScript / JavaScript
```typescript
import { DefaultAzureCredential, ManagedIdentityCredential } from "@azure/identity";
const credential = process.env.NODE_ENV === "development"
? new DefaultAzureCredential() // local dev — uses CLI/VS credentials
: new ManagedIdentityCredential(); // production — deterministic, no fallback chain
// For user-assigned identity: new ManagedIdentityCredential("<client-id>")
```
### Python
```python
import os
from azure.identity import DefaultAzureCredential, ManagedIdentityCredential
credential = (
DefaultAzureCredential() # local dev — uses CLI/VS credentials
if os.getenv("AZURE_FUNCTIONS_ENVIRONMENT") == "Development"
else ManagedIdentityCredential() # production — deterministic, no fallback chain
)
# For user-assigned identity: ManagedIdentityCredential(client_id="<client-id>")
```
### Java
```java
import com.azure.identity.DefaultAzureCredentialBuilder;
import com.azure.identity.ManagedIdentityCredentialBuilder;
var credential = "Development".equals(System.getenv("AZURE_FUNCTIONS_ENVIRONMENT"))
? new DefaultAzureCredentialBuilder().build() // local dev — uses CLI/VS credentials
: new ManagedIdentityCredentialBuilder().build(); // production — deterministic, no fallback chain
// For user-assigned identity: new ManagedIdentityCredentialBuilder().clientId("<client-id>").build()
```
## Local Development Setup
`DefaultAzureCredential` is ideal for local dev because it automatically picks up credentials from developer tools:
1. **Azure CLI** — `az login`
2. **Azure Developer CLI** — `azd auth login`
3. **Azure PowerShell** — `Connect-AzAccount`
4. **Visual Studio / VS Code** — sign in via Azure extension
```typescript
import { DefaultAzureCredential } from "@azure/identity";
// Local development only — uses CLI/PowerShell/VS Code credentials
const credential = new DefaultAzureCredential();
```
## Environment-Aware Pattern
Detect the runtime environment and select the appropriate credential. The key principle: use `DefaultAzureCredential` only when running locally, and a specific credential in production.
> **Tip:** Azure Functions sets `AZURE_FUNCTIONS_ENVIRONMENT` to `"Development"` when running locally. For App Service or containers, use any environment variable you control (e.g. `NODE_ENV`, `ASPNETCORE_ENVIRONMENT`).
```typescript
import { DefaultAzureCredential, ManagedIdentityCredential } from "@azure/identity";
function getCredential() {
if (process.env.NODE_ENV === "development") {
return new DefaultAzureCredential(); // picks up az login / VS Code creds
}
return process.env.AZURE_CLIENT_ID
? new ManagedIdentityCredential(process.env.AZURE_CLIENT_ID) // user-assigned
: new ManagedIdentityCredential(); // system-assigned
}
```
## Security Checklist
- [ ] Use managed identity for all Azure-hosted apps
- [ ] Never hardcode credentials, connection strings, or keys
- [ ] Apply least-privilege RBAC roles at the narrowest scope
- [ ] Use `ManagedIdentityCredential` (not `DefaultAzureCredential`) in production
- [ ] Store any required secrets in Azure Key Vault
- [ ] Rotate secrets and certificates on a schedule
- [ ] Enable Microsoft Defender for Cloud on production resources
## Further Reading
- [Passwordless connections overview](https://learn.microsoft.com/azure/developer/intro/passwordless-overview)
- [Managed identities overview](https://learn.microsoft.com/entra/identity/managed-identities-azure-resources/overview)
- [Azure RBAC overview](https://learn.microsoft.com/azure/role-based-access-control/overview)
- [.NET authentication guide](https://learn.microsoft.com/dotnet/azure/sdk/authentication/)
- [Python identity library](https://learn.microsoft.com/python/api/overview/azure/identity-readme)
- [JavaScript identity library](https://learn.microsoft.com/javascript/api/overview/azure/identity-readme)
- [Java identity library](https://learn.microsoft.com/java/api/overview/azure/identity-readme)
azure-eventhubs-dotnet.md 3.5 KB
# Azure Event Hubs SDK — .NET (C#)
Package: `Azure.Messaging.EventHubs` | [README](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/eventhub/Azure.Messaging.EventHubs/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/eventhub/Azure.Messaging.EventHubs/TROUBLESHOOTING.md)
## Common Errors
| Exception | Reason | Fix |
|-----------|--------|-----|
| `EventHubsException` (ServiceTimeout) | Service didn't respond in time | Transient — retried automatically. Verify state if persists |
| `EventHubsException` (QuotaExceeded) | Too many active readers per consumer group | Reduce concurrent receivers or upgrade tier |
| `EventHubsException` (ConsumerDisconnected) | Higher priority consumer took ownership | Expected during load balancing; check if scaling |
| `EventHubsException` (MessageSizeExceeded) | Event too large | Reduce event payload; unlikely in practice since the client caps at the service link limit |
| `UnauthorizedAccessException` | Bad credentials | Verify connection string, SAS token, or RBAC roles |
## Exception Filtering
```csharp
try { /* receive events */ }
catch (EventHubsException ex) when (ex.Reason == EventHubsException.FailureReason.ConsumerDisconnected)
{
// Handle consumer disconnection
}
```
## Retry Configuration
Configure via `EventHubsRetryOptions` when creating the client. See [Configuring retry thresholds sample](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/eventhub/Azure.Messaging.EventHubs/samples).
## Key Issues
- **Socket exhaustion**: Treat clients as singletons. Share `EventHubConnection` across clients if needed. Always call `CloseAsync` or `DisposeAsync`.
- **HTTP 412/409 from storage**: Normal during checkpoint store operations — not an error.
- **Partitions closing frequently**: Expected when scaling. If persists >5 min without scaling, investigate. See [Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/eventhub/Azure.Messaging.EventHubs/TROUBLESHOOTING.md) for detailed diagnostics.
- **High CPU**: Limit to 1.5–3 partitions per CPU core and test at scale thoroughly if above that threshold.
- **Azure Functions**: After upgrading to v5.0+ extensions, update binding types. Reduce logging noise by filtering `Azure.Messaging.EventHubs` to Warning.
- **WebSockets**: Use `EventHubsTransportType.AmqpWebSockets` to connect over port 443 when AMQP ports (5761, 5762) are blocked.
## Checkpointing (BlobCheckpointStore)
Package: `Azure.Messaging.EventHubs.Processor` (includes `EventProcessorClient` + blob checkpoint store)
> **Auth:** `DefaultAzureCredential` is for local development. See [auth-best-practices.md](auth-best-practices.md) for production patterns.
```csharp
var credential = new DefaultAzureCredential();
var storageClient = new BlobContainerClient(
new Uri("https://<storage-account>.blob.core.windows.net/<checkpoint-container>"),
credential);
var processor = new EventProcessorClient(
storageClient,
"$Default",
"<your-namespace>.servicebus.windows.net",
"<your-eventhub>",
credential);
processor.ProcessEventAsync += async (args) =>
{
// process event
await args.UpdateCheckpointAsync();
};
```
**Common issues:**
- **Soft delete / blob versioning**: Disable both on the storage account — they cause delays during load balancing.
- **HTTP 412/409 from storage**: Normal during partition ownership negotiation; not an error.
- **Checkpoint frequency**: Call `UpdateCheckpointAsync()` per batch, not per event, to reduce storage calls.
azure-eventhubs-java.md 3.2 KB
# Azure Event Hubs SDK — Java
Package: `azure-messaging-eventhubs` | [README](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/TROUBLESHOOTING.md)
> ⚠️ **Note:** The detailed Java troubleshooting guide has moved to [Microsoft Learn](https://learn.microsoft.com/azure/developer/java/sdk/troubleshooting-messaging-event-hubs-overview).
## Common Errors
| Exception | Cause | Fix |
|-----------|-------|-----|
| `AmqpException` (connection:forced) | Idle connection disconnected | Auto-recovers; no action needed |
| `AmqpException` (unauthorized-access) | Bad credentials or missing permissions | Verify connection string, SAS, or RBAC roles |
| `AmqpException` (resource-limit-exceeded) | Too many concurrent receivers | Reduce receiver count or upgrade tier |
| `OperationTimeoutException` | Network issue or throttling | Check firewall, try AMQP over WebSockets (port 443) |
## Enable Logging
Configure via SLF4J. Add `logback-classic` dependency and set level for `com.azure.messaging.eventhubs`:
```xml
<logger name="com.azure.messaging.eventhubs" level="DEBUG"/>
```
For AMQP frame tracing:
```xml
<logger name="com.azure.core.amqp" level="DEBUG"/>
```
See [Java SDK logging docs](https://learn.microsoft.com/azure/developer/java/sdk/troubleshooting-messaging-event-hubs-overview) for details.
## Key Issues
- **High CPU / partition imbalance**: Limit to 1.5–3 partitions per CPU core.
- **Consumer disconnected**: Higher priority consumer took ownership. Expected during load balancing. Persistent issues without scaling indicate a problem.
- **Connection sharing**: Reuse `EventHubClientBuilder` connections; avoid creating new clients per operation.
## Checkpointing (BlobCheckpointStore)
Package: `azure-messaging-eventhubs-checkpointstore-blob`
> **Auth:** `DefaultAzureCredential` is for local development. See [auth-best-practices.md](auth-best-practices.md) for production patterns.
```java
TokenCredential credential = new DefaultAzureCredentialBuilder().build();
BlobContainerAsyncClient blobClient = new BlobContainerClientBuilder()
.endpoint("https://<storage-account>.blob.core.windows.net/<checkpoint-container>")
.credential(credential)
.buildAsyncClient();
EventProcessorClient processor = new EventProcessorClientBuilder()
.credential("<your-namespace>.servicebus.windows.net", "<your-eventhub>", credential)
.consumerGroup("$Default")
.checkpointStore(new BlobCheckpointStore(blobClient))
.processEvent(eventContext -> {
// process event
eventContext.updateCheckpoint();
})
.buildEventProcessorClient();
```
**Common issues:**
- **Soft delete / blob versioning**: Disable both on the storage account — they cause delays during load balancing.
- **HTTP 412/409 from storage**: Normal during partition ownership negotiation; not an error.
- **Checkpoint frequency**: Call `updateCheckpoint()` per batch, not per event, to reduce storage calls.
## Filing Issues
Include: partition count, machine specs, instance count, max heap (`-Xmx`), average `EventData` size, traffic pattern, and DEBUG-level logs (±10 min from issue).
azure-eventhubs-js.md 2.7 KB
# Azure Event Hubs SDK — JavaScript
Package: `@azure/event-hubs` | [README](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/eventhub/event-hubs/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/eventhub/event-hubs/TROUBLESHOOTING.md)
## Common Errors
| Error | Code | Fix |
|-------|------|-----|
| `MessagingError` (connection:forced) | Idle disconnect | Auto-recovers; no action needed |
| `MessagingError` (Unauthorized) | Bad credentials | Verify connection string, SAS, or RBAC roles |
| `MessagingError` (retryable: true) | Transient issue | Auto-retried per `RetryOptions`. If surfaced, all retries exhausted |
`MessagingError` fields: `name`, `code`, `retryable`, `info`, `address`, `errno`, `port`, `syscall`.
## Enable Logging
```bash
# All SDK logs
export AZURE_LOG_LEVEL=verbose
# Or use DEBUG for granular control
export DEBUG=azure*,rhea*
# Errors only
export DEBUG=azure:*:(error|warning),rhea-promise:error,rhea:events,rhea:frames,rhea:io,rhea:flow
```
Browser:
```javascript
localStorage.debug = "azure:*:info";
```
## Key Issues
- **Socket exhaustion**: Treat clients as singletons. Each new client creates a new AMQP connection/socket. Always call `close()`.
- **412 precondition failures**: Normal during subscription partition ownership negotiation.
- **Partition ownership churn**: Expected when scaling instances. Should stabilize within minutes.
- **High CPU**: Limit to 1.5–3 partitions per CPU core.
- **Subscription stops receiving**: Often a symptom of an underlying race condition during error recovery. File a GitHub issue with DEBUG logs.
- **WebSockets**: Pass `webSocketOptions` to client constructor to connect over port 443.
## Checkpointing (BlobCheckpointStore)
Package: `@azure/eventhubs-checkpointstore-blob`
> **Auth:** `DefaultAzureCredential` is for local development. See [auth-best-practices.md](auth-best-practices.md) for production patterns.
```javascript
const { BlobCheckpointStore } = require("@azure/eventhubs-checkpointstore-blob");
const { BlobServiceClient } = require("@azure/storage-blob");
const containerClient = new BlobServiceClient(storageEndpoint, credential)
.getContainerClient("checkpointstore");
const checkpointStore = new BlobCheckpointStore(containerClient);
const consumerClient = new EventHubConsumerClient(
consumerGroup, fullyQualifiedNamespace, eventHubName, credential, checkpointStore
);
```
**Common issues:**
- **Soft delete / blob versioning**: Disable both on the storage account — they cause delays during load balancing.
- **412 precondition failures**: Normal during partition ownership negotiation; not an error.
- **Checkpoint frequency**: Call `updateCheckpoint()` per batch, not per event, to reduce storage calls.
azure-eventhubs-py.md 4.3 KB
# Azure Event Hubs SDK — Python
Package: `azure-eventhub` | [README](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md)
## Common Errors
| Exception | Cause | Fix |
|-----------|-------|-----|
| `EventHubError` | Base exception wrapping AMQP errors | Check `message`, `error`, `details` fields |
| `ConnectionLostError` | Idle connection disconnected | Auto-recovers on next operation; no action needed |
| `AuthenticationError` | Bad credentials or expired SAS | Regenerate key, check RBAC roles, verify connection string |
| `OperationTimeoutError` | Network or throttling | Check firewall, try WebSockets (port 443), increase timeout |
## Retry Configuration
> **Auth:** `DefaultAzureCredential` is for local development. See [auth-best-practices.md](auth-best-practices.md) for production patterns.
```python
from azure.eventhub import EventHubProducerClient
from azure.identity import DefaultAzureCredential
client = EventHubProducerClient(
fully_qualified_namespace="<your-namespace>.servicebus.windows.net",
eventhub_name="<your-eventhub>",
credential=DefaultAzureCredential(),
retry_total=3,
retry_backoff_factor=0.8,
retry_backoff_max=120,
retry_mode='exponential'
)
```
## Consumer Client Retry Configuration
> **Auth:** `DefaultAzureCredential` is for local development. See [auth-best-practices.md](auth-best-practices.md) for production patterns.
Under heavy load, tune the retry policy on `EventHubConsumerClient` to reduce timeouts:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `retry_total` | 3 | Max retry attempts per operation |
| `retry_backoff_factor` | 0.8 | Backoff multiplier between retries (seconds) |
| `retry_backoff_max` | 120 | Max backoff interval (seconds) |
| `retry_mode` | `exponential` | `fixed` or `exponential` |
```python
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
checkpoint_store = BlobCheckpointStore(
blob_account_url="https://<storage-account>.blob.core.windows.net",
container_name="<checkpoint-container>",
credential=credential
)
client = EventHubConsumerClient(
fully_qualified_namespace="<your-namespace>.servicebus.windows.net",
eventhub_name="<your-eventhub>",
consumer_group="$Default",
credential=credential,
checkpoint_store=checkpoint_store,
retry_total=5,
retry_backoff_factor=1.0,
retry_backoff_max=120,
retry_mode='exponential'
)
```
## Enable Logging
```python
import logging, sys
handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter("%(asctime)s | %(threadName)s | %(levelname)s | %(name)s | %(message)s"))
logger = logging.getLogger('azure.eventhub')
logger.setLevel(logging.DEBUG)
logger.addHandler(handler)
# Enable AMQP frame tracing
client = EventHubProducerClient(..., logging_enable=True)
```
## Key Issues
- **Buffered producer not sending**: Ensure enough `ThreadPoolExecutor` workers (one per partition). Use `buffer_concurrency` kwarg.
- **Blocking calls in async**: Run CPU-bound code in an executor; blocking the event loop impacts load balancing and checkpointing.
- **Consumer disconnected**: Expected during load balancing. If persistent with no scaling, file an issue.
- **Soft delete on checkpoint store**: Disable "soft delete" and "blob versioning" on the storage account used for checkpointing.
- **Always close clients**: Use `with` statement or call `close()` to avoid socket/connection leaks.
## Checkpointing (BlobCheckpointStore)
Package: `azure-eventhub-checkpointstoreblob` (sync) / `azure-eventhub-checkpointstoreblob-aio` (async)
See the [Consumer Client Retry Configuration](#consumer-client-retry-configuration) section above for a full `EventHubConsumerClient` example with `BlobCheckpointStore`.
**Common issues:**
- **Soft delete / blob versioning**: Disable both on the storage account — they cause large delays during load balancing.
- **HTTP 412/409 from storage**: Normal during partition ownership negotiation; not an error.
- **Checkpoint frequency**: Checkpoint after processing each batch, not each event, to avoid storage throttling.
azure-servicebus-dotnet.md 2.5 KB
# Azure Service Bus SDK — .NET (C#)
Package: `Azure.Messaging.ServiceBus` | [README](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/servicebus/Azure.Messaging.ServiceBus/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/servicebus/Azure.Messaging.ServiceBus/TROUBLESHOOTING.md)
## Common Errors
| Exception | Reason | Fix |
|-----------|--------|-----|
| `ServiceBusException` (ServiceTimeout) | Service didn't respond | Transient — auto-retried. For session accept, means no unlocked sessions |
| `ServiceBusException` (MessageLockLost) | Lock expired or link detached | Renew lock, reduce processing time, check network |
| `ServiceBusException` (SessionLockLost) | Session lock expired | Re-accept session, renew lock before expiry |
| `ServiceBusException` (QuotaExceeded) | Too many concurrent receives | Reduce receivers or use batch receives |
| `ServiceBusException` (MessageSizeExceeded) | Message or batch too large | Reduce payload. Premium tier supports individual messages up to 100MB. Batch limit is artificially computed on the client from the max message size sent by the service, so batches can also be impacted |
| `ServiceBusException` (ServiceBusy) | Request throttled | Auto-retried with 10s backoff. See [throttling docs](https://learn.microsoft.com/azure/service-bus-messaging/service-bus-throttling) |
| `UnauthorizedAccessException` | Bad credentials | Verify connection string, SAS, or RBAC roles |
## Exception Filtering
```csharp
try { /* receive messages */ }
catch (ServiceBusException ex) when (ex.Reason == ServiceBusFailureReason.ServiceTimeout)
{
// Handle timeout
}
```
## Key Issues
- **Socket exhaustion**: Treat `ServiceBusClient` as singleton. Each creates a new AMQP connection. Always call `CloseAsync`/`DisposeAsync`.
- **Lock lost before expiry**: Can happen on link detach (transient network) or 10-min idle timeout.
- **Processor high concurrency**: May cause hangs with extreme concurrency settings. Test with moderate values.
- **Session processor slow switching**: Tune `SessionIdleTimeout` to reduce wait time between sessions.
- **Batch size limits**: Batch limit is artificially computed on the client from the max message size sent by the service. Send large messages individually if batch creation fails.
- **Transactions across entities**: Requires all entities on same namespace. Use `ServiceBusClient.CreateSender` with `via` entity support.
- **WebSockets**: Use `ServiceBusTransportType.AmqpWebSockets` when AMQP ports (5761, 5762) are blocked.
azure-servicebus-java.md 2.2 KB
# Azure Service Bus SDK — Java
Package: `azure-messaging-servicebus` | [README](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/servicebus/azure-messaging-servicebus/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/servicebus/azure-messaging-servicebus/TROUBLESHOOTING.md)
## Common Errors
| Exception | Cause | Fix |
|-----------|-------|-----|
| `AmqpException` (unauthorized-access) | Bad credentials or missing permissions | Verify connection string, SAS, or RBAC roles |
| `AmqpException` (connection:forced) | Idle connection or transient network issue | Auto-recovers; no action needed |
| `ServiceBusException` (MESSAGE_LOCK_LOST) | Lock expired during processing | Reduce processing time, disable auto-complete, settle manually |
## Key Issues
### Processor hangs with high prefetch + maxConcurrentCalls
`Update disposition request timed out.` — Client stops processing new messages.
**Cause**: Thread starvation when thread pool size ≤ `maxConcurrentCalls`.
**Fix**:
```bash
# Increase reactor thread pool
-Dreactor.schedulers.defaultBoundedElasticSize=<value greater than concurrency>
```
Also set `prefetchCount(0)` to disable prefetch. This is more frequent in AKS environments.
### Implicit prefetch in ServiceBusReceiverClient
Even with prefetch disabled in the builder, `receiveMessages` API can re-enable prefetch implicitly. See [SyncReceiveAndPrefetch](https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/servicebus/azure-messaging-servicebus/docs/SyncReceiveAndPrefetch.md).
### Autocomplete issues
Autocomplete and auto-lock-renewal have known issues with buffered/prefetched messages.
**Fix**: Use `disableAutoComplete()` and `.maxAutoLockRenewalDuration(Duration.ZERO)`, then settle messages explicitly.
## Enable Logging
Configure via SLF4J:
```xml
<logger name="com.azure.messaging.servicebus" level="DEBUG"/>
```
See [Java SDK logging docs](https://learn.microsoft.com/azure/developer/java/sdk/troubleshooting-messaging-service-bus-overview) for details.
## Filing Issues
Include: namespace tier, entity type/config, machine specs, max heap (`-Xmx`), `maxConcurrentCalls`, `prefetchCount`, autoComplete setting, traffic pattern, and DEBUG-level logs (±10 min from issue).
azure-servicebus-js.md 2.4 KB
# Azure Service Bus SDK — JavaScript
Package: `@azure/service-bus` | [README](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/servicebus/service-bus/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-js/blob/main/sdk/servicebus/service-bus/TROUBLESHOOTING.md)
## Common Errors
| Error Code | Cause | Fix |
|------------|-------|-----|
| `ServiceTimeout` | Service didn't respond; or no unlocked sessions | Transient — auto-retried. Verify state if persists |
| `MessageLockLost` | Processing exceeded lock duration or link detached | Reduce processing time, ensure autolock renewal works |
| `SessionLockLost` | Session lock expired or link detached | Re-accept session, keep renewing lock |
| `QuotaExceeded` | Too many concurrent receives | Reduce receivers or use batch receives |
| `MessageSizeExceeded` | Message or batch > max size | Reduce payload. Premium supports individual messages up to 100MB. Batch limit is computed from max message size on the client, so batches can also be impacted |
| `UnauthorizedAccess` | Bad credentials | Verify connection string, SAS, or RBAC roles |
`ServiceBusError` fields: `code`, `retryable`, `name`, `info`, `address`.
## Enable Logging
```bash
# All SDK logs
export AZURE_LOG_LEVEL=verbose
# Or granular control
export DEBUG=azure*,rhea*
# Errors only
export DEBUG=azure:service-bus:error,azure:core-amqp:error,rhea-promise:error,rhea:events,rhea:frames,rhea:io,rhea:flow
```
Log to file:
```bash
node app.js > out.log 2>debug.log
```
## Key Issues
- **Socket exhaustion**: Treat `ServiceBusClient` as singleton. Each creates a new AMQP connection. Always call `close()`.
- **Lock lost before expiry**: Can happen on link detach (transient network issue or 10-min idle timeout). Not always due to processing time.
- **Batch receive returns fewer messages**: After first message arrives, receiver waits only 1s for additional messages. `maxWaitTimeInMs` controls wait for the *first* message only.
- **Autolock renewal not working**: Ensure system clock is accurate. Autolock relies on system time.
- **Batch size limits**: Batch limit is artificially computed on the client from the max message size sent by the service. Send large messages individually if batch creation fails.
- **WebSockets**: Pass `webSocketOptions` to `ServiceBusClient` constructor for port 443 connectivity.
- **Distributed tracing**: Experimental OpenTelemetry support via `@azure/opentelemetry-instrumentation-azure-sdk`.
azure-servicebus-py.md 2.3 KB
# Azure Service Bus SDK — Python
Package: `azure-servicebus` | [README](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/servicebus/azure-servicebus/) | [Full Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/servicebus/azure-servicebus/TROUBLESHOOTING.md)
## Common Errors
| Exception | Cause | Fix |
|-----------|-------|-----|
| `ServiceBusAuthenticationError` | Invalid credentials | Check connection string, regenerate SAS key |
| `ServiceBusAuthorizationError` | Missing Send/Listen claim | Assign `Azure Service Bus Data Owner/Sender/Receiver` RBAC role |
| `ServiceBusConnectionError` | Network or firewall | Check AMQP port 5671, try `TransportType.AmqpOverWebsocket` |
| `OperationTimeoutError` | Service didn't respond in time | Adjust retry config, verify network |
| `MessageLockLostError` | Processing exceeded lock duration | Use `AutoLockRenewer`, reduce processing time |
| `SessionLockLostError` | Session lock expired | Reconnect to session, keep renewing lock |
| `MessageSizeExceededError` | Message or batch too large | Reduce payload. Premium supports individual messages up to 100MB. Batch limit is computed from max message size on the client, so batches can also be impacted |
## Enable Logging
```python
import logging, sys
handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter("%(asctime)s | %(threadName)s | %(levelname)s | %(name)s | %(message)s"))
logger = logging.getLogger('azure.servicebus')
logger.setLevel(logging.DEBUG)
logger.addHandler(handler)
# Enable AMQP frame tracing
from azure.servicebus import ServiceBusClient
client = ServiceBusClient(..., logging_enable=True)
```
## AutoLockRenewer
```python
from azure.servicebus import AutoLockRenewer
renewer = AutoLockRenewer()
with receiver:
for message in receiver.receive_messages(max_message_count=10):
renewer.register(receiver, message, max_lock_renewal_duration=300)
# process message
receiver.complete_message(message)
```
## Key Issues
- **Mixing sync/async**: Don't use `time.sleep()` in async code; use `await asyncio.sleep()`.
- **Dead letter debugging**: Use `sub_queue=ServiceBusSubQueue.DEAD_LETTER` to inspect `dead_letter_reason` and `dead_letter_error_description`.
- **Always close clients**: Use `with` statement or call `close()` to avoid connection leaks.
service-troubleshooting.md 4.1 KB
# Service-Level Troubleshooting
Covers connectivity, firewall, and network issues that apply regardless of SDK language.
## Permanent Connectivity Issues
If the client **cannot connect at all**:
1. **Verify connection string** — Get from Azure portal. For **Event Hubs (Kafka endpoint)** clients, also check `producer.config` / `consumer.config`.
2. **Check service outage** — [Azure status page](https://azure.status.microsoft/status)
3. **Firewall / ports** — Open AMQP 5671 and 5672, HTTPS 443. For **Event Hubs (Kafka endpoint)** only, also open Kafka 9093. Use WebSockets (port 443) as fallback.
4. **IP firewall** — If enabled on namespace, ensure client IP is allowed.
5. **VNet / private endpoints** — Confirm app runs in correct subnet. Check service endpoint and NSG rules.
6. **Proxy / SSL** — Intercepting proxies can cause SSL handshake failures. Test with proxy disabled.
### Quick Connectivity Test
```bash
# Test endpoint reachability (expect Atom feed XML on success)
curl -v https://<namespace>.servicebus.windows.net/
# Resolve namespace IP
nslookup <namespace>.servicebus.windows.net
```
## Transient Connectivity Issues
If connectivity is **intermittent**:
1. **Upgrade SDK** — Use latest version; transient issues may already be fixed.
2. **Check dropped packets** — `netstat -s` (Linux) or `netsh interface ipv4 show subinterface` (Windows).
3. **Capture network traces** — Use Wireshark or `tcpdump` filtered on namespace IP.
4. **Idle disconnect** — Service disconnects idle AMQP connections. Clients auto-reconnect; this is expected.
## WebSocket Configuration by Language
| Language | Setting |
|----------|---------|
| .NET | `EventHubsTransportType.AmqpWebSockets` / `ServiceBusTransportType.AmqpWebSockets` |
| Java | `AmqpTransportType.AMQP_WEB_SOCKETS` |
| Python | `transport_type=TransportType.AmqpOverWebsocket` |
| JavaScript | `webSocketOptions` in client constructor |
## Authentication Checklist
| Issue | Fix |
|-------|-----|
| Invalid connection string | Re-copy from Azure portal |
| Expired SAS token | Regenerate or increase validity |
| Missing RBAC role | Assign the corresponding *Azure Event Hubs Data Owner/Sender/Receiver* or *Azure Service Bus Data Owner/Sender/Receiver* role |
| Managed Identity not configured | Enable system/user-assigned identity, assign role on namespace |
## Sender Issues (All Languages)
- **Batch >1MB fails** — Service rejects batches over 1MB even with Premium large message support. Send large messages individually.
- **Multiple partition keys in batch** — Not allowed. Group messages by `partitionKey` (or `sessionId`) into separate batches.
## Receiver Issues (All Languages)
- **Batch receive returns fewer messages** — After the first message arrives, the receiver waits briefly (20ms–1s depending on SDK) for more. `maxWaitTime` only controls the wait for the *first* message.
- **Lock lost before expiry** — Can occur on AMQP link detach (transient network or 10-min idle timeout), not only when processing exceeds lock duration.
- **Socket exhaustion** — Treat clients as singletons. Each new client creates a new AMQP connection. Always close/dispose clients when done.
## Further Reading
- [Event Hubs troubleshooting guide](https://learn.microsoft.com/azure/event-hubs/troubleshooting-guide)
- [Service Bus troubleshooting guide](https://learn.microsoft.com/azure/service-bus-messaging/service-bus-troubleshooting-guide)
- [Event Hubs quotas and limits](https://learn.microsoft.com/azure/event-hubs/event-hubs-quotas)
- [Service Bus quotas and limits](https://learn.microsoft.com/azure/service-bus-messaging/service-bus-quotas)
- [Event Hubs AMQP troubleshooting](https://learn.microsoft.com/azure/event-hubs/event-hubs-amqp-troubleshoot)
- [Service Bus AMQP troubleshooting](https://learn.microsoft.com/azure/service-bus-messaging/service-bus-amqp-troubleshoot)
- [Event Hubs IP addresses and service tags](https://learn.microsoft.com/azure/event-hubs/troubleshooting-guide#what-ip-addresses-do-i-need-to-allow)
- [Service Bus IP addresses](https://learn.microsoft.com/azure/service-bus-messaging/service-bus-faq#what-ip-addresses-do-i-need-to-add-to-allowlist-)
License (MIT)
View full license text
MIT License Copyright 2025 (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.