Azure Reliability

Scanned intermediate

Assess and improve the reliability posture of PaaS Applications (Azure Functions and Azure App Service). Scans deployed resources for zone redundancy, ZRS storage, health probes, and multi-region failover. Presents a feature-pivoted checklist, then drives staged remediation (CLI or IaC patches) end-to-end with user confirmation. WHEN: "assess reliability", "check reliability", "zone redundant", "multi-region failover", "high availability", "disaster recovery", "single points of failure", "reliability posture", "resiliency".

🚀 DevOps & CI/CD View Source MIT 13 files

Installation

Install with CLI Recommended

gh skills-hub install azure-reliability

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/azure-reliability/

Extract the ZIP to .github/skills/ in your repo. The folder name must match azure-reliability for Copilot to auto-discover it.

Skill Files (13)

SKILL.md 23.1 KB

---
name: azure-reliability
description: "Assess and improve the reliability posture of PaaS Applications (Azure Functions and Azure App Service). Scans deployed resources for zone redundancy, ZRS storage, health probes, and multi-region failover. Presents a feature-pivoted checklist, then drives staged remediation (CLI or IaC patches) end-to-end with user confirmation. WHEN: \"assess reliability\", \"check reliability\", \"zone redundant\", \"multi-region failover\", \"high availability\", \"disaster recovery\", \"single points of failure\", \"reliability posture\", \"resiliency\"."
license: MIT
metadata:
  author: Microsoft
  version: "1.0.2"
---

# Azure Reliability Assessment & Configuration

## Quick Reference

| Property | Details |
|---|---|
| Best for | Reliability posture assessment, zone redundancy enablement, multi-region failover setup |
| Primary capabilities | Reliability assessment table, Zone Redundancy Configuration, Multi-Region IaC Generation |
| Supported services | Azure Functions, App Service (Container Apps planned for a future version) |
| MCP tools | Azure Resource Graph queries, Azure CLI commands |

## When to Use This Skill

Activate this skill when user wants to:
- "Assess my Function app's reliability"
- "Assess my Web app's reliability"
- "Check the reliability of my resource group" (App Service and Functions resources only)
- "Is my app zone redundant?" (App Service and Functions resources only)
- "Is my app service plan zone redundant?" 
- "Make my app zone redundant" (App Service and Functions resources only)
- "Make my app service plan zone redundant"
- "Set up multi-region failover for my app" (App Service and Functions resources only)
- "Check my reliability posture"
- "Find single points of failure" (App Service and Functions resources only)
- "Enable high availability for my app" (App Service and Functions resources only)
- "Check disaster recovery readiness"
- "Improve my app's resilience" (App Service and Functions resources only)

> **Scope note:** This skill currently covers **Azure Functions and Azure App Service** only. If the user asks about Azure Container Apps reliability, acknowledge that support is planned but not yet available, and only proceed with the parts that apply to App Service and Functions resources in scope.

## Prerequisites

- Authentication: user is logged in to Azure via `az login`
- Permissions: Reader access on target subscription/resource group (for assessment)
- Permissions: Contributor access (for configuration changes)
- Azure Resource Graph extension: `az extension add --name resource-graph`

## MCP Tools

| Tool | Purpose |
|------|---------|
| `mcp_azure_mcp_extension_cli_generate` | Generate `az` CLI commands for resource queries and configuration |
| `mcp_azure_mcp_subscription_list` | List available subscriptions |
| `mcp_azure_mcp_group_list` | List resource groups |

Primary query method: Azure Resource Graph via `az graph query` (requires `az extension add --name resource-graph`).

## Assessment Workflow

### Phase 1: Discover Resources

1. **Identify scope** — Ask user for resource group, subscription, or app name
2. **Query Azure Resource Graph** to discover all resources in scope
3. **Classify resources** by service type (Functions, Storage, etc.). If non-Functions compute (App Service sites that aren't Function Apps, Container Apps) is found, **note it but do not deep-dive** — those services are planned for a future version of this skill.

**Important:** Always scope queries to the user's specified resource group or subscription. Add these filters to every Resource Graph query:
- Resource group: `| where resourceGroup =~ '<rg-name>'`
- Subscription: Use `--subscriptions <sub-id>` flag on `az graph query`
- App name: `| where name =~ '<app-name>'`

### Phase 2: Assess Reliability

Two-step assessment: **platform-level discovery first, then per-service deep dive.**

**Step 1 — Platform discovery (find what's there).** Use these to enumerate resources in scope and detect cross-cutting reliability gaps:

| Platform check | Reference |
|---|---|
| Zone redundancy — discovery | [references/zone-redundancy-checks.md](references/zone-redundancy-checks.md) |
| Storage redundancy (cross-service) | [references/storage-redundancy-checks.md](references/storage-redundancy-checks.md) |
| Multi-region & global load balancers | [references/multi-region-checks.md](references/multi-region-checks.md) |
| Front Door / Traffic Manager / App Insights probes | [references/health-probe-checks.md](references/health-probe-checks.md) |

**Step 2 — Per-service deep dive.** For each compute resource discovered in Step 1, load the matching service reference. The service reference is the single source of truth for that service's plan/SKU rules, assessment queries, CLI commands, IaC patches (Bicep + Terraform + AVM), and reporting hints.

This skill version ships **only the Azure Functions and App Service** per-service references. Other compute services are listed below explicitly so the dispatch logic is unambiguous: if a resource matches an unsupported row, do **not** attempt to load a reference, fabricate CLI commands, or generate IaC patches for it.

| Service detected | Reference |
|---|---|
| Azure Functions (`microsoft.web/serverfarms` with `kind contains 'functionapp'`) | [references/services/functions/reliability.md](references/services/functions/reliability.md) |
| Azure App Service (non-Functions sites: `microsoft.web/sites` without `kind contains 'functionapp'`, `microsoft.web/serverfarms` without `kind contains 'functionapp'`) | [references/services/app-service/reliability.md](references/services/app-service/reliability.md) |
| Azure Container Apps (`microsoft.app/containerapps`, `microsoft.app/managedenvironments`) | ⚪ Not yet shipped — planned for a future version |

> **Handling unsupported services:** If a resource matches an unsupported row above, surface it in the discovery summary, mark it as `⚪ not assessed (planned)` in the Phase 3 table, and skip the per-service remediation steps for it. Do **not** attempt to fabricate CLI commands or IaC patches for those services.

### Phase 3: Generate Reliability Checklist

Present findings as a **feature-pivoted** table: one row per reliability feature (Zone redundancy on compute, Zone-redundant storage, Health probes, Multi-region failover), with a single status indicator and the **specific resources** that are relevant to that feature. This avoids the noise of one-row-per-resource with mostly `n/a` cells. Do **not** assign numeric scores or grades.

```
🔍 Reliability Assessment — {scope}
─────────────────────────────────────────────────────────────────────────────────────────────
Reliability Feature              Status      Resources
─────────────────────────────────────────────────────────────────────────────────────────────
Zone redundancy — compute        🔴 OFF      • plan-web-ii5trxva2ark4 (P1v3)
                                              • plan-ii5trxva2ark4 (FC1)

Zone-redundant storage           🔴 GRS      • stii5trxva2ark4 (defaulted; no SKU set in IaC)

Health probes                    🔴 OFF      • func-api-ii5trxva2ark4 — needs code change (FC1)
                                              • app-web-ii5trxva2ark4 — no health check path

Multi-region failover            🔴 OFF      • Single region (eastus) only — Front Door not configured
─────────────────────────────────────────────────────────────────────────────────────────────

Want me to fix the 🔴 items? I'll do the quick wins first (App
plan zone redundancy + health checks on supported plans), then ask before
storage migration and multi-region setup. (yes/no)
```

**Rules for the table:**

- **Four feature rows, in this order:** Zone redundancy — compute · Zone-redundant storage · Health probes · Multi-region failover. Omit a row entirely only if no resource in scope could ever apply to it.
- **Status column** is one symbol + one short word, no other characters:
  - `🟢 ON` — feature is fully enabled across all relevant resources in scope
  - `🟡 PARTIAL` — some resources have it, some don't (or partial config like liveness-only)
  - `🔴 OFF` — feature is missing on all relevant resources
  - For storage, replace `OFF` with the current SKU when relevant (`🔴 LRS`, `🔴 GRS`, `🟢 ZRS`, `🟢 GZRS`). When no SKU is set in IaC, label as `🔴 GRS` (ARM/AVM default) and note that in the resource line.
- **Resources column** lists only what's relevant to that feature, one bullet per resource:
  - For "needs fixing" resources, include a short inline reason (`(FC1)`, `(defaulted; no SKU set)`, `liveness only`, `needs code change (FC1)`).
  - For resources that are **already ON** for that feature, mention them on the same row with `— already ON` so the user sees credit for what's right.
- **Do not** include `n/a`, `—`, or empty cells. If a feature doesn't apply to any resource in scope, drop the row.
- **Do not** include numeric scores, grades, or point totals.
- End the assessment with a **single yes/no question** that kicks off the staged remediation flow. Do not enumerate the per-resource fix list here — the user will see it after they say yes (Configuration Workflow Step 1).

> **UX Note:** If the assessment finds the app **already has** all core reliability features (zone redundancy, ZRS/GZRS storage, health probes), skip the fix-it question and jump straight to Configuration Workflow [Step 3](#step-3-both-paths-multi-region-followup--ask-and-wait) (Multi-region follow-up). Do **NOT** start any multi-region work without explicit consent.

## Configuration Workflow

When user wants to **fix** findings from the assessment:

> **⛔ ALWAYS confirm with user before executing changes.** Show what will change, any cost implications, and any destructive actions (e.g., environment recreation).

### Step 1: Present Fix Plan + Choose Path

After assessment, if user says "fix it" / "improve my reliability" / "enable zone redundancy":

1. List each fixable finding with the specific action
2. Flag any cost implications or breaking changes
3. **Ask user which path they want:**

```
I'll start with the quick wins (no downtime, fast):

1. ✏️  Enable zone redundancy on plan-ii5trxva2ark4 (Flex Consumption — no cost change)
2. ✏️  Set health check path to /api/health on func-api-ii5trxva2ark4

Then, separately, I'll ask if you want to upgrade storage:

3. 🕒  Upgrade stii5trxva2ark4 from LRS → ZRS (small cost increase, migration takes hours)
   — Required for full zone redundancy, but I'll confirm with you before starting.

How would you like to apply these changes?

  A) Fix now — Run az CLI commands against your live resources (immediate, one-time)
  B) Patch my IaC — Update your Bicep/Terraform files so changes persist across deploys

(If you use azd or Terraform, option B is recommended so `azd up` won't overwrite changes.)
```

### Path A: Fix Now (CLI)

Run fixes against live resources using `az` CLI commands. **Quick wins first, then ask before the slow storage migration.**

The exact CLI commands per service live in the per-service references — pick the one(s) matching the resources discovered in Phase 2:

| Fix | Reference |
|---|---|
| Enable zone redundancy / configure health probes (Functions) | [references/services/functions/reliability.md](references/services/functions/reliability.md) |
| Enable zone redundancy / configure health probes (App Service) | [references/services/app-service/reliability.md](references/services/app-service/reliability.md) |
| Upgrade storage replication (cross-service) | [references/configure-storage.md](references/configure-storage.md) |
| Set up multi-region (cross-service) | [references/configure-multi-region.md](references/configure-multi-region.md) |
| Platform overview / verification | [references/configure-zone-redundancy.md](references/configure-zone-redundancy.md), [references/configure-health-probes.md](references/configure-health-probes.md) |

**Execution order — always quick wins first:**

1. **Zone redundancy on compute** (fast, in-place property update on the App's plan).
2. **Health probes** (Premium / Dedicated only — in-place; for FC1 / Consumption, follow the consent gate in [configure-health-probes.md](references/configure-health-probes.md)).
3. **Verify** the compute changes succeeded before doing anything else.
4. **⛔ STOP — Ask about storage upgrade.** Compute is now zone-redundant, but storage may still be LRS or GRS. Ask the user explicitly:

   ```
   ✅ Compute is now zone-redundant.

   To be **fully zone-redundant**, your storage account also needs to be upgraded:
     • stii5trxva2ark4: currently `Standard_LRS` → needs `Standard_ZRS`

   ⚠️  This is a live storage redundancy conversion:
      • Takes hours to days depending on data volume
      • Small ongoing cost increase (~$0.01/GB/month more)
      • Only supported for Standard general-purpose v2 accounts

   Do you want me to start the storage migration now? (yes / no / later)
   ```

   - **yes** → run `az storage account update --sku Standard_ZRS` (or `migration start` if needed); poll `az storage account show --query sku.name` until it reports `Standard_ZRS`.
   - **no / later** → leave storage as-is; note in the re-assessment that ZR storage remains a gap.

5. **Multi-region** — do NOT auto-run. Handled in **Step 3** below as an explicit follow-up after re-assessment.

> **⚠️ Warning:** If the user uses `azd up` or `terraform apply` later, CLI-only changes may be overwritten by the IaC definitions. Recommend also patching IaC after CLI fixes.

### Path B: Patch IaC

Update the user's Bicep or Terraform files so reliability settings are persistent.

**Step 1: Detect IaC type**
1. Look for `infra/` folder in project root
2. If not found, check project root for `*.bicep` or `*.tf` files
3. If still not found, ask user: "Where are your IaC files located?"
4. Check for `*.bicep` files → use Bicep patching
5. Check for `*.tf` files → use Terraform patching
6. If both exist, ask user which to patch
7. If no IaC exists, fall back to Path A (CLI) and inform user

**Step 2: Classify each fix by risk level**

| Fix | Risk Level | What Happens |
|-----|-----------|--------------|
| Zone redundancy (App plan) | 🟢 Safe patch | In-place property update on next deploy |
| Storage LRS → ZRS | 🟡 Pre-migration required | Live storage migration must complete before the IaC SKU change can deploy. **Never bundle with safe patches** — use the two-deploy flow in Steps 3–5. |
| Health check path (Basic/Standard/Premium / Dedicated) | 🟢 Safe patch | In-place update, but causes app restart |
| Health check path (FC1 / Consumption) | ⚪ Code-only — ask first | `healthCheckPath` is unsupported. Adding a health endpoint requires adding an HTTP-triggered `/api/health` function to **app code**. **Always ask the user for explicit consent before touching source code.** Do **not** patch IaC. |

**Step 3: Apply patches in two deploys (quick wins first)**

The IaC patching framework (detection, AVM-module guidance, deploy-order rule, storage SKU patch) lives in:

| IaC Type | Framework reference |
|---|---|
| Bicep | [references/iac-patching-bicep.md](references/iac-patching-bicep.md) |
| Terraform | [references/iac-patching-terraform.md](references/iac-patching-terraform.md) |

The actual **per-service compute patches** (Function App plan ZR, App Service Plan ZR, etc.) live in the per-service references — load the matching service file from Phase 2 for the exact Bicep / Terraform / AVM snippets. Only Azure Functions and App Service have per-service references in this skill version; Container Apps is out of scope.

**Deploy 1 — Quick wins only.** Patch the 🟢 Safe items (zone redundancy on the App Service/Function App plan, health probes on Basic/Standard/Premium / Dedicated). Do **NOT** include the storage SKU patch in this deploy.

After patching, **the skill runs the deploy itself** (do not stop and tell the user to run it). Detect the deployment tool and confirm once before executing:

```
📦 Patches applied to your IaC. Ready to deploy:
   Tool detected: azd (found azure.yaml)
   Command:       azd up

Proceed with deployment? (yes / no)
```

On **yes**, run the appropriate command, stream output back to the user, and continue to the next step on success:
- AZD project (has `azure.yaml`): `azd up`
- Bicep-only: `az deployment group create --resource-group <rg> --template-file infra/main.bicep --parameters @infra/main.parameters.json`
- Terraform: `terraform plan -out tfplan` → (show plan summary) → `terraform apply tfplan`

On **no**, stop and report the patched files; do not proceed to Step 4 / Re-Assess.

If deployment fails, surface the error and stop — do not continue to the storage step.

**⛔ STOP — Ask about storage upgrade before Deploy 2.** After Deploy 1 succeeds, ask the user explicitly:

```
✅ Quick-win patches deployed. Compute is now zone-redundant.

To be **fully zone-redundant**, your storage account also needs to be upgraded:
  • stii5trxva2ark4: currently `Standard_LRS` → needs `Standard_ZRS`

⚠️  This is a two-part change:
   1. Live storage migration (`az storage account migration start`) — takes hours to days
   2. A second deploy to update your IaC's storage SKU to match

Do you want me to start the storage migration now? (yes / no / later)
```

- **yes** → the skill runs the migration command itself, polls until complete, then patches the storage SKU in IaC and runs **Deploy 2** (now a no-op confirmation). The user does not need to run anything manually.
- **no / later** → leave the storage SKU patch unapplied. Note in the re-assessment that ZR storage remains a gap; suggest revisiting later.

**Step 4: Storage migration (only if user said yes in Step 3)**

The skill runs these commands itself — do not ask the user to run them. Show progress as you go:

```
🔄 Starting storage migration (this can take up to 72 hours)...

   az storage account migration start --name stii5trxva2ark4 \
     --resource-group rg-example --sku Standard_ZRS --no-wait

   Polling: az storage account show --name stii5trxva2ark4 --query sku.name
   ...
   ✅ Migration complete: sku.name = Standard_ZRS
```

For very long migrations, you may surface a checkpoint to the user ("this is still running, check back later") rather than blocking the entire conversation.

**Step 5: Deploy 2 — storage SKU patch**

After the migration completes, the skill patches the storage SKU in IaC and runs the same deploy command as Step 3 (e.g. `azd up`). This deploy is a no-op confirmation that the IaC matches the live state. Confirm once with the user before executing, then run it directly.

### Step 2 (both paths): Re-Assess

After changes are applied (CLI) or deployed (IaC), automatically re-run the assessment and show the **same feature-pivoted table** as Phase 3, with each feature row's status updated to reflect the new state. Briefly call out what changed since the previous run.

```
🔄 Reliability Re-Assessment — rg-eventhubs-python-jan13 (eastus)
───────────────────────────────────────────────────────────────────────────────────────
Reliability Feature              Status      Resources
───────────────────────────────────────────────────────────────────────────────────────
Zone redundancy — compute        🟢 ON       • plan-ii5trxva2ark4 (FC1)              — now ON
                                             • plan-web-ii5trxva2ark4 (P1v3)         — now ON

Zone-redundant storage           🟢 ZRS      • stii5trxva2ark4                       — GRS → ZRS

Health probes                    🟡 PARTIAL  • func-api-ii5trxva2ark4                — still off (FC1, code change declined)
                                             • app-web-ii5trxva2ark4                 — now ON

Multi-region failover            🔴 OFF      • Single region (eastus) only
───────────────────────────────────────────────────────────────────────────────────────

What changed: Function App and App Service plan zone redundancy, storage replication and health probes on App Service.
(Multi-region offered next — see Step 3.)
```

### Step 3 (both paths): Multi-region follow-up — ASK and WAIT

Multi-region is a significant cost/complexity step. Do **NOT** start it automatically. After re-assessment, only if **all core single-region reliability features are 🟢 ON** (zone-redundant compute, ZRS/GZRS storage, health probes), explicitly ask the user and **wait for their response** before doing anything:

```
🟢 Your app is now fully zone-redundant in {region}.

The next step (optional) is multi-region failover with Azure Front Door:
   • Deploys compute + storage in a second region (paired region recommended)
   • Adds Azure Front Door for global load balancing with health-probe-driven failover
   • Protects against full region outages
   • Estimated additional cost: ~2x compute (active-passive); Front Door ~$35/month base

Do you want me to set up multi-region failover now? (yes / no / later)
```

- **yes** → proceed with [references/configure-multi-region.md](references/configure-multi-region.md). Confirm secondary region choice with the user, then:
  1. Generate the multi-region IaC (Bicep / Terraform additions for the secondary region + Front Door).
  2. Confirm once with the user: `📦 Multi-region IaC generated. Ready to deploy with \`azd up\`. Proceed? (yes / no)`
  3. On **yes**, **the skill runs the deploy itself** (`azd up` / `az deployment group create` / `terraform apply`) and streams output. Do not stop and tell the user to run it.
  4. After successful deploy, run a final re-assessment so the user sees Multi-region failover flip to 🟢 ON.
- **no / later** → leave the deployment as-is. Note that single-region zone-redundant is a reliable end state; multi-region can be revisited anytime.

> **⛔ Do not skip the wait.** Do not generate multi-region IaC, deploy a Front Door, or modify any files until the user has explicitly said yes. If core reliability is not yet all 🟢, do **not** ask about multi-region — finish the core gaps first.

## Priority Classification

| Priority | Criteria | Action |
|---|---|---|
| Critical | No zone redundancy AND production workload | Fix immediately |
| High | LRS storage on zone-redundant compute | Fix within days |
| Medium | No multi-region (single region but zone-redundant) | Plan for next sprint |
| Low | Missing health probes or monitoring gaps | Track and fix |

## Error Handling

| Error | Message | Remediation |
|---|---|---|
| Authentication required | "Please login" | Run `az login` and retry |
| Access denied | "Forbidden" | Confirm Reader/Contributor role assignment |
| Plan doesn't support ZR | "Upgrade required" | Inform user of plan upgrade path + cost delta |
| Region doesn't support AZ | "Region limitation" | Suggest supported regions |

## Best Practices

- Run reliability assessments after every significant infrastructure change
- Test failover scenarios periodically (at least quarterly)

## Skill Boundaries

| Action | This skill does | Hand off to |
|---|---|---|
| Assess reliability posture | ✅ Yes | — |
| Recommend improvements | ✅ Yes | — |
| Enable zone redundancy (CLI commands) | ✅ Yes | — |
| Patch Bicep/Terraform for reliability | ✅ Yes | — |
| Generate multi-region IaC | ✅ Yes (additions for the secondary region + Front Door) | `azure-prepare` for full new-app IaC scaffolding |
| Deploy IaC for reliability changes | ✅ Yes (runs `azd up` / `terraform apply` / `az deployment` itself, after user confirmation) | `azure-deploy` for general/non-reliability deploys |
| Validate pre-deployment | Reliability checks only | `azure-validate` for full validation |

references/

configure-health-probes.md 2.1 KB

# Configure Health Probes — Platform Notes

## What "health probe" means per service

| Service | Mechanism | Where |
|---|---|---|
| App Service (Basic / Standard / Premium / Dedicated) | `siteConfig.healthCheckPath` (platform health check) | [services/app-service/reliability.md](services/app-service/reliability.md) |
| Functions Premium / Dedicated | `siteConfig.healthCheckPath` (platform health check) | [services/functions/reliability.md](services/functions/reliability.md) |
| Functions Flex Consumption (FC1) / Consumption (Y1) | HTTP-triggered `/api/health` function in **app code** — `healthCheckPath` is unsupported | [services/functions/reliability.md](services/functions/reliability.md) |
| Azure Front Door | `healthProbeSettings` on origin group | [health-probe-checks.md](health-probe-checks.md) |
| Traffic Manager | `monitorConfig` on profile | [health-probe-checks.md](health-probe-checks.md) |

> Container Apps (`liveness` / `readiness` probes) deep-dive references are planned for a future version of this skill but are not yet shipped.

## ⛔ STOP — Code-only fixes require user consent

For any case where enabling health probing requires **modifying app source code** rather than IaC — most notably Functions on FC1 / Y1, and Container Apps where the image doesn't already serve a `/health` route — **always ask the user for explicit consent before touching source files**. The exact prompt and decision tree are documented in the relevant per-service reference.

Do not generate or modify code without an explicit yes.

## Best Practices for Health Endpoints

These apply to any service:

1. **Keep health endpoints lightweight** — return 200 quickly; don't run heavy DB or downstream-dependency queries on every probe.
2. **Use anonymous auth** — platform health probes can't pass auth tokens.
3. **Two endpoints, not one** — a fast `/health` for the load balancer, and an optional `/health/deep` for on-call diagnostics.
4. **For Container Apps, both liveness AND readiness** — liveness alone restarts the container without taking it out of rotation first.
5. **Test the endpoint** before relying on it: `curl https://<app-url>/api/health`.

configure-multi-region.md 18.5 KB

# Configure Multi-Region — Active-Passive with Azure Front Door

## When to Use

Use this reference when:
- Core single-region reliability is already in place (zone redundant compute, ZRS storage, health probes) and the user wants to go further
- User explicitly asks for "multi-region", "global reliability", or "region failover"
- User wants protection against a full Azure region outage

## Prerequisites

Before enabling multi-region:
1. App must already be zone-redundant in the primary region with ZRS/GZRS storage and health probes configured
2. App should have a health endpoint (`/api/health` or similar)
3. User must choose a secondary region (suggest paired region)

## Workflow

### Step 1: Gather Information

Ask the user:
```
To set up multi-region active-passive failover, I need:

1. Secondary region — Where should the standby deployment go?
   Suggested: [paired region for primary] (e.g., eastus2 → centralus, westus2 → westus3)

2. Pattern — Active-Passive (recommended, lower cost) or Active-Active?

3. Health endpoint — What path should Front Door probe?
   Default: /api/health
```

### Step 2: Choose Path (CLI vs IaC)

Same dual-path as zone redundancy:
- **Path A (Fix now):** Deploy secondary resources via CLI + create Front Door
- **Path B (Patch IaC):** Add secondary region module + Front Door to Bicep/Terraform

**Recommend Path B** for multi-region — it's complex enough that IaC is essential for maintainability.

### Step 3: What Gets Created

| Resource | Primary Region | Secondary Region |
|----------|---------------|-----------------|
| Resource Group | Existing | New (same name + `-secondary`) |
| App Service Plan | Existing (ZR) | New (ZR, same SKU) |
| Function App / Web App | Existing | New (same code, same config) |
| Storage Account | Existing (ZRS) | New (ZRS) |
| Event Hubs / other deps | Existing | Depends on service (some are global) |
| Azure Front Door | Global (new) | — |
| Managed Identity | Existing | New |

---

## Bicep: Full Multi-Region Module

### Add to `infra/main.bicep`:

```bicep
// ===== MULTI-REGION RELIABILITY =====

@description('Enable multi-region active-passive deployment')
param multiRegionEnabled bool = false

@description('Secondary region for failover deployment')
param secondaryLocation string = 'centralus'

@description('Health check path for Front Door probes')
param healthCheckPath string = '/api/health'

var secondaryResourceToken = toLower(uniqueString(subscription().id, environmentName, secondaryLocation))

// Secondary resource group
resource rgSecondary 'Microsoft.Resources/resourceGroups@2021-04-01' = if (multiRegionEnabled) {
  name: '${rg.name}-secondary'
  location: secondaryLocation
  tags: tags
}

// Secondary storage account
module storageSecondary 'br/public:avm/res/storage/storage-account:0.13.2' = if (multiRegionEnabled) {
  name: 'storage-secondary'
  scope: rgSecondary
  params: {
    name: 'st${secondaryResourceToken}'
    location: secondaryLocation
    tags: tags
    kind: 'StorageV2'
    skuName: 'Standard_ZRS'
    allowBlobPublicAccess: false
    allowSharedKeyAccess: false
    blobServices: {
      containers: [
        {
          name: deploymentStorageContainerName
        }
      ]
    }
  }
}

// Secondary managed identity
module apiUserAssignedIdentitySecondary 'br/public:avm/res/managed-identity/user-assigned-identity:0.4.1' = if (multiRegionEnabled) {
  name: 'apiUserAssignedIdentity-secondary'
  scope: rgSecondary
  params: {
    name: '${abbrs.managedIdentityUserAssignedIdentities}api-${secondaryResourceToken}'
    location: secondaryLocation
    tags: tags
  }
}

// Secondary App Service Plan (zone redundant)
module appServicePlanSecondary 'br/public:avm/res/web/serverfarm:0.5.0' = if (multiRegionEnabled) {
  name: 'appserviceplan-secondary'
  scope: rgSecondary
  params: {
    name: '${abbrs.webServerFarms}${secondaryResourceToken}'
    location: secondaryLocation
    tags: tags
    skuName: 'FC1'
    reserved: true
    zoneRedundant: true
  }
}

// Secondary Function App
module apiSecondary './app/api.bicep' = if (multiRegionEnabled) {
  name: 'api-secondary'
  scope: rgSecondary
  params: {
    name: '${abbrs.webSitesFunctions}api-${secondaryResourceToken}'
    location: secondaryLocation
    tags: tags
    applicationInsightsName: monitoring.outputs.name
    appServicePlanId: appServicePlanSecondary.outputs.resourceId
    runtimeName: 'node'
    runtimeVersion: '22'
    storageAccountName: storageSecondary.outputs.name
    enableBlob: true
    enableQueue: false
    enableTable: true
    deploymentStorageContainerName: deploymentStorageContainerName
    identityId: apiUserAssignedIdentitySecondary.outputs.resourceId
    identityClientId: apiUserAssignedIdentitySecondary.outputs.clientId
    virtualNetworkSubnetId: ''
    eventHubNamespaceName: eventHubs.outputs.eventHubNamespaceName
    inputEventHubName: 'input-events'
    outputEventHubName: 'output-events'
    appSettings: []
  }
}

// Azure Front Door for global load balancing
module frontDoor './app/front-door.bicep' = if (multiRegionEnabled) {
  name: 'front-door'
  scope: rg
  params: {
    name: 'afd-${resourceToken}'
    tags: tags
    primaryAppHostName: '${api.outputs.SERVICE_API_NAME}.azurewebsites.net'
    secondaryAppHostName: multiRegionEnabled ? '${abbrs.webSitesFunctions}api-${secondaryResourceToken}.azurewebsites.net' : ''
    healthCheckPath: healthCheckPath
  }
}

// Outputs for multi-region
output FRONT_DOOR_ENDPOINT string = multiRegionEnabled ? frontDoor.outputs.endpoint : ''
output SECONDARY_REGION string = multiRegionEnabled ? secondaryLocation : ''
output SECONDARY_FUNCTION_APP string = multiRegionEnabled ? '${abbrs.webSitesFunctions}api-${secondaryResourceToken}' : ''
```

### Create `infra/app/front-door.bicep`:

```bicep
@description('Name of the Front Door profile')
param name string

@description('Tags for the resource')
param tags object = {}

@description('Primary app hostname')
param primaryAppHostName string

@description('Secondary app hostname')
param secondaryAppHostName string

@description('Health check path')
param healthCheckPath string = '/api/health'

resource frontDoor 'Microsoft.Cdn/profiles@2024-02-01' = {
  name: name
  location: 'global'
  tags: tags
  sku: {
    name: 'Standard_AzureFrontDoor'
  }
}

resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2024-02-01' = {
  parent: frontDoor
  name: '${name}-endpoint'
  location: 'global'
  properties: {
    enabledState: 'Enabled'
  }
}

resource originGroup 'Microsoft.Cdn/profiles/originGroups@2024-02-01' = {
  parent: frontDoor
  name: 'app-origins'
  properties: {
    healthProbeSettings: {
      probePath: healthCheckPath
      probeProtocol: 'Https'
      probeRequestType: 'GET'
      probeIntervalInSeconds: 30
    }
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
      additionalLatencyInMilliseconds: 50
    }
  }
}

resource primaryOrigin 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
  parent: originGroup
  name: 'primary'
  properties: {
    hostName: primaryAppHostName
    originHostHeader: primaryAppHostName
    priority: 1
    weight: 1000
    httpPort: 80
    httpsPort: 443
    enabledState: 'Enabled'
  }
}

resource secondaryOrigin 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
  parent: originGroup
  name: 'secondary'
  properties: {
    hostName: secondaryAppHostName
    originHostHeader: secondaryAppHostName
    priority: 2
    weight: 1000
    httpPort: 80
    httpsPort: 443
    enabledState: 'Enabled'
  }
}

resource route 'Microsoft.Cdn/profiles/afdEndpoints/routes@2024-02-01' = {
  parent: endpoint
  name: 'default-route'
  properties: {
    originGroup: {
      id: originGroup.id
    }
    supportedProtocols: [
      'Http'
      'Https'
    ]
    patternsToMatch: [
      '/*'
    ]
    forwardingProtocol: 'HttpsOnly'
    httpsRedirect: 'Enabled'
    linkToDefaultDomain: 'Enabled'
  }
}

output endpoint string = 'https://${endpoint.properties.hostName}'
output frontDoorId string = frontDoor.id
```

---

## Terraform: Multi-Region Module

### Add to `infra/main.tf`:

```hcl
variable "multi_region_enabled" {
  description = "Enable multi-region active-passive deployment"
  type        = bool
  default     = false
}

variable "secondary_location" {
  description = "Secondary region for failover"
  type        = string
  default     = "centralus"
}

variable "health_check_path" {
  description = "Health check path for Front Door probes"
  type        = string
  default     = "/api/health"
}

# Secondary resource group
resource "azurerm_resource_group" "secondary" {
  count    = var.multi_region_enabled ? 1 : 0
  name     = "${azurerm_resource_group.rg.name}-secondary"
  location = var.secondary_location
  tags     = local.tags
}

# Secondary storage account
resource "azurerm_storage_account" "secondary" {
  count                    = var.multi_region_enabled ? 1 : 0
  name                     = "st${random_string.secondary_token.result}"
  resource_group_name      = azurerm_resource_group.secondary[0].name
  location                 = var.secondary_location
  account_tier             = "Standard"
  account_replication_type = "ZRS"
  tags                     = local.tags
}

# Secondary App Service Plan (zone redundant)
resource "azurerm_service_plan" "secondary" {
  count                  = var.multi_region_enabled ? 1 : 0
  name                   = "plan-${random_string.secondary_token.result}"
  location               = var.secondary_location
  resource_group_name    = azurerm_resource_group.secondary[0].name
  os_type                = "Linux"
  sku_name               = "FC1"
  zone_balancing_enabled = true
  tags                   = local.tags
}

# Secondary Function App
resource "azurerm_linux_function_app" "secondary" {
  count               = var.multi_region_enabled ? 1 : 0
  name                = "func-api-${random_string.secondary_token.result}"
  location            = var.secondary_location
  resource_group_name = azurerm_resource_group.secondary[0].name
  service_plan_id     = azurerm_service_plan.secondary[0].id
  storage_account_name = azurerm_storage_account.secondary[0].name
  tags                = local.tags

  site_config {
    application_stack {
      node_version = "22"
    }
  }
}

# Azure Front Door
resource "azurerm_cdn_frontdoor_profile" "main" {
  count               = var.multi_region_enabled ? 1 : 0
  name                = "afd-${random_string.token.result}"
  resource_group_name = azurerm_resource_group.rg.name
  sku_name            = "Standard_AzureFrontDoor"
  tags                = local.tags
}

resource "azurerm_cdn_frontdoor_endpoint" "main" {
  count                    = var.multi_region_enabled ? 1 : 0
  name                     = "afd-${random_string.token.result}-endpoint"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.main[0].id
}

resource "azurerm_cdn_frontdoor_origin_group" "main" {
  count                    = var.multi_region_enabled ? 1 : 0
  name                     = "app-origins"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.main[0].id

  health_probe {
    path                = var.health_check_path
    protocol            = "Https"
    request_type        = "GET"
    interval_in_seconds = 30
  }

  load_balancing {
    sample_size                 = 4
    successful_samples_required = 3
    additional_latency_in_milliseconds = 50
  }
}

resource "azurerm_cdn_frontdoor_origin" "primary" {
  count                          = var.multi_region_enabled ? 1 : 0
  name                           = "primary"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.main[0].id
  host_name                      = "${azurerm_linux_function_app.main.default_hostname}"
  origin_host_header             = "${azurerm_linux_function_app.main.default_hostname}"
  priority                       = 1
  weight                         = 1000
  https_port                     = 443
  enabled                        = true
}

resource "azurerm_cdn_frontdoor_origin" "secondary" {
  count                          = var.multi_region_enabled ? 1 : 0
  name                           = "secondary"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.main[0].id
  host_name                      = "${azurerm_linux_function_app.secondary[0].default_hostname}"
  origin_host_header             = "${azurerm_linux_function_app.secondary[0].default_hostname}"
  priority                       = 2
  weight                         = 1000
  https_port                     = 443
  enabled                        = true
}

resource "azurerm_cdn_frontdoor_route" "main" {
  count                          = var.multi_region_enabled ? 1 : 0
  name                           = "default-route"
  cdn_frontdoor_endpoint_id      = azurerm_cdn_frontdoor_endpoint.main[0].id
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.main[0].id
  cdn_frontdoor_origin_ids       = [
    azurerm_cdn_frontdoor_origin.primary[0].id,
    azurerm_cdn_frontdoor_origin.secondary[0].id
  ]
  supported_protocols            = ["Http", "Https"]
  patterns_to_match              = ["/*"]
  forwarding_protocol            = "HttpsOnly"
  https_redirect_enabled         = true
  link_to_default_domain         = true
}
```

---

## CLI: Quick Setup (Path A)

For users who want to deploy multi-region without IaC:

```bash
# Variables
PRIMARY_RG="rg-reliability-test"
SECONDARY_LOCATION="centralus"
SECONDARY_RG="${PRIMARY_RG}-secondary"
PRIMARY_APP="func-api-32mpw2gtw7lye"
RESOURCE_TOKEN=$(openssl rand -hex 6)
SECONDARY_APP="func-api-${RESOURCE_TOKEN}"
FRONT_DOOR_NAME="afd-${RESOURCE_TOKEN}"

# Step 1: Create secondary RG
az group create --name $SECONDARY_RG --location $SECONDARY_LOCATION

# Step 2: Create secondary storage (ZRS)
az storage account create \
  --name "st${RESOURCE_TOKEN}" \
  --resource-group $SECONDARY_RG \
  --location $SECONDARY_LOCATION \
  --sku Standard_ZRS \
  --kind StorageV2

# Step 3: Create secondary plan (zone redundant)
az functionapp plan create \
  --name "plan-${RESOURCE_TOKEN}" \
  --resource-group $SECONDARY_RG \
  --location $SECONDARY_LOCATION \
  --sku FC1 \
  --is-linux true

# Step 4: Create secondary function app
az functionapp create \
  --name $SECONDARY_APP \
  --resource-group $SECONDARY_RG \
  --plan "plan-${RESOURCE_TOKEN}" \
  --storage-account "st${RESOURCE_TOKEN}" \
  --runtime node \
  --runtime-version 22 \
  --functions-version 4

# Step 5: Deploy code to secondary (same zip as primary)
# az functionapp deployment source config-zip ...

# Step 6: Create Front Door
az afd profile create \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --sku Standard_AzureFrontDoor

# Step 7: Create endpoint
az afd endpoint create \
  --endpoint-name "${FRONT_DOOR_NAME}-endpoint" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG

# Step 8: Create origin group with health probe
az afd origin-group create \
  --origin-group-name "app-origins" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --probe-path "/api/health" \
  --probe-protocol Https \
  --probe-request-type GET \
  --probe-interval-in-seconds 30 \
  --sample-size 4 \
  --successful-samples-required 3 \
  --additional-latency-in-milliseconds 50

# Step 9: Add primary origin (priority 1)
az afd origin create \
  --origin-name "primary" \
  --origin-group-name "app-origins" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --host-name "${PRIMARY_APP}.azurewebsites.net" \
  --origin-host-header "${PRIMARY_APP}.azurewebsites.net" \
  --priority 1 \
  --weight 1000 \
  --https-port 443 \
  --enabled-state Enabled

# Step 10: Add secondary origin (priority 2)
az afd origin create \
  --origin-name "secondary" \
  --origin-group-name "app-origins" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --host-name "${SECONDARY_APP}.azurewebsites.net" \
  --origin-host-header "${SECONDARY_APP}.azurewebsites.net" \
  --priority 2 \
  --weight 1000 \
  --https-port 443 \
  --enabled-state Enabled

# Step 11: Create route
az afd route create \
  --route-name "default-route" \
  --endpoint-name "${FRONT_DOOR_NAME}-endpoint" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --origin-group "app-origins" \
  --supported-protocols Http Https \
  --patterns-to-match "/*" \
  --forwarding-protocol HttpsOnly \
  --https-redirect Enabled \
  --link-to-default-domain Enabled
```

---

## Cost Implications

Present this to the user before proceeding:

| Component | Approximate Monthly Cost |
|-----------|-------------------------|
| Secondary Function App (FC1 Flex) | Pay-per-execution only (standby = ~$0 if idle) |
| Secondary Storage (ZRS) | ~$0.02/GB/month (minimal if just app package) |
| Azure Front Door (Standard) | ~$35/month base + $0.01/10K requests |
| **Total additional cost** | **~$35-40/month** for active-passive with idle standby |

> **Note for Flex Consumption:** The secondary app costs near-zero when idle since Flex Consumption is pay-per-execution. This makes active-passive very cost-effective for Functions.

---

## Verification After Setup

After multi-region is configured, verify:

```bash
# Check Front Door endpoint is responding
curl -I https://<front-door-name>-endpoint.z01.azurefd.net/api/health

# Check both origins are healthy
az afd origin-group show \
  --origin-group-name "app-origins" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --query "healthProbeSettings"

# List origins with their health status
az afd origin list \
  --origin-group-name "app-origins" \
  --profile-name $FRONT_DOOR_NAME \
  --resource-group $PRIMARY_RG \
  --query "[].{name:name, hostName:hostName, priority:priority, enabled:enabledState}"
```

---

## Reliability Checklist Impact

After multi-region is configured, the **Multi-Region** column flips to ✅ for:
- Each compute resource that now has a paired deployment in the secondary region
- The Front Door / Traffic Manager profile (with health probes enabled)

---

## Limitations & Gotchas

1. **Event Hubs:** The primary Event Hub namespace is in one region. For true region failover, consider Event Hubs Geo-DR pairing (separate setup).
2. **Code deployment:** Both apps need the same code. If using `azd deploy`, you'll need to deploy to both apps.
3. **App Settings:** Both apps must have matching configuration. Consider using a shared Key Vault for secrets.
4. **Cold start:** Secondary Flex Consumption app may have cold start on failover since it's idle. Consider periodic health pings.
5. **Front Door propagation:** DNS/config changes take 5-10 minutes to propagate globally.
6. **Storage independence:** Each region uses its own storage. Event Hub checkpoints are per-app, so the secondary will process from its own checkpoint on failover.

configure-storage.md 5.3 KB

# Configure Storage Redundancy

## Overview

Storage accounts must match or exceed the redundancy level of the compute they support. Zone-redundant compute requires at minimum ZRS storage.

## Upgrade Paths

| Current | Target | Method | Downtime |
|---|---|---|---|
| Standard_LRS → Standard_ZRS | ZRS | Live migration or manual | None (live) or planned (manual) |
| Standard_LRS → Standard_GRS | GRS | In-place update | None |
| Standard_LRS → Standard_GZRS | GZRS | In-place update | None |
| Premium_LRS → Premium_ZRS | ZRS | Manual migration only | Planned |

## In-Place Upgrade (LRS → GRS/GZRS)

GRS and GZRS upgrades can be done in-place immediately:

```bash
# Upgrade to GZRS (zone + region redundant — recommended)
az storage account update \
  --name <account-name> \
  --resource-group <rg> \
  --sku Standard_GZRS

# Upgrade to GRS (region redundant only — NOT zone redundant)
az storage account update \
  --name <account-name> \
  --resource-group <rg> \
  --sku Standard_GRS
```

## Live Migration (LRS → ZRS)

ZRS conversion uses the storage account migration API (not `--sku` update):

```bash
# Check if account supports live migration
# Requirements: Standard general-purpose v2, in a supported region
az storage account show \
  --name <account-name> \
  --resource-group <rg> \
  --query "{name:name, sku:sku.name, kind:kind}"

# Start live migration to ZRS
az storage account migration start \
  --account-name <account-name> \
  -g <rg> \
  --sku Standard_ZRS \
  --name default \
  --no-wait

# Monitor migration status (can take hours to days)
az storage account migration show \
  --account-name <account-name> \
  -g <rg> \
  --migration-name default \
  --query "migrationStatus"
```

> **⛔ HARD GATE: Do NOT proceed to enable zone-redundant compute until migration status is `Succeeded` and `az storage account show --query sku.name` returns `Standard_ZRS`.**

⚠️ **Live migration limitations:**
- Only supported for **Standard general-purpose v2** accounts
- Premium storage accounts require manual migration
- BlobStorage and Storage (classic) kinds require manual migration
- Migration can take hours to days depending on data volume
- Account remains accessible during migration

## Manual Migration (When Live Migration Not Available)

For Premium_LRS or unsupported account types:

```bash
# 1. Create new ZRS storage account
az storage account create \
  --name <new-account-name> \
  --resource-group <rg> \
  --location <location> \
  --sku Standard_ZRS \
  --kind StorageV2
```

### For Function Apps (special handling required)

Function App host storage uses blobs, queues, tables, and potentially files. A simple blob-only copy is NOT sufficient.

**⛔ Full Function App storage migration workflow:**

```bash
# 1. Stop the function app to quiesce triggers and drain orchestrations
az functionapp stop --name <app-name> --resource-group <rg>

# 2. If using Durable Functions, wait for all orchestrations to complete
#    or terminate them before proceeding

# 3. Copy ALL storage services (blobs, queues, tables) using AzCopy
#    Authenticate first:
azcopy login

#    Copy blobs:
azcopy copy \
  "https://<old-account>.blob.core.windows.net/*" \
  "https://<new-account>.blob.core.windows.net/" \
  --recursive

#    Note: AzCopy does not support table/queue copy directly.
#    Use Azure Data Factory or Storage Explorer for tables/queues if needed.

# 4. Get new storage connection string
az storage account show-connection-string \
  --name <new-account-name> \
  --resource-group <rg> \
  --query connectionString -o tsv

# 5. Update app settings to point to new storage
az functionapp config appsettings set \
  --name <app-name> \
  --resource-group <rg> \
  --settings "AzureWebJobsStorage=<new-connection-string>"

# 6. Start the function app
az functionapp start --name <app-name> --resource-group <rg>

# 7. Verify app works — check logs, test triggers
az functionapp show --name <app-name> --resource-group <rg> --query "state"

# 8. Only delete old storage after confirming everything works
```

⚠️ **Warn user:** This involves app downtime. For zero-downtime migration, use live migration (Standard_ZRS) instead.

### For non-Function App workloads (simpler)

```bash
# Copy blobs
azcopy login
azcopy copy \
  "https://<old-account>.blob.core.windows.net/*" \
  "https://<new-account>.blob.core.windows.net/" \
  --recursive

# Update app connection strings as needed
# Delete old account after verification
```

## Function App Storage Considerations

Function Apps have a host storage account used for:
- Function code storage
- Timer trigger leases
- Event Hub checkpoints
- Durable Functions state

**Critical:** When upgrading Function App storage:
- The `AzureWebJobsStorage` connection string must point to the upgraded/new account
- If using a separate deployment storage account, upgrade that too
- Durable Functions state is stored in the host storage — ensure no active orchestrations during manual migration

```bash
# Check current storage connection
az functionapp config appsettings list \
  --name <app-name> \
  --resource-group <rg> \
  --query "[?name=='AzureWebJobsStorage'].value" -o tsv
```

## Verification

```bash
az graph query -q "
Resources
| where resourceGroup =~ '<rg>'
| where type =~ 'microsoft.storage/storageaccounts'
| project name, replication=sku.name, kind
" -o table
```

Expected: All accounts show `Standard_ZRS`, `Standard_GZRS`, or `Premium_ZRS`.

configure-zone-redundancy.md 1.9 KB

# Configure Zone Redundancy — Platform Notes

## Storage redundancy is part of the same fix — discover it now, migrate it later

Zone-redundant compute backed by LRS/GRS storage still suffers downtime in a zone failure, so the storage SKU **must** be assessed alongside compute. However, do **not** block the compute fix on a storage migration — they happen in separate steps.

**Required order (matches the parent skill's [Configuration Workflow](../SKILL.md#configuration-workflow)):**

1. **Discover** the current storage SKU during assessment (Phase 2) so the user sees both gaps in one checklist. Use [storage-redundancy-checks.md](storage-redundancy-checks.md).
2. **Enable compute ZR first** — fast, in-place property update, no downtime. This is the quick win and runs without any storage prerequisite.
3. **Verify** compute is `zoneRedundant: true`.
4. **Then ask the user** before starting the storage migration (hours-to-days, small cost increase). Commands live in [configure-storage.md](configure-storage.md).

## Per-service configuration commands

The `az` CLI commands, plan-upgrade paths, blue/green migration steps, and verification commands all live in the per-service references because the syntax differs per service:

| Service | Reference |
|---|---|
| Azure App Service (P1v2+, P0v3+, P0v4+, ASEv3) | [services/app-service/reliability.md](services/app-service/reliability.md) |
| Azure Functions (FC1, EP1–EP3) | [services/functions/reliability.md](services/functions/reliability.md) |

## Verification

After enabling zone redundancy on any compute resource, confirm with:

```bash
az graph query -q "
Resources
| where resourceGroup =~ '<rg>'
| where type =~ 'microsoft.web/serverfarms' or type =~ 'microsoft.app/managedenvironments'
| extend zoneRedundant = tobool(properties.zoneRedundant)
| project name, type, zoneRedundant
" --query "data[]" -o json
```

All patched resources should show `zoneRedundant = true`.

health-probe-checks.md 3.4 KB

# Health Probe & Monitoring — Platform-Level Checks

## Overview

Health probes enable automated failover and recovery. Without them, load balancers and platform services cannot detect failures automatically.

This file covers **global / platform-level** probe checks (Azure Front Door, Traffic Manager, Application Insights connectivity). For service-specific health-probe checks, configuration commands, and IaC patches, see:

| Service | Reference |
|---|---|
| Azure Functions | [services/functions/reliability.md](services/functions/reliability.md) |

> Azure App Service and Azure Container Apps per-service references are planned but not yet shipped in this skill version.

> **⚠️ Output format:** Use `--query "data[]" -o json` for `az graph query`. Standard `az afd` / `az network traffic-manager` commands work fine with `-o table`.

## Check Front Door Health Probe Configuration

```bash
az afd origin-group list \
  --profile-name <front-door-name> \
  --resource-group <rg> \
  --query "[].{name:name, probePath:healthProbeSettings.probePath, probeProtocol:healthProbeSettings.probeProtocol, intervalSeconds:healthProbeSettings.probeIntervalInSeconds}" -o table
```

**Interpretation:**
- `probePath` empty / null → ❌ No active health probing → no automatic failover
- `probePath = /api/health` (or similar) → ✅ Probe configured

## Check Traffic Manager Endpoint Monitoring

```bash
az graph query -q "
Resources
| where type =~ 'microsoft.network/trafficmanagerprofiles'
| extend monitorPath = tostring(properties.monitorConfig.path)
| extend monitorProtocol = tostring(properties.monitorConfig.protocol)
| extend monitorPort = tostring(properties.monitorConfig.port)
| project name, resourceGroup, monitorProtocol, monitorPort, monitorPath
" --query "data[]" -o json
```

## Check Application Insights Connectivity

App settings are not reliably queryable via Resource Graph. Use Azure CLI directly:

```bash
az webapp config appsettings list \
  --name <app-name> \
  --resource-group <rg> \
  --query "[?contains(name, 'APPINSIGHTS') || contains(name, 'APPLICATIONINSIGHTS')].{name:name}" -o table
```

For Function Apps:
```bash
az functionapp config appsettings list \
  --name <app-name> \
  --resource-group <rg> \
  --query "[?contains(name, 'APPINSIGHTS') || contains(name, 'APPLICATIONINSIGHTS')].{name:name}" -o table
```

## Best Practices for Health Endpoints

These apply across all services:

1. **Keep health endpoints lightweight** — return 200 quickly, no heavy DB/dependency queries on every probe.
2. **Use anonymous auth** — health probes can't pass auth tokens.
3. **Two endpoints, not one** — fast `/health` for the load balancer, optional `/health/deep` for on-call diagnostics.
4. **For Container Apps, both liveness AND readiness** — liveness alone restarts the container without taking it out of rotation.
5. **Test the endpoint** before relying on it: `curl https://<app-url>/api/health`.

## Reporting (for the Multi-Region row)

For the `Multi-region failover` row of the assessment table:
- ✅ — Front Door (or Traffic Manager) exists AND has a non-empty `probePath` / `monitorConfig.path`
- ⚠️ Partial — global load balancer exists but has no health probe configured (manual failover only)
- ❌ — no global load balancer

Per-service `Health probes` row reporting for Azure Functions is documented in [services/functions/reliability.md](services/functions/reliability.md). App Service and Container Apps per-service reporting is planned but not yet available.

iac-patching-bicep.md 6.8 KB

# IaC Patching — Bicep

## When to Use

Use this reference when the user chooses **"Patch my IaC"** instead of "Fix now" (CLI).
This patches Bicep files in the project's `infra/` folder so reliability settings persist across `azd up`.

## Detection

1. Look for `infra/` folder in the project root
2. Check for `*.bicep` files (especially `main.bicep`, `main.parameters.json`)
3. Confirm with user: "I found Bicep files in `infra/`. Want me to patch them for reliability?"

## File Discovery

Search for resources to patch using these patterns:

```
# Find all Bicep files
Get-ChildItem -Path infra -Recurse -Filter *.bicep

# Common file structure:
# infra/main.bicep          — orchestrator, references modules
# infra/main.parameters.json — parameters
# infra/app/                 — app-specific modules
# infra/core/               — shared modules (host, storage, monitoring)
```

The resource definitions may be in module files, not `main.bicep`. Search all `.bicep` files for the resource type.

---

## ⚠️ AVM Modules vs Raw Bicep — Parameter Naming Differs

If the project uses **Azure Verified Modules** (`br/public:avm/res/...`), the parameter names will **not** match the raw ARM property names shown in the patches below. Per-service AVM examples live in the per-service references. The most universally-applicable mapping is:

| Raw ARM/Bicep property | AVM module parameter |
|---|---|
| `sku.name` (storage) | `skuName` (top-level) |

**Before patching, always:**

1. Detect AVM usage:
   ```powershell
   Select-String -Path infra -Recurse -Pattern "br/public:avm/res/" -List
   ```
2. For each AVM module reference, open the module's published README (the version is in the registry path, e.g. `br/public:avm/res/storage/storage-account:0.x.y`) or run:
   ```powershell
   # Show the module call so you can see which params it currently passes
   Select-String -Path infra -Recurse -Pattern "avm/res/storage/storage-account" -Context 0,15
   ```
3. Map the reliability property to the **module's parameter name**, not the ARM property name. When in doubt, search the actual module call for the property and patch what's already in use.

## Per-service Bicep patches

The patches for compute (zone redundancy on the App Service plan or Function App plan, health check path) live in the per-service references because the SKU rules and ARM types differ:

| Service | Reference |
|---|---|
| Azure App Service | [services/app-service/reliability.md](services/app-service/reliability.md) |
| Azure Functions | [services/functions/reliability.md](services/functions/reliability.md) |

> Azure Container Apps per-service Bicep patches are planned for a future version of this skill.

The one truly cross-service patch — **storage** — lives below.

---

## Patch: Storage Account — LRS / GRS → ZRS / GZRS

**Find:** `Microsoft.Storage/storageAccounts`

**Search pattern:** `resource .* 'Microsoft.Storage/storageAccounts@`

> **💡 No `sku` block in the IaC?** If the storage resource (or AVM module call) does not specify a SKU, Azure deploys it as **`Standard_GRS`** by default. The patch in that case is to **add** the `sku` block (raw Bicep) or **add** the `skuName` parameter (AVM), not find-and-replace an existing value. Always grep for the absence of `sku` / `skuName` before assuming there's a value to swap.

**Before:**
```bicep
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
}
```

**After — change to ZRS:**
```bicep
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_ZRS'
  }
  kind: 'StorageV2'
}
```

### Case: SKU not specified at all (defaulted to GRS)

**Before (no `sku` block):**
```bicep
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  kind: 'StorageV2'
}
```

**After — add an explicit `sku`:**
```bicep
resource storageAccount 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_ZRS'
  }
  kind: 'StorageV2'
}
```

**AVM module equivalent:**
```bicep
module storage 'br/public:avm/res/storage/storage-account:<version>' = {
  // ...
  params: {
    name: storageAccountName
    skuName: 'Standard_ZRS'   // ← ADD THIS (defaults to Standard_GRS if omitted)
    // ...
  }
}
```

### Parameterized SKU (common pattern)

If the SKU is parameterized, update the default value:

**Before:**
```bicep
param storageSku string = 'Standard_LRS'
```

**After:**
```bicep
param storageSku string = 'Standard_ZRS'
```

Also check `main.parameters.json` for overrides:
```json
{
  "storageSku": {
    "value": "Standard_ZRS"
  }
}
```

### ⚠️ Important: Existing Deployed Storage

Changing SKU in Bicep expresses the **desired end state**, but does NOT automatically migrate existing storage.

- **New storage account** → deploys as ZRS directly ✅
- **Existing storage account** → ARM may attempt an in-place SKU update, but LRS→ZRS is a **storage redundancy conversion**, not a simple property change. For supported StorageV2/GPv2 accounts in supported regions, Azure can perform live conversion, but this is not guaranteed and the deployment may fail for unsupported account kinds.

**Always follow this order for existing storage:**
1. Patch the Bicep to `Standard_ZRS` (desired end state)
2. Run `az storage account migration start` to initiate the live conversion
3. Wait for migration to complete (`az storage account migration show`)
4. Then run `azd up` / deploy — the Bicep now matches the actual state

> ⛔ Do NOT run `azd up` before the migration completes. The deployment may fail or conflict with the in-progress migration.

---

## Deploy Plan (Skill executes this directly)

After patching, **the skill executes the deploys itself** — do not stop and tell the user to run commands. Confirm once with the user before each deploy, then run it.

Summarize the plan for the user:
```
✅ Bicep files patched for reliability.

Deploy plan (the skill will run these for you after your confirmation):
  1. Deploy 1 — safe patches only (zone redundancy, health probes, probes).
     Command: `azd up` (or `az deployment group create ...`).
  2. Storage migration (only if upgrading LRS → ZRS).
     Command: `az storage account migration start ...`, then poll until `sku.name = Standard_ZRS`.
  3. Deploy 2 — storage SKU patch (no-op confirmation that IaC matches live state).

Do NOT bundle the storage SKU change with the safe patches — a failed storage redundancy update can fail the whole deployment.

⚠️ Note: If you have an existing Container Apps environment without zone redundancy,
   the environment name was changed to force recreation. Your apps will be migrated
   to the new environment on next deploy.

Ready for Deploy 1? (yes / no)
```

iac-patching-terraform.md 4.7 KB

# IaC Patching — Terraform

## When to Use

Use this reference when the user chooses **"Patch my IaC"** instead of "Fix now" (CLI).
This patches Terraform files in the project's `infra/` folder so reliability settings persist across `terraform apply` / `azd up`.

## Detection

1. Look for `infra/` folder in the project root
2. Check for `*.tf` files (especially `main.tf`, `variables.tf`)
3. Confirm with user: "I found Terraform files in `infra/`. Want me to patch them for reliability?"

## File Discovery

```
# Find all Terraform files
Get-ChildItem -Path infra -Recurse -Filter *.tf

# Common file structure:
# infra/main.tf              — main resources
# infra/variables.tf          — input variables
# infra/terraform.tfvars      — variable values
# infra/modules/              — reusable modules
```

Resource definitions may be in module files. Search all `.tf` files for the resource type.

## Per-service Terraform patches

The patches for compute (zone redundancy on the App Service Plans / environments, Function App plan, health check path) live in the per-service references because the SKU rules and resource types differ:

| Service | Reference |
|---|---|
| Azure App Service | [services/app-service/reliability.md](services/app-service/reliability.md) |
| Azure Functions | [services/functions/reliability.md](services/functions/reliability.md) |

> Azure App Service and Azure Container Apps per-service Terraform patches are planned for a future version of this skill.

The one truly cross-service patch — **storage** — lives below.

---

## Patch: Storage Account — LRS / GRS → ZRS / GZRS

**Find:** `azurerm_storage_account`

**Search pattern:** `resource "azurerm_storage_account"`

**Before:**
```hcl
resource "azurerm_storage_account" "storage" {
  name                     = var.storage_account_name
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = azurerm_resource_group.rg.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}
```

**After — change to ZRS:**
```hcl
resource "azurerm_storage_account" "storage" {
  name                     = var.storage_account_name
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = azurerm_resource_group.rg.location
  account_tier             = "Standard"
  account_replication_type = "ZRS"
}
```

### Parameterized Replication Type

If parameterized, update the default:

**variables.tf — Before:**
```hcl
variable "storage_replication_type" {
  default = "LRS"
}
```

**After:**
```hcl
variable "storage_replication_type" {
  default = "ZRS"
}
```

Also check `terraform.tfvars` for overrides.

### ⚠️ Existing Deployed Storage

Changing `account_replication_type` in Terraform expresses the **desired end state**, but LRS→ZRS is a **storage redundancy conversion**, not a simple property change. Terraform may attempt an in-place update that fails, or worse, plan a destroy+recreate (data loss risk).

**Always follow this order for existing storage:**
1. Patch Terraform to `account_replication_type = "ZRS"` (desired end state)
2. Run `az storage account migration start` to initiate the live conversion
3. Wait for migration to complete (`az storage account migration show`)
4. Run `terraform plan` — confirm it shows **no changes** (state now matches desired)
5. If plan still shows changes, run `terraform refresh` to sync state, then re-plan

> ⛔ Do NOT run `terraform apply` before the migration completes. It may fail or attempt to recreate the storage account.

---

## Deploy Plan (Skill executes this directly)

After patching, **the skill executes the deploys itself** — do not stop and tell the user to run commands. Confirm once with the user before each deploy, then run it.

Summarize the plan for the user:
```
✅ Terraform files patched for reliability.

Deploy plan (the skill will run these for you after your confirmation):
  1. `terraform plan -out tfplan` (skill will show the plan summary)
  2. Deploy 1 — `terraform apply tfplan` for the safe patches.
  3. Storage migration (only if upgrading LRS → ZRS).
     Command: `az storage account migration start ...`, then poll until `sku.name = Standard_ZRS`.
  4. Deploy 2 — second `terraform plan` + `apply` for the storage SKU patch (no-op confirmation).

Do NOT bundle the storage SKU change with the safe patches — a failed storage redundancy update can fail the whole apply.

⚠️ Note: If you have an existing Container Apps environment without zone redundancy,
   the environment name was changed to force recreation. The skill will surface the
   `terraform plan` summary before applying so you can confirm — apps will be recreated
   in the new environment.

Ready to run `terraform plan`? (yes / no)
```

multi-region-checks.md 5.9 KB

# Multi-Region & Failover Checks

## Overview

Multi-region deployment protects against entire region outages. This requires deploying compute in multiple regions and using a global load balancer (Azure Front Door or Traffic Manager) to route traffic.

## Resource Graph Queries

> **⚠️ Output format:** Use `--query "data[]" -o json` (not `-o table`). `az graph query -o table` only renders summary columns and does not show projected fields.

### Check if App is Deployed in Multiple Regions

```bash
az graph query -q "
Resources
| where type in~ ('microsoft.web/sites', 'microsoft.app/containerapps')
| extend appKind = case(
    type =~ 'microsoft.web/sites' and kind contains 'functionapp', 'FunctionApp',
    type =~ 'microsoft.web/sites', 'WebApp',
    type =~ 'microsoft.app/containerapps', 'ContainerApp',
    'Other')
| extend baseName = extract('^(.+?)(-[a-z]+\\d*)?$', 1, name)
| summarize regions=make_list(location), regionCount=dcount(location), apps=make_list(name) by baseName, appKind
| where regionCount > 1
| project baseName, appKind, regionCount, regions, apps
" --query "data[]" -o json
```

**Interpretation:**
- Results show apps with the same base name deployed across multiple regions → ✅ Multi-region
- No results → ❌ All apps are single-region

**Important:** The `baseName` extraction uses a naming convention (e.g., `my-app-eastus`, `my-app-westus`). If apps don't follow this pattern, also check by resource tags:

```bash
az graph query -q "
Resources
| where type in~ ('microsoft.web/sites', 'microsoft.app/containerapps')
| where isnotempty(tags['app-group']) or isnotempty(tags['application'])
| extend appGroup = coalesce(tostring(tags['app-group']), tostring(tags['application']))
| summarize regions=make_list(location), regionCount=dcount(location) by appGroup
| where regionCount > 1
| project appGroup, regionCount, regions
" --query "data[]" -o json
```

### Check for Azure Front Door

```bash
az graph query -q "
Resources
| where type =~ 'microsoft.cdn/profiles'
| where sku.name =~ 'Standard_AzureFrontDoor' or sku.name =~ 'Premium_AzureFrontDoor'
| project name, resourceGroup, sku=sku.name
" --query "data[]" -o json
```

### Check for Traffic Manager Profiles

```bash
az graph query -q "
Resources
| where type =~ 'microsoft.network/trafficmanagerprofiles'
| extend routingMethod = tostring(properties.trafficRoutingMethod)
| extend endpoints = array_length(properties.endpoints)
| project name, resourceGroup, routingMethod, endpoints, status=properties.profileStatus
" --query "data[]" -o json
```

### Check Front Door Origins/Backends

```bash
# List Front Door origin groups and origins
az afd origin-group list \
  --profile-name <front-door-name> \
  --resource-group <rg> \
  --query "[].{name:name, origins:length(origins)}" -o table

az afd origin list \
  --profile-name <front-door-name> \
  --resource-group <rg> \
  --origin-group-name <group-name> \
  --query "[].{name:name, hostName:hostName, priority:priority, weight:weight}" -o table
```

## Assessment Criteria

| Check | Pass | Fail |
|---|---|---|
| App deployed in ≥2 regions | ✅ Multi-region | ❌ Single region |
| Global load balancer exists (Front Door or TM) | ✅ Traffic routing | ❌ No failover mechanism |
| Health probes configured on load balancer | ✅ Auto-failover | ⚠️ Manual failover only |
| Storage is geo-redundant (GRS/GZRS) | ✅ Data survives region failure | ❌ Data loss risk |

## Multi-Region Patterns

### Active-Passive (Recommended starting point)

```
Users → Azure Front Door → Primary Region (priority 1)
                        → Secondary Region (priority 2, standby)
```

- Primary serves all traffic
- Front Door health probes detect primary failure
- Automatic failover to secondary
- Lower cost (secondary can be scaled down)

### Active-Active

```
Users → Azure Front Door → Region A (weight 50)
                        → Region B (weight 50)
```

- Both regions serve traffic simultaneously
- Better performance (route to nearest)
- Higher cost (both at full capacity)
- Requires stateless design or data sync

## Remediation: Generate Multi-Region IaC

When user wants multi-region, generate Bicep that includes:

1. **Secondary region compute** — Same service type as primary
2. **Secondary region storage** — ZRS in secondary region
3. **Azure Front Door profile** — With:
   - Origin group containing both regions
   - Health probe (HTTP/HTTPS to health endpoint)
   - Routing rule (priority for active-passive, weighted for active-active)
4. **DNS configuration** — Custom domain on Front Door

### Bicep Skeleton for Active-Passive Front Door

```bicep
resource frontDoor 'Microsoft.Cdn/profiles@2024-02-01' = {
  name: frontDoorName
  location: 'global'
  sku: {
    name: 'Standard_AzureFrontDoor'
  }
}

resource originGroup 'Microsoft.Cdn/profiles/originGroups@2024-02-01' = {
  parent: frontDoor
  name: 'primary-group'
  properties: {
    healthProbeSettings: {
      probePath: '/api/health'
      probeProtocol: 'Https'
      probeIntervalInSeconds: 30
    }
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
    }
  }
}

resource primaryOrigin 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
  parent: originGroup
  name: 'primary'
  properties: {
    hostName: primaryAppHostName
    priority: 1
    weight: 1000
    originHostHeader: primaryAppHostName
  }
}

resource secondaryOrigin 'Microsoft.Cdn/profiles/originGroups/origins@2024-02-01' = {
  parent: originGroup
  name: 'secondary'
  properties: {
    hostName: secondaryAppHostName
    priority: 2
    weight: 1000
    originHostHeader: secondaryAppHostName
  }
}
```

## Reporting

For the reliability checklist, mark the **Multi-Region** column per resource:
- ✅ — resource is deployed in ≥2 regions AND fronted by Azure Front Door / Traffic Manager with health probes
- ❌ — single region OR multi-region without an active global load balancer / health probes
- For Front Door / Traffic Manager rows: ✅ if configured with health probes, ❌ if absent or missing health probes

storage-redundancy-checks.md 5.0 KB

# Storage Redundancy Checks

## Overview

Storage accounts underpin Azure Functions, Container Apps (for host storage), and App Service. If compute is zone-redundant but storage is not, a zone failure can still cause downtime.

## Replication Types

| Type | Zone Redundancy | Region Redundancy | Description |
|---|---|---|---|
| LRS | ❌ None | ❌ None | 3 copies in one datacenter. No zone or region protection. |
| ZRS | ✅ Zone-redundant | ❌ None | 3 copies across 3 availability zones in one region. |
| GRS | ❌ None (LRS per region) | ✅ Region-redundant | LRS in primary + LRS in secondary region. Zone failure in primary = risk. |
| GZRS | ✅ Zone-redundant | ✅ Region-redundant | ZRS in primary region + LRS in secondary region. Best protection. |
| RA-GRS | ❌ None (LRS per region) | ✅ Region + read | Like GRS but secondary is readable. Still LRS within each region. |
| RA-GZRS | ✅ Zone-redundant | ✅ Region + read | GZRS + read access to secondary. Maximum redundancy. |

## Minimum Requirement

- If compute is zone-redundant → storage MUST be at least **ZRS** (not GRS — GRS uses LRS in each region and is NOT zone-redundant)
- For multi-region failover → storage should be **GZRS** (zone + region) or **GRS** (region only, accepts zone risk)

## Resource Graph Queries

> **⚠️ Output format:** Use `--query "data[]" -o json` (not `-o table`). `az graph query -o table` only renders summary columns and does not show projected fields.

### Find All Storage Accounts and Their Replication

```bash
az graph query -q "
Resources
| where type =~ 'microsoft.storage/storageaccounts'
| extend replication = tostring(sku.name)
| extend tier = tostring(sku.tier)
| project name, resourceGroup, location, replication, tier, kind
| order by replication asc
" --query "data[]" -o json
```

> **💡 No SKU specified?** If a storage account was deployed without an explicit `sku.name` (raw ARM/Bicep) or `skuName` (AVM module), Azure defaults to **`Standard_GRS`**. Treat any storage account showing `Standard_GRS` as potentially "defaulted" rather than intentionally chosen — check the IaC source to confirm and recommend setting it explicitly to `Standard_ZRS` or `Standard_GZRS`.

### Find Storage Accounts Using LRS (Not Zone Redundant)

```bash
az graph query -q "
Resources
| where type =~ 'microsoft.storage/storageaccounts'
| where sku.name =~ 'Standard_LRS' or sku.name =~ 'Premium_LRS'
| project name, resourceGroup, location, replication=sku.name
" --query "data[]" -o json
```

### Find Function App Host Storage Accounts

```bash
# List function apps and their storage connections
az graph query -q "
Resources
| where type =~ 'microsoft.web/sites'
| where kind contains 'functionapp'
| project name, resourceGroup, location
" --query "data[]" -o json

# Then for each function app, check its storage:
az functionapp config appsettings list \
  --name <app-name> \
  --resource-group <rg> \
  --query "[?name=='AzureWebJobsStorage'].value" -o tsv
```

### Cross-Reference: Zone-Redundant Compute with Non-ZRS Storage

This is a critical gap detection query — zone-redundant compute paired with LRS storage:

```bash
# Step 1: Find zone-redundant plans
az graph query -q "
Resources
| where type =~ 'microsoft.web/serverfarms'
| where tobool(properties.zoneRedundant) == true
| project planName=name, resourceGroup, location
" --query "data[]" -o json

# Step 2: Check if associated storage accounts are ZRS
# (Requires app-level inspection of AzureWebJobsStorage setting)
```

## Remediation

### Upgrade Storage from LRS to ZRS

```bash
# Check if live migration is available (not all regions/account types support it)
az storage account show \
  --name <account-name> \
  --resource-group <rg> \
  --query "{name:name, sku:sku.name, kind:kind, location:location}"

# Request live migration (Standard_LRS → Standard_ZRS)
az storage account update \
  --name <account-name> \
  --resource-group <rg> \
  --sku Standard_ZRS
```

⚠️ **Limitations:**
- Live migration from LRS to ZRS is only supported for Standard general-purpose v2 accounts
- Premium accounts and legacy account types require manual migration (create new ZRS account + copy data)
- Migration can take hours to days depending on data volume

### Upgrade Storage from LRS to GRS/GZRS

```bash
# Upgrade to GRS
az storage account update \
  --name <account-name> \
  --resource-group <rg> \
  --sku Standard_GRS

# Upgrade to GZRS (zone + region redundant)
az storage account update \
  --name <account-name> \
  --resource-group <rg> \
  --sku Standard_GZRS
```

## Reporting

For the reliability checklist, mark the **ZRS Storage** column per storage account:
- ✅ — SKU is `Standard_ZRS` or `Standard_GZRS`
- ❌ (LRS) — `Standard_LRS` (no zone redundancy)
- ❌ (GRS) — `Standard_GRS` or `Standard_RAGRS` (region-redundant but uses LRS in each region; still a zone-failure risk). Also flag this when no SKU is set in IaC at all — ARM/AVM defaults to `Standard_GRS`.

⚠️ **Key distinction:** GRS provides region redundancy but uses LRS in each region. If compute is zone-redundant but storage is GRS (not ZRS/GZRS), a zone failure can still impact storage availability.

zone-redundancy-checks.md 2.8 KB

# Zone Redundancy — Platform Overview

## Overview

Zone redundancy distributes compute instances across availability zones within a region. If one zone fails, instances in other zones continue serving traffic automatically.

This file covers **platform-level discovery and concepts**. For service-specific assessment queries, configuration commands, and IaC patches, see:

| Service | Reference |
|---|---|
| Azure App Service | [services/app-service/reliability.md](services/app-service/reliability.md) |
| Azure Functions | [services/functions/reliability.md](services/functions/reliability.md) |

> Azure Container Apps deep-dive references are planned for a future version of this skill. The discovery query below still surfaces those resources — just don't dispatch to a per-service reference for them yet.

## Discovery: Find All Non-Zone-Redundant Compute

> **⚠️ Output format:** Use `--query "data[]" -o json` for `az graph query`. `-o table` only renders summary columns (`Count`, `Total_records`) and hides projected fields. Pipe JSON through `jq` if you need a table view.

Use this single query to discover every compute resource in scope that is **not** zone-redundant. Use it during Phase 1 (Discover Resources) to decide which service references to load.

```bash
az graph query -q "
Resources
| where type in~ ('microsoft.web/serverfarms', 'microsoft.app/managedenvironments')
| extend zoneRedundant = tobool(properties.zoneRedundant)
| where zoneRedundant == false or isnull(zoneRedundant)
| project name, type, resourceGroup, location, sku=sku.name
| order by type asc
" --query "data[]" -o json
```

For each row in the result, dispatch to the matching service reference:
- `microsoft.web/serverfarms` with `kind contains 'functionapp'` → Functions reference
- `microsoft.web/serverfarms` (other kinds) → App Service reference
- `microsoft.app/managedenvironments` → _planned (Container Apps)_ — surface in the discovery summary, do not deep-dive

## Regions Supporting Availability Zones

```bash
az functionapp list-flexconsumption-locations --zone-redundant=true
```

Common regions with AZ support across Functions, App Service, and Container Apps:

- East US, East US 2, West US 2, West US 3
- Central US, South Central US
- North Europe, West Europe, UK South
- France Central, Germany West Central, Sweden Central
- Southeast Asia, Japan East, Australia East

> **Service-specific region availability differs.** Always confirm support for the specific SKU/plan in the target region using the per-service reference.

## Reporting

For the assessment table's `Zone redundancy — compute` row, the per-service references define exactly what `🟢 ON / 🟡 PARTIAL / 🔴 OFF` mean for that service (e.g. ZR + minimum instance count for App Service, Premium Functions). Only App Service and Functions have a per-service reference in this skill version; Container Apps support is planned.

references/services/app-service/

reliability.md 10.1 KB

# Azure App Service — Reliability Reference

## Supported Plans & Zone Redundancy

| Plan | Zone Redundancy | Min Instances | Health Check |
|------|----------------|---------------|--------------|
| Free/Shared (F1/D1) | ❌ Not supported | N/A | ❌ |
| Basic (B1/B2/B3) | ❌ Not supported | N/A | ✅ |
| Standard (S1/S2/S3) | ❌ Not supported | N/A | ✅ |
| Premium v2 (P1v2+) | ✅ `zoneRedundant: true` + `capacity: 2` | 2 | ✅ |
| Premium v3 (P0v3+) | ✅ `zoneRedundant: true` + `capacity: 2` | 2 (recommended) | ✅ |
| Premium v4 (P0v4+) | ✅ `zoneRedundant: true` + `capacity: 2` | 2 (recommended) | ✅ |
| Isolated v2 (I1v2+) | ✅ `zoneRedundant: true` + `capacity: 2`  | 2 | ✅ |

## Assessment Queries

> **⚠️ Output format:** Use `--query "data[]" -o json` for `az graph query`. `-o table` only shows summary columns (`Count`, `Total_records`) and hides projected fields. Standard `az webapp` commands work fine with `-o table`.

### Plan Zone Redundancy
```bash
az graph query -q "
resources
| where resourceGroup =~ '<rg>'
| where type =~ 'microsoft.web/serverfarms'
| where kind !contains 'functionapp'
| project name, sku=sku.name, capacity=sku.capacity, zoneRedundant=properties.zoneRedundant, location
" --subscriptions <sub-id> --query "data[]" -o json
```

### Health Check Configuration
```bash
az webapp config show --name <app> --resource-group <rg> \
  --query "{healthCheckPath:healthCheckPath, alwaysOn:alwaysOn}" -o table
```

### Client Affinity (ARR Affinity) — should be **disabled** for ZR / multi-region
```bash
az webapp show --name <app> --resource-group <rg> \
  --query "clientAffinityEnabled" -o tsv
```
When `true`, sticky sessions pin clients to a single instance and defeat zone load balancing.

### Deployment Slots (for zero-downtime deploys)
```bash
az webapp deployment slot list --name <app> --resource-group <rg> \
  --query "[].{name:name, state:state}" -o table
```

### Auto Heal

Auto Heal automatically restarts or mitigates your web app when it hits defined thresholds.  Can be configured via Azure Portal, CLI or ARM/Bicep

Azure Portal - App Service -> Diagnose and solve problems -> Auto-heal (under Diagnostic Tools) or directly: App Service -> Configuration -> General settings -> Auto Heal

CLI
```bash
az webapp config set --resource-group <rg> --name <app> --auto-heal-enabled true

# Rules must be set via ARM PATCH (CLI doesn't expose autoHealRules directly)
az rest --method patch \
  --uri "https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Web/sites/{app}/config/web?api-version=2022-03-01" \
  --body '{"properties":{"autoHealEnabled":true,"autoHealRules":{...}}}'
```

## Configure: Zone Redundancy

### Upgrade Plan (if needed)
```bash
# Check current SKU
az appservice plan show --name <plan> --resource-group <rg> --query "sku"

# Upgrade to Premium v3 (if currently on lower tier)
az appservice plan update \
  --name <plan> \
  --resource-group <rg> \
  --sku P1v3
```

### Enable Zone Redundancy
```bash
# Set min instances (required for ZR)
az appservice plan update \
  --name <plan> \
  --resource-group <rg> \
  --number-of-workers 2

# Enable ZR
az resource update \
  --resource-group <rg> \
  --name <plan> \
  --resource-type "Microsoft.Web/serverfarms" \
  --set properties.zoneRedundant=true
```

⚠️ Enabling zone redundancy may require scaling up first — for the supported App Service plans listed above, set the plan to at least 2 instances before enabling ZR.

## Configure: Health Check

```bash
# Enable health check
az webapp config set \
  --name <app> \
  --resource-group <rg> \
  --generic-configurations '{"healthCheckPath": "/api/health"}'
```

⚠️ **Warning:** Enabling health check causes an app restart. Configure during maintenance window.

### Health Check Behavior
- Ping interval: **1 minute**
- Failure threshold: **10 consecutive failures** (configurable via `WEBSITE_HEALTHCHECK_MAXPINGFAILURES`)
- After threshold: instance marked unhealthy, replaced within **1 hour**
- Healthy threshold: **1 successful response** restores instance

### Recommended: Always On
```bash
az webapp config set \
  --name <app> \
  --resource-group <rg> \
  --always-on true
```

## Configure: Disable Client Affinity (ARR Affinity)

App Service enables ARR affinity by default, which pins each client to a single instance via the `ARRAffinity` cookie. **This defeats zone-load-balancing and any multi-region routing**, so it should be disabled for stateless apps:

```bash
az webapp update --name <app> --resource-group <rg> \
  --client-affinity-enabled false
```

Leave it on **only** if your app stores state in instance memory and you cannot move it to a shared cache / database.

## Configure: Deployment Slots (Zero-Downtime)

Deployment slots complement reliability by enabling safe deployments:

```bash
# Create staging slot
az webapp deployment slot create \
  --name <app> \
  --resource-group <rg> \
  --slot staging

# Deploy to staging first, then swap
az webapp deployment slot swap \
  --name <app> \
  --resource-group <rg> \
  --slot staging \
  --target-slot production
```

## Back Up Support by SKU

| Plan | Automatic Backup | Custom Backup |
|------|----------------|---------------|
| Free/Shared (F1/D1) | ❌ Not supported | ❌ Not supported |
| Basic (B1/B2/B3) | ✅ | ✅ Configuration required  |
| Standard (S1/S2/S3) | ✅  | ✅  Configuration required  |
| Premium v2 (P1v2+) | ✅ | ✅  Configuration required  |
| Premium v3 (P0v3+) | ✅ | ✅  Configuration required  |
| Premium v4 (P0v4+) | ✅ | ✅  Configuration required  |
| Isolated v2 (I1v2+) | ✅ | ✅  Configuration required  |

- Automatic backups recommended since requires no configuration and is automatically enabled

## IaC Patching: Bicep

> **AVM modules:** If the project uses `br/public:avm/res/web/serverfarm` or `br/public:avm/res/web/site`, the parameter names differ from raw ARM (e.g. `zoneRedundant` and `skuCapacity` are top-level params; `siteConfig` is usually preserved). Always grep the actual module call (`Select-String -Path infra -Recurse -Pattern "avm/res/web/" -Context 0,15`) and patch the params already in use. The raw-Bicep examples below show the property paths to translate.

### App Service Plan
```bicep
resource appServicePlan 'Microsoft.Web/serverfarms@2023-12-01' = {
  name: planName
  location: location
  sku: {
    name: 'P0v3'
    capacity: 2              // ← ADD (min 2 for ZR on P1v3)
  }
  properties: {
    reserved: true           // Linux
    zoneRedundant: true      // ← ADD
  }
}
```

### Web App — Health Check
```bicep
resource webApp 'Microsoft.Web/sites@2023-12-01' = {
  name: appName
  location: location
  properties: {
    serverFarmId: appServicePlan.id
    siteConfig: {
      healthCheckPath: '/api/health'  // ← ADD
      alwaysOn: true                  // ← ADD (recommended)
    }
  }
}
```

## IaC Patching: Terraform

### App Service Plan
```hcl
resource "azurerm_service_plan" "plan" {
  name                   = var.plan_name
  location               = azurerm_resource_group.rg.location
  resource_group_name    = azurerm_resource_group.rg.name
  os_type                = "Linux"
  sku_name               = "P1v3"
  worker_count           = 2                   # ← ADD (min 2 for ZR)
  zone_balancing_enabled = true                # ← ADD
}
```

### Web App — Health Check
```hcl
resource "azurerm_linux_web_app" "app" {
  name                = var.app_name
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  service_plan_id     = azurerm_service_plan.plan.id

  site_config {
    health_check_path = "/api/health"    # ← ADD
    always_on         = true             # ← ADD
  }
}
```

## Virtual Network Integration notes

- Subnet sizing is important.  VNet integration consumes IPs during scale-out slot swaps.  
- Undersized subnet cause scale or deployment failures during regional stress or failover.  Recommend /26 minimum, /24 for larger plans.  Zone Redundant plans require integration subnet to be sized for Zone Redundancy (more IPs).
- Subnets cannot be resized after assignment without reconfiguring VNET integration.
- Dependencies reached over private endpoints must have a per-region private endpoint and private DNS Zone.  Sharing a single or global private DNS zone linked to the primary VNET will break failover.
- Recommend Azure DNS Private Resolver per region, or per-region forwarders.  Verify WEBSITE_DNS_SERVER/WEBSITE_DNS_ALT_SERVER are set with a fallback.
- For predictable outbound traffic flow during failover, attach NAT Gateway to the subnet in each region to enable partner allow lists to work for all regions.  Use Nat Gateway to avoid SNAT port exhaustion under load.
- Service Endpoints vs Private Endpoints - Service endpoints are regional and don't failover.  Use Private Endpoints per region for resiliency. 

## Multi-Region Notes

- App Service supports deployment slots — use slot swap for safe regional deployments
- Consider auto-scale rules to handle failover traffic surge
- App Service Managed Certificates don't support custom domains on Front Door — use App Service Certificate or Key Vault
- Client affinity (ARR Affinity) must be disabled for multi-region (see Configure: Disable Client Affinity above)
- App Service Environment (v3) live in one subnet and is regional; multi-region still requires one ASE per region with Azure Front Door/Traffic Manager in front. 


## Reporting (for the assessment table)

When the parent skill builds the feature-pivoted assessment table, report each App Service resource on the relevant rows:

| Feature row | What to report |
|---|---|
| Zone redundancy — compute | `🟢 ON` if the **plan** has `zoneRedundant: true` AND `sku.capacity ≥ 2`. `🔴 OFF` if either is missing or the plan tier doesn't support ZR (Free / Shared / Basic / Standard). Annotate `(needs plan upgrade)` for unsupported tiers. |
| Health probes | `🟢 ON` if `siteConfig.healthCheckPath` is set on the **app**. `🔴 OFF` if empty. Basic tier and above support it; Free/Shared do not — annotate `(needs plan upgrade)` in that case. |
| Multi-region failover | `🟢 ON` if the same app is deployed in ≥2 regions behind Front Door / Traffic Manager. `🟡 PARTIAL` if multi-region is set up but `clientAffinityEnabled` is still `true` (sticky sessions break failover). `🔴 OFF` otherwise. |

references/services/functions/

reliability.md 6.2 KB

# Azure Functions — Reliability Reference

## Supported Plans & Zone Redundancy

| Plan | Zone Redundancy | Min Instances | Health Check |
|------|----------------|---------------|--------------|
| Flex Consumption (FC1) | ✅ `zoneRedundant: true` | Auto-managed | ❌ Platform health check not supported |
| Premium (EP1/EP2/EP3) | ✅ `zoneRedundant: true` + `sku.capacity: 2` | `minimumElasticInstanceCount: 2` per app | ✅ `healthCheckPath` |
| Consumption (Y1) | ❌ Not supported | N/A | ❌ Not supported |
| Dedicated (P1v2+) | ✅ (treated as App Service) | `sku.capacity: 2` | ✅ `healthCheckPath` |

## Assessment Queries

### Zone Redundancy Check
```bash
az graph query -q "
resources
| where resourceGroup =~ '<rg>'
| where type =~ 'microsoft.web/serverfarms'
| where kind contains 'functionapp' or kind =~ 'linux' or kind =~ 'elastic'
| project name, sku=sku.name, zoneRedundant=properties.zoneRedundant, location
" --subscriptions <sub-id>
```

### Function App Instance Count (Premium)
```bash
az functionapp show --name <app> --resource-group <rg> \
  --query "{minInstances:siteConfig.minimumElasticInstanceCount}" -o table
```

## Configure: Zone Redundancy

### Flex Consumption (FC1)
```bash
# Enable zone redundancy on plan
az resource update \
  --resource-group <rg> \
  --name <plan-name> \
  --resource-type "Microsoft.Web/serverfarms" \
  --set properties.zoneRedundant=true
```

### Premium (EP1/EP2/EP3)
```bash
# Enable zone redundancy + set min capacity
az appservice plan update \
  --name <plan-name> \
  --resource-group <rg> \
  --number-of-workers 2

az resource update \
  --resource-group <rg> \
  --name <plan-name> \
  --resource-type "Microsoft.Web/serverfarms" \
  --set properties.zoneRedundant=true

# Set minimum elastic instances per app
az resource update \
  --resource-group <rg> \
  --name <app-name> \
  --resource-type "Microsoft.Web/sites" \
  --set properties.siteConfig.minimumElasticInstanceCount=2
```

### Consumption (Y1) — upgrade path required

Consumption (Y1) plans do **not** support zone redundancy. The user must upgrade the plan first:

- **Recommended:** Upgrade to **Flex Consumption** — similar serverless model, supports ZR, no per-app minimum cost.
- **Alternative:** Upgrade to **Premium (EP1+)** — more control, higher base cost (always-ready instances charged 24/7).

⚠️ Inform the user of cost implications **before** initiating any plan change.

## Configure: Health Endpoint

Flex Consumption does NOT support platform health check (`healthCheckPath`). Instead, add an HTTP endpoint in code:

### TypeScript (v4 programming model)
```typescript
import { app } from "@azure/functions";

app.http('health', {
  methods: ['GET'],
  authLevel: 'anonymous',
  route: 'health',
  handler: async () => ({ status: 200, body: 'OK' })
});
```

### Python (v2 programming model)
```python
import azure.functions as func

app = func.FunctionApp()

@app.route(route="health", methods=["GET"], auth_level=func.AuthLevel.ANONYMOUS)
def health(req: func.HttpRequest) -> func.HttpResponse:
    return func.HttpResponse("OK", status_code=200)
```

### C# (isolated worker)
```csharp
[Function("Health")]
public IActionResult Health([HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "health")] HttpRequest req)
{
    return new OkObjectResult("OK");
}
```

### Premium Functions — Platform Health Check
```bash
az webapp config set \
  --name <app-name> \
  --resource-group <rg> \
  --generic-configurations '{"healthCheckPath": "/api/health"}'
```

⚠️ Enabling health check causes an app restart.

## IaC Patching: Bicep

### App Service Plan (AVM module)
```bicep
module appServicePlan 'br/public:avm/res/web/serverfarm:0.5.0' = {
  params: {
    skuName: 'FC1'
    reserved: true
    zoneRedundant: true  // ← ADD
  }
}
```

### Premium Plan — extra settings
```bicep
module appServicePlan 'br/public:avm/res/web/serverfarm:0.5.0' = {
  params: {
    skuName: 'EP1'
    reserved: true
    zoneRedundant: true      // ← ADD
    skuCapacity: 2           // ← ADD (min 2 for ZR)
  }
}

// On the function app resource:
resource functionApp 'Microsoft.Web/sites@2023-12-01' = {
  properties: {
    siteConfig: {
      minimumElasticInstanceCount: 2  // ← ADD
    }
  }
}
```

## IaC Patching: Terraform

```hcl
resource "azurerm_service_plan" "plan" {
  sku_name               = "FC1"
  os_type                = "Linux"
  zone_balancing_enabled = true  # ← ADD
}

# Premium plan:
resource "azurerm_service_plan" "plan" {
  sku_name               = "EP1"
  os_type                = "Linux"
  zone_balancing_enabled = true  # ← ADD
  worker_count           = 2     # ← ADD
}

resource "azurerm_linux_function_app" "func" {
  site_config {
    minimum_elastic_instance_count = 2  # ← ADD (Premium only)
  }
}
```

## Multi-Region Notes

- Flex Consumption standby costs ~$0 (pay-per-execution) — ideal for active-passive
- Code must be deployed to both regions separately
- Event Hub checkpoints are per-app — secondary starts from its own checkpoint on failover
- Consider Event Hubs Geo-DR for true event replication

## Reporting (for the assessment table)

When the parent skill builds the feature-pivoted assessment table, report each Functions resource on the relevant rows:

| Feature row | What to report |
|---|---|
| Zone redundancy — compute | `🟢 ON` if the **plan** has `zoneRedundant: true`. For Premium plans, also requires `sku.capacity ≥ 2` AND each Function App has `minimumElasticInstanceCount ≥ 2`. `🔴 OFF` if the plan tier doesn't support ZR (Consumption Y1) — annotate `(needs plan upgrade to Flex / Premium)`. |
| Health probes | For Premium / Dedicated: `🟢 ON` if `siteConfig.healthCheckPath` is set, `🔴 OFF` otherwise. For Flex Consumption (FC1) / Consumption (Y1): always annotate `🔴 OFF (code-only fix)` — `healthCheckPath` is not supported on these plans, so an HTTP-triggered `/api/health` function must be added in app code (gated by user consent — see [configure-health-probes.md](../../configure-health-probes.md)). |
| Multi-region failover | `🟢 ON` if the same Function App is deployed in ≥2 regions behind Front Door / Traffic Manager; otherwise `🔴 OFF`. |

## Additional References

- [Reliability in Azure Functions (Microsoft Learn)](https://learn.microsoft.com/en-us/azure/reliability/reliability-functions)

License (MIT)

MIT Source: microsoft/azure-skills

View full license text

MIT License

Copyright 2025 (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

Security Scan

2 issues found

Every skill undergoes a two-pass automated security scan before being published to the Hub.

View 2 issues

medium Path Traversal

file:references/configure-zone-redundancy.md

Fix: Use path.resolve() or os.path.abspath() to normalize paths and verify the result stays within the intended directory.

medium Path Traversal

file:references/services/functions/reliability.md

Fix: Use path.resolve() or os.path.abspath() to normalize paths and verify the result stays within the intended directory.

How does it work?

Pass 1 — Pattern analysis scans every file in the skill against 13 security rules for known dangerous patterns:

Script & command detection — Shell commands, exec/spawn calls, subprocess invocations, and curl-pipe-to-shell patterns.
Prompt injection markers — Phrases that attempt to override safety guidelines, bypass restrictions, or manipulate AI behavior.
Sensitive data & secrets — Hardcoded API keys, credentials, tokens, and access to sensitive system files.
Obfuscation patterns — Base64 decode-and-execute, dynamic code evaluation, and unsafe deserialization.
Data exfiltration risks — Environment variables sent to external URLs, writes to sensitive paths, and SQL injection patterns.

Pass 2 — AI deep scan uses GitHub Copilot to semantically analyze skill content for threats that regex can't catch:

Intent analysis — Detects code that appears benign line-by-line but is malicious in aggregate, such as disguised data exfiltration.
Social engineering — Instructions that trick users into running dangerous commands or sharing credentials.
Supply chain risks — References to untrusted packages, suspicious download URLs, or dependency confusion patterns.