Installation

Install with CLI Recommended
gh skills-hub install airunway-aks-setup

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/airunway-aks-setup/

Extract the ZIP to .github/skills/ in your repo. The folder name must match airunway-aks-setup for Copilot to auto-discover it.

Skill Files (11)

SKILL.md 4.0 KB
---
name: airunway-aks-setup
description: "Set up AI Runway on AKS โ€” from bare cluster to running model. Covers cluster verification, controller install, GPU assessment, provider setup, and first deployment. WHEN: \"setup AI Runway\", \"onboard AKS cluster\", \"install AI Runway\", \"airunway setup\", \"deploy model to AKS\", \"GPU inference on AKS\", \"KAITO setup on AKS\", \"run LLM on AKS\", \"vLLM on AKS\", \"set up model serving on AKS\", \"AI Runway controller\"."
license: MIT
metadata:
  author: Microsoft
  version: "1.0.1"
argument-hint: "[skip-to-step N]"
---

# AI Runway AKS Setup

This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides `skip-to-step N` to resume from a specific phase.

> **Cost awareness:** GPU node pools incur significant compute charges (A100-80GB can cost $3โ€“5+/hr). Confirm the user understands cost implications before provisioning GPU resources.

## Prerequisites

This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the `azure-kubernetes` skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.

## Quick Reference

| Property | Value |
|----------|-------|
| Best for | End-to-end AI Runway onboarding on AKS |
| CLI tools | `kubectl`, `make`, `curl` |
| MCP tools | None |
| Related skills | `azure-kubernetes` (cluster setup), `azure-diagnostics` (troubleshooting) |

## When to Use This Skill

Use this skill when the user wants to:
- Set up AI Runway on an existing AKS cluster from scratch
- Install the AI Runway controller and CRDs
- Assess GPU hardware compatibility for model deployment
- Choose and install an inference provider (KAITO, Dynamo, KubeRay)
- Deploy their first AI model to AKS via AI Runway
- Resume a partially-complete AI Runway setup from a specific step

## MCP Tools

This skill uses no MCP tools. All cluster operations are performed directly via `kubectl` and `make`.

## Rules

1. Execute steps in sequence โ€” load the reference for each step as you reach it
2. Report cluster state at each step: โœ“ healthy, โœ— missing/failed
3. Ask for user confirmation before any install or deployment action
4. If a step is already complete, report status and skip to the next step
5. If the user provides `skip-to-step N`, start at step N; assume prior steps are complete

## Steps

| # | Step | Reference |
|---|------|-----------|
| 1 | **Cluster Verification** โ€” context check, node inventory, GPU detection | [step-1-verify.md](references/steps/step-1-verify.md) |
| 2 | **Controller Installation** โ€” CRD + controller deployment | [step-2-controller.md](references/steps/step-2-controller.md) |
| 3 | **GPU Assessment** โ€” detect GPU models, flag dtype/attention constraints | [step-3-gpu.md](references/steps/step-3-gpu.md) |
| 4 | **Provider Setup** โ€” recommend and install inference provider | [step-4-provider.md](references/steps/step-4-provider.md) |
| 5 | **First Deployment** โ€” pick a model, deploy, verify Ready | [step-5-deploy.md](references/steps/step-5-deploy.md) |
| 6 | **Summary** โ€” recap, smoke test, next steps | [step-6-summary.md](references/steps/step-6-summary.md) |

## Error Handling

| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| No kubeconfig context | Not connected to a cluster | Run `az aks get-credentials` or equivalent |
| Controller in CrashLoopBackOff | Config or RBAC issue | `kubectl logs -n airunway-system -l control-plane=controller-manager --previous` |
| Provider not ready | Image pull or RBAC issue | `kubectl logs <pod-name> -n <namespace>` for the provider pod |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | `kubectl describe modeldeployment <name> -n <namespace>` events |
| `bfloat16` errors at inference | T4 or V100 lacks bfloat16 support | Add `--dtype float16` to serving args |

For full error handling and rollback procedures, see [troubleshooting.md](references/troubleshooting.md).
references/
gpu-profiles.md 2.5 KB
# GPU Compatibility Reference

This reference is used by the `airunway-aks-setup` skill during **Step 3 โ€” GPU Assessment** to match detected GPU hardware to known compatibility profiles and surface constraints before model deployment.

## GPU Profiles

| GPU Model | VRAM (GB) | bfloat16 | float16 | Attention Backends | Compute Capability | Notes |
|-----------|-----------|----------|---------|-------------------|-------------------|-------|
| T4 | 16 | No | Yes | XFORMERS | 7.5 | No bfloat16 โ€” must use float16 or float32. |
| V100 (16 GB) | 16 | No | Yes | XFORMERS | 7.0 | Limited flash attention. Prefer xformers. |
| V100 (32 GB) | 32 | No | Yes | XFORMERS | 7.0 | Same dtype constraints as 16 GB variant. |
| A10 | 24 | Yes | Yes | FLASH_ATTN, XFORMERS | 8.6 | Single-slot Ampere GPU. |
| A10G | 24 | Yes | Yes | FLASH_ATTN, XFORMERS | 8.6 | Good general-purpose GPU. Common in AWS. |
| L4 | 24 | Yes | Yes | FLASH_ATTN, XFORMERS | 8.9 | Inference-optimized. Common in GCP and Azure. |
| L40S | 48 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 8.9 | Ada Lovelace. Growing availability on Azure. |
| A100 (40 GB) | 40 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 8.0 | High-performance training and inference. |
| A100 (80 GB) | 80 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 8.0 | Recommended for large models (70B+). |
| H100 | 80 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 9.0 | Highest single-GPU performance. |
| H200 (SXM) | 141 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 9.0 | Maximum memory (HBM3e). |
| H20 | 96 | Yes | Yes | FLASH_ATTN, TRITON_ATTN, XFORMERS | 9.0 | China-market H100 derivative. 96 GB HBM3. |

### Attention Backends

| Backend | Description | Min Compute Capability |
|---------|-------------|----------------------|
| FLASH_ATTN | FlashAttention-2 โ€” fastest, lowest memory | 8.0 (Ampere+) |
| TRITON_ATTN | Triton-based attention โ€” good performance | 8.0 (Ampere+) |
| XFORMERS | Memory-efficient attention โ€” works on older GPUs | 7.0 (Volta+) |

## Compatibility Warnings

Surface these warnings when the following GPUs are detected:

### T4
> **Warning:** T4 GPUs do not support bfloat16. You must configure `--dtype float16` in serving arguments. Failure to do so causes errors or silent dtype casting.

### V100
> **Warning:** V100 GPUs do not support bfloat16 and have limited flash attention support. Use xformers backend and `--dtype float16`.

## Model Sizing & Recommendations

For VRAM sizing estimates and starter model recommendations, see [model-sizing.md](model-sizing.md).
model-sizing.md 1.8 KB
# Model Sizing & Starter Recommendations

## VRAM Sizing Guide

Estimate how much VRAM a model requires. Values are **weights only** โ€” add 20โ€“30% overhead for KV cache and activations during inference.

| Parameters | float16 / bfloat16 | int8 | int4 / GGUF Q4 |
|-----------|-------------------|------|----------------|
| 1B | ~2 GB | ~1 GB | ~0.5 GB |
| 3B | ~6 GB | ~3 GB | ~1.5 GB |
| 7โ€“8B | ~14โ€“16 GB | ~7โ€“8 GB | ~3.5โ€“4 GB |
| 13B | ~26 GB | ~13 GB | ~6.5 GB |
| 34B | ~68 GB | ~34 GB | ~17 GB |
| 70B | ~140 GB | ~70 GB | ~35 GB |

**Example:** 4ร— A100 80 GB = 320 GB total. Llama-3.1-70B at bfloat16 โ‰ˆ 168 GB with overhead โ€” fits across 4 GPUs with tensor parallelism.

## Starter Model Recommendations

| Cluster Capacity | Model | Provider | Notes |
|-----------------|-------|----------|-------|
| CPU-only | `google/gemma-3-1b-it-qat-q8_0-gguf` | KAITO (llama.cpp) | GGUF Q8; runs on CPU |
| 1ร— T4 (16 GB) | `microsoft/Phi-3-mini-4k-instruct` | KAITO (vLLM) | ~8 GB float16; fits with headroom |
| 1ร— A10G/L4 (24 GB) | `meta-llama/Llama-3.1-8B-Instruct` | KAITO (vLLM) | ~16 GB bfloat16; gated โ€” needs HF token |
| 1ร— A100 40 GB | `microsoft/Phi-3-medium-128k-instruct` | KAITO (vLLM) | ~28 GB float16; non-gated; MIT license |
| 1ร— A100 80 GB / H100 | `meta-llama/Llama-3.1-8B-Instruct` | KAITO (vLLM) | Oversized; upgrade to 70B if more GPUs available |
| 4ร— A100 80 GB | `meta-llama/Llama-3.1-70B-Instruct` | KAITO (vLLM, TP) | ~168 GB; tensor parallelism; gated |

### Gated Models

These models are **gated on HuggingFace** and require an access token:
- `meta-llama/Llama-3.1-8B-Instruct`
- `meta-llama/Llama-3.1-70B-Instruct`

**Non-gated alternatives** (no token required):
- `microsoft/Phi-3-mini-4k-instruct` (MIT license)
- `microsoft/Phi-3-medium-128k-instruct` (MIT license)
- `google/gemma-3-1b-it-qat-q8_0-gguf` (Gemma license)
powershell-notes.md 3.1 KB
# PowerShell Command Variants

This reference provides PowerShell equivalents for commands in steps 1, 4, 5, and 6 that use Bash-specific syntax.

## Prerequisite Check (Step 1)

```powershell
@('kubectl', 'make', 'curl') | ForEach-Object {
  if (Get-Command $_ -ErrorAction SilentlyContinue) {
    Write-Host "โœ“ $_ found"
  } else {
    Write-Host "โœ— $_ NOT FOUND โ€” install before continuing"
  }
}
```

## GPU Detection (Step 1)

```powershell
# Extract GPU count and model per node (requires NVIDIA device plugin labels)
# PowerShell does not need the bash single-quote escaping โ€” pass the jsonpath directly
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'
```

> **Note:** If the output shows empty GPU fields, use the fallback:

```powershell
kubectl describe nodes | Select-String -Pattern "nvidia" -Context 0,2
```

## Provider Check (Step 4)

```powershell
# Check if providers are already registered (errors indicate CRD not yet installed โ€” expected on a fresh cluster)
kubectl get inferenceproviderconfigs --all-namespaces
```

## Provider Discovery (Step 4)

```powershell
# List available providers
Get-ChildItem providers/

# Check default image for a provider
Get-Content providers/<provider>/Makefile | Select-String -Pattern 'IMG\s*\?='
```

## HuggingFace Token Secret (Step 5)

```powershell
$token = Read-Host -Prompt "HuggingFace token" -AsSecureString
$bstr = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($token)
try {
  [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($bstr) |
    Set-Content -NoNewline -Encoding UTF8 -Path "$env:TEMP\hf-token.txt"
} finally {
  [System.Runtime.InteropServices.Marshal]::ZeroFreeBSTR($bstr)
}

kubectl create secret generic hf-token `
  --from-file=token="$env:TEMP\hf-token.txt" `
  -n <namespace> `
  --dry-run=client -o yaml | kubectl apply -f -

Remove-Item -Force "$env:TEMP\hf-token.txt"
```

## ModelDeployment CR (Step 5)

**For gated models (Llama etc.):**

```powershell
$manifest = @"
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: <model-name>
  namespace: <namespace>
spec:
  model:
    id: <model-id>
    huggingFaceTokenSecretRef:
      name: hf-token
      key: token
  provider:
    name: <provider-name>
"@
$manifest | kubectl apply -f -
```

**For non-gated models (Phi-3, Gemma etc.):**

```powershell
$manifest = @"
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: <model-name>
  namespace: <namespace>
spec:
  model:
    id: <model-id>
  provider:
    name: <provider-name>
"@
$manifest | kubectl apply -f -
```

## Smoke Test (Step 6)

```powershell
$endpoint = kubectl get modeldeployment <model-name> -n <namespace> -o jsonpath='{.status.endpoint}'

if ([string]::IsNullOrEmpty($endpoint)) {
  Write-Host "Endpoint not yet available โ€” model may still be starting"
} else {
  $body = @{
    model    = "<model-id>"
    messages = @(@{ role = "user"; content = "Hello!" })
  } | ConvertTo-Json -Depth 3

  Invoke-RestMethod -Method Post -Uri "$endpoint/v1/chat/completions" `
    -ContentType "application/json" -Body $body
}
```
troubleshooting.md 1.5 KB
# Troubleshooting & Rollback

## Error Handling

| Error / Symptom | Likely Cause | Remediation |
|-----------------|--------------|-------------|
| No kubeconfig context | Not connected to a cluster | Run `az aks get-credentials` or equivalent |
| `make: *** No rule to make target` | Not in AI Runway repo root | `cd` to repo root and retry |
| Controller in CrashLoopBackOff | Config or RBAC issue | `kubectl logs -n airunway-system -l control-plane=controller-manager --previous` |
| Provider not ready | Image pull or RBAC issue | `kubectl describe pod` for the provider pod |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | `kubectl describe modeldeployment` events |
| Pod shows ImagePullBackOff | Wrong image reference or missing pull secret | `kubectl describe pod` for the model pod |
| 401 from HuggingFace at model load | Gated model, token secret not wired into CR | Ensure `huggingFaceTokenSecretRef` is set in the CR |
| `bfloat16` errors at inference | T4 or V100 lacks bfloat16 support | Add `--dtype float16` to serving args |

## Rollback

If a step fails and you need to undo a partial setup, work backwards through the steps:

| What to undo | Command |
|--------------|---------|
| Model deployment | `kubectl delete modeldeployment <model-name> -n <namespace>` |
| HuggingFace token secret | `kubectl delete secret hf-token -n <namespace>` |
| Provider | `cd providers/<provider> && make undeploy` (from repo root) |
| Controller | `make controller-undeploy && make controller-uninstall` (from repo root) |
references/steps/
step-1-verify.md 1.9 KB
# Step 1 โ€” Cluster Verification

**Goal**: Confirm prerequisites are met, validate the cluster connection, and inventory available nodes and GPUs.

## Prerequisites

Verify required CLI tools are available before proceeding:

```bash
# Check all required tools are installed
for tool in kubectl make curl; do
  command -v "$tool" >/dev/null 2>&1 && echo "โœ“ $tool found" || echo "โœ— $tool NOT FOUND โ€” install before continuing"
done
```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

If any tool is missing, **STOP** and tell the user which tools to install before continuing.

## Cluster Connection

```bash
# Confirm active context
kubectl config current-context

# Inventory all nodes
kubectl get nodes -o wide
```

## GPU Detection

```bash
# Extract GPU count and model per node (requires NVIDIA device plugin labels)
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable['\''nvidia.com/gpu'\'']}{"\t"}{.metadata.labels['\''nvidia.com/gpu.product'\'']}{"\n"}{end}'
```

> **Note:** The above command relies on `nvidia.com/gpu.product` node labels, which are set by the NVIDIA device plugin or GPU operator. If the output shows empty GPU fields, try the fallback:
>
> ```bash
> # Fallback: check node descriptions for GPU capacity
> kubectl describe nodes | grep -A 5 "Allocatable:" | grep -i nvidia
> ```
>
> If neither approach shows GPUs but you know the nodes have GPU hardware, the NVIDIA device plugin may not be installed yet. Guide the user to install it before proceeding.

**Report to user:**
- Cluster context name
- Total node count and GPU node count
- Per GPU type: model, count, VRAM per card, total cluster VRAM

**Decision logic:**
- No kubeconfig context โ†’ **STOP**. Tell user to configure `kubeconfig` (e.g., `az aks get-credentials`) and retry.
- No GPU nodes detected โ†’ Note "CPU-only cluster" and proceed; CPU-only inference is available via KAITO + llama.cpp.
step-2-controller.md 1.0 KB
# Step 2 โ€” Controller Status & Installation

**Goal**: Ensure the AI Runway controller and CRDs are installed and healthy.

```bash
# Check CRD presence
kubectl get crd modeldeployments.airunway.ai

# Check controller pod health
kubectl get pods -n airunway-system -l control-plane=controller-manager
```

**If already installed and healthy:** Report version and pod status, skip to Step 3.

**If not installed:** Ask user to confirm, then from the **repository root** run:

```bash
make controller-install   # Install CRDs
make controller-deploy    # Deploy controller manager
```

> **Note:** `make` targets must be run from the AI Runway repository root.

Verify rollout:

```bash
kubectl rollout status deployment/airunway-controller-manager -n airunway-system --timeout=120s
```

**Error handling:**
- `CrashLoopBackOff` โ†’ `kubectl logs -n airunway-system -l control-plane=controller-manager --previous`
- Rollout timeout โ†’ `kubectl describe deployment airunway-controller-manager -n airunway-system`
- `No rule to make target` โ†’ Navigate to repo root and retry
step-3-gpu.md 0.8 KB
# Step 3 โ€” GPU Assessment

**Goal**: Match detected hardware to known profiles and surface compatibility constraints.

Load and consult [gpu-profiles.md](../gpu-profiles.md).

For each GPU type detected in Step 1:
1. Look up the model in `gpu-profiles.md`
2. Report: VRAM per card, total cluster VRAM, supported dtypes, recommended attention backend
3. Surface compatibility warnings:

| GPU | Warning |
|-----|---------|
| T4 | Does not support bfloat16. Set `--dtype float16` in serving args. |
| V100 | Does not support bfloat16; limited flash attention. Use xformers backend. |

**CPU-only path:** Note CPU-only inference is available via KAITO + llama.cpp. Recommend `google/gemma-3-1b-it-qat-q8_0-gguf`.

Use the **Model Sizing Guide** in [model-sizing.md](../model-sizing.md) to calculate maximum model size for the cluster.
step-4-provider.md 2.0 KB
# Step 4 โ€” Provider Recommendation & Installation

**Goal**: Select and install the right inference provider for the detected hardware.

```bash
# Check if providers are already registered
kubectl get inferenceproviderconfigs --all-namespaces 2>/dev/null || kubectl get inferenceproviderconfigs
```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

**If providers already registered:** Show name and status, skip installation.

**Provider recommendation logic:**

| Hardware | Use Case | Recommended Provider |
|----------|----------|---------------------|
| CPU-only | Any | KAITO (llama.cpp) |
| GPUs | Standard inference (most users start here) | KAITO |
| GPUs | High-throughput serving with separate prefill/decode phases | Dynamo |
| GPUs | Already using Ray for ML workloads | KubeRay |

> **Default to KAITO** unless the user has a specific reason to choose otherwise. KAITO is the simplest to set up and handles most use cases. Dynamo is for teams that need to independently scale the prefill and decode stages of inference for high throughput. KubeRay is for teams already invested in the Ray ecosystem.

Present recommendation with reasoning. Ask user to confirm before installing.

**Installation** โ€” from the **repository root**:

First, check the provider's Makefile or README for the default image:

```bash
# List available providers and their default images
ls providers/
cat providers/<provider>/Makefile | grep -E 'IMG\s*\?='
```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

Then deploy:

```bash
cd providers/<provider>
make deploy IMG=<image>
```

> **Tip:** If the Makefile defines a default `IMG`, you can omit the `IMG=` argument and just run `make deploy`.

**Verify registration:**

```bash
kubectl get inferenceproviderconfigs <provider> -o yaml
```

Check `status.ready: true`. If not ready within 2 minutes, inspect provider pod logs with `kubectl logs <pod-name>` and optionally use `kubectl describe pod <pod-name>` to review events and pod status.
step-5-deploy.md 3.5 KB
# Step 5 โ€” First Model Deployment

**Goal**: Deploy a starter model that fits the cluster's GPU capacity and reach `Ready` status.

**Model recommendations** (based on cluster VRAM from Step 3 โ€” see [gpu-profiles.md](../gpu-profiles.md) for full sizing guide):

| Cluster Capacity | Recommended Model | Provider | Notes |
|-----------------|-------------------|----------|-------|
| CPU-only | `google/gemma-3-1b-it-qat-q8_0-gguf` | KAITO (llama.cpp) | GGUF Q8 quantized; runs on CPU |
| 1ร— T4 (16 GB) | `microsoft/Phi-3-mini-4k-instruct` | KAITO (vLLM) | Small, fast; fits T4 with headroom |
| 1ร— A10G or L4 (24 GB) | `meta-llama/Llama-3.1-8B-Instruct` | KAITO (vLLM) | Good general-purpose starter (gated โ€” needs HF token) |
| 1ร— A100 40 GB | `microsoft/Phi-3-medium-128k-instruct` | KAITO (vLLM) | ~28 GB float16; non-gated; MIT license |
| 1ร— A100 80 GB / H100 | `meta-llama/Llama-3.1-8B-Instruct` | KAITO (vLLM) | Oversized for 8B; suggest 70B with tensor parallelism if more GPUs available |
| 4ร— A100 80 GB | `meta-llama/Llama-3.1-70B-Instruct` | KAITO (vLLM, tensor parallel) | Large model; requires tensor parallelism; gated โ€” needs HF token |

Present recommendation. Ask user to confirm or choose their own model.

## Naming Convention

- **`model-name`**: A DNS-safe Kubernetes resource name derived from the model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct` โ†’ `llama-3-1-8b-instruct`)
- **`model-id`**: The full HuggingFace model identifier (e.g., `meta-llama/Llama-3.1-8B-Instruct`)

## Gated Models (HuggingFace Token)

Llama-family models are gated on HuggingFace and require an access token. Read the token interactively to avoid exposing it in shell history:

```bash
# Read token without echo, store in a securely permissioned temp file
hf_token_file="$(mktemp)"
trap 'rm -f "$hf_token_file"' EXIT

read -r -s -p "Enter HuggingFace token: " hf_token
printf '\n'
printf '%s' "$hf_token" > "$hf_token_file"
chmod 600 "$hf_token_file"
unset hf_token

kubectl create secret generic hf-token \
  --from-file=token="$hf_token_file" \
  -n <namespace> \
  --dry-run=client -o yaml | kubectl apply -f -
```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

## Apply the ModelDeployment CR

Use the appropriate manifest based on whether the model is gated.

**For gated models (Llama etc. โ€” requires HuggingFace token):**

```bash
kubectl apply -f - <<EOF
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: <model-name>
  namespace: <namespace>
spec:
  model:
    id: <model-id>
    huggingFaceTokenSecretRef:
      name: hf-token
      key: token
  provider:
    name: <provider-name>
EOF
```

**For non-gated models (Phi-3, Gemma etc. โ€” no token required):**

```bash
kubectl apply -f - <<EOF
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: <model-name>
  namespace: <namespace>
spec:
  model:
    id: <model-id>
  provider:
    name: <provider-name>
EOF
```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

## Monitor Deployment

```bash
kubectl get modeldeployment <model-name> -n <namespace> -w
```

Wait until `STATUS = Ready`. If not ready within 10 minutes:
- `kubectl describe modeldeployment <model-name> -n <namespace>` โ€” check events
- `kubectl get pods -l modeldeployment=<model-name> -n <namespace>` โ€” check pod status

> **Note:** Large models (13B+) may take significantly longer than 10 minutes because weights must be downloaded first. For 70B models, expect 20โ€“40+ minutes depending on network speed. Check pod logs to confirm download progress rather than assuming failure.
step-6-summary.md 1.7 KB
# Step 6 โ€” Summary & Smoke Test

**Goal**: Confirm everything is working and guide the user on next steps.

Report:
- Cluster context and node inventory (Step 1)
- Controller version and namespace (Step 2)
- GPU types and constraints in effect (Step 3)
- Provider installed and registration status (Step 4)
- Model deployed and endpoint URL (Step 5)

## Smoke Test

Retrieve the endpoint and test it (only after `STATUS = Ready`):

> Replace `<namespace>` with the namespace used during deployment (default: `default`).

```bash
ENDPOINT=$(kubectl get modeldeployment <model-name> -n <namespace> -o jsonpath='{.status.endpoint}')

if [ -z "$ENDPOINT" ]; then
  echo "Endpoint not yet available โ€” model may still be starting"
else
  curl -X POST "${ENDPOINT}/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{"model": "<model-id>", "messages": [{"role": "user", "content": "Hello!"}]}'
fi
```

> **If the endpoint is a cluster-internal URL** (e.g., `http://svc-name.namespace:port`), set up port-forwarding first:
>
> ```bash
> # Discover the service port (vLLM: 8000, llama.cpp: 8080)
> SERVICE_PORT=$(kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.ports[0].port}')
>
> kubectl port-forward svc/<service-name> 8080:${SERVICE_PORT} -n <namespace> &
> # Then use http://localhost:8080 as the endpoint
> ```

> See [powershell-notes.md](../powershell-notes.md) for the PowerShell equivalent.

**Expected output:** A JSON response with a `choices` array containing the model's reply.

**Suggest next steps:**
- Open the AI Runway Web UI to browse and deploy additional models
- Configure an ingress gateway for external access
- Review `controller/config/samples/` for advanced ModelDeployment options

License (MIT)

View full license text
MIT License

Copyright 2025 (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.