Phoenix Evals

Verified intermediate

Build and run evaluators for AI/LLM applications using Phoenix.

🎨 Frontend & UI View Source MIT 35 files

Installation

Install with CLI Recommended

gh skills-hub install phoenix-evals

Don't have the extension? Run gh extension install samueltauil/skills-hub first.

Download and extract to your repository:

.github/skills/phoenix-evals/

Extract the ZIP to .github/skills/ in your repo. The folder name must match phoenix-evals for Copilot to auto-discover it.

Skill Files (35)

SKILL.md 4.4 KB

---
name: phoenix-evals
description: Build and run evaluators for AI/LLM applications using Phoenix.
license: Apache-2.0
compatibility: Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client.
metadata:
  author: oss@arize.com
  version: "1.0.0"
  languages: "Python, TypeScript"
---

# Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

## Quick Reference

| Task | Files |
| ---- | ----- |
| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |
| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |
| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |
| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |
| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |
| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |
| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |
| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |
| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |
| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |
| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |
| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |
| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |
| RAG evals | [evaluators-rag](references/evaluators-rag.md) |
| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |
| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |

## Workflows

**Starting Fresh:**
[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)

**Building Evaluator:**
[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

**RAG Systems:**
[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

**Production:**
[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)

## Reference Categories

| Prefix | Description |
| ------ | ----------- |
| `fundamentals-*` | Types, scores, anti-patterns |
| `observe-*` | Tracing, sampling |
| `error-analysis-*` | Finding failures |
| `axial-coding-*` | Categorizing failures |
| `evaluators-*` | Code, LLM, RAG evaluators |
| `experiments-*` | Datasets, running experiments |
| `validation-*` | Validating evaluator accuracy against human labels |
| `production-*` | CI/CD, monitoring |

## Key Principles

| Principle | Action |
| --------- | ------ |
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |

references/

axial-coding.md 2.3 KB

# Axial Coding

Group open-ended notes into structured failure taxonomies.

## Process

1. **Gather** - Collect open coding notes
2. **Pattern** - Group notes with common themes
3. **Name** - Create actionable category names
4. **Quantify** - Count failures per category

## Example Taxonomy

```yaml
failure_taxonomy:
  content_quality:
    hallucination: [invented_facts, fictional_citations]
    incompleteness: [partial_answer, missing_key_info]
    inaccuracy: [wrong_numbers, wrong_dates]
  
  communication:
    tone_mismatch: [too_casual, too_formal]
    clarity: [ambiguous, jargon_heavy]
  
  context:
    user_context: [ignored_preferences, misunderstood_intent]
    retrieved_context: [ignored_documents, wrong_context]
  
  safety:
    missing_disclaimers: [legal, medical, financial]
```

## Add Annotation (Python)

```python
from phoenix.client import Client

client = Client()
client.spans.add_span_annotation(
    span_id="abc123",
    annotation_name="failure_category",
    label="hallucination",
    explanation="invented a feature that doesn't exist",
    annotator_kind="HUMAN",
    sync=True,
)
```

## Add Annotation (TypeScript)

```typescript
import { addSpanAnnotation } from "@arizeai/phoenix-client/spans";

await addSpanAnnotation({
  spanAnnotation: {
    spanId: "abc123",
    name: "failure_category",
    label: "hallucination",
    explanation: "invented a feature that doesn't exist",
    annotatorKind: "HUMAN",
  }
});
```

## Agent Failure Taxonomy

```yaml
agent_failures:
  planning: [wrong_plan, incomplete_plan]
  tool_selection: [wrong_tool, missed_tool, unnecessary_call]
  tool_execution: [wrong_parameters, type_error]
  state_management: [lost_context, stuck_in_loop]
  error_recovery: [no_fallback, wrong_fallback]
```

## Transition Matrix (Agents)

Shows where failures occur between states:

```python
def build_transition_matrix(conversations, states):
    matrix = defaultdict(lambda: defaultdict(int))
    for conv in conversations:
        if conv["failed"]:
            last_success = find_last_success(conv)
            first_failure = find_first_failure(conv)
            matrix[last_success][first_failure] += 1
    return pd.DataFrame(matrix).fillna(0)
```

## Principles

- **MECE** - Each failure fits ONE category
- **Actionable** - Categories suggest fixes
- **Bottom-up** - Let categories emerge from data

common-mistakes-python.md 6.9 KB

# Common Mistakes (Python)

Patterns that LLMs frequently generate incorrectly from training data.

## Legacy Model Classes

```python
# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")

# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")
```

**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.
The `LLM` class is provider-agnostic and is the current 2.0 API.

## Using run_evals Instead of evaluate_dataframe

```python
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames

# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
```

**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current
2.0 function with a different return format.

## Wrong Result Column Names

```python
# WRONG — column doesn't exist
score = results_df["relevance"].mean()

# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()

# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()
```

**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts
like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`.

## Deprecated project_name Parameter

```python
# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")

# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")
```

**Why**: `project_name` is deprecated in favor of `project_identifier`, which also
accepts project IDs.

## Wrong Client Constructor

```python
# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")

# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")

# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()
```

**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,
`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.

## Too-Aggressive Time Filters

```python
# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    start_time=datetime.now() - timedelta(hours=1),
)

# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=50,
)
```

**Why**: Traces may be from any time period. A 1-hour window frequently returns
nothing. Use `limit=` to control result size instead.

## Not Filtering Spans Appropriately

```python
# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")

# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    root_spans_only=True,
)

# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
    project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]
```

**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.
For RAG systems, you often need child spans separately — retriever spans for
DocumentRelevance and LLM spans for Faithfulness. Choose the right span level
for your evaluation target.

## Assuming Span Output is Plain Text

```python
# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]

# RIGHT — parse JSON and extract the answer field
import json

def extract_answer(output_value):
    if not isinstance(output_value, str):
        return str(output_value) if output_value is not None else ""
    try:
        parsed = json.loads(output_value)
        if isinstance(parsed, dict):
            for key in ("answer", "result", "output", "response"):
                if key in parsed:
                    return str(parsed[key])
    except (json.JSONDecodeError, TypeError):
        pass
    return output_value

df["output"] = df["attributes.output.value"].apply(extract_answer)
```

**Why**: LangChain and other frameworks often output structured JSON from root spans,
like `{"context": "...", "question": "...", "answer": "..."}`. Evaluators need
the actual answer text, not the raw JSON.

## Using @create_evaluator for LLM-Based Evaluation

```python
# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
    pass  # No LLM is involved

# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
```

**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind="llm"`
marks it as LLM-based but you must implement the LLM call yourself.
For LLM-based evaluation, prefer `ClassificationEvaluator` which handles
the LLM call, structured output parsing, and explanations automatically.

## Using llm_classify Instead of ClassificationEvaluator

```python
# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
    dataframe=df,
    template=template_str,
    model=model,
    rails=["relevant", "irrelevant"],
)

# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM

classifier = ClassificationEvaluator(
    name="relevance",
    prompt_template=template_str,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])
```

**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create
an evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.

## Using HallucinationEvaluator

```python
# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)

# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))
```

**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,
using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).

error-analysis-multi-turn.md 1.3 KB

# Error Analysis: Multi-Turn Conversations

Debugging complex multi-turn conversation traces.

## The Approach

1. **End-to-end first** - Did the conversation achieve the goal?
2. **Find first failure** - Trace backwards to root cause
3. **Simplify** - Try single-turn before multi-turn debug
4. **N-1 testing** - Isolate turn-specific vs capability issues

## Find First Upstream Failure

```
Turn 1: User asks about flights ✓
Turn 2: Assistant asks for dates ✓
Turn 3: User provides dates ✓
Turn 4: Assistant searches WRONG dates ← FIRST FAILURE
Turn 5: Shows wrong flights (consequence)
Turn 6: User frustrated (consequence)
```

Focus on Turn 4, not Turn 6.

## Simplify First

Before debugging multi-turn, test single-turn:

```python
# If single-turn also fails → problem is retrieval/knowledge
# If single-turn passes → problem is conversation context
response = chat("What's the return policy for electronics?")
```

## N-1 Testing

Give turns 1 to N-1 as context, test turn N:

```python
context = conversation[:n-1]
response = chat_with_context(context, user_message_n)
# Compare to actual turn N
```

This isolates whether error is from context or underlying capability.

## Checklist

1. Did conversation achieve goal? (E2E)
2. Which turn first went wrong?
3. Can you reproduce with single-turn?
4. Is error from context or capability? (N-1 test)

error-analysis.md 4.2 KB

# Error Analysis

Review traces to discover failure modes before building evaluators.

## Process

1. **Sample** - 100+ traces (errors, negative feedback, random)
2. **Open Code** - Write free-form notes per trace
3. **Axial Code** - Group notes into failure categories
4. **Quantify** - Count failures per category
5. **Prioritize** - Rank by frequency × severity

## Sample Traces

### Span-level sampling (Python — DataFrame)

```python
from phoenix.client import Client

# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")

# Build representative sample
sample = pd.concat([
    spans_df[spans_df["status_code"] == "ERROR"].sample(30),
    spans_df[spans_df["feedback"] == "negative"].sample(30),
    spans_df.sample(40),
]).drop_duplicates("span_id").head(100)
```

### Span-level sampling (TypeScript)

```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans: errors } = await getSpans({
  project: { projectName: "my-app" },
  statusCode: "ERROR",
  limit: 30,
});
const { spans: allSpans } = await getSpans({
  project: { projectName: "my-app" },
  limit: 70,
});
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);
```

### Trace-level sampling (Python)

When errors span multiple spans (e.g., agent workflows), sample whole traces:

```python
from datetime import datetime, timedelta

traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=24),
    include_spans=True,
    sort="latency_ms",
    order="desc",
    limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans
```

### Trace-level sampling (TypeScript)

```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});
```

## Add Notes (Python)

```python
client.spans.add_span_note(
    span_id="abc123",
    note="wrong timezone - said 3pm EST but user is PST"
)
```

## Add Notes (TypeScript)

```typescript
import { addSpanNote } from "@arizeai/phoenix-client/spans";

await addSpanNote({
  spanNote: {
    spanId: "abc123",
    note: "wrong timezone - said 3pm EST but user is PST"
  }
});
```

## What to Note

| Type | Examples |
| ---- | -------- |
| Factual errors | Wrong dates, prices, made-up features |
| Missing info | Didn't answer question, omitted details |
| Tone issues | Too casual/formal for context |
| Tool issues | Wrong tool, wrong parameters |
| Retrieval | Wrong docs, missing relevant docs |

## Good Notes

```
BAD:  "Response is bad"
GOOD: "Response says ships in 2 days but policy is 5-7 days"
```

## Group into Categories

```python
categories = {
    "factual_inaccuracy": ["wrong shipping time", "incorrect price"],
    "hallucination": ["made up a discount", "invented feature"],
    "tone_mismatch": ["informal for enterprise client"],
}
# Priority = Frequency × Severity
```

## Retrieve Existing Annotations

### Python

```python
# From a spans DataFrame
annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=sample,
    project_identifier="my-app",
    include_annotation_names=["quality", "correctness"],
)
# annotations_df has: span_id (index), name, label, score, explanation

# Or from specific span IDs
annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span-id-1", "span-id-2"],
    project_identifier="my-app",
)
```

### TypeScript

```typescript
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";

const { annotations } = await getSpanAnnotations({
  project: { projectName: "my-app" },
  spanIds: ["span-id-1", "span-id-2"],
  includeAnnotationNames: ["quality", "correctness"],
});

for (const ann of annotations) {
  console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
}
```

## Saturation

Stop when new traces reveal no new failure modes. Minimum: 100 traces.

evaluate-dataframe-python.md 4.3 KB

# Batch Evaluation with evaluate_dataframe (Python)

Run evaluators across a DataFrame. The core 2.0 batch evaluation API.

## Preferred: async_evaluate_dataframe

For batch evaluations (especially with LLM evaluators), prefer the async version
for better throughput:

```python
from phoenix.evals import async_evaluate_dataframe

results_df = await async_evaluate_dataframe(
    dataframe=df,              # pandas DataFrame with columns matching evaluator params
    evaluators=[eval1, eval2], # List of evaluators
    concurrency=5,             # Max concurrent LLM calls (default 3)
    exit_on_error=False,       # Optional: stop on first error (default True)
    max_retries=3,             # Optional: retry failed LLM calls (default 10)
)
```

## Sync Version

```python
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(
    dataframe=df,              # pandas DataFrame with columns matching evaluator params
    evaluators=[eval1, eval2], # List of evaluators
    exit_on_error=False,       # Optional: stop on first error (default True)
    max_retries=3,             # Optional: retry failed LLM calls (default 10)
)
```

## Result Column Format

`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.
**Result columns contain dicts, NOT raw numbers.**

For each evaluator named `"foo"`, two columns are added:

| Column | Type | Contents |
| ------ | ---- | -------- |
| `foo_score` | `dict` | `{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}` |
| `foo_execution_details` | `dict` | `{"status": "success", "exceptions": [], "execution_seconds": 0.001}` |

Only non-None fields appear in the score dict.

### Extracting Numeric Scores

```python
# WRONG — these will fail or produce unexpected results
score = results_df["relevance"].mean()                    # KeyError!
score = results_df["relevance_score"].mean()              # Tries to average dicts!

# RIGHT — extract the numeric score from each dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
mean_score = scores.mean()
```

### Extracting Labels

```python
labels = results_df["relevance_score"].apply(
    lambda x: x.get("label", "") if isinstance(x, dict) else ""
)
```

### Extracting Explanations (LLM evaluators)

```python
explanations = results_df["relevance_score"].apply(
    lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
)
```

### Finding Failures

```python
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
failed_mask = scores < 0.5
failures = results_df[failed_mask]
```

## Input Mapping

Evaluators receive each row as a dict. Column names must match the evaluator's
expected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:

```python
from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe

@create_evaluator(name="check", kind="code")
def check(response: str) -> bool:
    return len(response.strip()) > 0

# Option 1: Use .bind() method on the evaluator
check.bind(input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])

# Option 2: Use bind_evaluator function
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])
```

Or simply rename columns to match:

```python
df = df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
})
```

## DO NOT use run_evals

```python
# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1])
# Returns List[DataFrame] — one per evaluator

# RIGHT — current 2.0 API
from phoenix.evals import async_evaluate_dataframe
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns
```

Key differences:
- `run_evals` returns a **list** of DataFrames (one per evaluator)
- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged
- `async_evaluate_dataframe` uses `{name}_score` dict column format
- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)

evaluators-code-python.md 2.6 KB

# Evaluators: Code Evaluators in Python

Deterministic evaluators without LLM. Fast, cheap, reproducible.

## Basic Pattern

```python
import re
import json
from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))

@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
    try:
        json.loads(output)
        return True
    except json.JSONDecodeError:
        return False
```

## Parameter Binding

| Parameter | Description |
| --------- | ----------- |
| `output` | Task output |
| `input` | Example input |
| `expected` | Expected output |
| `metadata` | Example metadata |

```python
@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
    return output.strip() == expected.get("answer", "").strip()
```

## Common Patterns

- **Regex**: `re.search(pattern, output)`
- **JSON schema**: `jsonschema.validate()`
- **Keywords**: `keyword in output.lower()`
- **Length**: `len(output.split())`
- **Similarity**: `editdistance.eval()` or Jaccard

## Return Types

| Return type | Result |
| ----------- | ------ |
| `bool` | `True` → score=1.0, label="True"; `False` → score=0.0, label="False" |
| `float`/`int` | Used as the `score` value directly |
| `str` (short, ≤3 words) | Used as the `label` value |
| `str` (long, ≥4 words) | Used as the `explanation` value |
| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |
| `Score` object | Used as-is |

## Important: Code vs LLM Evaluators

The `@create_evaluator` decorator wraps a plain Python function.

- `kind="code"` (default): For deterministic evaluators that don't call an LLM.
- `kind="llm"`: Marks the evaluator as LLM-based, but **you** must implement the LLM
  call inside the function. The decorator does not call an LLM for you.

For most LLM-based evaluation, prefer `ClassificationEvaluator` which handles
the LLM call, structured output parsing, and explanations automatically:

```python
from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
```

## Pre-Built

```python
from phoenix.experiments.evaluators import ContainsAnyKeyword, JSONParseable, MatchesRegex

evaluators = [
    ContainsAnyKeyword(keywords=["disclaimer"]),
    JSONParseable(),
    MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}"),
]
```

evaluators-code-typescript.md 1.2 KB

# Evaluators: Code Evaluators in TypeScript

Deterministic evaluators without LLM. Fast, cheap, reproducible.

## Basic Pattern

```typescript
import { createEvaluator } from "@arizeai/phoenix-evals";

const containsCitation = createEvaluator<{ output: string }>(
  ({ output }) => /\[\d+\]/.test(output) ? 1 : 0,
  { name: "contains_citation", kind: "CODE" }
);
```

## With Full Results (asExperimentEvaluator)

```typescript
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";

const jsonValid = asExperimentEvaluator({
  name: "json_valid",
  kind: "CODE",
  evaluate: async ({ output }) => {
    try {
      JSON.parse(String(output));
      return { score: 1.0, label: "valid_json" };
    } catch (e) {
      return { score: 0.0, label: "invalid_json", explanation: String(e) };
    }
  },
});
```

## Parameter Types

```typescript
interface EvaluatorParams {
  input: Record<string, unknown>;
  output: unknown;
  expected: Record<string, unknown>;
  metadata: Record<string, unknown>;
}
```

## Common Patterns

- **Regex**: `/pattern/.test(output)`
- **JSON**: `JSON.parse()` + zod schema
- **Keywords**: `output.includes(keyword)`
- **Similarity**: `fastest-levenshtein`

evaluators-custom-templates.md 1.1 KB

# Evaluators: Custom Templates

Design LLM judge prompts.

## Complete Template Pattern

```python
TEMPLATE = """Evaluate faithfulness of the response to the context.

<context>{{context}}</context>
<response>{{output}}</response>

CRITERIA:
"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context

EXAMPLES:
Context: "Price is $10" → Response: "It costs $10" → faithful
Context: "Price is $10" → Response: "About $15" → unfaithful

EDGE CASES:
- Empty context → cannot_evaluate
- "I don't know" when appropriate → faithful
- Partial faithfulness → unfaithful (strict)

Answer (faithful/unfaithful):"""
```

## Template Structure

1. Task description
2. Input variables in XML tags
3. Criteria definitions
4. Examples (2-4 cases)
5. Edge cases
6. Output format

## XML Tags

```
<question>{{input}}</question>
<response>{{output}}</response>
<context>{{context}}</context>
<reference>{{reference}}</reference>
```

## Common Mistakes

| Mistake | Fix |
| ------- | --- |
| Vague criteria | Define each label exactly |
| No examples | Include 2-4 cases |
| Ambiguous format | Specify exact output |
| No edge cases | Address ambiguity |

evaluators-llm-python.md 2.6 KB

# Evaluators: LLM Evaluators in Python

LLM evaluators use a language model to judge outputs. Use when criteria are subjective.

## Quick Start

```python
from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

HELPFULNESS_TEMPLATE = """Rate how helpful the response is.

<question>{{input}}</question>
<response>{{output}}</response>

"helpful" means directly addresses the question.
"not_helpful" means does not address the question.

Your answer (helpful/not_helpful):"""

helpfulness = ClassificationEvaluator(
    name="helpfulness",
    prompt_template=HELPFULNESS_TEMPLATE,
    llm=llm,
    choices={"not_helpful": 0, "helpful": 1}
)
```

## Template Variables

Use XML tags to wrap variables for clarity:

| Variable | XML Tag |
| -------- | ------- |
| `{{input}}` | `<question>{{input}}</question>` |
| `{{output}}` | `<response>{{output}}</response>` |
| `{{reference}}` | `<reference>{{reference}}</reference>` |
| `{{context}}` | `<context>{{context}}</context>` |

## create_classifier (Factory)

Shorthand factory that returns a `ClassificationEvaluator`. Prefer direct
`ClassificationEvaluator` instantiation for more parameters/customization:

```python
from phoenix.evals import create_classifier, LLM

relevance = create_classifier(
    name="relevance",
    prompt_template="""Is this response relevant to the question?
<question>{{input}}</question>
<response>{{output}}</response>
Answer (relevant/irrelevant):""",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
```

## Input Mapping

Column names must match template variables. Rename columns or use `bind_evaluator`:

```python
# Option 1: Rename columns to match template variables
df = df.rename(columns={"user_query": "input", "ai_response": "output"})

# Option 2: Use bind_evaluator
from phoenix.evals import bind_evaluator

bound = bind_evaluator(
    evaluator=helpfulness,
    input_mapping={"input": "user_query", "output": "ai_response"},
)
```

## Running

```python
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])
```

## Best Practices

1. **Be specific** - Define exactly what pass/fail means
2. **Include examples** - Show concrete cases for each label
3. **Explanations by default** - `ClassificationEvaluator` includes explanations automatically
4. **Study built-in prompts** - See
   `phoenix.evals.__generated__.classification_evaluator_configs` for examples
   of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)

evaluators-llm-typescript.md 1.4 KB

# Evaluators: LLM Evaluators in TypeScript

LLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.

## Quick Start

```typescript
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const helpfulness = await createClassificationEvaluator<{
  input: string;
  output: string;
}>({
  name: "helpfulness",
  model: openai("gpt-4o"),
  promptTemplate: `Rate helpfulness.
<question>{{input}}</question>
<response>{{output}}</response>
Answer (helpful/not_helpful):`,
  choices: { not_helpful: 0, helpful: 1 },
});
```

## Template Variables

Use XML tags: `<question>{{input}}</question>`, `<response>{{output}}</response>`, `<context>{{context}}</context>`

## Custom Evaluator with asExperimentEvaluator

```typescript
import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";

const customEval = asExperimentEvaluator({
  name: "custom",
  kind: "LLM",
  evaluate: async ({ input, output }) => {
    // Your LLM call here
    return { score: 1.0, label: "pass", explanation: "..." };
  },
});
```

## Pre-Built Evaluators

```typescript
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";

const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
});
```

## Best Practices

- Be specific about criteria
- Include examples in prompts
- Use `<thinking>` for chain of thought

evaluators-overview.md 1.1 KB

# Evaluators: Overview

When and how to build automated evaluators.

## Decision Framework

```
Should I Build an Evaluator?
        │
        ▼
Can I fix it with a prompt change?
    YES → Fix the prompt first
    NO  → Is this a recurring issue?
          YES → Build evaluator
          NO  → Add to watchlist
```

**Don't automate prematurely.** Many issues are simple prompt fixes.

## Evaluator Requirements

1. **Clear criteria** - Specific, not "Is it good?"
2. **Labeled test set** - 100+ examples with human labels
3. **Measured accuracy** - Know TPR/TNR before deploying

## Evaluator Lifecycle

1. **Discover** - Error analysis reveals pattern
2. **Design** - Define criteria and test cases
3. **Implement** - Build code or LLM evaluator
4. **Calibrate** - Validate against human labels
5. **Deploy** - Add to experiment/CI pipeline
6. **Monitor** - Track accuracy over time
7. **Maintain** - Update as product evolves

## What NOT to Automate

- **Rare issues** - <5 instances? Watchlist, don't build
- **Quick fixes** - Fixable by prompt change? Fix it
- **Evolving criteria** - Stabilize definition first

evaluators-pre-built.md 2.4 KB

# Evaluators: Pre-Built

Use for exploration only. Validate before production.

## Python

```python
from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
```

**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.
It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.

## TypeScript

```typescript
import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });
```

## Available (2.0)

| Evaluator | Type | Description |
| --------- | ---- | ----------- |
| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |
| `CorrectnessEvaluator` | LLM | Is the response correct? |
| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |
| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |
| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |
| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |
| `MatchesRegex` | Code | Does output match a regex pattern? |
| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |
| `exact_match` | Code | Exact string match |

Legacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,
`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.

## When to Use

| Situation | Recommendation |
| --------- | -------------- |
| Exploration | Find traces to review |
| Find outliers | Sort by scores |
| Production | Validate first (>80% human agreement) |
| Domain-specific | Build custom |

## Exploration Pattern

```python
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])

# Score columns contain dicts — extract numeric scores
scores = results_df["faithfulness_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
low_scores = results_df[scores < 0.5]   # Review these
high_scores = results_df[scores > 0.9]  # Also sample
```

## Validation Required

```python
from sklearn.metrics import classification_report

print(classification_report(human_labels, evaluator_results["label"]))
# Target: >80% agreement
```

evaluators-rag.md 3.1 KB

# Evaluators: RAG Systems

RAG has two distinct components requiring different evaluation approaches.

## Two-Phase Evaluation

```
RETRIEVAL                    GENERATION
─────────                    ──────────
Query → Retriever → Docs     Docs + Query → LLM → Answer
         │                              │
    IR Metrics              LLM Judges / Code Checks
```

**Debug retrieval first** using IR metrics, then tackle generation quality.

## Retrieval Evaluation (IR Metrics)

Use traditional information retrieval metrics:

| Metric | What It Measures |
| ------ | ---------------- |
| Recall@k | Of all relevant docs, how many in top k? |
| Precision@k | Of k retrieved docs, how many relevant? |
| MRR | How high is first relevant doc? |
| NDCG | Quality weighted by position |

```python
# Requires query-document relevance labels
def recall_at_k(retrieved_ids, relevant_ids, k=5):
    retrieved_set = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    if not relevant_set:
        return 0.0
    return len(retrieved_set & relevant_set) / len(relevant_set)
```

## Creating Retrieval Test Data

Generate query-document pairs synthetically:

```python
# Reverse process: document → questions that document answers
def generate_retrieval_test(documents):
    test_pairs = []
    for doc in documents:
        # Extract facts, generate questions
        questions = llm(f"Generate 3 questions this document answers:\n{doc}")
        for q in questions:
            test_pairs.append({"query": q, "relevant_doc_id": doc.id})
    return test_pairs
```

## Generation Evaluation

Use LLM judges for qualities code can't measure:

| Eval | Question |
| ---- | -------- |
| **Faithfulness** | Are all claims supported by retrieved context? |
| **Relevance** | Does answer address the question? |
| **Completeness** | Does answer cover key points from context? |

```python
from phoenix.evals import ClassificationEvaluator, LLM

FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?

<context>{{context}}</context>
<answer>{{output}}</answer>

"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context

Answer (faithful/unfaithful):"""

faithfulness = ClassificationEvaluator(
    name="faithfulness",
    prompt_template=FAITHFULNESS_TEMPLATE,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"unfaithful": 0, "faithful": 1}
)
```

## RAG Failure Taxonomy

Common failure modes to evaluate:

```yaml
retrieval_failures:
  - no_relevant_docs: Query returns unrelated content
  - partial_retrieval: Some relevant docs missed
  - wrong_chunk: Right doc, wrong section

generation_failures:
  - hallucination: Claims not in retrieved context
  - ignored_context: Answer doesn't use retrieved docs
  - incomplete: Missing key information from context
  - wrong_synthesis: Misinterprets or miscombines sources
```

## Evaluation Order

1. **Retrieval first** - If wrong docs, generation will fail
2. **Faithfulness** - Is answer grounded in context?
3. **Answer quality** - Does answer address the question?

Fix retrieval problems before debugging generation.

experiments-datasets-python.md 3.5 KB

# Experiments: Datasets in Python

Creating and managing evaluation datasets.

## Creating Datasets

```python
from phoenix.client import Client

client = Client()

# From examples
dataset = client.datasets.create_dataset(
    name="qa-test-v1",
    examples=[
        {
            "input": {"question": "What is 2+2?"},
            "output": {"answer": "4"},
            "metadata": {"category": "math"},
        },
    ],
)

# From DataFrame
dataset = client.datasets.create_dataset(
    dataframe=df,
    name="qa-test-v1",
    input_keys=["question"],
    output_keys=["answer"],
    metadata_keys=["category"],
)
```

## From Production Traces

```python
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")

dataset = client.datasets.create_dataset(
    dataframe=spans_df[["input.value", "output.value"]],
    name="production-sample-v1",
    input_keys=["input.value"],
    output_keys=["output.value"],
)
```

## Retrieving Datasets

```python
dataset = client.datasets.get_dataset(name="qa-test-v1")
df = dataset.to_dataframe()
```

## Key Parameters

| Parameter | Description |
| --------- | ----------- |
| `input_keys` | Columns for task input |
| `output_keys` | Columns for expected output |
| `metadata_keys` | Additional context |

## Using Evaluators in Experiments

### Evaluators as experiment evaluators

Pass phoenix-evals evaluators directly to `run_experiment` as the `evaluators` argument:

```python
from functools import partial
from phoenix.client import AsyncClient
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator

# Define an LLM evaluator
refusal = ClassificationEvaluator(
    name="refusal",
    prompt_template="Is this a refusal?\nQuestion: {{query}}\nResponse: {{response}}",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"refusal": 0, "answer": 1},
)

# Bind to map dataset columns to evaluator params
refusal_evaluator = bind_evaluator(refusal, {"query": "input.query", "response": "output"})

# Define experiment task
async def run_rag_task(input, rag_engine):
    return rag_engine.query(input["query"])

# Run experiment with the evaluator
experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=query_engine),
    experiment_name="baseline",
    evaluators=[refusal_evaluator],
    concurrency=10,
)
```

### Evaluators as the task (meta evaluation)

Use an LLM evaluator as the experiment **task** to test the evaluator itself
against human annotations:

```python
from phoenix.evals import create_evaluator

# The evaluator IS the task being tested
def run_refusal_eval(input, evaluator):
    result = evaluator.evaluate(input)
    return result[0]

# A simple heuristic checks judge vs human agreement
@create_evaluator(name="exact_match")
def exact_match(output, expected):
    return float(output["score"]) == float(expected["refusal_score"])

# Run: evaluator is the task, exact_match evaluates it
experiment = await AsyncClient().experiments.run_experiment(
    dataset=annotated_dataset,
    task=partial(run_refusal_eval, evaluator=refusal),
    experiment_name="judge-v1",
    evaluators=[exact_match],
    concurrency=10,
)
```

This pattern lets you iterate on evaluator prompts until they align with human judgments.
See `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example.

## Best Practices

- **Versioning**: Create new datasets (e.g., `qa-test-v2`), don't modify
- **Metadata**: Track source, category, difficulty
- **Balance**: Ensure diverse coverage across categories

experiments-datasets-typescript.md 1.6 KB

# Experiments: Datasets in TypeScript

Creating and managing evaluation datasets.

## Creating Datasets

```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createDataset } from "@arizeai/phoenix-client/datasets";

const client = createClient();

const { datasetId } = await createDataset({
  client,
  name: "qa-test-v1",
  examples: [
    {
      input: { question: "What is 2+2?" },
      output: { answer: "4" },
      metadata: { category: "math" },
    },
  ],
});
```

## Example Structure

```typescript
interface DatasetExample {
  input: Record<string, unknown>;    // Task input
  output?: Record<string, unknown>;  // Expected output
  metadata?: Record<string, unknown>; // Additional context
}
```

## From Production Traces

```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans } = await getSpans({
  project: { projectName: "my-app" },
  parentId: null, // root spans only
  limit: 100,
});

const examples = spans.map((span) => ({
  input: { query: span.attributes?.["input.value"] },
  output: { response: span.attributes?.["output.value"] },
  metadata: { spanId: span.context.span_id },
}));

await createDataset({ client, name: "production-sample", examples });
```

## Retrieving Datasets

```typescript
import { getDataset, listDatasets } from "@arizeai/phoenix-client/datasets";

const dataset = await getDataset({ client, datasetId: "..." });
const all = await listDatasets({ client });
```

## Best Practices

- **Versioning**: Create new datasets, don't modify existing
- **Metadata**: Track source, category, provenance
- **Type safety**: Use TypeScript interfaces for structure

experiments-overview.md 1.2 KB

# Experiments: Overview

Systematic testing of AI systems with datasets, tasks, and evaluators.

## Structure

```
DATASET     → Examples: {input, expected_output, metadata}
TASK        → function(input) → output
EVALUATORS  → (input, output, expected) → score
EXPERIMENT  → Run task on all examples, score results
```

## Basic Usage

```python
from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=my_dataset,
    task=my_task,
    evaluators=[accuracy, faithfulness],
    experiment_name="improved-retrieval-v2",
)

print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}
```

## Workflow

1. **Create dataset** - From traces, synthetic data, or manual curation
2. **Define task** - The function to test (your LLM pipeline)
3. **Select evaluators** - Code and/or LLM-based
4. **Run experiment** - Execute and score
5. **Analyze & iterate** - Review, modify task, re-run

## Dry Runs

Test setup before full execution:

```python
experiment = run_experiment(dataset, task, evaluators, dry_run=3)  # Just 3 examples
```

## Best Practices

- **Name meaningfully**: `"improved-retrieval-v2-2024-01-15"` not `"test"`
- **Version datasets**: Don't modify existing
- **Multiple evaluators**: Combine perspectives

experiments-running-python.md 1.6 KB

# Experiments: Running Experiments in Python

Execute experiments with `run_experiment`.

## Basic Usage

```python
from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(name="qa-test-v1")

def my_task(example):
    return call_llm(example.input["question"])

def exact_match(output, expected):
    return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[exact_match],
    experiment_name="qa-experiment-v1",
)
```

## Task Functions

```python
# Basic task
def task(example):
    return call_llm(example.input["question"])

# With context (RAG)
def rag_task(example):
    return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")
```

## Evaluator Parameters

| Parameter | Access |
| --------- | ------ |
| `output` | Task output |
| `expected` | Example expected output |
| `input` | Example input |
| `metadata` | Example metadata |

## Options

```python
experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=evaluators,
    experiment_name="my-experiment",
    dry_run=3,       # Test with 3 examples
    repetitions=3,   # Run each example 3 times
)
```

## Results

```python
print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}

for run in experiment.runs:
    print(run.output, run.scores)
```

## Add Evaluations Later

```python
from phoenix.client.experiments import evaluate_experiment

evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])
```

experiments-running-typescript.md 1.8 KB

# Experiments: Running Experiments in TypeScript

Execute experiments with `runExperiment`.

## Basic Usage

```typescript
import { createClient } from "@arizeai/phoenix-client";
import {
  runExperiment,
  asExperimentEvaluator,
} from "@arizeai/phoenix-client/experiments";

const client = createClient();

const task = async (example: { input: Record<string, unknown> }) => {
  return await callLLM(example.input.question as string);
};

const exactMatch = asExperimentEvaluator({
  name: "exact_match",
  kind: "CODE",
  evaluate: async ({ output, expected }) => ({
    score: output === expected?.answer ? 1.0 : 0.0,
    label: output === expected?.answer ? "match" : "no_match",
  }),
});

const experiment = await runExperiment({
  client,
  experimentName: "qa-experiment-v1",
  dataset: { datasetId: "your-dataset-id" },
  task,
  evaluators: [exactMatch],
});
```

## Task Functions

```typescript
// Basic task
const task = async (example) => await callLLM(example.input.question as string);

// With context (RAG)
const ragTask = async (example) => {
  const prompt = `Context: ${example.input.context}\nQ: ${example.input.question}`;
  return await callLLM(prompt);
};
```

## Evaluator Parameters

```typescript
interface EvaluatorParams {
  input: Record<string, unknown>;
  output: unknown;
  expected: Record<string, unknown>;
  metadata: Record<string, unknown>;
}
```

## Options

```typescript
const experiment = await runExperiment({
  client,
  experimentName: "my-experiment",
  dataset: { datasetName: "qa-test-v1" },
  task,
  evaluators,
  repetitions: 3, // Run each example 3 times
  maxConcurrency: 5, // Limit concurrent executions
});
```

## Add Evaluations Later

```typescript
import { evaluateExperiment } from "@arizeai/phoenix-client/experiments";

await evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });
```

experiments-synthetic-python.md 1.8 KB

# Experiments: Generating Synthetic Test Data

Creating diverse, targeted test data for evaluation.

## Dimension-Based Approach

Define axes of variation, then generate combinations:

```python
dimensions = {
    "issue_type": ["billing", "technical", "shipping"],
    "customer_mood": ["frustrated", "neutral", "happy"],
    "complexity": ["simple", "moderate", "complex"],
}
```

## Two-Step Generation

1. **Generate tuples** (combinations of dimension values)
2. **Convert to natural queries** (separate LLM call per tuple)

```python
# Step 1: Create tuples
tuples = [
    ("billing", "frustrated", "complex"),
    ("shipping", "neutral", "simple"),
]

# Step 2: Convert to natural query
def tuple_to_query(t):
    prompt = f"""Generate a realistic customer message:
    Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
    
    Write naturally, include typos if appropriate. Don't be formulaic."""
    return llm(prompt)
```

## Target Failure Modes

Dimensions should target known failures from error analysis:

```python
# From error analysis findings
dimensions = {
    "timezone": ["EST", "PST", "UTC", "ambiguous"],  # Known failure
    "date_format": ["ISO", "US", "EU", "relative"],   # Known failure
}
```

## Quality Control

- **Validate**: Check for placeholder text, minimum length
- **Deduplicate**: Remove near-duplicate queries using embeddings
- **Balance**: Ensure coverage across dimension values

## When to Use

| Use Synthetic | Use Real Data |
| ------------- | ------------- |
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |

## Sample Sizes

| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |

experiments-synthetic-typescript.md 2.2 KB

# Experiments: Generating Synthetic Test Data (TypeScript)

Creating diverse, targeted test data for evaluation.

## Dimension-Based Approach

Define axes of variation, then generate combinations:

```typescript
const dimensions = {
  issueType: ["billing", "technical", "shipping"],
  customerMood: ["frustrated", "neutral", "happy"],
  complexity: ["simple", "moderate", "complex"],
};
```

## Two-Step Generation

1. **Generate tuples** (combinations of dimension values)
2. **Convert to natural queries** (separate LLM call per tuple)

```typescript
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

// Step 1: Create tuples
type Tuple = [string, string, string];
const tuples: Tuple[] = [
  ["billing", "frustrated", "complex"],
  ["shipping", "neutral", "simple"],
];

// Step 2: Convert to natural query
async function tupleToQuery(t: Tuple): Promise<string> {
  const { text } = await generateText({
    model: openai("gpt-4o"),
    prompt: `Generate a realistic customer message:
    Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}
    
    Write naturally, include typos if appropriate. Don't be formulaic.`,
  });
  return text;
}
```

## Target Failure Modes

Dimensions should target known failures from error analysis:

```typescript
// From error analysis findings
const dimensions = {
  timezone: ["EST", "PST", "UTC", "ambiguous"], // Known failure
  dateFormat: ["ISO", "US", "EU", "relative"], // Known failure
};
```

## Quality Control

- **Validate**: Check for placeholder text, minimum length
- **Deduplicate**: Remove near-duplicate queries using embeddings
- **Balance**: Ensure coverage across dimension values

```typescript
function validateQuery(query: string): boolean {
  const minLength = 20;
  const hasPlaceholder = /\[.*?\]|<.*?>/.test(query);
  return query.length >= minLength && !hasPlaceholder;
}
```

## When to Use

| Use Synthetic | Use Real Data |
| ------------- | ------------- |
| Limited production data | Sufficient traces |
| Testing edge cases | Validating actual behavior |
| Pre-launch evals | Post-launch monitoring |

## Sample Sizes

| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Comprehensive eval | 100-500 |
| Per-dimension | 10-20 per combination |

fundamentals-anti-patterns.md 1.2 KB

# Anti-Patterns

Common mistakes and fixes.

| Anti-Pattern | Problem | Fix |
| ------------ | ------- | --- |
| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |
| Vibe-based | No quantification | Measure with experiments |
| Ignoring humans | Uncalibrated LLM judges | Validate >80% TPR/TNR |
| Premature automation | Evaluators for imagined problems | Let observed failures drive |
| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |
| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |
| Model switching | Hoping a model works better | Error analysis first |

## Quantify Changes

```python
baseline = run_experiment(dataset, old_prompt, evaluators)
improved = run_experiment(dataset, new_prompt, evaluators)
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")
```

## Don't Use Similarity for Generation

```python
# BAD
score = bertscore(output, reference)

# GOOD
correct_facts = check_facts_against_source(output, context)
```

## Error Analysis Before Model Change

```python
# BAD
for model in models:
    results = test(model)

# GOOD
failures = analyze_errors(results)
# Then decide if model change is warranted
```

fundamentals-model-selection.md 1.2 KB

# Model Selection

Error analysis first, model changes last.

## Decision Tree

```
Performance Issue?
       │
       ▼
Error analysis suggests model problem?
    NO  → Fix prompts, retrieval, tools
    YES → Is it a capability gap?
          YES → Consider model change
          NO  → Fix the actual problem
```

## Judge Model Selection

| Principle | Action |
| --------- | ------ |
| Start capable | Use gpt-4o first |
| Optimize later | Test cheaper after criteria stable |
| Same model OK | Judge does different task |

```python
# Start with capable model
judge = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o"),
    ...
)

# After validation, test cheaper
judge_cheap = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    ...
)
# Compare TPR/TNR on same test set
```

## Don't Model Shop

```python
# BAD
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
    results = run_experiment(dataset, task, model)

# GOOD
failures = analyze_errors(results)
# "Ignores context" → Fix prompt
# "Can't do math" → Maybe try better model
```

## When Model Change Is Warranted

- Failures persist after prompt optimization
- Capability gaps (reasoning, math, code)
- Error analysis confirms model limitation

fundamentals.md 1.8 KB

# Fundamentals

Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.

## Evaluator Types

| Type | Speed | Cost | Use Case |
| ---- | ----- | ---- | -------- |
| **Code** | Fast | Cheap | Regex, JSON, format, exact match |
| **LLM** | Medium | Medium | Subjective quality, complex criteria |
| **Human** | Slow | Expensive | Ground truth, calibration |

**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.

## Score Structure

| Property | Required | Description |
| -------- | -------- | ----------- |
| `name` | Yes | Evaluator name |
| `kind` | Yes | `"code"`, `"llm"`, `"human"` |
| `score` | No* | 0-1 numeric |
| `label` | No* | `"pass"`, `"fail"` |
| `explanation` | No | Rationale |

*One of `score` or `label` required.

## Binary > Likert

Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.

```python
# Multiple binary checks instead of one Likert scale
evaluators = [
    AnswersQuestion(),    # Yes/No
    UsesContext(),        # Yes/No
    NoHallucination(),    # Yes/No
]
```

## Quick Patterns

### Code Evaluator

```python
from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))
```

### LLM Evaluator

```python
from phoenix.evals import ClassificationEvaluator, LLM

evaluator = ClassificationEvaluator(
    name="helpfulness",
    prompt_template="...",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"not_helpful": 0, "helpful": 1}
)
```

### Run Experiment

```python
from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[evaluator1, evaluator2],
)
print(experiment.aggregate_scores)
```

observe-sampling-python.md 2.4 KB

# Observe: Sampling Strategies

How to efficiently sample production traces for review.

## Strategies

### 1. Failure-Focused (Highest Priority)

```python
errors = spans_df[spans_df["status_code"] == "ERROR"]
negative_feedback = spans_df[spans_df["feedback"] == "negative"]
```

### 2. Outliers

```python
long_responses = spans_df.nlargest(50, "response_length")
slow_responses = spans_df.nlargest(50, "latency_ms")
```

### 3. Stratified (Coverage)

```python
# Sample equally from each category
by_query_type = spans_df.groupby("metadata.query_type").apply(
    lambda x: x.sample(min(len(x), 20))
)
```

### 4. Metric-Guided

```python
# Review traces flagged by automated evaluators
flagged = spans_df[eval_results["label"] == "hallucinated"]
borderline = spans_df[(eval_results["score"] > 0.3) & (eval_results["score"] < 0.7)]
```

## Building a Review Queue

```python
def build_review_queue(spans_df, max_traces=100):
    queue = pd.concat([
        spans_df[spans_df["status_code"] == "ERROR"],
        spans_df[spans_df["feedback"] == "negative"],
        spans_df.nlargest(10, "response_length"),
        spans_df.sample(min(30, len(spans_df))),
    ]).drop_duplicates("span_id").head(max_traces)
    return queue
```

## Sample Size Guidelines

| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Error analysis | 100+ (until saturation) |
| Golden dataset | 100-500 |
| Judge calibration | 100+ per class |

**Saturation:** Stop when new traces show the same failure patterns.

## Trace-Level Sampling

When you need whole requests (all spans per trace), use `get_traces`:

```python
from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Recent traces with full span trees
traces = client.traces.get_traces(
    project_identifier="my-app",
    limit=100,
    include_spans=True,
)

# Time-windowed sampling (e.g., last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    limit=50,
    include_spans=True,
)

# Filter by session (multi-turn conversations)
traces = client.traces.get_traces(
    project_identifier="my-app",
    session_id="user-session-abc",
    include_spans=True,
)

# Sort by latency to find slowest requests
traces = client.traces.get_traces(
    project_identifier="my-app",
    sort="latency_ms",
    order="desc",
    limit=50,
)
```

observe-sampling-typescript.md 3.9 KB

# Observe: Sampling Strategies (TypeScript)

How to efficiently sample production traces for review.

## Strategies

### 1. Failure-Focused (Highest Priority)

Use server-side filters to fetch only what you need:

```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";

// Server-side filter — only ERROR spans are returned
const { spans: errors } = await getSpans({
  project: { projectName: "my-project" },
  statusCode: "ERROR",
  limit: 100,
});

// Fetch only LLM spans
const { spans: llmSpans } = await getSpans({
  project: { projectName: "my-project" },
  spanKind: "LLM",
  limit: 100,
});

// Filter by span name
const { spans: chatSpans } = await getSpans({
  project: { projectName: "my-project" },
  name: "chat_completion",
  limit: 100,
});
```

### 2. Outliers

```typescript
const { spans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 200,
});
const latency = (s: (typeof spans)[number]) =>
  new Date(s.end_time).getTime() - new Date(s.start_time).getTime();
const sorted = [...spans].sort((a, b) => latency(b) - latency(a));
const slowResponses = sorted.slice(0, 50);
```

### 3. Stratified (Coverage)

```typescript
// Sample equally from each category
function stratifiedSample<T>(items: T[], groupBy: (item: T) => string, perGroup: number): T[] {
  const groups = new Map<string, T[]>();
  for (const item of items) {
    const key = groupBy(item);
    if (!groups.has(key)) groups.set(key, []);
    groups.get(key)!.push(item);
  }
  return [...groups.values()].flatMap((g) => g.slice(0, perGroup));
}

const { spans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 500,
});
const byQueryType = stratifiedSample(spans, (s) => s.attributes?.["metadata.query_type"] ?? "unknown", 20);
```

### 4. Metric-Guided

```typescript
import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";

// Fetch annotations for your spans, then filter by label
const { annotations } = await getSpanAnnotations({
  project: { projectName: "my-project" },
  spanIds: spans.map((s) => s.context.span_id),
  includeAnnotationNames: ["hallucination"],
});

const flaggedSpanIds = new Set(
  annotations.filter((a) => a.result?.label === "hallucinated").map((a) => a.span_id)
);
const flagged = spans.filter((s) => flaggedSpanIds.has(s.context.span_id));
```

## Trace-Level Sampling

When you need whole requests (all spans in a trace), use `getTraces`:

```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";

// Recent traces with full span trees
const { traces } = await getTraces({
  project: { projectName: "my-project" },
  limit: 100,
  includeSpans: true,
});

// Filter by session (e.g., multi-turn conversations)
const { traces: sessionTraces } = await getTraces({
  project: { projectName: "my-project" },
  sessionId: "user-session-abc",
  includeSpans: true,
});

// Time-windowed sampling
const { traces: recentTraces } = await getTraces({
  project: { projectName: "my-project" },
  startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour
  limit: 50,
  includeSpans: true,
});
```

## Building a Review Queue

```typescript
// Combine server-side filters into a review queue
const { spans: errorSpans } = await getSpans({
  project: { projectName: "my-project" },
  statusCode: "ERROR",
  limit: 30,
});
const { spans: allSpans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 100,
});
const random = allSpans.sort(() => Math.random() - 0.5).slice(0, 30);

const combined = [...errorSpans, ...random];
const unique = [...new Map(combined.map((s) => [s.context.span_id, s])).values()];
const reviewQueue = unique.slice(0, 100);
```

## Sample Size Guidelines

| Purpose | Size |
| ------- | ---- |
| Initial exploration | 50-100 |
| Error analysis | 100+ (until saturation) |
| Golden dataset | 100-500 |
| Judge calibration | 100+ per class |

**Saturation:** Stop when new traces show the same failure patterns.

observe-tracing-setup.md 3.3 KB

# Observe: Tracing Setup

Configure tracing to capture data for evaluation.

## Quick Setup

```python
# Python
from phoenix.otel import register

register(project_name="my-app", auto_instrument=True)
```

```typescript
// TypeScript
import { registerPhoenix } from "@arizeai/phoenix-otel";

registerPhoenix({ projectName: "my-app", autoInstrument: true });
```

## Essential Attributes

| Attribute | Why It Matters |
| --------- | -------------- |
| `input.value` | User's request |
| `output.value` | Response to evaluate |
| `retrieval.documents` | Context for faithfulness |
| `tool.name`, `tool.parameters` | Agent evaluation |
| `llm.model_name` | Track by model |

## Custom Attributes for Evals

```python
span.set_attribute("metadata.client_type", "enterprise")
span.set_attribute("metadata.query_category", "billing")
```

## Exporting for Evaluation

### Spans (Python — DataFrame)

```python
from phoenix.client import Client

# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-app",  # NOT project_name= (deprecated)
    root_spans_only=True,
)

dataset = client.datasets.create_dataset(
    name="error-analysis-set",
    dataframe=spans_df[["input.value", "output.value"]],
    input_keys=["input.value"],
    output_keys=["output.value"],
)
```

### Spans (TypeScript)

```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans } = await getSpans({
  project: { projectName: "my-app" },
  parentId: null, // root spans only
  limit: 100,
});
```

### Traces (Python — structured)

Use `get_traces` when you need full trace trees (e.g., multi-turn conversations, agent workflows):

```python
from datetime import datetime, timedelta

traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=24),
    include_spans=True,  # includes all spans per trace
    limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)
```

### Traces (TypeScript)

```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});
```

## Uploading Evaluations as Annotations

### Python

```python
from phoenix.evals import evaluate_dataframe
from phoenix.evals.utils import to_annotation_dataframe

# Run evaluations
results_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])

# Format results for Phoenix annotations
annotations_df = to_annotation_dataframe(results_df)

# Upload to Phoenix
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
```

### TypeScript

```typescript
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

await logSpanAnnotations({
  spanAnnotations: [
    {
      spanId: "abc123",
      name: "quality",
      label: "good",
      score: 0.95,
      annotatorKind: "LLM",
    },
  ],
});
```

Annotations are visible in the Phoenix UI alongside your traces.

## Verify

Required attributes: `input.value`, `output.value`, `status_code`
For RAG: `retrieval.documents`
For agents: `tool.name`, `tool.parameters`

production-continuous.md 3.5 KB

# Production: Continuous Evaluation

Capability vs regression evals and the ongoing feedback loop.

## Two Types of Evals

| Type | Pass Rate Target | Purpose | Update |
| ---- | ---------------- | ------- | ------ |
| **Capability** | 50-80% | Measure improvement | Add harder cases |
| **Regression** | 95-100% | Catch breakage | Add fixed bugs |

## Saturation

When capability evals hit >95% pass rate, they're saturated:
1. Graduate passing cases to regression suite
2. Add new challenging cases to capability suite

## Feedback Loop

```
Production → Sample traffic → Run evaluators → Find failures
    ↑                                              ↓
Deploy  ←  Run CI evals  ←  Create test cases  ←  Error analysis
```

## Implementation

Build a continuous monitoring loop:

1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)
2. **Run evaluators** on sampled traces
3. **Log results** to Phoenix for tracking
4. **Queue concerning results** for human review
5. **Create test cases** from recurring failure patterns

### Python

```python
from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# 1. Sample recent spans (includes full attributes for evaluation)
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    root_spans_only=True,
    limit=100,
)

# 2. Run evaluators
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(
    dataframe=spans_df,
    evaluators=[quality_eval, safety_eval],
)

# 3. Upload results as annotations
from phoenix.evals.utils import to_annotation_dataframe

annotations_df = to_annotation_dataframe(results_df)
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)
```

### TypeScript

```typescript
import { getSpans } from "@arizeai/phoenix-client/spans";
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

// 1. Sample recent spans
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  parentId: null, // root spans only
  limit: 100,
});

// 2. Run evaluators (user-defined)
const results = await Promise.all(
  spans.map(async (span) => ({
    spanId: span.context.span_id,
    ...await runEvaluators(span, [qualityEval, safetyEval]),
  }))
);

// 3. Upload results as annotations
await logSpanAnnotations({
  spanAnnotations: results.map((r) => ({
    spanId: r.spanId,
    name: "quality",
    score: r.qualityScore,
    label: r.qualityLabel,
    annotatorKind: "LLM" as const,
  })),
});
```

For trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:

```python
# Python: identify slow traces
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    sort="latency_ms",
    order="desc",
    limit=50,
)
```

```typescript
// TypeScript: identify slow traces
import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 50,
});
```

## Alerting

| Condition | Severity | Action |
| --------- | -------- | ------ |
| Regression < 98% | Critical | Page oncall |
| Capability declining | Warning | Slack notify |
| Capability > 95% for 7d | Info | Schedule review |

## Key Principles

- **Two suites** - Capability + Regression always
- **Graduate cases** - Move consistent passes to regression
- **Track trends** - Monitor over time, not just snapshots

production-guardrails.md 1.3 KB

# Production: Guardrails vs Evaluators

Guardrails block in real-time. Evaluators measure asynchronously.

## Key Distinction

```
Request → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response
                                            │
                                            └──→ ASYNC EVALUATOR (background)
```

## Guardrails

| Aspect | Requirement |
| ------ | ----------- |
| Timing | Synchronous, blocking |
| Latency | < 100ms |
| Purpose | Prevent harm |
| Type | Code-based (deterministic) |

**Use for:** PII detection, prompt injection, profanity, length limits, format validation.

## Evaluators

| Aspect | Characteristic |
| ------ | -------------- |
| Timing | Async, background |
| Latency | Can be seconds |
| Purpose | Measure quality |
| Type | Can use LLMs |

**Use for:** Helpfulness, faithfulness, tone, completeness, citation accuracy.

## Decision

| Question | Answer |
| -------- | ------ |
| Must block harmful content? | Guardrail |
| Measuring quality? | Evaluator |
| Need LLM judgment? | Evaluator |
| < 100ms required? | Guardrail |
| False positives = angry users? | Evaluator |

## LLM Guardrails: Rarely

Only use LLM guardrails if:
- Latency budget > 1s
- Error cost >> LLM cost
- Low volume
- Fallback exists

**Key Principle:** Guardrails prevent harm (block). Evaluators measure quality (log).

production-overview.md 2.2 KB

# Production: Overview

CI/CD evals vs production monitoring - complementary approaches.

## Two Evaluation Modes

| Aspect | CI/CD Evals | Production Monitoring |
| ------ | ----------- | -------------------- |
| **When** | Pre-deployment | Post-deployment, ongoing |
| **Data** | Fixed dataset | Sampled traffic |
| **Goal** | Prevent regression | Detect drift |
| **Response** | Block deploy | Alert & analyze |

## CI/CD Evaluations

```python
# Fast, deterministic checks
ci_evaluators = [
    has_required_format,
    no_pii_leak,
    safety_check,
    regression_test_suite,
]

# Small but representative dataset (~100 examples)
run_experiment(ci_dataset, task, ci_evaluators)
```

Set thresholds: regression=0.95, safety=1.0, format=0.98.

## Production Monitoring

### Python

```python
from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Sample recent traces (last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    include_spans=True,
    limit=100,
)

# Run evaluators on sampled traffic
for trace in traces:
    results = run_evaluators_async(trace, production_evaluators)
    if any(r["score"] < 0.5 for r in results):
        alert_on_failure(trace, results)
```

### TypeScript

```typescript
import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";

// Sample recent traces (last hour)
const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

// Or sample spans directly for evaluation
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 100,
});

// Run evaluators on sampled traffic
for (const span of spans) {
  const results = await runEvaluators(span, productionEvaluators);
  if (results.some((r) => r.score < 0.5)) {
    await alertOnFailure(span, results);
  }
}
```

Prioritize: errors → negative feedback → random sample.

## Feedback Loop

```
Production finds failure → Error analysis → Add to CI dataset → Prevents future regression
```

setup-python.md 1.9 KB

# Setup: Python

Packages required for Phoenix evals and experiments.

## Installation

```bash
# Core Phoenix package (includes client, evals, otel)
pip install arize-phoenix

# Or install individual packages
pip install arize-phoenix-client   # Phoenix client only
pip install arize-phoenix-evals    # Evaluation utilities
pip install arize-phoenix-otel     # OpenTelemetry integration
```

## LLM Providers

For LLM-as-judge evaluators, install your provider's SDK:

```bash
pip install openai      # OpenAI
pip install anthropic   # Anthropic
pip install google-generativeai  # Google
```

## Validation (Optional)

```bash
pip install scikit-learn  # For TPR/TNR metrics
```

## Quick Verify

```python
from phoenix.client import Client
from phoenix.evals import LLM, ClassificationEvaluator
from phoenix.otel import register

# All imports should work
print("Phoenix Python setup complete")
```

## Key Imports (Evals 2.0)

```python
from phoenix.client import Client
from phoenix.evals import (
    ClassificationEvaluator,      # LLM classification evaluator (preferred)
    LLM,                          # Provider-agnostic LLM wrapper
    async_evaluate_dataframe,     # Batch evaluate a DataFrame (preferred, async)
    evaluate_dataframe,           # Batch evaluate a DataFrame (sync)
    create_evaluator,             # Decorator for code-based evaluators
    create_classifier,            # Factory for LLM classification evaluators
    bind_evaluator,               # Map column names to evaluator params
    Score,                        # Score dataclass
)
from phoenix.evals.utils import to_annotation_dataframe  # Format results for Phoenix annotations
```

**Prefer**: `ClassificationEvaluator` over `create_classifier` (more parameters/customization).
**Prefer**: `async_evaluate_dataframe` over `evaluate_dataframe` (better throughput for LLM evals).

**Do NOT use** legacy 1.0 imports: `OpenAIModel`, `AnthropicModel`, `run_evals`, `llm_classify`.

setup-typescript.md 1.0 KB

# Setup: TypeScript

Packages required for Phoenix evals and experiments.

## Installation

```bash
# Using npm
npm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel

# Using pnpm
pnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel
```

## LLM Providers

For LLM-as-judge evaluators, install Vercel AI SDK providers:

```bash
npm install ai @ai-sdk/openai      # Vercel AI SDK + OpenAI
npm install @ai-sdk/anthropic      # Anthropic
npm install @ai-sdk/google         # Google
```

Or use direct provider SDKs:

```bash
npm install openai                 # OpenAI direct
npm install @anthropic-ai/sdk      # Anthropic direct
```

## Quick Verify

```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { registerPhoenix } from "@arizeai/phoenix-otel";

// All imports should work
console.log("Phoenix TypeScript setup complete");
```

validation-evaluators-python.md 1.0 KB

# Validating Evaluators (Python)

Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.

## Calculate Metrics

```python
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(human_labels, evaluator_predictions))

cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")
```

## Correct Production Estimates

```python
def correct_estimate(observed, tpr, tnr):
    """Adjust observed pass rate using known TPR/TNR."""
    return (observed - (1 - tnr)) / (tpr - (1 - tnr))
```

## Find Misclassified

```python
# False Positives: Evaluator pass, human fail
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
false_positives = dataset[fp_mask]

# False Negatives: Evaluator fail, human pass
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
false_negatives = dataset[fn_mask]
```

## Red Flags

- TPR or TNR < 70%
- Large gap between TPR and TNR
- Kappa < 0.6

validation-evaluators-typescript.md 4.4 KB

# Validating Evaluators (TypeScript)

Validate an LLM evaluator against human-labeled examples before deploying it.
Target: **>80% TPR and >80% TNR**.

Roles are inverted compared to a normal task experiment:

| Normal experiment | Evaluator validation |
|---|---|
| Task = agent logic | Task = run the evaluator under test |
| Evaluator = judge output | Evaluator = exact-match vs human ground truth |
| Dataset = agent examples | Dataset = golden hand-labeled examples |

## Golden Dataset

Use a separate dataset name so validation experiments don't mix with task experiments in Phoenix.
Store human ground truth in `metadata.groundTruthLabel`. Aim for ~50/50 balance:

```typescript
import type { Example } from "@arizeai/phoenix-client/types/datasets";

const goldenExamples: Example[] = [
  { input: { q: "Capital of France?" }, output: { answer: "Paris" },       metadata: { groundTruthLabel: "correct" } },
  { input: { q: "Capital of France?" }, output: { answer: "Lyon" },        metadata: { groundTruthLabel: "incorrect" } },
  { input: { q: "Capital of France?" }, output: { answer: "Major city..." }, metadata: { groundTruthLabel: "incorrect" } },
];

const VALIDATOR_DATASET = "my-app-qa-evaluator-validation"; // separate from task dataset
const POSITIVE_LABEL = "correct";
const NEGATIVE_LABEL = "incorrect";
```

## Validation Experiment

```typescript
import { createClient } from "@arizeai/phoenix-client";
import { createOrGetDataset, getDatasetExamples } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";
import { myEvaluator } from "./myEvaluator.js";

const client = createClient();

const { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });
const { examples } = await getDatasetExamples({ client, dataset: { datasetId } });
const groundTruth = new Map(examples.map((ex) => [ex.id, ex.metadata?.groundTruthLabel as string]));

// Task: invoke the evaluator under test
const task = async (example: (typeof examples)[number]) => {
  const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });
  return result.label ?? "unknown";
};

// Evaluator: exact-match against human ground truth
const exactMatch = asExperimentEvaluator({
  name: "exact-match", kind: "CODE",
  evaluate: ({ output, metadata }) => {
    const expected = metadata?.groundTruthLabel as string;
    const predicted = typeof output === "string" ? output : "unknown";
    return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };
  },
});

const experiment = await runExperiment({
  client, experimentName: `evaluator-validation-${Date.now()}`,
  dataset: { datasetId }, task, evaluators: [exactMatch],
});

// Compute confusion matrix
const runs = Object.values(experiment.runs);
const predicted = new Map((experiment.evaluationRuns ?? [])
  .filter((e) => e.name === "exact-match")
  .map((e) => [e.experimentRunId, e.result?.label ?? null]));

let tp = 0, fp = 0, tn = 0, fn = 0;
for (const run of runs) {
  if (run.error) continue;
  const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);
  if (!p || !a) continue;
  if (a === POSITIVE_LABEL && p === POSITIVE_LABEL) tp++;
  else if (a === NEGATIVE_LABEL && p === POSITIVE_LABEL) fp++;
  else if (a === NEGATIVE_LABEL && p === NEGATIVE_LABEL) tn++;
  else if (a === POSITIVE_LABEL && p === NEGATIVE_LABEL) fn++;
}
const total = tp + fp + tn + fn;
const tpr = tp + fn > 0 ? (tp / (tp + fn)) * 100 : 0;
const tnr = tn + fp > 0 ? (tn / (tn + fp)) * 100 : 0;
console.log(`TPR: ${tpr.toFixed(1)}%  TNR: ${tnr.toFixed(1)}%  Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);
```

## Results & Quality Rules

| Metric | Target | Low value means |
|---|---|---|
| TPR (sensitivity) | >80% | Misses real failures (false negatives) |
| TNR (specificity) | >80% | Flags good outputs (false positives) |
| Accuracy | >80% | General weakness |

**Golden dataset rules:** ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 20–50 examples is enough.

**Re-validate when:** prompt template changes · judge model changes · criteria updated · production FP/FN spike.

## See Also

- `validation.md` — Metric definitions and concepts
- `experiments-running-typescript.md` — `runExperiment` API
- `experiments-datasets-typescript.md` — `createOrGetDataset` / `getDatasetExamples`

validation.md 1.7 KB

# Validation

Validate LLM judges against human labels before deploying. Target >80% agreement.

## Requirements

| Requirement | Target |
| ----------- | ------ |
| Test set size | 100+ examples |
| Balance | ~50/50 pass/fail |
| Accuracy | >80% |
| TPR/TNR | Both >70% |

## Metrics

| Metric | Formula | Use When |
| ------ | ------- | -------- |
| **Accuracy** | (TP+TN) / Total | General |
| **TPR (Recall)** | TP / (TP+FN) | Quality assurance |
| **TNR (Specificity)** | TN / (TN+FP) | Safety-critical |
| **Cohen's Kappa** | Agreement beyond chance | Comparing evaluators |

## Quick Validation

```python
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score

print(classification_report(human_labels, evaluator_predictions))
print(f"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}")

# Get TPR/TNR
cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
```

## Golden Dataset Structure

```python
golden_example = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital.",
    "ground_truth_label": "correct",
}
```

## Building Golden Datasets

1. Sample production traces (errors, negative feedback, edge cases)
2. Balance ~50/50 pass/fail
3. Expert labels each example
4. Version datasets (never modify existing)

```python
# GOOD - create new version
golden_v2 = golden_v1 + [new_examples]

# BAD - never modify existing
golden_v1.append(new_example)
```

## Warning Signs

- All pass or all fail → too lenient/strict
- Random results → criteria unclear
- TPR/TNR < 70% → needs improvement

## Re-Validate When

- Prompt template changes
- Judge model changes
- Criteria changes
- Monthly

License (MIT)

MIT Source: github/awesome-copilot

View full license text

MIT License

Copyright GitHub, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

Security Scan

Passed

Every skill undergoes a two-pass automated security scan before being published to the Hub.

How does it work?

Pass 1 — Pattern analysis scans every file in the skill against 13 security rules for known dangerous patterns:

Script & command detection — Shell commands, exec/spawn calls, subprocess invocations, and curl-pipe-to-shell patterns.
Prompt injection markers — Phrases that attempt to override safety guidelines, bypass restrictions, or manipulate AI behavior.
Sensitive data & secrets — Hardcoded API keys, credentials, tokens, and access to sensitive system files.
Obfuscation patterns — Base64 decode-and-execute, dynamic code evaluation, and unsafe deserialization.
Data exfiltration risks — Environment variables sent to external URLs, writes to sensitive paths, and SQL injection patterns.

Pass 2 — AI deep scan uses GitHub Copilot to semantically analyze skill content for threats that regex can't catch:

Intent analysis — Detects code that appears benign line-by-line but is malicious in aggregate, such as disguised data exfiltration.
Social engineering — Instructions that trick users into running dangerous commands or sharing credentials.
Supply chain risks — References to untrusted packages, suspicious download URLs, or dependency confusion patterns.