Installation
Install with CLI
Recommended
gh skills-hub install eval-driven-dev Don't have the extension? Run gh extension install samueltauil/skills-hub first.
Download and extract to your repository:
.github/skills/eval-driven-dev/ Extract the ZIP to .github/skills/ in your repo. The folder name must match eval-driven-dev for Copilot to auto-discover it.
Skill Files (19)
SKILL.md 17.3 KB
---
name: eval-driven-dev
description: >
Improve AI application with evaluation-driven development. Define eval criteria, instrument the application, build golden datasets, observe and evaluate application runs, analyze results, and produce a concrete action plan for improvements.
ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals,
evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.
license: MIT
compatibility: Python 3.10+
metadata:
version: 0.8.4
pixie-qa-version: ">=0.8.4,<0.9.0"
pixie-qa-source: https://github.com/yiouli/pixie-qa/
---
# Eval-Driven Development for Python LLM Applications
You're building an **automated evaluation pipeline** that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.
**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.
During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.
**Rule: The app's LLM calls must go to a real LLM.** Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.
**The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.
This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.
---
## Before you start
**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources.
The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates.
**Setup error handling — what you can skip vs. what must succeed:**
- **Skill update fails** → OK to continue. The existing skill version is sufficient.
- **pixie-qa upgrade fails but was already installed** → OK to continue with the existing version.
- **pixie-qa is NOT installed and installation fails** → **STOP.** Ask the user for help. The workflow cannot proceed without the `pixie` package.
- **`pixie init` fails** → **STOP.** Ask the user for help.
- **`pixie start` (web server) fails** → **STOP.** Ask the user for help. Check `server.log` in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results.
---
## The workflow
Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
**How to work — read this before doing anything else:**
- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.
- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.
**When to stop and ask for help:**
Some blockers cannot and should not be worked around. When you encounter any of the following, **stop immediately and ask the user for help** — do not attempt workarounds:
- **Application won't run due to missing environment variables or configuration**: The app requires environment variables or configuration that are not set and cannot be inferred. Do NOT work around this by mocking, faking, or replacing application components — the eval must exercise real production code. Ask the user to fix the environment setup.
- **App import failures that indicate a broken project**: If the app's core modules cannot be imported due to missing system dependencies or incompatible Python versions (not just missing pip packages you can install), ask the user to fix the project setup.
- **Ambiguous entry point**: If the app has multiple equally plausible entry points and the project analysis doesn't clarify which one matters most, ask the user which to target.
Blockers you SHOULD resolve yourself (do not ask): missing Python packages (install them), missing `pixie` package (install it), port conflicts (pick a different port), file permission issues (fix them).
**Run Steps 1–6 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.
---
### Step 1: Understand the app and define eval criteria
**First, check the user's prompt for specific requirements.** Before reading app code, examine what the user asked for:
- **Referenced documents or specs**: Does the prompt mention a file to follow (e.g., "follow the spec in EVAL_SPEC.md", "use the methodology in REQUIREMENTS.md")? If so, **read that file first** — it may specify datasets, evaluation dimensions, pass criteria, or methodology that override your defaults.
- **Specified datasets or data sources**: Does the prompt reference specific data files (e.g., "use questions from eval_inputs/research_questions.json", "use the scenarios in call_scenarios.json")? If so, **read those files** — you must use them as the basis for your eval dataset, not fabricate generic alternatives.
- **Specified evaluation dimensions**: Does the prompt name specific quality aspects to evaluate (e.g., "evaluate on factuality, completeness, and bias", "test identity verification and tool call correctness")? If so, **every named dimension must have a corresponding evaluator** in your test file.
If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.
Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**
#### Sub-step 1a: Project analysis
> **Reference**: Read `references/1-a-project-analysis.md` now.
Before looking at code structure or entry points, understand what this software does in the real world — its purpose, its users, the complexity of real inputs, and where it fails. This understanding drives every downstream decision: which entry points matter most, what eval criteria to define, what trace inputs to use, and what dataset entries to create. Write the detailed context file before moving on. **Note**: the project may contain `tests/`, `fixtures/`, `examples/`, mock servers, and documentation — these are the project's own development infrastructure, NOT data sources for your eval pipeline. Ignore them when sourcing trace inputs and dataset content.
> **Checkpoint**: `pixie_qa/00-project-analysis.md` written — covering what the software does, target users, capability inventory (at least 3 capabilities if the project has them), realistic input characteristics, and hard problems / failure modes (at least 2).
#### Sub-step 1b: Entry point & execution flow
> **Reference**: Read `references/1-b-entry-point.md` now.
Read the source code to understand how the app starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize entry points — focus on the entry point(s) that exercise the most valuable capabilities, not just the first one found. Write the detailed context file before moving on.
> **Checkpoint**: `pixie_qa/01-entry-point.md` written — covering entry point, execution flow, user-facing interface, and env requirements.
#### Sub-step 1c: Eval criteria
> **Reference**: Read `references/1-c-eval-criteria.md` now.
Define the app's use cases and eval criteria. Derive use cases from the **capability inventory** in `pixie_qa/00-project-analysis.md`. Derive eval criteria from the **hard problems / failure modes** — not generic quality dimensions. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write the detailed context file before moving on.
> **Checkpoint**: `pixie_qa/02-eval-criteria.md` written — covering use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet.
---
### Step 2: Instrument, run application, and capture a reference trace
Step 2 has three sub-steps. Each reads its own reference file. **Complete each sub-step before starting the next.**
#### Sub-step 2a: Instrument with `wrap`
> **Reference**: Read `references/2a-instrumentation.md` now.
Add `wrap()` calls at the app's data boundaries so the eval harness can inject controlled inputs and capture outputs. This makes the app testable without changing its logic.
> **Checkpoint**: `wrap()` calls added at all data boundaries. Every eval criterion from `pixie_qa/02-eval-criteria.md` has a corresponding data point.
#### Sub-step 2b: Implement the Runnable
> **Reference**: Read `references/2b-implement-runnable.md` now.
Write a Runnable class that lets the eval harness invoke the app exactly as a real user would. The Runnable should be simple — it just wires up the app's real entry point to the harness interface. If it's getting complicated, something is wrong.
> **Checkpoint**: `pixie_qa/run_app.py` written. The Runnable calls the app's real entry point with real LLM configuration — no mocking, no faking, no component replacement.
#### Sub-step 2c: Capture and verify a reference trace
> **Reference**: Read `references/2c-capture-and-verify-trace.md` now.
Run the app through the Runnable and capture a trace. The trace proves instrumentation and the Runnable are working correctly, and provides the data shapes needed for dataset creation in Step 4.
> **Checkpoint**: `pixie_qa/reference-trace.jsonl` exists. All expected `wrap` entries and `llm_span` entries appear. `pixie format` shows all data points needed for evaluation. Do NOT read Step 3 instructions yet.
---
### Step 3: Define evaluators
> **Reference**: Read `references/3-define-evaluators.md` now for the detailed sub-steps.
**Goal**: Turn the qualitative eval criteria from Step 1c into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator, an **agent evaluator** (the default for any semantic or qualitative criterion), or a manual custom function (only for mechanical/deterministic checks like regex or field existence). The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer. Select evaluators that measure the **hard problems** identified in `pixie_qa/00-project-analysis.md` — not just generic quality dimensions.
> **Checkpoint**: All evaluators implemented. `pixie_qa/03-evaluator-mapping.md` written with criterion-to-evaluator mapping and decision rationale. Do NOT read Step 4 instructions yet.
---
### Step 4: Build the dataset
> **Reference**: Read `references/4-build-dataset.md` now for the detailed sub-steps.
**Goal**: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names. Cover entries from the **capability inventory** in `pixie_qa/00-project-analysis.md` and include entries targeting the **failure modes** identified there. **Do NOT use the project's own test fixtures, mock servers, or example data as dataset `eval_input` content** — source real-world data instead. **Every `wrap(purpose="input")` in the app must have pre-captured content in each entry's `eval_input`** — do NOT leave `eval_input` empty when the app has input wraps.
> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/<name>.json` with diverse entries covering all use cases. **Dataset realism audit passed** — entries use real-world data at representative scale, no project test fixtures contamination, at least one entry targets a failure mode with uncertain outcome, and every `eval_input` has captured content for all input wraps. Do NOT read Step 5 instructions yet.
---
### Step 5: Run `pixie test` and fix mechanical issues
> **Reference**: Read `references/5-run-tests.md` now for the detailed sub-steps.
**Goal**: Execute the full pipeline end-to-end and get it running without mechanical errors. This step is strictly about fixing setup and data issues in the pixie QA components (dataset, runnable, custom evaluators) — NOT about fixing the application itself or evaluating result quality. Once `pixie test` completes without errors and produces real evaluator scores for every entry, this step is done.
> **Checkpoint**: `pixie test` runs to completion. Every dataset entry has evaluator scores (real `EvaluationResult` or `PendingEvaluation`). No setup errors, no import failures, no data validation errors.
>
> If the test errors out, that's a mechanical bug in your QA components — fix and re-run. But once tests produce scores, move on. Do NOT assess result quality here — that's Step 6.
**Always proceed to Step 6 after tests produce scores.** Analysis is the essential final step — without it, pending evaluations are never completed and the user gets uninterpreted raw scores with no actionable insights. Do NOT stop here and ask the user whether to continue.
**Cycle rule for iterative runs**: Every successful `pixie test` invocation creates a concrete `pixie_qa/results/<test_id>` directory and starts a new analysis cycle. Before you edit application code, prompts, datasets, evaluators, or rerun `pixie test`, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.
---
### Step 6: Analyze outcomes
> **Reference**: Read `references/6-analyze-outcomes.md` now — it has the complete three-phase analysis process, writing guidelines, and output format requirements.
**Goal**: Analyze `pixie test` results in a structured, data-driven process to produce actionable insights on test case quality, evaluator quality, and application quality. This step completes pending evaluations, writes per-entry and per-dataset analysis, and produces a prioritized action plan. Every statement must be backed by concrete data from the evaluation run — no speculation, no hand-waving.
**Persisted analysis artifacts**: In this trimmed workflow, persist analysis only at the dataset level and test-run level. Those artifacts still use a **detailed version** (for agent consumption: data points, evidence trails, reasoning chains) plus a **summary version** (for human review: concise TLDR readable in under 2 minutes). Do not create per-entry analysis files.
**Hard completion gate**: Step 6 is **not complete** until all of the following are true:
- Every `"status": "pending"` entry in every `pixie_qa/results/<test_id>/dataset-*/entry-*/evaluations.jsonl` has been replaced with a scored result containing `score` and `reasoning`.
- Every dataset directory has `analysis.md` and `analysis-summary.md`.
- The test run root has `action-plan.md` and `action-plan-summary.md`.
- You have run the Step 6 verifier script from this skill's `resources/` directory against `pixie_qa/results/<test_id>`, and it reports success.
**Explicitly not sufficient**:
- Writing a single top-level file such as `pixie_qa/06-analysis.md`
- Saying pending evaluations are for the user to review in the web UI
- Saying an entry "likely passes" without updating `evaluations.jsonl`
---
## Web Server Management
pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script (via `pixie start`, which launches a detached background process and returns immediately).
When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with:
```bash
pixie stop
```
IMPORTANT: after the web server is stopped, the web UI becomes inaccessible. So only stop the server if the user confirms they're done with all web UI features. If they want to keep using the web UI, do NOT stop the server.
And whenever you restart the workflow, always run the setup.sh script in resources again to ensure the web server is running:
references/
1-a-project-analysis.md 5.7 KB
# Step 1a: Project Analysis
Before looking at code structure, entry points, or writing any instrumentation, understand what this software does in the real world. This analysis is the foundation for every subsequent step — it determines which entry points to prioritize, what eval criteria to define, what trace inputs to use, and what dataset entries to build.
---
## What to investigate
Read the project's README, documentation, and top-level source files. You're looking for answers to five questions:
### 1. What does this software do?
Write a one-paragraph plain-language summary. What problem does it solve? What does a successful run look like?
### 2. Who uses it and why?
Who are the target users? What's the primary use case? What problem does this solve that alternatives don't? This helps you understand what "quality" means for this app — a chatbot that chats with customers has different quality requirements than a research agent that synthesises multi-source reports.
### 3. Capability inventory
List the distinct capabilities, modes, or features the app offers. Be specific. for example:
- For a scraping library: single-page scraping, multi-page scraping, search-based scraping, speech output, script generation
- For a voice agent: greeting, FAQ handling, account lookup, transfer to human, call summarization
- For a research agent: topic research, multi-source synthesis, citation generation, report formatting
Each capability may need its own entry point, its own trace, and its own dataset entries. This list directly feeds Step 1c (use cases) and Step 4 (dataset diversity).
### 4. What are realistic inputs?
Characterize the real-world inputs the app processes — not toy examples:
- For a web scraper: "messy HTML pages with navigation, ads, dynamic content, tables, nested structures — typically 5KB-500KB of HTML"
- For a research agent: "open-ended research questions requiring multi-source synthesis, with 3-10 sub-questions"
- For a voice agent: "multi-turn conversations with background noise, interruptions, and ambiguous requests"
Be specific about **scale** (how large), **complexity** (how messy/diverse), and **variety** (what kinds). This directly feeds trace input selection (Step 2) — if you don't characterize realistic inputs here, you'll end up using toy inputs that bypass the app's real logic.
**This section is an operational constraint, not just documentation.** Steps 2c (trace input) and 4c (dataset entries) will cross-reference these characteristics to verify that trace inputs and dataset entries match real-world scale and complexity. Be concrete and quantitative — write "5KB–500KB HTML pages," not "various HTML pages."
### 5. What are the hard problems / failure modes?
What makes this app's job difficult? Where does it fail in practice? These become the most valuable eval scenarios:
- For a scraper: "malformed HTML, dynamic JS-rendered content, complex nested schemas, very large pages that exceed context windows"
- For a research agent: "conflicting sources, questions requiring multi-step reasoning, hallucinating citations"
- For a voice agent: "ambiguous caller intent, account lookup failures, simultaneous tool calls"
Each failure mode should map to at least one eval criterion (Step 1c) and at least one dataset entry (Step 4).
---
## Output: `pixie_qa/00-project-analysis.md`
Write your findings to this file. **Complete all five sections before moving to sub-step 1b.** This document is referenced by every subsequent step.
### Template
```markdown
# Project Analysis
## What this software does
<One paragraph: what it does, in plain language. Not class names or file paths — what problem does it solve for its users?>
## Target users and value proposition
<Who uses it, why, what problem it solves that alternatives don't>
## Capability inventory
1. <Capability name>: <one-line description>
2. <Capability name>: <one-line description>
3. ...
## Realistic input characteristics
<What real-world inputs look like — size, complexity, messiness, variety. Be specific about scale and structure.>
## Hard problems and failure modes
1. <Failure mode>: <why it's hard, what goes wrong>
2. <Failure mode>: <why it's hard, what goes wrong>
3. ...
```
### Quality check
Before moving on, verify:
- The "What this software does" section describes the app's purpose in terms a non-technical user would understand — not just "it runs a graph" or "it calls OpenAI"
- The capability inventory lists at least 3 capabilities (if the project has them) — if you only found 1, you may have only looked at one part of the codebase
- The realistic input characteristics describe real-world scale and complexity, not the simplest possible input
- The failure modes are specific to this app's domain, not generic ("bad input" is not a failure mode; "malformed HTML with unclosed tags that breaks the parser" is)
### What to ignore in the project
The project may contain directories and files that are part of its own development/test infrastructure — `tests/`, `fixtures/`, `examples/`, `mock_server/`, `docs/`, demo scripts, etc. These exist for the project's developers, not for your eval pipeline.
**Critical**: Do NOT use the project's test fixtures, mock servers, example data, or unit test infrastructure as inputs for your eval traces or dataset entries. They are designed for development speed and isolation — small, clean, deterministic data that bypasses every real-world difficulty. Using them produces trivially easy evaluations that cannot catch real quality issues.
When you encounter these directories during analysis, note their existence but treat them as implementation details of the project — not as data sources for your QA pipeline. Your QA pipeline must test the app against real-world conditions, not against the project's own test shortcuts.
1-b-entry-point.md 2.3 KB
# Step 1b: Entry Point & Execution Flow
Identify how the application starts and how a real user invokes it. Use the **capability inventory** from `pixie_qa/00-project-analysis.md` to prioritize — focus on the entry point(s) that exercise the most valuable and frequently-used capabilities, not just the first one you find.
---
## What to investigate
### 1. How the software runs
What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
Look for:
- `if __name__ == "__main__"` blocks
- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
- CLI entry points in `pyproject.toml` (`[project.scripts]`)
- Docker/compose configs that reveal startup commands
### 2. The real user entry point
How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
- **CLI**: What command-line arguments does the user provide?
- **Library/function**: What function does the caller import and call? What arguments?
### 3. Environment and configuration
- What env vars does the app require? (service endpoints, database URLs, feature flags)
- What config files does it read?
- What has sensible defaults vs. what must be explicitly set?
---
## Output: `pixie_qa/01-entry-point.md`
Write your findings to this file. Keep it focused — only entry point and execution flow.
### Template
```markdown
# Entry Point & Execution Flow
## How to run
<Command to start the app, required env vars, config files>
## Entry point
- **File**: <e.g., app.py, main.py>
- **Type**: <FastAPI server / CLI / standalone function / etc.>
- **Framework**: <FastAPI, Flask, Django, none>
## User-facing endpoints / interface
<For each way a user interacts with the app:>
- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
- **Input format**: <request body shape, CLI args, function params>
- **Output format**: <response shape, stdout format, return type>
## Environment requirements
| Variable | Purpose | Required? | Default |
| -------- | ------- | --------- | ------- |
| ... | ... | ... | ... |
```
1-c-eval-criteria.md 7.5 KB
# Step 1c: Eval Criteria
Define what quality dimensions matter for this app — based on the project analysis (`00-project-analysis.md`) and the entry point (`01-entry-point.md`) you've already documented.
This document serves two purposes:
1. **Dataset creation (Step 4)**: The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset.
2. **Evaluator selection (Step 3)**: The eval criteria tell you what evaluators to choose and how to map them.
**Derive use cases from the capability inventory** in `pixie_qa/00-project-analysis.md`. **Derive eval criteria from the hard problems / failure modes** — not generic quality dimensions like "factuality" or "relevance".
Keep this concise — it's a planning artifact, not a comprehensive spec.
---
## What to define
### 1. Use cases
List the distinct scenarios the app handles. Derive these from the **capability inventory** in `pixie_qa/00-project-analysis.md` — each capability should map to at least one use case. Each use case becomes a category of dataset items. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
When possible, indicate the **expected difficulty level** for each use case — e.g., "routine" for straightforward cases, "challenging" for edge cases or failure-mode scenarios. This guides dataset creation (Step 4) to include entries across a range of difficulty levels rather than clustering at easy cases.
**Good use case descriptions:**
- "Reroute to human agent on account lookup difficulties"
- "Answer billing question using customer's plan details from CRM"
- "Decline to answer questions outside the support domain"
- "Summarize research findings including all queried sub-topics"
**Bad use case descriptions (too vague):**
- "Handle billing questions"
- "Edge case"
- "Error handling"
### 2. Eval criteria
Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3.
**Good criteria are specific to the app's purpose** and derived from the **hard problems / failure modes** in `pixie_qa/00-project-analysis.md`. Examples:
- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
- Web scraper: "Does the extracted data match the requested schema fields?", "Does it handle malformed HTML without crashing or losing data?"
**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app. If your criteria could apply to any chatbot (e.g., "Groundedness", "PromptRelevance"), they're too generic — go back to the failure modes in `00-project-analysis.md` and derive criteria from those.
At this stage, don't pick evaluator classes or thresholds. That comes in Step 3.
### 3. Check criteria applicability and observability
For each criterion:
1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because:
- **Universal criteria** → become dataset-level default evaluators
- **Case-specific criteria** → become item-level evaluators on relevant rows only
2. **Verify observability** — for each criterion, identify what data point in the app needs to be captured as a `wrap()` call to evaluate it. This drives the wrap coverage in Step 2.
- If the criterion is about the app's final response → captured by `wrap(purpose="output", name="response")`
- If it's about a routing decision → captured by `wrap(purpose="state", name="routing_decision")`
- If it's about data the app fetched and used → captured by `wrap(purpose="input", name="...")`
---
## Projects with multiple capabilities
If the project analysis (`pixie_qa/00-project-analysis.md`) lists multiple capabilities, you should evaluate at minimum the **2-3 most important / commonly used** capabilities. Don't limit the dataset to a single capability when the project's value comes from breadth.
For each additional capability beyond the first:
- Add use cases in `02-eval-criteria.md`
- Plan for a separate trace (run `pixie trace` with different entry points / configs) in Step 2
- Plan dataset entries covering that capability in Step 4
If time or context constraints make it impractical to cover all capabilities, **document which ones you covered and which you skipped** (with rationale) at the end of `02-eval-criteria.md`.
---
## Criteria quality gate (mandatory self-check)
Before writing `02-eval-criteria.md`, run this check on every criterion:
> **For each criterion, ask: "If the app returned a structurally correct but semantically wrong or hallucinated answer, would this criterion catch it?"**
- If the answer is "no" for ALL criteria, your criteria set is **structural-only** — it checks plumbing (fields exist, data flowed through) but not quality (content is correct, complete, non-hallucinated). **You must add at least one semantic criterion** that evaluates the _content_ of the app's output, not just its shape.
- Structural criteria (field existence, JSON validity, format checks) are useful but insufficient. They pass even when the app returns fabricated or incorrect data.
**Examples of structural vs semantic criteria:**
| Structural (checks shape) | Semantic (checks quality) |
| ------------------------------------------- | -------------------------------------------------------------------------- |
| "Required fields are present in the output" | "Extracted values match the source content — no hallucinated data" |
| "Source type matches expected type" | "The app correctly interpreted noisy input without losing key facts" |
| "Output is valid JSON" | "The summary accurately captures the main points of the document" |
| "Response contains at least N characters" | "The response addresses the user's specific question, not a generic topic" |
A good criteria set has **both** structural and semantic criteria. Structural criteria catch gross failures (app crashed, returned empty output). Semantic criteria catch quality failures (app ran but returned wrong/hallucinated/incomplete content).
---
## Output: `pixie_qa/02-eval-criteria.md`
Write your findings to this file. **Keep it short** — the template below is the maximum length.
### Template
```markdown
# Eval Criteria
## Use cases
1. <Use case name>: <one-liner conveying input + expected behavior>
2. ...
## Eval criteria
| # | Criterion | Applies to | Data to capture |
| --- | --------- | ------------- | --------------- |
| 1 | ... | All | wrap name: ... |
| 2 | ... | Use case 1, 3 | wrap name: ... |
## Capability coverage
Capabilities covered: <list>
Capabilities skipped (with rationale): <list or "none">
```
2a-instrumentation.md 7.2 KB
# Step 2a: Instrument with `wrap`
> For the full `wrap()` API reference, see `wrap-api.md`.
**Goal**: Add `wrap()` calls at data boundaries so the eval harness can (1) inject controlled inputs in place of real external dependencies, and (2) capture outputs for scoring.
---
## Data-flow analysis
Starting from LLM call sites, trace backwards and forwards through the code to find:
- **Dependency input**: data from external systems (databases, APIs, caches, file systems, network fetches)
- **App output**: data going out to users or external systems
- **Intermediate state**: internal decisions relevant to evaluation (routing, tool calls)
You do **not** need to wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation.
## Adding `wrap()` calls
For each data point found, add a `wrap()` call in the application code:
```python
import pixie
# External dependency data — function form (prevents the real call in eval mode)
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile",
description="Customer profile fetched from database")(user_id)
# External dependency data — function form (prevents the real call in eval mode)
history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history",
description="Conversation history from Redis")(session_id)
# App output — what the user receives
response = pixie.wrap(response_text, purpose="output", name="response",
description="The assistant's response to the user")
# Intermediate state — internal decision relevant to evaluation
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
description="Which agent was selected to handle this request")
```
### Value vs. function wrapping
```python
# Value form: wrap a data value (result already computed)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile")
# Function form: wrap the callable — in eval mode the original function is
# NOT called; the registry value is returned instead.
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id)
```
**CRITICAL: Always use function form for `purpose="input"` wraps on external calls** — HTTP requests, database queries, API calls, file reads, cache lookups. Function form prevents the real call from executing in eval mode, so the dataset value is returned directly without making a live network request or database query. Value form still executes the real call first and only replaces the result afterwards — this wastes time, creates flaky tests, and makes evals dependent on external service availability.
The only case where value form is acceptable for `purpose="input"` is when the wrapped value is a local computation (no I/O, no side effects) that is cheap to recompute.
### Placement rules
1. **Wrap at the data boundary** — where data enters or exits the application, not deep inside utility functions.
2. **Names must be unique** across the entire application (used as registry keys and dataset field names).
3. **Use `lower_snake_case`** for names.
4. **Don't change the function's interface** — `wrap()` is purely additive, returns the same type.
### Placement by purpose
#### `purpose="input"` — where external data enters
Place input wraps at the **boundary where external data enters the app**, not at intermediate processing stages. In a pipeline architecture (fetch → process → extract → format):
- **Correct**: `wrap(fetch_page, purpose="input", name="fetched_page")(url)` using **function form** at the HTTP fetch boundary — in eval mode, the fetch is skipped entirely and the dataset value is returned; in trace mode, the real fetch runs and the result is captured.
- **Incorrect**: `wrap(html_content, purpose="input", name="fetched_page")` using value form — the HTTP fetch still runs in eval mode (wasting time and creating flaky tests), and only the result is replaced afterwards.
- **Incorrect**: `wrap(processed_chunks, purpose="input", name="chunks")` after parsing — eval mode bypasses parsing and chunking entirely.
**Principle**: `wrap(purpose="input")` replaces the _minimum external dependency_ while exercising the _maximum internal logic_. Push the boundary as far upstream as possible. **Always use function form** for input wraps on external calls — this prevents the real call from executing in eval mode.
#### `purpose="output"` — where processed data exits
Track **downstream** from the LLM response to find where data leaves the app — sent to the user, written to storage, rendered in UI, or passed to an external system. Wrap at that exit boundary.
- Don't wrap raw LLM responses — those are already captured by OpenInference auto-instrumentation as `llm_span` entries.
- Wrap the app's **final processed result** — after any post-processing, formatting, or transformation the app applies to the LLM output.
- If the app has multiple output channels (e.g., a response to the user AND a side-effect write to a database), wrap each one separately.
```python
# Final response after the app's formatting pipeline
response = pixie.wrap(formatted_response, purpose="output", name="response",
description="Final response sent to the user")
# Side-effect output — data written to external storage
pixie.wrap(saved_record, purpose="output", name="saved_summary",
description="Summary record saved to the database")
```
**Principle**: output wraps are observation-only — they capture what the app produced so evaluators can score it. They are never mocked or injected during eval runs.
#### `purpose="state"` — internal decisions relevant to evaluation
Some eval criteria need to judge the app's internal reasoning — not just what went in or came out, but _how_ the app made decisions. Wrap internal state when an eval criterion requires it and the data isn't visible in inputs or outputs.
Common examples:
- **Agent routing**: which sub-agent or tool was selected to handle a request
- **Plan/step decisions**: what steps the agent chose to execute
- **Memory updates**: what the agent added to or removed from its working memory
- **Retrieval results**: which documents/chunks were retrieved before being fed to the LLM
```python
# Agent routing decision
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
description="Which agent was selected to handle this request")
# Retrieved context fed to LLM
pixie.wrap(retrieved_chunks, purpose="state", name="retrieved_context",
description="Document chunks retrieved by RAG before LLM call")
```
**Principle**: only wrap state that an eval criterion actually needs. Don't wrap every variable — state wraps are for internal data that evaluators must see but that doesn't appear in the app's inputs or outputs.
### Coverage check
After adding all `wrap()` calls, go through each eval criterion from `pixie_qa/02-eval-criteria.md` and verify:
1. Every criterion that judges **what went in** has a corresponding `input` or `entry` wrap.
2. Every criterion that judges **what came out** has a corresponding `output` wrap.
3. Every criterion that judges **how the app decided** has a corresponding `state` wrap.
If a criterion needs data that isn't captured, add the wrap now — don't defer.
---
## Output
Modified application source files with `wrap()` calls at data boundaries.
2b-implement-runnable.md 7.3 KB
# Step 2b: Implement the Runnable
> For the full `Runnable` protocol and `wrap()` API, see `wrap-api.md`.
**Goal**: Write a Runnable class that lets the eval harness invoke the application exactly as a real user would.
---
## The core idea
The Runnable is how `pixie test` and `pixie trace` run your application. Think of it as a programmatic stand-in for a real user: it starts the app, sends it a request, and lets the app do its thing. The eval harness calls `run()` for each test case, passing in the user's input parameters. The app processes those parameters through its real code — real routing, real prompt assembly, real LLM calls, real response formatting — and the harness observes what happens via the `wrap()` instrumentation from Step 2a.
**This means the Runnable should be simple.** It just wires up the app's real entry point to the harness interface. If your Runnable is getting complicated — if you're building custom logic, reimplementing app behavior, or replacing components — something is wrong.
## Four requirements
### 1. Run the real production code
The Runnable calls the app's actual entry point — the same function, class, or endpoint a real user would trigger. It does not reimplement, shortcut, or substitute any part of the application.
This includes the LLM. The app's LLM calls must go through the real code path — do not mock, fake, or replace application components. The whole point of eval-based testing is that LLM outputs are non-deterministic, so you use evaluators (not assertions) to score them. If you replace any component with a fake, you've eliminated the real behavior and the eval measures nothing.
**If the app won't run due to missing environment variables or configuration that you cannot resolve, stop and ask the user to fix the environment setup.** Do not work around it by mocking components.
### 2. Represent start-up args with a Pydantic BaseModel
The `run()` method receives a Pydantic `BaseModel` whose fields are populated from the dataset's `input_data`. Define a subclass with the fields the app needs:
```python
from pydantic import BaseModel
class AppArgs(BaseModel):
user_message: str
# Add more fields as the app's entry point requires.
# These map 1:1 to the dataset input_data keys.
```
**The fields must reflect what a real user actually provides.** Read `pixie_qa/00-project-analysis.md` — the "Realistic input characteristics" section describes the complexity, scale, and variety of real inputs. Design the model to accept inputs at that level of realism, not simplified toy versions.
Understand the boundary between user-provided parameters and world data:
- **User-provided parameters** (fields on the BaseModel): what a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions.
- **World data** (handled by `wrap(purpose="input")` in Step 2a): content the app fetches from external sources during execution — web pages, database records, API responses. This is NOT part of the BaseModel.
| App type | BaseModel fields (user provides) | World data (wrap provides) |
| -------------------- | ------------------------------------- | ------------------------------------------------------------------ |
| Web scraper | URL + prompt + schema definition | The HTML page content |
| Research agent | Research question + scope constraints | Source documents, search results |
| Customer support bot | Customer's spoken message | Customer profile from CRM, conversation history from session store |
| Code review tool | PR URL + review criteria | The actual diff, file contents, CI results |
If a field ends up holding data the app would normally fetch itself, it probably belongs in a `wrap(purpose="input")` call instead of on the BaseModel.
### 3. Be concurrency-safe
`run()` is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — protect access with `asyncio.Semaphore`:
```python
import asyncio
class AppRunnable(pixie.Runnable[AppArgs]):
_sem: asyncio.Semaphore
@classmethod
def create(cls) -> "AppRunnable":
inst = cls()
inst._sem = asyncio.Semaphore(1)
return inst
async def run(self, args: AppArgs) -> None:
async with self._sem:
await call_app(args.message)
```
Only add the semaphore when the app actually has shared mutable state. If the app uses per-request state (keyed by unique IDs) or is inherently stateless, concurrent calls are naturally isolated.
### 4. Adhere to the Runnable interface
```python
class AppRunnable(pixie.Runnable[AppArgs]):
@classmethod
def create(cls) -> "AppRunnable": ... # construct instance
async def setup(self) -> None: ... # once, before first run()
async def run(self, args: AppArgs) -> None: ... # per dataset entry, concurrent
async def teardown(self) -> None: ... # once, after last run()
```
- `create()` — class method, returns a new instance. Use a quoted return type (`-> "AppRunnable"`) to avoid forward reference errors.
- `setup()` — optional async; initialize shared resources (HTTP clients, DB connections, servers).
- `run(args)` — async; called per dataset entry. Invoke the app's real entry point here.
- `teardown()` — optional async; clean up resources from `setup()`.
## Minimal example
```python
# pixie_qa/run_app.py
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
user_message: str
class AppRunnable(pixie.Runnable[AppArgs]):
"""Drives the application for tracing and evaluation."""
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def run(self, args: AppArgs) -> None:
from myapp import handle_request
await handle_request(args.user_message)
```
That's it. The Runnable imports the app's real entry point and calls it. No custom logic, no component replacement, no clever workarounds.
## Architecture-specific examples
Based on how the application runs, read the corresponding example file:
| App type | Entry point | Example file |
| ----------------------------------- | ----------------------- | ---------------------------------------------------------- |
| **Standalone function** (no server) | Python function | Read `references/runnable-examples/standalone-function.md` |
| **Web server** (FastAPI, Flask) | HTTP/WebSocket endpoint | Read `references/runnable-examples/fastapi-web-server.md` |
| **CLI application** | Command-line invocation | Read `references/runnable-examples/cli-app.md` |
Read **only** the example file that matches your app type.
## File placement
- Place the file at `pixie_qa/run_app.py`.
- The dataset's `"runnable"` field references: `"pixie_qa/run_app.py:AppRunnable"`.
- The project root is automatically on `sys.path`, so use normal imports (`from app import service`).
## Technical note
Do NOT use `from __future__ import annotations` in runnable files — it breaks Pydantic's model resolution for nested models. Use quoted return types where needed instead.
---
## Output
`pixie_qa/run_app.py` — the Runnable class.
2c-capture-and-verify-trace.md 5.7 KB
# Step 2c: Capture and verify a reference trace
**Goal**: Run the app through the Runnable, capture a trace, and verify that instrumentation and the Runnable are working correctly. The trace proves everything is wired up and provides the exact data shapes needed for dataset creation in Step 4.
---
## Choose the trace input
The trace input determines what code paths are captured. A trivial input produces a trivial trace that misses the app's real behavior.
The input must reflect the "Realistic input characteristics" section, according to `pixie_qa/00-project-analysis.md` you've read in step 2b.
The input has two parts — understand the boundary between them:
- **User-provided parameters** (you author): What a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions. Write these to be representative of real usage.
- **World data** (captured from production code, not fabricated): Content the app fetches from external sources during execution — database records, API responses, files, etc. Run the production code once to capture this data into the trace. Only resort to synthetic data generation when:
- The user explicitly instructs you to use synthetic data, OR
- Fetching from real sources is impractical (too many fetches, incurs real monetary cost, or takes unreasonably long — more than ~30 minutes)
**Quick check before writing input**: "Would a real user create this data, or would the app get it from somewhere else?" If the app gets it, let the production code run and capture it.
| App type | User provides (you author) | World provides (you source) |
| -------------------- | ------------------------------------- | ------------------------------------------------------------------ |
| Web scraper | URL + prompt + schema definition | The HTML page content |
| Research agent | Research question + scope constraints | Source documents, search results |
| Customer support bot | Customer's spoken message | Customer profile from CRM, conversation history from session store |
| Code review tool | PR URL + review criteria | The actual diff, file contents, CI results |
### Capture multiple traces
Capture **at least 2 traces** with different input characteristics before building the dataset:
- Different complexity (simple case vs. complex case)
- Different capabilities (see `00-project-analysis.md` capability inventory)
- Different edge conditions (missing optional data, unusually large input)
This calibration prevents dataset homogeneity — you see what the app actually does with varied inputs.
---
## Run `pixie trace`
**First**, verify the app can be imported: `python -c "from <module> import <class>"`. Catch missing packages before entering a trace-install-retry loop.
```bash
# Create a JSON file with input data
echo '{"user_message": "a realistic sample input"}' > pixie_qa/sample-input.json
uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \
--input pixie_qa/sample-input.json \
--output pixie_qa/reference-trace.jsonl
```
The `--input` flag takes a **file path** to a JSON file (not inline JSON). The JSON keys become kwargs for the Pydantic model.
For additional traces:
```bash
uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \
--input pixie_qa/sample-input-complex.json \
--output pixie_qa/trace-complex.jsonl
```
---
## Verify the trace
### Quick inspection
The trace JSONL contains one line per `wrap()` event and one line per LLM span:
```jsonl
{"type": "kwargs", "value": {"user_message": "What are your hours?"}}
{"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {...}, ...}
{"type": "llm_span", "request_model": "gpt-4o", "input_messages": [...], ...}
{"type": "wrap", "name": "response", "purpose": "output", "data": "Our hours are...", ...}
```
Check that:
- Expected `wrap` entries appear (one per `wrap()` call in the code)
- At least one `llm_span` entry appears (confirms real LLM calls were made)
- Missing entries indicate the execution path was different than expected — fix before continuing
### Format and verify coverage
Run `pixie format` to see the data in dataset-entry format:
```bash
pixie format --input trace.jsonl --output dataset_entry.json
```
The output shows:
- `input_data`: the exact keys/values for runnable arguments
- `eval_input`: data from `wrap(purpose="input")` calls
- `eval_output`: the actual app output (from `wrap(purpose="output")`)
For each eval criterion from `pixie_qa/02-eval-criteria.md`, verify the format output contains the data needed. If a data point is missing, go back to Step 2a and add the `wrap()` call.
### Trace audit
Before proceeding to Step 3, audit every trace:
1. **World data check**: For each `wrap(purpose="input")` field, is the data realistically complex? Compare against `00-project-analysis.md` "Realistic input characteristics." If the analysis says inputs are 5KB–500KB and yours is under 5KB, it's not representative.
2. **LLM span check**: Do `llm_span` entries appear? If not, the app's LLM calls didn't fire — the Runnable may be misconfigured or the LLM may be mocked/faked. Fix this before continuing.
3. **Complexity check**: Does the trace exercise the hard problems from `00-project-analysis.md`? If it only exercises the happy path, capture an additional trace with harder inputs.
If any check fails, go back and fix the input or Runnable, then re-capture.
---
## Output
- `pixie_qa/reference-trace.jsonl` — reference trace with all expected wrap events and LLM spans
- Additional trace files for varied inputs
3-define-evaluators.md 10.3 KB
# Step 3: Define Evaluators
**Why this step**: With the app instrumented (Step 2), you now map each eval criterion to a concrete evaluator — implementing custom ones where needed — so the dataset (Step 4) can reference them by name.
---
## 3a. Map criteria to evaluators
**Every eval criterion from Step 1c — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension. Prioritize evaluators that measure the **hard problems / failure modes** identified in `pixie_qa/00-project-analysis.md` — these are more valuable than generic quality evaluators.
For each eval criterion, choose an evaluator using this decision order:
1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog.
2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.
**Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.
For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are non-deterministic.
`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance, use an agent evaluator with clear criteria.
## 3b. Implement custom evaluators
If any criterion requires a custom evaluator, implement it now. Place custom evaluators in `pixie_qa/evaluators.py` (or a sub-module if there are many).
### Agent evaluators (`create_agent_evaluator`) — the default
Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
```python
from pixie import create_agent_evaluator
extraction_accuracy = create_agent_evaluator(
name="ExtractionAccuracy",
criteria="The extracted data accurately reflects the source content. All fields "
"contain correct values from the source — no hallucinated, fabricated, or "
"placeholder values. Compare the final_answer against the fetched_content "
"and parsed_content to verify every claimed fact.",
)
noise_handling = create_agent_evaluator(
name="NoiseHandling",
criteria="The app correctly ignored navigation chrome, boilerplate, ads, and other "
"non-content elements from the source. The extracted data contains only "
"information relevant to the user's prompt, not noise from the page structure.",
)
schema_compliance = create_agent_evaluator(
name="SchemaCompliance",
criteria="The output contains all fields requested in the prompt with appropriate "
"types and non-trivial values. Missing fields, null values for required data, "
"or fields with generic placeholder text indicate failure.",
)
```
Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`).
During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 5d.
**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable:
- **Bad**: "Check if the output is good" — too vague to grade consistently
- **Bad**: "The response should be accurate" — doesn't say what to compare against
- **Good**: "Compare the extracted fields against the source HTML/document. Each field must have a corresponding passage in the source. Flag any field whose value cannot be traced back to the source content."
- **Good**: "The app should preserve the structural hierarchy of the source document. If the source has sections/subsections, the extraction should reflect that nesting, not flatten everything into a single level."
### Manual custom evaluator — for mechanical checks only
Use manual custom evaluators **only** for deterministic, programmatic checks where a simple function definitively gives the right answer. Examples: field existence, regex matching, JSON schema validation, numeric range checks, type verification.
**Do NOT use manual custom evaluators for semantic quality.** If the check requires _judgment_ about whether content is correct, relevant, complete, or well-written, use an agent evaluator instead. The litmus test: "Could a regex, string match, or comparison operator implement this check perfectly?" If not, it's semantic — use an agent evaluator.
Custom evaluators can be **sync or async functions**. Assign them to module-level variables in `pixie_qa/evaluators.py`:
```python
from pixie import Evaluation, Evaluable
def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
return Evaluation(score=score, reasoning="...")
```
Reference by `filepath:callable_name` in the dataset: `"pixie_qa/evaluators.py:my_evaluator"`.
**Accessing `eval_metadata` and captured data**: Custom evaluators access per-entry metadata and `wrap()` outputs via the `Evaluable` fields:
- `evaluable.eval_metadata` — dict from the entry's `eval_metadata` field (e.g., `{"expected_tool": "endCall"}`)
- `evaluable.eval_output` — `list[NamedData]` containing ALL `wrap(purpose="output")` and `wrap(purpose="state")` values. Each item has `.name` (str) and `.value` (JsonValue). Use the helper below to look up by name.
```python
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
def call_ended_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
expected = evaluable.eval_metadata.get("expected_call_ended") if evaluable.eval_metadata else None
actual = _get_output(evaluable, "call_ended")
if expected is None:
return Evaluation(score=1.0, reasoning="No expected_call_ended in eval_metadata")
match = bool(actual) == bool(expected)
return Evaluation(
score=1.0 if match else 0.0,
reasoning=f"Expected call_ended={expected}, got {actual}",
)
```
### ValidJSON and string expectations conflict
`ValidJSON` treats the dataset entry's `expectation` field as a JSON Schema when present. If your entries use **string** expectations (e.g., for `Factuality`), adding `ValidJSON` as a dataset-level default evaluator will cause failures — it cannot validate a plain string as a JSON Schema. Either apply `ValidJSON` only to entries with object/boolean expectations, or omit it when the dataset relies on string expectations.
## 3c. Produce the evaluator mapping artifact
Write the criterion-to-evaluator mapping to `pixie_qa/03-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1c) and the dataset (Step 4).
**CRITICAL**: Use the exact evaluator names as they appear in the `evaluators.md` reference — built-in evaluators use their short name (e.g., `Factuality`, `ClosedQA`), and custom evaluators use `filepath:callable_name` format (e.g., `pixie_qa/evaluators.py:ConciseVoiceStyle`).
### Template
```markdown
# Evaluator Mapping
## Built-in evaluators used
| Evaluator name | Criterion it covers | Applies to |
| -------------- | ------------------- | -------------------------- |
| Factuality | Factual accuracy | All items |
| ClosedQA | Answer correctness | Items with expected_output |
## Agent evaluators
| Evaluator name | Criterion it covers | Applies to | Source file |
| ------------------------------------------ | ---------------------------- | ---------- | ---------------------- |
| pixie_qa/evaluators.py:extraction_accuracy | Content accuracy vs source | All items | pixie_qa/evaluators.py |
| pixie_qa/evaluators.py:noise_handling | Navigation/boilerplate noise | All items | pixie_qa/evaluators.py |
## Manual custom evaluators (mechanical checks only)
| Evaluator name | Criterion it covers | Applies to | Source file |
| ---------------------------------------------- | -------------------- | ---------- | ---------------------- |
| pixie_qa/evaluators.py:required_fields_present | Required field check | All items | pixie_qa/evaluators.py |
## Applicability summary
- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:extraction_accuracy
- **Item-specific** (apply to subset): ClosedQA (only items with expected_output)
```
## Output
- Custom evaluator implementations in `pixie_qa/evaluators.py` (if any custom evaluators needed)
- `pixie_qa/03-evaluator-mapping.md` — the criterion-to-evaluator mapping
---
> **Evaluator selection guide**: See `evaluators.md` for the full built-in evaluator catalog and `create_agent_evaluator` reference.
>
> **If you hit an unexpected error** when implementing evaluators (import failures, API mismatch), read `evaluators.md` for the authoritative evaluator reference and `wrap-api.md` for API details before guessing at a fix.
4-build-dataset.md 21.4 KB
# Step 4: Build the Dataset
**Why this step**: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c) — into concrete test scenarios. At test time, `pixie test` calls the runnable with `input_data`, the wrap registry is populated with `eval_input`, and evaluators score the resulting captured outputs.
**Before building entries**, review:
- **`pixie_qa/00-project-analysis.md`** — the capability inventory and failure modes. Dataset entries should cover entries from the capability inventory and include entries targeting the listed failure modes.
- **`pixie_qa/02-eval-criteria.md`** — use cases and their capability coverage. Ensure every listed use case has representative entries.
---
## Understanding `input_data`, `eval_input`, and `expectation`
Before building the dataset, understand what these terms mean:
- **`input_data`** = the kwargs passed to `Runnable.run()` as a Pydantic model. These are the input data (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for `run(args: T)`.
- **`eval_input`** = a list of `{"name": ..., "value": ...}` objects corresponding to `wrap(purpose="input")` calls in the app. At test time, these are injected automatically by the wrap registry; `wrap(purpose="input")` calls in the app return the registry value instead of calling the real external dependency.
`eval_input` **may be an empty list** only when the app has no `wrap(purpose="input")` calls. **If the app HAS input wraps, every dataset entry MUST provide corresponding `eval_input` values with pre-captured content** — otherwise the app makes live external calls during eval, which is slow, flaky, and non-reproducible. See section 4b′ for how to capture this content.
Each item is a `NamedData` object with `name` (str) and `value` (any JSON-serializable value).
- **`expectation`** (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., `Factuality`, `ClosedQA`). Not needed for output-quality evaluators that don't require a reference.
- **eval output** = what the app actually produces, captured at runtime by `wrap(purpose="output")` and `wrap(purpose="state")` calls. **Not stored in the dataset** — it's produced when `pixie test` runs the app.
The **reference trace** at `pixie_qa/reference-trace.jsonl` is your primary source for data shapes:
- Filter it to see the exact serialized format for `eval_input` values
- Read the `kwargs` record to understand the `input_data` structure
- Read `purpose="output"/"state"` events to understand what outputs the app produces, so you can write meaningful `expectation` values
---
## 4a. Derive evaluator assignments
The eval criteria artifact (`pixie_qa/02-eval-criteria.md`) maps each criterion to use cases. The evaluator mapping artifact (`pixie_qa/03-evaluator-mapping.md`) maps each criterion to a concrete evaluator name. Combine these:
1. **Dataset-level default evaluators**: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level `"evaluators"` array.
2. **Item-level evaluators**: Criteria that apply to only a subset → their evaluator names go in `"evaluators"` on the relevant rows only, using `"..."` to also include the defaults.
## 4b. Inspect data shapes with `pixie format`
Use `pixie format` on the reference trace to see the exact data shapes **and** the real app output in dataset-entry format:
```bash
uv run pixie format --input reference-trace.jsonl --output dataset-sample.json
```
The output looks like:
```json
{
"input_data": {
"user_message": "What are your business hours?"
},
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice", "tier": "gold" }
},
{
"name": "conversation_history",
"value": [{ "role": "user", "content": "What are your hours?" }]
}
],
"expectation": null,
"eval_output": {
"response": "Our business hours are Monday to Friday, 9am to 5pm..."
}
}
```
**Important**: The `eval_output` in this template is the **full real output** produced by the running app. Do NOT copy `eval_output` into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:
- Use `input_data` and `eval_input` as exact templates for data keys and format
- Look at `eval_output` to understand what the app produces — then write a **concise `expectation` description** that captures the key quality criteria for each scenario
**Example**: if `eval_output.response` is `"Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM."`, write `expectation` as `"Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours"` — a short description a human or LLM evaluator can compare against.
## 4b′. Capture external content for `eval_input` (mandatory)
**CRITICAL**: If the app has ANY `wrap(purpose="input")` calls, every dataset entry MUST provide corresponding `eval_input` values with **pre-captured real content**. An empty `eval_input` list means the app will make live external calls (HTTP requests, database queries, API calls) during every eval run — this makes evals slow, flaky, and non-reproducible.
### Why this matters
During `pixie test`, each `wrap(purpose="input", name="X")` call in the app checks the wrap registry for a value named `"X"`:
- **If found**: the registered value is returned directly (no external call)
- **If not found**: the real external call executes (non-deterministic, slow, may fail)
An `eval_input: []` entry means NOTHING is in the registry, so every external dependency runs live. This defeats the purpose of instrumentation.
### How to capture content
For each `wrap(purpose="input", name="X")` in the app, you must capture the real data once and embed it in the dataset. Choose one of these approaches:
**Option A — Use the reference trace** (preferred):
The reference trace from Step 2c already contains captured values for every `purpose="input"` wrap. Extract them:
```bash
# View the reference trace to find input wrap values
grep '"purpose": "input"' pixie_qa/reference-trace.jsonl
```
Or use `pixie format` to see the data in dataset-entry format — the `eval_input` array in the output already has the captured values with correct names and shapes.
**Option B — Fetch content directly** (for new entries with different inputs):
When creating dataset entries with different input sources (e.g., different URLs, different queries), capture the content by running the dependency code once:
```python
# Example: for a web scraper, run the app's own fetch logic once
from myapp.fetcher import fetch_page
page_content = fetch_page(target_url) # use the app's real code path
```
Then include the captured content in the entry's `eval_input`:
```json
{
"eval_input": [
{
"name": "fetch_result",
"value": "<captured page content here>"
}
]
}
```
**Option C — Run `pixie trace` with each input** (most thorough):
For each set of `input_data`, run `pixie trace` to execute the app with real dependencies and capture all values:
```bash
pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input trace-input.json
```
Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`.
### Content format
The `eval_input` value must match the **exact type and format** that the `wrap()` call returns. Check the reference trace to see what format the app produces:
- If the wrap captures a string (e.g., HTML content, markdown text), the value is a string
- If the wrap captures a dict (e.g., database record), the value is a JSON object
- If the wrap captures a list, the value is a JSON array
**Do NOT skip this step.** Every `wrap(purpose="input")` in the app must have a corresponding `eval_input` entry in every dataset row. If you proceed with empty `eval_input` when the app has input wraps, evals will be unreliable.
## 4c. Generate dataset items
Create diverse entries guided by the reference trace and use cases:
- **`input_data` keys** must match the fields of the Pydantic model used in `Runnable.run(args: T)`
- **`eval_input`** must be a list of `{"name": ..., "value": ...}` objects matching the `name` values of `wrap(purpose="input")` calls in the app
- **Cover each use case** from `pixie_qa/02-eval-criteria.md` — at least one entry per use case, with meaningfully diverse inputs across entries
**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the `input_data` / `eval_input` shape, and incorporate them into the dataset. Do NOT ignore specified data.
### Entry quality checklist
Before finalizing the dataset, verify each entry against these criteria:
**Input realism**:
- Does `eval_input` contain world data that respects the synthesization boundary (see Step 2c)? User-authored parameters are fine; world data should be sourced, not fabricated from scratch.
- Does the world data in `eval_input` match the scale and complexity described in `00-project-analysis.md` "Realistic input characteristics"? If the analysis says inputs are typically 5KB–500KB, a 200-char input is not realistic.
- Is the answer to the prompt non-trivial to extract from the input? A test where the answer is in a clearly labeled HTML tag or the first sentence doesn't test extraction quality.
**Scenario diversity**:
- Do entries cover meaningfully different difficulty levels — not just different topics with the same difficulty?
- Does at least one entry target a failure mode from `00-project-analysis.md` that you expect might actually cause degraded scores (not a guaranteed pass)?
- Do entries use different structural patterns in the input data (not just different content poured into the same template)?
**Difficulty calibration**:
- Is there at least one entry you are genuinely uncertain whether the app will handle correctly? If you're confident every entry will pass trivially, the dataset is too easy.
- Consider including one intentionally challenging entry that probes a known limitation — a "stress test" entry. If it passes, great. If it fails, the eval has demonstrated it can catch real issues.
### Anti-patterns for dataset entries
- **Fabricating world data**: Hand-authoring content the app would normally fetch from external sources (e.g., writing HTML for a web scraper, writing "retrieved documents" for a RAG system). This removes real-world complexity.
- **Uniform difficulty**: All entries have the same complexity level. Real workloads have a distribution — some easy, some hard, some edge cases.
- **Obvious answers**: Every entry has the target information cleanly labeled and unambiguous. Real data often has the answer scattered, partially present, duplicated with variations, or embedded in noise.
- **Round-trip authorship**: You wrote both the input and the expected output, so you know exactly what's there. A real evaluator tests whether the app can find information it hasn't seen before.
- **Only happy paths**: No entry tests error conditions, edge cases, or known failure modes.
- **Building all entries from the same toy trace with minor rephrasing**: If all entries have similar `input_data` and similar `eval_input` data, the dataset tests nothing meaningful. Each entry should represent a meaningfully different scenario.
- **Reusing the project's own test fixtures as eval data**: The project's `tests/`, `fixtures/`, `examples/`, and `mock_server/` directories contain data designed for unit/integration tests — small, clean, deterministic, and trivially easy. Using them as `eval_input` data guarantees 100% pass rates and zero quality signal. Even if these fixtures look convenient, they bypass every real-world difficulty that makes the app's job hard. **Run the production code to capture realistic data instead**, or generate synthetic data that matches the scale/complexity from `00-project-analysis.md`.
- **Using a project's mock/fake implementations**: If the project includes mock LLMs, fake HTTP servers, or stub services in its test infrastructure, do NOT use them in your eval pipeline. Your eval must exercise the app's real code paths with realistically complex data — not the project's own test shortcuts.
## 4c′. Verify coverage against project analysis
Before writing the final dataset JSON, open `pixie_qa/00-project-analysis.md` and check:
1. **Realistic input characteristics**: For each characteristic listed (size, complexity, noise, variety), confirm at least one dataset entry reflects it. If the analysis says "messy inputs with navigation and ads," at least one entry's `eval_input` should contain messy data with navigation and ads.
2. **Failure modes**: For each failure mode listed, confirm at least one dataset entry is designed to exercise it. The entry doesn't need to guarantee failure — but it should create conditions where that failure mode _could_ manifest. If a failure mode cannot be exercised with the current instrumentation setup, add a note in `02-eval-criteria.md` explaining why.
3. **Capability coverage**: Confirm the dataset covers the capabilities listed in the eval criteria (Step 1c). Each covered capability should have at least one entry.
If any gap is found, add entries to close it before proceeding to 4d.
## 4c″. STOP CHECK — Dataset realism audit (hard gate)
**This is a hard gate.** Do NOT proceed to 4d until every check passes. If any check fails, revise the dataset and re-audit.
Before writing the final dataset JSON, perform this self-audit:
1. **Cross-reference `00-project-analysis.md`**: Open the "Realistic input characteristics" section. For each characteristic (size, complexity, noise, structure), verify at least one dataset entry's `eval_input` reflects it. If the analysis says "5KB–500KB HTML pages with navigation chrome and ads" and your largest `eval_input` is 1KB of clean HTML, **the dataset is not realistic — add harder entries.**
2. **Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing.
3. **Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).
- **Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine.
- **At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.
4. **Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry.
- **Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3.
- If coverage is below threshold, add entries targeting the uncovered capabilities.
5. **Project fixture contamination check**: Scan every `eval_input` value. Did any data originate from the project's `tests/`, `fixtures/`, `examples/`, or mock server directories? If yes, **replace it with real-world data.** These fixtures are designed for development convenience, not evaluation realism.
6. **Tautology check**: Will the test pipeline produce meaningful scores, or is it a closed loop? If you authored both the input data and the evaluator logic such that passing is guaranteed by construction (e.g., regex extractor + exact-match evaluator on hand-authored HTML), **the pipeline is tautological** and cannot catch real issues. The app's real LLM should produce the output, and evaluators should assess quality dimensions that can genuinely fail.
7. **`eval_input` completeness check**: For every `wrap(purpose="input", name="X")` call in the instrumented app code, verify that EVERY dataset entry provides a corresponding `eval_input` item with `"name": "X"` and a non-empty `"value"`. If any entry has `eval_input: []` while the app has input wraps, **the dataset is incomplete — captured content is missing.** Go back to step 4b′ and capture the content.
## 4d. Build the dataset JSON file
Create the dataset at `pixie_qa/datasets/<name>.json`:
```json
{
"name": "qa-golden-set",
"runnable": "pixie_qa/run_app.py:AppRunnable",
"evaluators": ["Factuality", "pixie_qa/evaluators.py:ConciseVoiceStyle"],
"entries": [
{
"input_data": {
"user_message": "What are your business hours?"
},
"description": "Customer asks about business hours with gold tier account",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Alice Johnson", "tier": "gold" }
}
],
"expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
},
{
"input_data": {
"user_message": "I want to change something"
},
"description": "Ambiguous change request from basic tier customer",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Bob Smith", "tier": "basic" }
}
],
"expectation": "Should ask for clarification",
"evaluators": ["...", "ClosedQA"]
},
{
"input_data": {
"user_message": "I want to end this call"
},
"description": "User requests call end after failed verification",
"eval_input": [
{
"name": "customer_profile",
"value": { "name": "Charlie Brown", "tier": "basic" }
}
],
"expectation": "Agent should call endCall tool and end the conversation",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
]
}
```
### Key fields
**Entry structure** — all fields are top-level on each entry (flat structure — no nesting):
```
entry:
├── input_data (required) — args for Runnable.run()
├── eval_input (optional) — list of {"name": ..., "value": ...} objects (default: [])
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
```
**Top-level fields:**
- **`runnable`** (required): `filepath:ClassName` reference to the `Runnable` class from Step 2 (e.g., `"pixie_qa/run_app.py:AppRunnable"`). Path is relative to the project root.
- **`evaluators`** (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.
**Per-entry fields (all top-level on each entry):**
- **`input_data`** (required): Keys match the Pydantic model fields for `Runnable.run(args: T)`. These are the app's input data.
- **`eval_input`** (optional, default `[]`): List of `{"name": ..., "value": ...}` objects. Names match `wrap(purpose="input")` names in the app. The runner automatically prepends `input_data` when building the `Evaluable`.
- **`description`** (required): Use case one-liner from `pixie_qa/02-eval-criteria.md`.
- **`expectation`** (optional): Case-specific expectation text for evaluators that need a reference.
- **`eval_metadata`** (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as `evaluable.eval_metadata`.
- **`evaluators`** (optional): Row-level evaluator override.
### Evaluator assignment rules
1. Evaluators that apply to ALL items go in the top-level `"evaluators"` array.
2. Items that need **additional** evaluators use `"evaluators": ["...", "ExtraEval"]` — `"..."` expands to defaults.
3. Items that need a **completely different** set use `"evaluators": ["OnlyThis"]` without `"..."`.
4. Items using only defaults: omit the `"evaluators"` field.
---
## Dataset Creation Reference
### Using `eval_input` values
The `eval_input` values are `{"name": ..., "value": ...}` objects. Use the reference trace as templates — copy the `"data"` field from the relevant `purpose="input"` event and adapt the values:
**Simple dict**:
```json
{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }
```
**List of dicts** (e.g., conversation history):
```json
{
"name": "conversation_history",
"value": [
{ "role": "user", "content": "Hello" },
{ "role": "assistant", "content": "Hi there!" }
]
}
```
**Important**: The exact format depends on what the `wrap(purpose="input")` call captures. Always copy from the reference trace rather than constructing from scratch.
### Crafting diverse eval scenarios
Cover different aspects of each use case. Refer to **`pixie_qa/00-project-analysis.md`** for the capability inventory and failure modes:
- **Cover each capability** — at least one entry per capability from the capability inventory, not just the primary capability
- **Target failure modes** — include entries that exercise the hard problems / failure modes listed in the project analysis (e.g., malformed input, edge cases, complex scenarios)
- Different user phrasings of the same request
- Edge cases (ambiguous input, missing information, error conditions)
- Entries that stress-test specific eval criteria
- At least one entry per use case from Step 1c
---
## Output
`pixie_qa/datasets/<name>.json` — the dataset file.
5-run-tests.md 5.9 KB
# Step 5: Run `pixie test` and Fix Mechanical Issues
**Why this step**: Run `pixie test` and fix mechanical issues in your QA components — dataset format problems, runnable implementation bugs, and custom evaluator errors — until every entry produces real scores. This step is NOT about assessing result quality or fixing the application itself.
---
## 5a. Run tests
```bash
uv run pixie test
```
For verbose output with per-case scores and evaluator reasoning:
```bash
uv run pixie test -v
```
`pixie test` automatically loads the `.env` file before running tests.
The evaluation harness:
1. Resolves the `Runnable` class from the dataset's `runnable` field
2. Calls `Runnable.create()` to construct an instance, then `setup()` once
3. Runs all dataset entries **concurrently** (up to 4 in parallel):
a. Reads `input_data` and `eval_input` from the entry
b. Populates the wrap input registry with `eval_input` data
c. Initialises the capture registry
d. Validates `input_data` into the Pydantic model and calls `Runnable.run(args)`
e. `wrap(purpose="input")` calls in the app return registry values instead of calling external services
f. `wrap(purpose="output"/"state")` calls capture data for evaluation
g. Builds `Evaluable` from captured data
h. Runs evaluators
4. Calls `Runnable.teardown()` once
Because entries run concurrently, the Runnable's `run()` method must be concurrency-safe. If you see `sqlite3.OperationalError`, `"database is locked"`, or similar errors, add a `Semaphore(1)` to your Runnable (see the concurrency section in Step 2 reference).
## 5b. Fix mechanical issues only
This step is strictly about fixing what you built in previous steps — the dataset, the runnable, and any custom evaluators. You are fixing mechanical problems that prevent the pipeline from running, NOT assessing or improving the application's output quality.
**What counts as a mechanical issue** (fix these):
| Error | Cause | Fix |
| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| `WrapRegistryMissError: name='<key>'` | Dataset entry missing an `eval_input` item with the `name` that the app's `wrap(purpose="input", name="<key>")` expects | Add the missing `{"name": "<key>", "value": ...}` to `eval_input` in every affected entry |
| `WrapTypeMismatchError` | Deserialized type doesn't match what the app expects | Fix the value in the dataset |
| Runnable resolution failure | `runnable` path or class name is wrong, or the class doesn't implement the `Runnable` protocol | Fix `filepath:ClassName` in the dataset; ensure the class has `create()` and `run()` methods |
| Import error | Module path or syntax error in runnable/evaluator | Fix the referenced file |
| `ModuleNotFoundError: pixie_qa` | `pixie_qa/` directory missing `__init__.py` | Run `pixie init` to recreate it |
| `TypeError: ... is not callable` | Evaluator name points to a non-callable attribute | Evaluators must be functions, classes, or callable instances |
| `sqlite3.OperationalError` | Concurrent `run()` calls sharing a SQLite connection | Add `asyncio.Semaphore(1)` to the Runnable (see Step 2 concurrency section) |
| Custom evaluator crashes | Bug in your custom evaluator implementation | Fix the evaluator code |
**What is NOT a mechanical issue** (do NOT fix these here):
- Application produces wrong/low-quality output → that's the application's behavior, analyzed in Step 6
- Evaluator scores are low → that's a quality signal, analyzed in Step 6
- LLM calls fail inside the application → report in Step 6, do not mock or work around
- Evaluator scores fluctuate between runs → normal LLM non-determinism, not a bug
Iterate — fix errors, re-run, fix the next error — until `pixie test` runs to completion with real evaluator scores for all entries.
## Output
After `pixie test` completes successfully, results are stored in the per-entry directory structure:
```
{PIXIE_ROOT}/results/<test_id>/
meta.json # test run metadata
dataset-{idx}/
metadata.json # dataset name, path, runnable
entry-{idx}/
config.json # evaluators, description, expectation
eval-input.jsonl # input data fed to evaluators
eval-output.jsonl # output data captured from app
evaluations.jsonl # evaluation results (scored + pending)
trace.jsonl # LLM call traces (if captured)
```
The `<test_id>` is printed in console output. You will reference this directory in Step 6.
---
> **If you hit an unexpected error** when running tests (wrong parameter names, import failures, API mismatch), read `wrap-api.md`, `evaluators.md`, or `testing-api.md` for the authoritative API reference before guessing at a fix.
6-analyze-outcomes.md 16.1 KB
# Step 6: Analyze Outcomes
**Why this step**: `pixie test` produced raw scores. Now you analyze those results to understand what they mean — completing pending evaluations, identifying patterns, validating hypotheses, and producing an actionable improvement plan. The analysis is structured in three phases that build on each other: entry-level → dataset-level → action plan.
---
## Result directory structure
After `pixie test`, the result directory looks like:
```text
{PIXIE_ROOT}/results/<test_id>/
meta.json
dataset-{idx}/
metadata.json
entry-{idx}/
config.json # evaluators, description, expectation
eval-input.jsonl # input data fed to evaluators
eval-output.jsonl # output data captured from app
evaluations.jsonl # scored + pending evaluations
trace.jsonl # LLM call traces
```
Read `meta.json` to find the `<test_id>`. All the data you need for analysis is in this directory.
---
## Hard completion gate
You are the grader for Step 6. **Pending evaluations are not a handoff to the user, and the web UI is not a substitute for grading.** You may use the web UI to browse traces and outputs, but completion happens by writing files on disk.
Step 6 is incomplete until all of the following are true:
- Every `"status": "pending"` entry in every `evaluations.jsonl` has been replaced with a scored entry that contains both `score` and `reasoning`.
- Every dataset directory contains `analysis.md` and `analysis-summary.md`.
- The test run root contains `action-plan.md` and `action-plan-summary.md`.
- The verifier script in this skill's `resources/` directory passes for the target results directory.
**Forbidden shortcuts**:
- Leaving any `"status": "pending"` entries in place
- Telling the user to review pending evaluations in the web UI
- Writing a single top-level substitute file such as `pixie_qa/06-analysis.md`
- Writing phrases like "likely passes" or "probably fails" without scoring the evaluation and updating `evaluations.jsonl`
If you do any of the above, Step 6 is not done.
## Iteration rule
If you are iterating across multiple fix/test cycles, every successful `pixie test` run creates a new `pixie_qa/results/<test_id>` directory and a new Step 6 obligation. The moment that directory exists, it becomes the analysis target for the current cycle.
Before you edit application code, prompts, datasets, evaluators, or rerun `pixie test`, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.
**Additional forbidden shortcut**:
- Do not create a newer `pixie_qa/results/<test_id>` and leave an older one from the same task without Step 6 artifacts.
---
## Writing principles
Every analysis **detailed** artifact you produce must follow these principles:
- **Data-driven**: Every opinion or statement must be backed by concrete data from the evaluation run. Quote scores, cite entry indices, reference specific eval input/output content. No hand-waving. It is better to write nothing than to write something unsubstantiated.
- **Evidence-first**: Present the raw data and evidence before drawing conclusions. The reader (another coding agent) should be able to independently verify your conclusions from the evidence you cite.
- **Traceable**: For every conclusion, provide the chain: data source → observation → reasoning → conclusion. Another agent should be able to follow this chain backward to verify or challenge any claim.
- **No selling**: Do not advocate, promote, or use value-laden language ("excellent", "robust", "impressive", "well-designed"). State what the data shows and what actions it implies. Let the reader form quality judgments.
- **Action-oriented**: Every analysis should contribute to the end goal of concrete improvements to the evaluation pipeline or application. Do not write observations that don't lead somewhere.
Every persisted analysis **summary** artifact must follow these principles:
- **Concise**: The human reader should be able to understand the key findings and actions in under 2 minutes for any single artifact.
- **Conclusions-first**: Lead with what the reader needs to know (results, findings, actions), not with methodology or background.
- **Plain language**: Avoid jargon. A non-technical stakeholder should be able to follow the summary.
- **Consistent**: Summary conclusions must match the detailed version's evidence. Never add claims in the summary that aren't supported in the detailed version.
### Dual-variant pattern
Every persisted analysis artifact in this step has two files:
| Artifact | Detailed file (for agent) | Summary file (for human) |
| ---------------- | --------------------------- | ----------------------------------- |
| Dataset analysis | `dataset-{idx}/analysis.md` | `dataset-{idx}/analysis-summary.md` |
| Action plan | `action-plan.md` | `action-plan-summary.md` |
**Always write the detailed version first**, then derive the summary from it. The summary is a strict subset of the detailed version's content — it should never contain claims or conclusions not present in the detailed version.
---
## Phase 1: Entry-level grading pass
Process each dataset entry individually. For each `dataset-{idx}/entry-{idx}/`:
### 1a. Read the entry data
Read these files for the entry:
- `config.json` — what evaluators were configured, the description, the expectation
- `eval-input.jsonl` — what data was fed to the app/evaluators
- `eval-output.jsonl` — what the app produced
- `evaluations.jsonl` — current evaluation results (scored and pending)
- `trace.jsonl` — what LLM calls the app made (if available)
### 1b. Complete pending evaluations
If `evaluations.jsonl` contains entries with `"status": "pending"`, you must grade them:
1. Read the `criteria` field of the pending evaluation
2. Apply the criteria to the entry's eval input, eval output, and trace data
3. Assign a **score** between 0.0 and 1.0:
- `1.0` — fully meets the criteria
- `0.5`–`0.9` — partially meets criteria (explain what's missing)
- `0.0`–`0.4` — does not meet criteria
4. Write a **reasoning** string (1–3 sentences citing specific evidence from the output or trace)
5. Replace the pending entry in `evaluations.jsonl` with the scored result. **Do not append a second row and leave the pending row in place. Overwrite the pending row itself.**
**Before** (pending):
```json
{
"evaluator": "ResponseQuality",
"status": "pending",
"criteria": "The response should..."
}
```
**After** (scored):
```json
{
"evaluator": "ResponseQuality",
"score": 0.85,
"reasoning": "Response addresses the main question but omits..."
}
```
**Grading guidelines**:
- Be evidence-based — every score must reference specific output or trace content
- Use the criteria literally — do not expand or reinterpret beyond what's written
- Consider the trace — distinguish between app logic problems and LLM quality issues
- Be calibrated — reserve 1.0 for outputs that genuinely satisfy criteria fully
- Do not penalize LLM non-determinism — different phrasing of a correct answer is not a failure
- Do not defer to the user — if the evidence is sufficient to write "likely passes", it is sufficient to assign a score and update `evaluations.jsonl`
### 1c. Do not persist entry-level analysis files
In this trimmed workflow, **do not write `entry-{idx}/analysis.md` or `entry-{idx}/analysis-summary.md`**. Phase 1 is only for reading evidence and converting every pending evaluation into a scored row in `evaluations.jsonl`.
You may take temporary scratch notes while reasoning, but they are not deliverables. Persist only:
- updated `evaluations.jsonl` in each entry directory
- dataset-level analysis files in Phase 2
- run-level action plan files in Phase 3
---
## Phase 2: Dataset-level analysis
After all entries in a dataset are analyzed, produce the dataset-level analysis. Write `analysis.md` in the dataset directory (`dataset-{idx}/analysis.md`).
### 2a. Aggregate the data
Summarize across all entries in the dataset:
- Pass/fail counts and overall pass rate
- Per-evaluator statistics (pass rate, min/max/mean scores)
- Which entries failed which evaluators (failure clusters)
### 2b. Form and validate hypotheses
Come up with **exactly 3 high-confidence hypotheses** across these three dimensions:
1. **Test cases quality** — Does the set of test cases sufficiently and efficiently verify the application's capabilities? Does it cover the important failure modes? Are there blind spots?
2. **Evaluation criteria/evaluator quality** — Do the evaluators have proper granularity and grading to catch real issues? Are there rubber-stamp evaluators (all 1.0)? Are there flaky evaluators (high variance without code changes)? Are criteria too vague or too strict?
3. **Application quality** — Based on the evaluation results, what are the application's strengths and weaknesses? Where does it produce high-quality output? Where does it fail?
For each hypothesis:
- **State the hypothesis** clearly in one sentence
- **Cite the evidence** — entry indices, evaluator names, scores, reasoning quotes, trace data
- **Validate or invalidate** — look at the actual eval input/output data and code to confirm or refute
- **Conclusion** — what action does this hypothesis imply?
It is always possible to produce 3 hypotheses even when the data is limited. If the evaluation data doesn't give a conclusive answer on application quality, that itself is a signal about test case or evaluator gaps.
### 2c. Write the dataset analysis (two files)
Produce **two files** for the dataset analysis. Write the detailed version first, then derive the summary.
#### Detailed version: `dataset-{idx}/analysis.md`
This file is for **agent consumption** — it provides the complete data aggregation, hypothesis formation with evidence chains, and validated conclusions that a coding agent can act on directly.
**Writing principles:**
- **Show all the data before interpreting it.** Start with the raw aggregation (pass/fail, per-evaluator stats, failure clusters) before any hypotheses. The data should stand on its own.
- **For each hypothesis, present: data → reasoning → conclusion.** The reader should be able to follow your logic step by step and arrive at the same conclusion independently.
- **Cross-reference raw entry evidence directly.** When citing evidence, reference the specific entry index and the underlying files/data points (for example: `entry-3/evaluations.jsonl`, `entry-3/eval-output.jsonl`, or `entry-3/trace.jsonl`).
- **Distinguish correlation from causation.** If two entries fail the same evaluator, that's a pattern. But the root cause might differ — verify by checking the actual output data, don't assume.
- **Do not speculate without marking it.** If a conclusion is uncertain, say "Hypothesis (unvalidated): ..." and explain what additional data would confirm or refute it.
**Content:**
1. **Overview** — dataset name, entry count, overall pass rate
2. **Raw aggregation data**
- Per-evaluator statistics table (pass rate, score range, mean, standard deviation)
- Failure matrix: entries × evaluators showing scores, highlighting failures
- Failure clusters: entries grouped by shared failed evaluators
3. **Hypothesis 1: Test cases** — hypothesis statement, evidence with entry/evaluator references, validation steps taken, conclusion with specific action
4. **Hypothesis 2: Evaluators** — same structure
5. **Hypothesis 3: Application** — same structure
6. **Open questions** — anything the data doesn't conclusively answer, with suggestions for what additional data would help
#### Summary version: `dataset-{idx}/analysis-summary.md`
This file is for **human review** — a scannable overview of the dataset results, key findings, and recommended actions.
**Template:**
```markdown
# Dataset Analysis — Summary
**Dataset**: <name> | **Entries**: <N> | **Pass rate**: <X/N (Y%)>
## Results at a glance
| Evaluator | Pass rate | Avg score | Notes |
| --------- | --------- | --------- | ---------------------- |
| ... | ... | ... | <one-liner if notable> |
## Key findings
1. <Finding>: <1-2 sentences with the conclusion and its implication>
2. ...
3. ...
## Recommended actions (priority order)
1. <Action>: <what to do and expected impact, 1-2 sentences>
2. ...
3. ...
```
Maximum ~40 lines for the summary.
---
## Phase 3: Action plan (two files)
After all datasets are analyzed, produce the action plan. Write **two files** at the test run root. Write the detailed version first, then derive the summary.
### Detailed version: `{PIXIE_ROOT}/results/<test_id>/action-plan.md`
This file is for **agent consumption** — it provides specific, implementable improvement items with full evidence trails, so a coding agent can pick up any item and execute it without additional context-gathering.
**Writing principles:**
- **Each item must be self-contained.** A coding agent reading just one priority item should have enough context (evidence references, file paths, expected changes) to implement it.
- **Trace every item back to evidence.** Each priority must reference: which hypothesis (from which dataset analysis), which entries/evaluators provided the evidence, and what the specific data showed.
- **Be concrete about "How".** Don't say "improve the prompt" — say "In `scrapegraphai/prompts/generate_answer.py` line 45, add instruction: '...'". The more specific, the more actionable.
- **Do not include speculative items.** Every item must have validated evidence. If an item is based on an unvalidated hypothesis, either validate it first or exclude it.
**Structure:**
```markdown
# Action Plan (Detailed)
## Summary
- X datasets analyzed, Y total entries, Z% overall pass rate
- [1-2 sentence high-level assessment]
## Priority 1: [Most impactful improvement]
- **What**: [specific change to make]
- **Why**: [which hypothesis from which dataset analysis, with entry/evaluator references]
- **Evidence**: [specific scores, output excerpts, trace data that support this]
- **Expected impact**: [which entries/evaluators this will improve, and predicted score change]
- **How**: [concrete implementation steps with file paths and line numbers]
- **Verification**: [how to verify the fix worked — which entries to re-run, what scores to expect]
## Priority 2: ...
...
```
### Summary version: `{PIXIE_ROOT}/results/<test_id>/action-plan-summary.md`
This file is for **human review** — a prioritized list of improvements that a human can understand and approve in under 2 minutes.
**Template:**
```markdown
# Action Plan — Summary
**Overall**: <X entries, Y% pass rate. 1-sentence assessment.>
## Actions (priority order)
1. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
2. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
3. ...
```
Maximum ~30 lines for the summary.
**Prioritization criteria**:
- Systemic issues (affecting multiple entries/datasets) before isolated ones
- Issues with clear, validated evidence before speculative ones
- Application quality gaps before evaluator refinements before test case additions
- Quick fixes before large refactors
The action plan should have 3–5 items. Each must trace back to a validated hypothesis from Phase 2. Do not include items that are speculative or lack evidence.
---
## Process summary
1. **Phase 1** (per entry): Read data → grade pending evaluations → update `evaluations.jsonl`
2. **Phase 2** (per dataset): Aggregate → form 3 hypotheses → validate → write `dataset-{idx}/analysis.md` + `dataset-{idx}/analysis-summary.md`
3. **Phase 3** (per test run): Synthesize → prioritize → write `action-plan.md` + `action-plan-summary.md`
Process entries within a dataset concurrently (using subagents if available). Process phases sequentially — Phase 2 depends on Phase 1 outputs, Phase 3 depends on Phase 2 outputs.
---
## Final verification
Before you end your turn, run the Step 6 verifier script that ships beside `setup.sh` in this skill's `resources/` directory against the exact test run directory you analyzed.
Example shape:
```bash
python /path/to/eval-driven-dev/resources/verify_step6_completion.py pixie_qa/results/<test_id>
```
If the verifier reports any error, keep working. Step 6 is not complete until the verifier passes.
evaluators.md 17.5 KB
# Built-in Evaluators
> Auto-generated from pixie source code docstrings.
> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
Autoevals adapters — pre-made evaluators wrapping `autoevals` scorers.
This module provides :class:`AutoevalsAdapter`, which bridges the
autoevals `Scorer` interface to pixie's `Evaluator` protocol, and
a set of factory functions for common evaluation tasks.
Public API (all are also re-exported from `pixie.evals`):
**Core adapter:** - :class:`AutoevalsAdapter` — generic wrapper for any autoevals `Scorer`.
**Heuristic scorers (no LLM required):** - :func:`LevenshteinMatch` — edit-distance string similarity. - :func:`ExactMatch` — exact value comparison. - :func:`NumericDiff` — normalised numeric difference. - :func:`JSONDiff` — structural JSON comparison. - :func:`ValidJSON` — JSON syntax / schema validation. - :func:`ListContains` — overlap between two string lists.
**Embedding scorer:** - :func:`EmbeddingSimilarity` — cosine similarity via embeddings.
**LLM-as-judge scorers:** - :func:`Factuality`, :func:`ClosedQA`, :func:`Battle`,
:func:`Humor`, :func:`Security`, :func:`Sql`,
:func:`Summary`, :func:`Translation`, :func:`Possible`.
**Moderation:** - :func:`Moderation` — OpenAI content-moderation check.
**RAGAS metrics:** - :func:`ContextRelevancy`, :func:`Faithfulness`,
:func:`AnswerRelevancy`, :func:`AnswerCorrectness`.
## Evaluator Selection Guide
Choose evaluators based on the **output type** and eval criteria:
| Output type | Evaluator category | Examples |
| -------------------------------------------- | ----------------------------------------------------------- | -------------------------------------- |
| Deterministic (labels, yes/no, fixed-format) | Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON` | Label classification, JSON extraction |
| Open-ended text with a reference answer | LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness` | Chatbot responses, QA, summaries |
| Text with expected context/grounding | RAG: `Faithfulness`, `ContextRelevancy` | RAG pipelines |
| Text with style/format requirements | Custom via `create_llm_evaluator` | Voice-friendly responses, tone checks |
| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
| Trace-dependent quality (tool use, routing) | Agent evaluator via `create_agent_evaluator` | Tool correctness, multi-step reasoning |
Critical rules:
- For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are
non-deterministic.
- `AnswerRelevancy` is **RAG-only** — requires `context` in the trace.
Returns 0.0 without it. For general relevance, use `create_llm_evaluator`.
- Do NOT use comparison evaluators (`Factuality`, `ClosedQA`,
`ExactMatch`) on items without `expected_output` — they produce
meaningless scores.
---
## Evaluator Reference
### `AnswerCorrectness`
```python
AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Answer correctness evaluator (RAGAS).
Judges whether `eval_output` is correct compared to
`expected_output`, combining factual similarity and semantic
similarity.
**When to use**: QA scenarios in RAG pipelines where you have a
reference answer and want a comprehensive correctness score.
**Requires `expected_output`**: Yes.
**Requires `eval_metadata["context"]`**: Optional (improves accuracy).
Args:
client: OpenAI client instance.
### `AnswerRelevancy`
```python
AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Answer relevancy evaluator (RAGAS).
Judges whether `eval_output` directly addresses the question in
`eval_input`.
**When to use**: RAG pipelines only — requires `context` in the
trace. Returns 0.0 without it. For general (non-RAG) response
relevance, use `create_llm_evaluator` with a custom prompt instead.
**Requires `expected_output`**: No.
**Requires `eval_metadata["context"]`**: Yes — **RAG pipelines only**.
Args:
client: OpenAI client instance.
### `Battle`
```python
Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Head-to-head comparison evaluator (LLM-as-judge).
Uses an LLM to compare `eval_output` against `expected_output`
and determine which is better given the instructions in `eval_input`.
**When to use**: A/B testing scenarios, comparing model outputs,
or ranking alternative responses.
**Requires `expected_output`**: Yes.
Args:
model: LLM model name.
client: OpenAI client instance.
### `ClosedQA`
```python
ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Closed-book question-answering evaluator (LLM-as-judge).
Uses an LLM to judge whether `eval_output` correctly answers the
question in `eval_input` compared to `expected_output`. Optionally
forwards `eval_metadata["criteria"]` for custom grading criteria.
**When to use**: QA scenarios where the answer should match a reference —
e.g. customer support answers, knowledge-base queries.
**Requires `expected_output`**: Yes — do NOT use on items without
`expected_output`; produces meaningless scores.
Args:
model: LLM model name.
client: OpenAI client instance.
### `ContextRelevancy`
```python
ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Context relevancy evaluator (RAGAS).
Judges whether the retrieved context is relevant to the query.
Forwards `eval_metadata["context"]` to the underlying scorer.
**When to use**: RAG pipelines — evaluating retrieval quality.
**Requires `expected_output`**: Yes.
**Requires `eval_metadata["context"]`**: Yes (RAG pipelines only).
Args:
client: OpenAI client instance.
### `EmbeddingSimilarity`
```python
EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Embedding-based semantic similarity evaluator.
Computes cosine similarity between embedding vectors of `eval_output`
and `expected_output`.
**When to use**: Comparing semantic meaning of two texts when exact
wording doesn't matter. More robust than Levenshtein for paraphrased
content but less nuanced than LLM-as-judge evaluators.
**Requires `expected_output`**: Yes.
Args:
prefix: Optional text to prepend for domain context.
model: Embedding model name.
client: OpenAI client instance.
### `ExactMatch`
```python
ExactMatch() -> 'AutoevalsAdapter'
```
Exact value comparison evaluator.
Returns 1.0 if `eval_output` exactly equals `expected_output`,
0.0 otherwise.
**When to use**: Deterministic, structured outputs (classification labels,
yes/no answers, fixed-format strings). **Never** use for open-ended LLM
text — LLM outputs are non-deterministic, so exact match will almost always
fail.
**Requires `expected_output`**: Yes.
### `Factuality`
```python
Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Factual accuracy evaluator (LLM-as-judge).
Uses an LLM to judge whether `eval_output` is factually consistent
with `expected_output` given the `eval_input` context.
**When to use**: Open-ended text where factual correctness matters
(chatbot responses, QA answers, summaries). Preferred over
`ExactMatch` for LLM-generated text.
**Requires `expected_output`**: Yes — do NOT use on items without
`expected_output`; produces meaningless scores.
Args:
model: LLM model name.
client: OpenAI client instance.
### `Faithfulness`
```python
Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Faithfulness evaluator (RAGAS).
Judges whether `eval_output` is faithful to (i.e. supported by)
the provided context. Forwards `eval_metadata["context"]`.
**When to use**: RAG pipelines — ensuring the answer doesn't
hallucinate beyond what the retrieved context supports.
**Requires `expected_output`**: No.
**Requires `eval_metadata["context"]`**: Yes (RAG pipelines only).
Args:
client: OpenAI client instance.
### `Humor`
```python
Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Humor quality evaluator (LLM-as-judge).
Uses an LLM to judge the humor quality of `eval_output` against
`expected_output`.
**When to use**: Evaluating humor in creative writing, chatbot
personality, or entertainment applications.
**Requires `expected_output`**: Yes.
Args:
model: LLM model name.
client: OpenAI client instance.
### `JSONDiff`
```python
JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'
```
Structural JSON comparison evaluator.
Recursively compares two JSON structures and produces a similarity
score. Handles nested objects, arrays, and mixed types.
**When to use**: Structured JSON outputs where field-level comparison
is needed (e.g. extracted data, API response schemas, tool call arguments).
**Requires `expected_output`**: Yes.
Args:
string_scorer: Optional pairwise scorer for string fields.
### `LevenshteinMatch`
```python
LevenshteinMatch() -> 'AutoevalsAdapter'
```
Edit-distance string similarity evaluator.
Computes a normalised Levenshtein distance between `eval_output` and
`expected_output`. Returns 1.0 for identical strings and decreasing
scores as edit distance grows.
**When to use**: Deterministic or near-deterministic outputs where small
textual variations are acceptable (e.g. formatting differences, minor
spelling). Not suitable for open-ended LLM text — use an LLM-as-judge
evaluator instead.
**Requires `expected_output`**: Yes.
### `ListContains`
```python
ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'
```
List overlap evaluator.
Checks whether `eval_output` contains all items from
`expected_output`. Scores based on overlap ratio.
**When to use**: Outputs that produce a list of items where completeness
matters (e.g. extracted entities, search results, recommendations).
**Requires `expected_output`**: Yes.
Args:
pairwise_scorer: Optional scorer for pairwise element comparison.
allow_extra_entities: If True, extra items in output are not penalised.
### `Moderation`
```python
Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Content moderation evaluator.
Uses the OpenAI moderation API to check `eval_output` for unsafe
content (hate speech, violence, self-harm, etc.).
**When to use**: Any application where output safety is a concern —
chatbots, content generation, user-facing AI.
**Requires `expected_output`**: No.
Args:
threshold: Custom flagging threshold.
client: OpenAI client instance.
### `NumericDiff`
```python
NumericDiff() -> 'AutoevalsAdapter'
```
Normalised numeric difference evaluator.
Computes a normalised numeric distance between `eval_output` and
`expected_output`. Returns 1.0 for identical numbers and decreasing
scores as the difference grows.
**When to use**: Numeric outputs where approximate equality is acceptable
(e.g. price calculations, scores, measurements).
**Requires `expected_output`**: Yes.
### `Possible`
```python
Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Feasibility / plausibility evaluator (LLM-as-judge).
Uses an LLM to judge whether `eval_output` is a plausible or
feasible response.
**When to use**: General-purpose quality check when you want to
verify outputs are reasonable without a specific reference answer.
**Requires `expected_output`**: No.
Args:
model: LLM model name.
client: OpenAI client instance.
### `Security`
```python
Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Security vulnerability evaluator (LLM-as-judge).
Uses an LLM to check `eval_output` for security vulnerabilities
based on the instructions in `eval_input`.
**When to use**: Code generation, SQL output, or any scenario
where output must be checked for injection or vulnerability risks.
**Requires `expected_output`**: No.
Args:
model: LLM model name.
client: OpenAI client instance.
### `Sql`
```python
Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
SQL equivalence evaluator (LLM-as-judge).
Uses an LLM to judge whether `eval_output` SQL is semantically
equivalent to `expected_output` SQL.
**When to use**: Text-to-SQL applications where the generated SQL
should be functionally equivalent to a reference query.
**Requires `expected_output`**: Yes.
Args:
model: LLM model name.
client: OpenAI client instance.
### `Summary`
```python
Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Summarisation quality evaluator (LLM-as-judge).
Uses an LLM to judge the quality of `eval_output` as a summary
compared to the reference summary in `expected_output`.
**When to use**: Summarisation tasks where the output must capture
key information from the source material.
**Requires `expected_output`**: Yes.
Args:
model: LLM model name.
client: OpenAI client instance.
### `Translation`
```python
Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
```
Translation quality evaluator (LLM-as-judge).
Uses an LLM to judge the translation quality of `eval_output`
compared to `expected_output` in the target language.
**When to use**: Machine translation or multilingual output scenarios.
**Requires `expected_output`**: Yes.
Args:
language: Target language (e.g. `"Spanish"`).
model: LLM model name.
client: OpenAI client instance.
### `ValidJSON`
```python
ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter'
```
JSON syntax and schema validation evaluator.
Returns 1.0 if `eval_output` is valid JSON (and optionally matches
the provided schema), 0.0 otherwise.
**When to use**: Outputs that must be valid JSON — optionally conforming
to a specific schema (e.g. tool call responses, structured extraction).
**Requires `expected_output`**: No.
Args:
schema: Optional JSON Schema to validate against.
---
## Custom Evaluators: `create_llm_evaluator`
Factory for custom LLM-as-judge evaluators from prompt templates.
Usage::
from pixie import create_llm_evaluator
concise_voice_style = create_llm_evaluator(
name="ConciseVoiceStyle",
prompt_template="""
You are evaluating whether a voice agent response is concise and
phone-friendly.
User said: {eval_input}
Agent responded: {eval_output}
Expected behavior: {expectation}
Score 1.0 if the response is concise (under 3 sentences), directly
addresses the question, and uses conversational language suitable for
a phone call. Score 0.0 if it's verbose, off-topic, or uses
written-style formatting.
""",
)
### `create_llm_evaluator`
```python
create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
```
Create a custom LLM-as-judge evaluator from a prompt template.
The template may reference these variables (populated from the
:class:`~pixie.storage.evaluable.Evaluable` fields):
- `{eval_input}` — the evaluable's input data. Single-item lists expand
to that item's value; multi-item lists expand to a JSON dict of
`name → value` pairs.
- `{eval_output}` — the evaluable's output data (same rule as
`eval_input`).
- `{expectation}` — the evaluable's expected output
Args:
name: Display name for the evaluator (shown in scorecard).
prompt_template: A string template with `{eval_input}`,
`{eval_output}`, and/or `{expectation}` placeholders.
model: OpenAI model name (default: `gpt-4o-mini`).
client: Optional pre-configured OpenAI client instance.
Returns:
An evaluator callable satisfying the `Evaluator` protocol.
Raises:
ValueError: If the template uses nested field access like
`{eval_input[key]}` (only top-level placeholders are supported).
### `create_agent_evaluator`
```python
create_agent_evaluator(name: 'str', criteria: 'str') -> '_AgentEvaluator'
```
Create an evaluator whose grading is deferred to a coding agent.
During `pixie test`, agent evaluators are not scored automatically.
Instead, they raise `AgentEvaluationPending` and record a
`PendingEvaluation` with the evaluation criteria. The coding agent
(guided by Step 6) reviews each entry's trace and output, then
grades the pending evaluations.
**When to use**: Quality dimensions that require holistic review of
the LLM trace — tool call correctness, multi-step reasoning quality,
routing decisions — where an automated LLM-as-judge prompt can't
capture the nuance.
**When NOT to use**: Simple text quality checks (use
`create_llm_evaluator` instead), deterministic checks (use heuristic
evaluators), or any criterion that can be scored from input + output
alone without trace context.
Args:
name: Display name for the evaluator (shown in scorecard as ⏳ pending).
criteria: What to evaluate — the grading instructions the agent
will follow when reviewing results. Be specific and actionable.
Returns:
An evaluator callable satisfying the `Evaluator` protocol. Its
`__call__` raises `AgentEvaluationPending` instead of returning an
`Evaluation`.
Example:
```python
from pixie import create_agent_evaluator
ResponseQuality = create_agent_evaluator(
name="ResponseQuality",
criteria="The response directly addresses the user's question with "
"accurate, well-structured information. No hallucinations "
"or off-topic content.",
)
ToolUsageCorrectness = create_agent_evaluator(
name="ToolUsageCorrectness",
criteria="The app called the correct tools in the right order based "
"on the user's intent. No unnecessary or missed tool calls.",
)
```
testing-api.md 16.1 KB
# Testing API Reference
> Auto-generated from pixie source code docstrings.
> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
pixie.evals — evaluation harness for LLM applications.
Public API: - `Evaluation` — result dataclass for a single evaluator run - `Evaluator` — protocol for evaluation callables - `evaluate` — run one evaluator against one evaluable - `run_and_evaluate` — evaluate spans from a MemoryTraceHandler - `assert_pass` — batch evaluation with pass/fail criteria - `assert_dataset_pass` — load a dataset and run assert_pass - `EvalAssertionError` — raised when assert_pass fails - `capture_traces` — context manager for in-memory trace capture - `MemoryTraceHandler` — InstrumentationHandler that collects spans - `ScoreThreshold` — configurable pass criteria - `last_llm_call` / `root` — trace-to-evaluable helpers - `DatasetEntryResult` — evaluation results for a single dataset entry - `DatasetScorecard` — per-dataset scorecard with non-uniform evaluators - `generate_dataset_scorecard_html` — render a scorecard as HTML - `save_dataset_scorecard` — write scorecard HTML to disk
Pre-made evaluators (autoevals adapters): - `AutoevalsAdapter` — generic wrapper for any autoevals `Scorer` - `LevenshteinMatch` — edit-distance string similarity - `ExactMatch` — exact value comparison - `NumericDiff` — normalised numeric difference - `JSONDiff` — structural JSON comparison - `ValidJSON` — JSON syntax / schema validation - `ListContains` — list overlap - `EmbeddingSimilarity` — embedding cosine similarity - `Factuality` — LLM factual accuracy check - `ClosedQA` — closed-book QA evaluation - `Battle` — head-to-head comparison - `Humor` — humor detection - `Security` — security vulnerability check - `Sql` — SQL equivalence - `Summary` — summarisation quality - `Translation` — translation quality - `Possible` — feasibility check - `Moderation` — content moderation - `ContextRelevancy` — RAGAS context relevancy - `Faithfulness` — RAGAS faithfulness - `AnswerRelevancy` — RAGAS answer relevancy - `AnswerCorrectness` — RAGAS answer correctness
## Dataset JSON Format
The dataset is a JSON object with these top-level fields:
```json
{
"name": "customer-faq",
"runnable": "pixie_qa/run_app.py:AppRunnable",
"evaluators": ["Factuality"],
"entries": [
{
"input_data": { "question": "Hello" },
"description": "Basic greeting",
"eval_input": [{ "name": "input", "value": "Hello" }],
"expectation": "A friendly greeting that offers to help",
"evaluators": ["...", "ClosedQA"]
}
]
}
```
### Entry structure
All fields are top-level on each entry (flat structure — no nesting):
```
entry:
├── input_data (required) — args for Runnable.run()
├── eval_input (optional) — list of {"name": ..., "value": ...} objects (default: [])
├── description (required) — human-readable label for the test case
├── expectation (optional) — reference for comparison-based evaluators
├── eval_metadata (optional) — extra per-entry data for custom evaluators
└── evaluators (optional) — evaluator names for THIS entry
```
### Field reference
- `runnable` (required): `filepath:ClassName` reference to the `Runnable`
subclass that drives the app during evaluation.
- `evaluators` (dataset-level, optional): Default evaluator names — applied to
every entry that does not declare its own `evaluators`.
- `entries[].input_data` (required): Kwargs passed to `Runnable.run()` as a
Pydantic model. Keys must match the fields of the Pydantic model used in
`run(args: T)`.
- `entries[].description` (required): Human-readable label for the test case.
- `entries[].eval_input` (optional, default `[]`): List of `{"name": ..., "value": ...}`
objects. Used to populate the wrap input registry — `wrap(purpose="input")`
calls in the app return registry values keyed by `name`. The runner
automatically prepends `input_data` when building the `Evaluable`.
- `entries[].expectation` (optional): Concise expectation description
for comparison-based evaluators. Should describe what a correct output looks
like, **not** copy the verbatim output. Use `pixie format` on the trace to
see the real output shape, then write a shorter description.
- `entries[].eval_metadata` (optional): Extra per-entry data for custom
evaluators — e.g., expected tool names, boolean flags, thresholds. Accessed in
evaluators as `evaluable.eval_metadata`.
- `entries[].evaluators` (optional): Row-level evaluator override. Rules:
- Omit → entry inherits dataset-level `evaluators`.
- `["...", "ClosedQA"]` → dataset defaults **plus** ClosedQA.
- `["OnlyThis"]` (no `"..."`) → **only** OnlyThis, no defaults.
## Evaluator Name Resolution
In dataset JSON, evaluator names are resolved as follows:
- **Built-in names** (bare names like `"Factuality"`, `"ExactMatch"`) are
resolved to `pixie.{Name}` automatically.
- **Custom evaluators** use `filepath:callable_name` format
(e.g. `"pixie_qa/evaluators.py:my_evaluator"`).
- Custom evaluator references point to module-level callables — classes
(instantiated automatically), factory functions (called if zero-arg),
evaluator functions (used as-is), or pre-instantiated callables (e.g.
`create_llm_evaluator` results — used as-is).
## CLI Commands
| Command | Description |
| ------------------------------------------- | ------------------------------------- |
| `pixie test [path] [-v] [--no-open]` | Run eval tests on dataset files |
| `pixie dataset create <name>` | Create a new empty dataset |
| `pixie dataset list` | List all datasets |
| `pixie dataset save <name> [--select MODE]` | Save a span to a dataset |
| `pixie dataset validate [path]` | Validate dataset JSON files |
| `pixie analyze <test_run_id>` | Generate analysis and recommendations |
---
## Types
### `Evaluable`
```python
class Evaluable(TestCase):
eval_output: list[NamedData] # wrap(purpose="output") + wrap(purpose="state") values
# Inherited from TestCase:
# eval_input: list[NamedData] # from eval_input in dataset entry
# expectation: JsonValue | _Unset # from expectation in dataset entry
# eval_metadata: dict[str, JsonValue] | None # from eval_metadata in dataset entry
# description: str | None
```
Data carrier for evaluators. Extends `TestCase` with actual output.
- `eval_input` — `list[NamedData]` populated from the entry's `eval_input` field plus `input_data` (prepended by the runner). Always has at least one item.
- `eval_output` — `list[NamedData]` containing ALL `wrap(purpose="output")` and `wrap(purpose="state")` values captured during the run. Each item has `.name` (str) and `.value` (JsonValue). Use `_get_output(evaluable, "name")` to look up by name.
- `eval_metadata` — `dict[str, JsonValue] | None` from the entry's `eval_metadata` field
- `expected_output` — expectation text from dataset (or `UNSET` if not provided)
Attributes:
eval_input: Named input data items (from dataset + input_data prepended by runner). Always non-empty.
eval_output: Named output data items (from wrap calls during run).
Each item has `.name` (str) and `.value` (JsonValue).
Contains ALL `wrap(purpose="output")` and `wrap(purpose="state")` values.
eval_metadata: Supplementary metadata (`None` when absent).
expected_output: The expected/reference output for evaluation.
Defaults to `UNSET` (not provided). May be explicitly
set to `None` to indicate "there is no expected output".
### How `wrap()` maps to `Evaluable` fields at test time
When `pixie test` runs a dataset entry, `wrap()` calls in the app populate the `Evaluable` that evaluators receive:
| `wrap()` call in app code | Evaluable field | Type | How to access in evaluator |
| ---------------------------------------- | ----------------- | ----------------- | ---------------------------------------------------- |
| `wrap(data, purpose="input", name="X")` | `eval_input` | `list[NamedData]` | Pre-populated from `eval_input` in the dataset entry |
| `wrap(data, purpose="output", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — see helper below |
| `wrap(data, purpose="state", name="X")` | `eval_output` | `list[NamedData]` | `_get_output(evaluable, "X")` — same list as output |
| (from dataset entry `expectation`) | `expected_output` | `str \| None` | `evaluable.expected_output` |
| (from dataset entry `eval_metadata`) | `eval_metadata` | `dict \| None` | `evaluable.eval_metadata` |
**Key insight**: Both `purpose="output"` and `purpose="state"` wrap values end up in `eval_output` as `NamedData` items. There is no separate `captured_output` or `captured_state` dict. Use the helper function below to look up values by wrap name:
```python
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
```
**`eval_metadata`** is for passing extra per-entry data to evaluators that isn't an input data or output — e.g., expected tool names, boolean flags, thresholds. Defined as a top-level field on the entry, accessed as `evaluable.eval_metadata`.
**Complete custom evaluator example** (tool call check + dataset entry):
```python
from pixie import Evaluation, Evaluable
def _get_output(evaluable: Evaluable, name: str) -> Any:
"""Look up a wrap value by name from eval_output."""
for item in evaluable.eval_output:
if item.name == name:
return item.value
return None
def tool_call_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
expected = evaluable.eval_metadata.get("expected_tool") if evaluable.eval_metadata else None
actual = _get_output(evaluable, "function_called")
if expected is None:
return Evaluation(score=1.0, reasoning="No expected_tool specified")
match = str(actual) == str(expected)
return Evaluation(
score=1.0 if match else 0.0,
reasoning=f"Expected {expected}, got {actual}",
)
```
Corresponding dataset entry:
```json
{
"input_data": { "user_message": "I want to end this call" },
"description": "User requests call end after failed verification",
"eval_input": [{ "name": "user_input", "value": "I want to end this call" }],
"expectation": "Agent should call endCall tool",
"eval_metadata": {
"expected_tool": "endCall",
"expected_call_ended": true
},
"evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
}
```
### `Evaluation`
```python
Evaluation(score: 'float', reasoning: 'str', details: 'dict[str, Any]' = <factory>) -> None
```
The result of a single evaluator applied to a single test case.
Attributes:
score: Evaluation score between 0.0 and 1.0.
reasoning: Human-readable explanation (required).
details: Arbitrary JSON-serializable metadata.
### `ScoreThreshold`
```python
ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
```
Pass criteria: _pct_ fraction of inputs must score >= _threshold_ on all evaluators.
Attributes:
threshold: Minimum score an individual evaluation must reach.
pct: Fraction of test-case inputs (0.0–1.0) that must pass.
## Eval Functions
### `pixie.run_and_evaluate`
```python
pixie.run_and_evaluate(evaluator: 'Callable[..., Any]', runnable: 'Callable[..., Any]', eval_input: 'Any', *, expected_output: 'Any' = <object object at 0x7788c2ad5c80>, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'Evaluation'
```
Run _runnable(eval_input)_ while capturing traces, then evaluate.
Convenience wrapper combining `_run_and_capture` and `evaluate`.
The runnable is called exactly once.
Args:
evaluator: An evaluator callable (sync or async).
runnable: The application function to test.
eval*input: The single input passed to \_runnable*.
expected_output: Optional expected value merged into the
evaluable.
from_trace: Optional callable to select a specific span from
the trace tree for evaluation.
Returns:
The `Evaluation` result.
Raises:
ValueError: If no spans were captured during execution.
### `pixie.assert_pass`
```python
pixie.assert_pass(runnable: 'Callable[..., Any]', eval_inputs: 'list[Any]', evaluators: 'list[Callable[..., Any]]', *, evaluables: 'list[Evaluable] | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```
Run evaluators against a runnable over multiple inputs.
For each input, runs the runnable once via `_run_and_capture`,
then evaluates with every evaluator concurrently via
`asyncio.gather`.
The results matrix has shape `[eval_inputs][evaluators]`.
If the pass criteria are not met, raises :class:`EvalAssertionError`
carrying the matrix.
When `evaluables` is provided, behaviour depends on whether each
item already has `eval_output` populated:
- **eval_output is None** — the `runnable` is called via
`run_and_evaluate` to produce an output from traces, and
`expected_output` from the evaluable is merged into the result.
- **eval_output is not None** — the evaluable is used directly
(the runnable is not called for that item).
Args:
runnable: The application function to test.
eval*inputs: List of inputs, each passed to \_runnable*.
evaluators: List of evaluator callables.
evaluables: Optional list of `Evaluable` items, one per input.
When provided, their `expected_output` is forwarded to
`run_and_evaluate`. Must have the same length as
_eval_inputs_.
pass_criteria: Receives the results matrix, returns
`(passed, message)`. Defaults to `ScoreThreshold()`.
from_trace: Optional span selector forwarded to
`run_and_evaluate`.
Raises:
EvalAssertionError: When pass criteria are not met.
ValueError: When _evaluables_ length does not match _eval_inputs_.
### `pixie.assert_dataset_pass`
```python
pixie.assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
```
Load a dataset by name, then run `assert_pass` with its items.
This is a convenience wrapper that:
1. Loads the dataset from the `DatasetStore`.
2. Extracts `eval_input` from each item as the runnable inputs.
3. Uses the full `Evaluable` items (which carry `expected_output`)
as the evaluables.
4. Delegates to `assert_pass`.
Args:
runnable: The application function to test.
dataset_name: Name of the dataset to load.
evaluators: List of evaluator callables.
dataset_dir: Override directory for the dataset store.
When `None`, reads from `PixieConfig.dataset_dir`.
pass_criteria: Receives the results matrix, returns
`(passed, message)`.
from_trace: Optional span selector forwarded to
`assert_pass`.
Raises:
FileNotFoundError: If no dataset with _dataset_name_ exists.
EvalAssertionError: When pass criteria are not met.
## Trace Helpers
### `pixie.last_llm_call`
```python
pixie.last_llm_call(trace: 'list[ObservationNode]') -> 'Evaluable'
```
Find the `LLMSpan` with the latest `ended_at` in the trace tree.
Args:
trace: The trace tree (list of root `ObservationNode` instances).
Returns:
An `Evaluable` wrapping the most recently ended `LLMSpan`.
Raises:
ValueError: If no `LLMSpan` exists in the trace.
### `pixie.root`
```python
pixie.root(trace: 'list[ObservationNode]') -> 'Evaluable'
```
Return the first root node's span as `Evaluable`.
Args:
trace: The trace tree (list of root `ObservationNode` instances).
Returns:
An `Evaluable` wrapping the first root node's span.
Raises:
ValueError: If the trace is empty.
### `pixie.capture_traces`
```python
pixie.capture_traces() -> 'Generator[MemoryTraceHandler, None, None]'
```
Context manager that installs a `MemoryTraceHandler` and yields it.
Calls `init()` (no-op if already initialised) then registers the
handler via `add_handler()`. On exit the handler is removed and
the delivery queue is flushed so that all spans are available on
`handler.spans`.
wrap-api.md 8.8 KB
# Wrap API Reference
> Auto-generated from pixie source code docstrings.
> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
`pixie.wrap` — data-oriented observation API.
`wrap()` observes a data value or callable at a named point in the
processing pipeline. Its behavior depends on the active mode:
- **No-op** (tracing disabled, no eval registry): returns `data` unchanged.
- **Tracing** (during `pixie trace`): writes to the trace file and emits an
OTel event (via span event if a span is active, or via OTel logger
otherwise) and returns `data` unchanged (or wraps a callable so the
event fires on call).
- **Eval** (eval registry active): injects dependency data for
`purpose="input"`, captures output/state for `purpose="output"`/
`purpose="state"`.
---
## CLI Commands
| Command | Description |
| ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). |
| `pixie format --input <trace.jsonl> --output <dataset_entry.json>` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
| `pixie trace filter <file.jsonl> --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. |
---
## Classes
### `pixie.Runnable`
```python
class pixie.Runnable(Protocol[T]):
@classmethod
def create(cls) -> Runnable[Any]: ...
async def setup(self) -> None: ...
async def run(self, args: T) -> None: ...
async def teardown(self) -> None: ...
```
Protocol for structured runnables used by the evaluation harness. `T` is a
`pydantic.BaseModel` subclass whose fields match the `input_data` keys
in the dataset JSON.
Lifecycle:
1. `create()` — class method to construct and return a runnable instance.
2. `setup()` — **async**, called **once** before the first `run()` call.
Initialize shared resources here (e.g., `TestClient`, database connections).
Optional — has a default no-op implementation.
3. `run(args)` — **async**, called **concurrently for each dataset entry**
(up to 4 entries in parallel). `args` is a validated Pydantic model
built from `input_data`. Invoke the application's real entry point.
4. `teardown()` — **async**, called **once** after the last `run()` call.
Release any resources acquired in `setup()`.
Optional — has a default no-op implementation.
`setup()` and `teardown()` have default no-op implementations;
you only need to override them when shared resources are required.
**Concurrency**: `run()` is called concurrently via `asyncio.gather`. Your
implementation **must be concurrency-safe**. If it uses shared mutable state
(e.g., a SQLite connection, an in-memory cache, a file handle), protect it
with `asyncio.Semaphore` or `asyncio.Lock`:
```python
class AppRunnable(pixie.Runnable[AppArgs]):
_sem: asyncio.Semaphore
@classmethod
def create(cls) -> "AppRunnable":
inst = cls()
inst._sem = asyncio.Semaphore(1) # serialise DB access
return inst
async def run(self, args: AppArgs) -> None:
async with self._sem:
await call_app(args.message)
```
Common concurrency pitfalls:
- **SQLite**: not safe for concurrent writes — use `Semaphore(1)` or `aiosqlite` with WAL mode.
- **Global mutable state**: module-level dicts/lists modified in `run()` need protection.
- **Rate-limited APIs**: add a semaphore to avoid 429 errors.
**Import resolution**: The project root directory (where `pixie test` / `pixie trace`
is invoked) is automatically added to `sys.path` before loading runnables and
evaluators. This means your runnable can use normal `import` statements to
reference project modules (e.g., `from app import service`).
**Example**:
```python
# pixie_qa/run_app.py
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
user_message: str
class AppRunnable(pixie.Runnable[AppArgs]):
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def run(self, args: AppArgs) -> None:
from myapp import handle_request
await handle_request(args.user_message)
```
**Web server example** (using an async HTTP client):
```python
import httpx
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
user_message: str
class AppRunnable(pixie.Runnable[AppArgs]):
_client: httpx.AsyncClient
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def setup(self) -> None:
self._client = httpx.AsyncClient(base_url="http://localhost:8000")
async def run(self, args: AppArgs) -> None:
await self._client.post("/chat", json={"message": args.user_message})
async def teardown(self) -> None:
await self._client.aclose()
```
---
## Functions
### `pixie.wrap`
```python
pixie.wrap(data: 'T', *, purpose: "Literal['input', 'output', 'state']", name: 'str', description: 'str | None' = None) -> 'T'
```
Observe a data value or data-provider callable at a point in the processing pipeline.
`data` can be either a plain value or a callable that produces a value.
In both cases the return type is `T` — the caller gets back exactly the
same type it passed in when in no-op or tracing modes.
In eval mode with `purpose="input"`, the returned value (or callable) is
replaced with the deserialized registry value. When `data` is callable
the returned wrapper ignores the original function and returns the injected
value on every call; in all other modes the returned callable wraps the
original and adds tracing or capture behaviour.
Args:
data: A data value or a data-provider callable.
purpose: Classification of the data point: - "input": data from external dependencies (DB records, API responses) - "output": data going out to external systems or users - "state": intermediate state for evaluation (routing decisions, etc.)
name: Unique identifier for this data point. Used as the key in the
eval registry and in trace logs.
description: Optional human-readable description of what this data is.
Returns:
The original data unchanged (tracing / no-op modes), or the
registry value (eval mode with purpose="input"). When `data`
is callable the return value is also callable.
---
## Error Types
### `WrapRegistryMissError`
```python
WrapRegistryMissError(name: 'str') -> 'None'
```
Raised when a wrap(purpose="input") name is not found in the eval registry.
### `WrapTypeMismatchError`
```python
WrapTypeMismatchError(name: 'str', expected_type: 'type', actual_type: 'type') -> 'None'
```
Raised when deserialized registry value doesn't match expected type.
---
## Trace File Utilities
Pydantic model for wrap log entries and JSONL loading utilities.
`WrapLogEntry` is the typed representation of a single `wrap()` event
as recorded in a JSONL trace file. Multiple places in the codebase load
these objects — the `pixie trace filter` CLI, the dataset loader, and
the verification scripts — so they share this single model.
### `pixie.WrapLogEntry`
```python
pixie.WrapLogEntry(*, type: str = 'wrap', name: str, purpose: str, data: Any, description: str | None = None, trace_id: str | None = None, span_id: str | None = None) -> None
```
A single wrap() event as logged to a JSONL trace file.
Attributes:
type: Always `"wrap"` for wrap events.
name: The wrap point name (matches `wrap(name=...)`).
purpose: One of `"input"`, `"output"`, `"state"`.
data: The serialized data (jsonpickle string).
description: Optional human-readable description.
trace_id: OTel trace ID (if available).
span_id: OTel span ID (if available).
### `pixie.load_wrap_log_entries`
```python
pixie.load_wrap_log_entries(jsonl_path: 'str | Path') -> 'list[WrapLogEntry]'
```
Load all wrap log entries from a JSONL file.
Skips non-wrap lines (e.g. `type=llm_span`) and malformed lines.
Args:
jsonl_path: Path to a JSONL trace file.
Returns:
List of :class:`WrapLogEntry` objects.
### `pixie.filter_by_purpose`
```python
pixie.filter_by_purpose(entries: 'list[WrapLogEntry]', purposes: 'set[str]') -> 'list[WrapLogEntry]'
```
Filter wrap log entries by purpose.
Args:
entries: List of wrap log entries.
purposes: Set of purpose values to include.
Returns:
Filtered list.
references/runnable-examples/
cli-app.md 2.0 KB
# Runnable Example: CLI Application
**When the app is invoked from the command line** (e.g., `python -m myapp`, a CLI tool with argparse/click).
**Approach**: Use `asyncio.create_subprocess_exec` to invoke the CLI and capture output.
```python
# pixie_qa/run_app.py
import asyncio
import sys
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
query: str
class AppRunnable(pixie.Runnable[AppArgs]):
"""Drives a CLI application via subprocess."""
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def run(self, args: AppArgs) -> None:
proc = await asyncio.create_subprocess_exec(
sys.executable, "-m", "myapp", "--query", args.query,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120)
if proc.returncode != 0:
raise RuntimeError(f"App failed (exit {proc.returncode}): {stderr.decode()}")
```
## When the CLI needs patched dependencies
If the CLI reads from external services, create a wrapper entry point that patches dependencies before running the real CLI:
```python
# pixie_qa/patched_app.py
"""Entry point that patches external deps before running the real CLI."""
import myapp.config as config
config.redis_url = "mock://localhost"
from myapp.main import main
main()
```
Then point your Runnable at the wrapper:
```python
async def run(self, args: AppArgs) -> None:
proc = await asyncio.create_subprocess_exec(
sys.executable, "-m", "pixie_qa.patched_app", "--query", args.query,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120)
```
**Note**: For CLI apps, `wrap(purpose="input")` injection only works when the app runs in the same process. If using subprocess, you may need to pass test data via environment variables or config files instead.
fastapi-web-server.md 4.0 KB
# Runnable Example: FastAPI / Web Server
**When the app is a web server** (FastAPI, Flask, Starlette) and you need to exercise the full HTTP request pipeline.
**Approach**: Use `httpx.AsyncClient` with `ASGITransport` to run the ASGI app in-process. This is the fastest and most reliable approach — no subprocess, no port management.
```python
# pixie_qa/run_app.py
import httpx
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
user_message: str
class AppRunnable(pixie.Runnable[AppArgs]):
"""Drives a FastAPI app via in-process ASGI transport."""
_client: httpx.AsyncClient
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def setup(self) -> None:
from myapp.main import app # your FastAPI/Starlette app instance
transport = httpx.ASGITransport(app=app)
self._client = httpx.AsyncClient(transport=transport, base_url="http://test")
async def run(self, args: AppArgs) -> None:
await self._client.post("/chat", json={"message": args.user_message})
async def teardown(self) -> None:
await self._client.aclose()
```
## ASGITransport skips lifespan events
`httpx.ASGITransport` does **not** trigger ASGI lifespan events (`startup` / `shutdown`). If the app initializes resources in its lifespan (database connections, caches, service clients), you must replicate that initialization manually in `setup()`:
```python
async def setup(self) -> None:
# Manually replicate what the app's lifespan does
from myapp.db import get_connection, init_db, seed_data
import myapp.main as app_module
conn = get_connection()
init_db(conn)
seed_data(conn)
app_module.db_conn = conn # set the module-level global the app expects
transport = httpx.ASGITransport(app=app_module.app)
self._client = httpx.AsyncClient(transport=transport, base_url="http://test")
async def teardown(self) -> None:
await self._client.aclose()
# Clean up the manually-initialized resources
import myapp.main as app_module
if hasattr(app_module, "db_conn") and app_module.db_conn:
app_module.db_conn.close()
```
## Concurrency with shared mutable state
If the app uses shared mutable state (in-memory SQLite, file-based DB, global caches), add a semaphore to serialise access:
```python
import asyncio
class AppRunnable(pixie.Runnable[AppArgs]):
_client: httpx.AsyncClient
_sem: asyncio.Semaphore
@classmethod
def create(cls) -> "AppRunnable":
inst = cls()
inst._sem = asyncio.Semaphore(1)
return inst
async def setup(self) -> None:
from myapp.main import app
transport = httpx.ASGITransport(app=app)
self._client = httpx.AsyncClient(transport=transport, base_url="http://test")
async def run(self, args: AppArgs) -> None:
async with self._sem:
await self._client.post("/chat", json={"message": args.user_message})
async def teardown(self) -> None:
await self._client.aclose()
```
Only use the semaphore when needed — if the app uses per-session state keyed by unique IDs (call_sid, session_id), concurrent calls are naturally isolated and no lock is needed.
## Alternative: External server with httpx
When the app can't be imported directly (complex startup, `uvicorn.run()` in `__main__`), start it as a subprocess and hit it with HTTP:
```python
class AppRunnable(pixie.Runnable[AppArgs]):
_client: httpx.AsyncClient
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def setup(self) -> None:
# Assumes the server is already running (started via run-with-timeout.sh)
self._client = httpx.AsyncClient(base_url="http://localhost:8000")
async def run(self, args: AppArgs) -> None:
await self._client.post("/chat", json={"message": args.user_message})
async def teardown(self) -> None:
await self._client.aclose()
```
Start the server before running `pixie trace` or `pixie test`:
```bash
bash resources/run-with-timeout.sh 120 uv run python -m myapp.server
sleep 3 # wait for readiness
```
standalone-function.md 1.8 KB
# Runnable Example: Standalone Function (No Server)
**When the app is a plain Python function or module** — no web framework, no server, no infrastructure.
**Approach**: Import and call the function directly from `run()`. This is the simplest case.
```python
# pixie_qa/run_app.py
from pydantic import BaseModel
import pixie
class AppArgs(BaseModel):
question: str
class AppRunnable(pixie.Runnable[AppArgs]):
"""Drives a standalone function for tracing and evaluation."""
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def run(self, args: AppArgs) -> None:
from myapp.agent import answer_question
await answer_question(args.question)
```
If the function is synchronous, wrap it with `asyncio.to_thread`:
```python
import asyncio
async def run(self, args: AppArgs) -> None:
from myapp.agent import answer_question
await asyncio.to_thread(answer_question, args.question)
```
If the function depends on an external service (e.g., a vector store), the `wrap(purpose="input")` calls you added in Step 2a handle it automatically — the registry injects test data in eval mode.
### When to use `setup()` / `teardown()`
Most standalone functions don't need lifecycle methods. Use them only when the function requires a shared resource (e.g., a pre-loaded embedding model, a database connection):
```python
class AppRunnable(pixie.Runnable[AppArgs]):
_model: SomeModel
@classmethod
def create(cls) -> "AppRunnable":
return cls()
async def setup(self) -> None:
from myapp.models import load_model
self._model = load_model()
async def run(self, args: AppArgs) -> None:
from myapp.agent import answer_question
await answer_question(args.question, model=self._model)
```
resources/
setup.sh 3.1 KB
#!/usr/bin/env bash
# Setup script for eval-driven-dev skill.
# Updates the skill, installs/upgrades pixie-qa[all], initializes the
# pixie working directory, and starts the web UI server in the background.
#
# Error handling:
# - Skill update failure → non-fatal (continue with existing version)
# - pixie-qa upgrade failure when already installed → non-fatal
# - pixie-qa NOT installed and install fails → FATAL (exit 1)
# - pixie init failure → FATAL (exit 1)
# - pixie start failure → FATAL (exit 1)
set -u
echo "=== Updating skill ==="
npx skills update github/awesome-copilot --skill eval-driven-dev -g -y && npx skills update github/awesome-copilot --skill eval-driven-dev -p -y || {
echo "(skill update failed — proceeding with existing version)"
}
echo ""
echo "=== Installing / upgrading pixie-qa[all] ==="
# Helper: check if pixie CLI is importable
_pixie_available() {
if [ -f uv.lock ]; then
uv run python -c "import pixie" 2>/dev/null
elif [ -f poetry.lock ]; then
poetry run python -c "import pixie" 2>/dev/null
else
python -c "import pixie" 2>/dev/null
fi
}
# Check if pixie is already installed before attempting upgrade
PIXIE_WAS_INSTALLED=false
if _pixie_available; then
PIXIE_WAS_INSTALLED=true
fi
INSTALL_OK=false
if [ -f uv.lock ]; then
# uv add does universal resolution across all Python versions in
# requires-python. If the host project supports a Python version
# where pixie-qa is unavailable (e.g. <3.10), uv add fails.
# Fall back to uv pip install which only targets the active interpreter.
if uv add "pixie-qa[all]>=0.8.4,<0.9.0" --upgrade 2>&1; then
INSTALL_OK=true
else
echo "(uv add failed — falling back to uv pip install)"
if uv pip install "pixie-qa[all]>=0.8.4,<0.9.0" 2>&1; then
INSTALL_OK=true
fi
fi
elif [ -f poetry.lock ]; then
if poetry add "pixie-qa[all]>=0.8.4,<0.9.0"; then
INSTALL_OK=true
fi
else
if pip install --upgrade "pixie-qa[all]>=0.8.4,<0.9.0"; then
INSTALL_OK=true
fi
fi
if [ "$INSTALL_OK" = false ]; then
if [ "$PIXIE_WAS_INSTALLED" = true ]; then
echo "(pixie-qa upgrade failed — proceeding with existing version)"
else
echo ""
echo "ERROR: pixie-qa is not installed and installation failed."
echo "The eval-driven-dev workflow requires the pixie-qa package."
echo "Please install it manually and re-run this script."
exit 1
fi
fi
echo ""
echo "=== Initializing pixie working directory ==="
if [ -f uv.lock ]; then
uv run pixie init
elif [ -f poetry.lock ]; then
poetry run pixie init
else
pixie init
fi
if [ $? -ne 0 ]; then
echo ""
echo "ERROR: Failed to initialize pixie working directory."
echo "Please check the error above and fix it before continuing."
exit 1
fi
echo ""
echo "=== Starting web UI server (background) ==="
if [ -f uv.lock ]; then
uv run pixie start
elif [ -f poetry.lock ]; then
poetry run pixie start
else
pixie start
fi
if [ $? -ne 0 ]; then
echo ""
echo "ERROR: Failed to start the web UI server."
echo "Please check the error above and fix it before continuing."
exit 1
fi
echo ""
echo "=== Setup complete ==="
verify_step6_completion.py 4.4 KB
#!/usr/bin/env python3
"""Validate that eval-driven-dev Step 6 artifacts are complete.
Usage:
python verify_step6_completion.py /path/to/pixie_qa/results/<test_id>
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
ENTRY_REQUIRED_FILES = ("evaluations.jsonl",)
DATASET_ANALYSIS_FILES = ("analysis.md", "analysis-summary.md")
ROOT_ANALYSIS_FILES = ("action-plan.md", "action-plan-summary.md", "meta.json")
def _dataset_dirs(results_dir: Path) -> list[Path]:
return sorted(
path
for path in results_dir.iterdir()
if path.is_dir() and path.name.startswith("dataset-")
)
def _entry_dirs(dataset_dir: Path) -> list[Path]:
return sorted(
path
for path in dataset_dir.iterdir()
if path.is_dir() and path.name.startswith("entry-")
)
def _read_jsonl(path: Path, errors: list[str]) -> list[dict[str, object]]:
rows: list[dict[str, object]] = []
try:
for index, line in enumerate(
path.read_text(encoding="utf-8").splitlines(), start=1
):
if not line.strip():
continue
obj = json.loads(line)
if not isinstance(obj, dict):
errors.append(f"{path}: line {index} is not a JSON object")
continue
rows.append(obj)
except OSError as exc:
errors.append(f"{path}: could not read file ({exc})")
except json.JSONDecodeError as exc:
errors.append(f"{path}: invalid JSONL ({exc})")
return rows
def validate_results_dir(results_dir: Path) -> list[str]:
"""Return a list of validation errors for a pixie results directory."""
errors: list[str] = []
if not results_dir.is_dir():
return [f"{results_dir}: results directory not found"]
for file_name in ROOT_ANALYSIS_FILES:
if not (results_dir / file_name).is_file():
errors.append(f"Missing root artifact: {results_dir / file_name}")
datasets = _dataset_dirs(results_dir)
if not datasets:
errors.append(f"{results_dir}: no dataset-* directories found")
return errors
for dataset_dir in datasets:
for file_name in DATASET_ANALYSIS_FILES:
if not (dataset_dir / file_name).is_file():
errors.append(f"Missing dataset artifact: {dataset_dir / file_name}")
entry_dirs = _entry_dirs(dataset_dir)
if not entry_dirs:
errors.append(f"{dataset_dir}: no entry-* directories found")
continue
for entry_dir in entry_dirs:
for file_name in ENTRY_REQUIRED_FILES:
if not (entry_dir / file_name).is_file():
errors.append(f"Missing entry artifact: {entry_dir / file_name}")
evaluations_path = entry_dir / "evaluations.jsonl"
if not evaluations_path.is_file():
continue
evaluations = _read_jsonl(evaluations_path, errors)
for row in evaluations:
status = row.get("status")
if status == "pending":
errors.append(
"Pending evaluation remains: "
f"{evaluations_path} ({row.get('evaluator', 'unknown evaluator')})"
)
continue
if "score" not in row:
errors.append(
"Missing score in scored evaluation: "
f"{evaluations_path} ({row.get('evaluator', 'unknown evaluator')})"
)
if "reasoning" not in row:
errors.append(
"Missing reasoning in scored evaluation: "
f"{evaluations_path} ({row.get('evaluator', 'unknown evaluator')})"
)
return errors
def main(argv: list[str] | None = None) -> int:
"""CLI entry point."""
parser = argparse.ArgumentParser(
description="Validate Step 6 completion for a pixie results directory"
)
parser.add_argument(
"results_dir",
type=Path,
help="Path to pixie_qa/results/<test_id>",
)
args = parser.parse_args(argv)
errors = validate_results_dir(args.results_dir)
if errors:
print("Step 6 completion check failed:")
for error in errors:
print(f"- {error}")
return 1
print("Step 6 completion check passed.")
return 0
if __name__ == "__main__":
sys.exit(main())
License (MIT)
View full license text
MIT License Copyright GitHub, Inc. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.