Skill v1.0.0
currentAutomated scan100/100version: "1.0.0" name: promptfoo-evaluation description: Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
Promptfoo Evaluation
Overview
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation projectnpx promptfoo@latest init# Run evaluationnpx promptfoo@latest eval# View results in browsernpx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/├── promptfooconfig.yaml # Main configuration├── prompts/│ ├── system.md # System prompt│ └── chat.json # Chat format prompt├── tests/│ └── cases.yaml # Test cases└── scripts/└── metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.jsondescription: "My LLM Evaluation"# Prompts to testprompts:- file://prompts/system.md- file://prompts/chat.json# Models to compareproviders:- id: anthropic:messages:claude-sonnet-4-6label: Claude-Sonnet-4.6- id: openai:gpt-4.1label: GPT-4.1# Test casestests: file://tests/cases.yaml# Concurrency control (MUST be under commandLineOptions, NOT top-level)commandLineOptions:maxConcurrency: 2# Default assertions for all testsdefaultTest:assert:- type: pythonvalue: file://scripts/metrics.py:custom_assert- type: llm-rubricvalue: |Evaluate the response quality on a 0-1 scale.threshold: 0.7# Output pathoutputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant.Task: {{task}}Context: {{context}}
Chat Format (chat.json)
[{"role": "system", "content": "{{system_prompt}}"},{"role": "user", "content": "{{user_input}}"}]
Few-Shot Pattern
Embed examples directly in prompt or use chat format with assistant messages:
[{"role": "system", "content": "{{system_prompt}}"},{"role": "user", "content": "Example input: {{example_input}}"},{"role": "assistant", "content": "{{example_output}}"},{"role": "user", "content": "Now process: {{actual_input}}"}]
Test Cases (tests/cases.yaml)
- description: "Test case 1"vars:system_prompt: file://prompts/system.mduser_input: "Hello world"# Load content from filescontext: file://data/context.txtassert:- type: containsvalue: "expected text"- type: pythonvalue: file://scripts/metrics.py:custom_checkthreshold: 0.8
Python Custom Assertions
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:"""Default assertion function."""vars_dict = context.get('vars', {})# Access test variablesexpected = vars_dict.get('expected', '')# Return resultreturn {"pass": expected in output,"score": 0.8,"reason": "Contains expected content","named_scores": {"relevance": 0.9}}def custom_check(output: str, context: dict) -> dict:"""Custom named assertion."""word_count = len(output.split())passed = 100 <= word_count <= 500return {"pass": passed,"score": min(1.0, word_count / 300),"reason": f"Word count: {word_count}"}
Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return
bool,float(score), ordictwith pass/score/reason - Access variables via
context['vars']
LLM-as-Judge (llm-rubric)
assert:- type: llm-rubricvalue: |Evaluate the response based on:1. Accuracy of information2. Clarity of explanation3. CompletenessScore 0.0-1.0 where 0.7+ is passing.threshold: 0.7provider: openai:gpt-4.1 # Optional: override grader model
When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert:- type: llm-rubricvalue: |Evaluate quality on a 0-1 scale.threshold: 0.7provider:id: anthropic:messages:claude-sonnet-4-6config:apiBaseUrl: https://your-relay.example.com/api
Best practices:
- Provide clear scoring criteria
- Use
thresholdto set minimum passing score - Default grader uses available API keys (OpenAI → Anthropic → Google)
- When using relay/proxy: every
llm-rubricmust have its ownproviderwithapiBaseUrl— the main provider'sapiBaseUrlis NOT inherited
Common Assertion Types
| Type | Usage | Example | |
|---|---|---|---|
contains | Check substring | value: "hello" | |
icontains | Case-insensitive | value: "HELLO" | |
equals | Exact match | value: "42" | |
regex | Pattern match | value: "\\d{4}" | |
python | Custom logic | value: file://script.py | |
llm-rubric | LLM grading | value: "Is professional" | |
latency | Response time | threshold: 1000 |
File References
All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.
# Load file content as variablevars:content: file://data/input.txt# Load prompt from fileprompts:- file://prompts/main.md# Load test cases from filetests: file://tests/cases.yaml# Load Python assertionassert:- type: pythonvalue: file://scripts/check.py:validate
Running Evaluations
# Basic runnpx promptfoo@latest eval# With specific confignpx promptfoo@latest eval --config path/to/config.yaml# Output to filenpx promptfoo@latest eval --output results.json# Filter testsnpx promptfoo@latest eval --filter-metadata category=math# View resultsnpx promptfoo@latest view
Relay / Proxy API Configuration
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
providers:- id: anthropic:messages:claude-sonnet-4-6label: Claude-Sonnet-4.6config:max_tokens: 4096apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages# CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)commandLineOptions:maxConcurrency: 1 # Respect relay rate limits
Key rules:
apiBaseUrlgoes inproviders[].config— Promptfoo appends/v1/messagesautomaticallymaxConcurrencymust be undercommandLineOptions:— placing it at top level is silently ignored- When using relay with LLM-as-judge, set
maxConcurrency: 1to avoid concurrent request limits (generation + grading share the same pool) - Pass relay token as
ANTHROPIC_API_KEYenv var
Troubleshooting
Python not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.
maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.
LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.
HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:
import reclean_text = re.sub(r'<[^>]+>', '', raw_text)
Echo Provider (Preview Mode)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yamlproviders:- echo # Returns prompt as output, no API callstests:- vars:input: "test content"
Use cases:
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure
# Run preview modenpx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
Advanced Few-Shot Implementation
Multi-turn Conversation Pattern
For complex few-shot learning with full examples:
[{"role": "system", "content": "{{system_prompt}}"},// Few-shot Example 1{"role": "user", "content": "Task: {{example_input_1}}"},{"role": "assistant", "content": "{{example_output_1}}"},// Few-shot Example 2 (optional){"role": "user", "content": "Task: {{example_input_2}}"},{"role": "assistant", "content": "{{example_output_2}}"},// Actual test{"role": "user", "content": "Task: {{actual_input}}"}]
Test case configuration:
tests:- vars:system_prompt: file://prompts/system.md# Few-shot examplesexample_input_1: file://data/examples/input1.txtexample_output_1: file://data/examples/output1.txtexample_input_2: file://data/examples/input2.txtexample_output_2: file://data/examples/output2.txt# Actual testactual_input: file://data/test1.txt
Best practices:
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure
Long Text Handling
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:- id: anthropic:messages:claude-sonnet-4-6config:max_tokens: 8192 # Increase for long outputsdefaultTest:assert:- type: pythonvalue: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import redef strip_tags(text: str) -> str:"""Remove HTML tags for pure text."""return re.sub(r'<[^>]+>', '', text)def check_length(output: str, context: dict) -> dict:"""Check output length constraints."""raw_input = context['vars'].get('raw_input', '')input_len = len(strip_tags(raw_input))output_len = len(strip_tags(output))reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0return {"pass": 0.7 <= reduction_ratio <= 0.9,"score": reduction_ratio,"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)","named_scores": {"input_length": input_len,"output_length": output_len,"reduction_ratio": reduction_ratio}}
Real-World Example
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/├── promptfooconfig.yaml # Production config├── promptfooconfig-preview.yaml # Preview config (echo provider)├── prompts/│ ├── tiaogaoren-prompt.json # Chat format with few-shot│ └── v4/system-v4.md # System prompt├── tests/cases.yaml # 3 test samples├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)├── data/ # 5 samples (2 few-shot, 3 eval)└── results/
See: ./tiaogaoren/ (example project root) for full implementation.
Resources
For detailed API reference and advanced patterns, see references/promptfoo_api.md.