multi-provider prompt evaluation engine
Executes the same prompt across multiple LLM providers (OpenAI, Anthropic, Google, AWS Bedrock, Ollama, local models) in parallel, collecting structured outputs with metadata (latency, token counts, cost). Uses a provider registry pattern with pluggable provider implementations that normalize API differences into a unified interface, enabling side-by-side comparison of model behavior on identical inputs.
Unique: Uses a pluggable provider registry pattern where each provider (OpenAI, Anthropic, Bedrock, Ollama, HTTP, Python scripts) implements a normalized interface, allowing new providers to be added without modifying core evaluation logic. Tracks cost per provider using model-specific pricing tables, enabling ROI analysis across providers.
vs alternatives: Broader provider support (10+ integrations including local models) and native cost tracking than competitors like LangSmith or Weights & Biases, with zero-config local execution via Ollama
assertion-based test grading with custom evaluators
Defines test assertions (exact match, similarity, regex, LLM-based grading) that automatically evaluate whether model outputs meet criteria. Supports custom evaluator functions (JavaScript, Python, HTTP webhooks) that receive the prompt, output, and test case metadata, returning a pass/fail score and optional details. Assertions are composable and can be chained to create complex evaluation logic without writing test harnesses.
Unique: Supports four distinct assertion types (exact, similarity, regex, LLM-rubric) plus arbitrary custom evaluators (JS functions, Python scripts, HTTP webhooks), allowing teams to mix deterministic checks with LLM-based subjective evaluation in a single test suite. Custom evaluators receive full test context (prompt, output, variables, metadata) enabling sophisticated domain-specific grading.
vs alternatives: More flexible assertion model than basic string matching in competitors; native support for LLM-as-judge grading without requiring separate evaluation pipeline setup
evaluation result persistence and historical tracking
Stores evaluation results in local SQLite database or cloud storage (AWS S3, Google Cloud Storage, etc.), enabling historical tracking of prompt quality over time. Results include full metadata (prompt, model, variables, outputs, scores, latency, cost). Enables trend analysis (e.g., 'pass rate improved 5% over last month') and regression detection by comparing against previous baselines.
Unique: Stores evaluation results in local SQLite or cloud storage with full metadata (prompt, model, variables, outputs, scores, latency, cost). Enables historical tracking and trend analysis. Results can be queried to detect regressions by comparing against previous baselines.
vs alternatives: Integrated persistence (not a separate tool); supports both local and cloud storage; enables historical tracking and regression detection without external databases
aws bedrock and cloud provider integration
Provides native integration with AWS Bedrock (Claude, Llama, Mistral models), Google Vertex AI, Azure OpenAI, and other cloud providers. Handles authentication (IAM roles, API keys), model selection, and parameter mapping. Enables teams to test against cloud-hosted models without writing custom provider code. Supports streaming responses for real-time output evaluation.
Unique: Native integration with AWS Bedrock, Google Vertex AI, and Azure OpenAI with support for cloud provider authentication (IAM roles). Handles model selection, parameter mapping, and streaming responses. Enables teams to test cloud-hosted models without custom integration code.
vs alternatives: Broader cloud provider support than competitors; native IAM role support for better security; integrated streaming response handling
python and node.js script provider execution
Executes Python scripts (3.7+) and Node.js scripts (18+) as providers, passing prompt and variables as command-line arguments or stdin. Scripts can implement arbitrary logic (e.g., calling local models, preprocessing inputs, routing to multiple models). Output is captured from stdout and parsed as JSON or plain text. Enables teams to test custom inference logic without modifying promptfoo.
Unique: Supports Python and Node.js scripts as first-class providers, receiving prompt and variables as command-line arguments or stdin. Scripts can implement arbitrary logic (preprocessing, routing, local model calls). Output is captured from stdout and parsed as JSON or plain text.
vs alternatives: More flexible than HTTP provider for local execution; enables testing of custom inference logic without external servers; supports both Python and Node.js
ollama and local model integration
Provides native integration with Ollama (local LLM inference engine) and compatible local model servers (LLaMA.cpp, LocalAI). Connects to local HTTP endpoints, enabling teams to test open-source models (Llama, Mistral, etc.) without cloud API costs or latency. Supports model selection, parameter tuning, and streaming responses.
Unique: Native Ollama integration with support for local model servers (LLaMA.cpp, LocalAI). Connects to local HTTP endpoints, enabling zero-cost local inference. Supports model selection, parameter tuning, and streaming responses.
vs alternatives: Purpose-built for local model testing; enables cost-free evaluation of open-source models; supports multiple local model servers (Ollama, LLaMA.cpp, LocalAI)
evaluation result filtering and search
Provides CLI and web UI search/filtering capabilities to navigate large evaluation result sets. Supports filtering by test case name, provider, model, pass/fail status, and custom metadata. Search uses full-text indexing for fast queries. Enables teams to quickly find specific test cases or failure patterns without manually reviewing all results.
Unique: Provides both CLI and web UI search/filtering with full-text indexing. Supports filtering by test case name, provider, model, status, and custom metadata. Enables fast navigation of large result sets without manual review.
vs alternatives: Integrated search (not a separate tool); supports both CLI and web UI; enables efficient navigation of large result sets
automated red-team vulnerability scanning
Generates adversarial test cases using attack strategies (jailbreaks, prompt injection, prompt leaking, toxicity, bias) to probe LLM vulnerabilities. Uses a plugin-based attack provider system where each strategy (e.g., 'crescendo jailbreak', 'SQL injection') generates variations of inputs designed to trigger unsafe behavior. Results are graded using guardrails (safety checks) to identify which attacks succeeded, producing a vulnerability report.
Unique: Implements a modular attack strategy system where each vulnerability type (jailbreak, injection, prompt leaking, toxicity, bias) is a pluggable provider that generates test cases. Strategies can be composed and parameterized (e.g., 'crescendo jailbreak with 5 iterations'), and results are graded against guardrails (safety checks) to produce a structured vulnerability report.
vs alternatives: Purpose-built red-teaming system integrated into evaluation pipeline (not a separate tool); supports custom attack strategies via plugins; generates reproducible adversarial test cases that can be version-controlled and shared
+7 more capabilities