promptfoo
CLI ToolFreeLLM prompt testing and evaluation — compare models, detect regressions, assertions, CI/CD.
Capabilities15 decomposed
multi-provider prompt evaluation engine
Medium confidenceExecutes the same prompt across multiple LLM providers (OpenAI, Anthropic, Google, AWS Bedrock, Ollama, local models) in parallel, collecting structured outputs with metadata (latency, token counts, cost). Uses a provider registry pattern with pluggable provider implementations that normalize API differences into a unified interface, enabling side-by-side comparison of model behavior on identical inputs.
Uses a pluggable provider registry pattern where each provider (OpenAI, Anthropic, Bedrock, Ollama, HTTP, Python scripts) implements a normalized interface, allowing new providers to be added without modifying core evaluation logic. Tracks cost per provider using model-specific pricing tables, enabling ROI analysis across providers.
Broader provider support (10+ integrations including local models) and native cost tracking than competitors like LangSmith or Weights & Biases, with zero-config local execution via Ollama
assertion-based test grading with custom evaluators
Medium confidenceDefines test assertions (exact match, similarity, regex, LLM-based grading) that automatically evaluate whether model outputs meet criteria. Supports custom evaluator functions (JavaScript, Python, HTTP webhooks) that receive the prompt, output, and test case metadata, returning a pass/fail score and optional details. Assertions are composable and can be chained to create complex evaluation logic without writing test harnesses.
Supports four distinct assertion types (exact, similarity, regex, LLM-rubric) plus arbitrary custom evaluators (JS functions, Python scripts, HTTP webhooks), allowing teams to mix deterministic checks with LLM-based subjective evaluation in a single test suite. Custom evaluators receive full test context (prompt, output, variables, metadata) enabling sophisticated domain-specific grading.
More flexible assertion model than basic string matching in competitors; native support for LLM-as-judge grading without requiring separate evaluation pipeline setup
evaluation result persistence and historical tracking
Medium confidenceStores evaluation results in local SQLite database or cloud storage (AWS S3, Google Cloud Storage, etc.), enabling historical tracking of prompt quality over time. Results include full metadata (prompt, model, variables, outputs, scores, latency, cost). Enables trend analysis (e.g., 'pass rate improved 5% over last month') and regression detection by comparing against previous baselines.
Stores evaluation results in local SQLite or cloud storage with full metadata (prompt, model, variables, outputs, scores, latency, cost). Enables historical tracking and trend analysis. Results can be queried to detect regressions by comparing against previous baselines.
Integrated persistence (not a separate tool); supports both local and cloud storage; enables historical tracking and regression detection without external databases
aws bedrock and cloud provider integration
Medium confidenceProvides native integration with AWS Bedrock (Claude, Llama, Mistral models), Google Vertex AI, Azure OpenAI, and other cloud providers. Handles authentication (IAM roles, API keys), model selection, and parameter mapping. Enables teams to test against cloud-hosted models without writing custom provider code. Supports streaming responses for real-time output evaluation.
Native integration with AWS Bedrock, Google Vertex AI, and Azure OpenAI with support for cloud provider authentication (IAM roles). Handles model selection, parameter mapping, and streaming responses. Enables teams to test cloud-hosted models without custom integration code.
Broader cloud provider support than competitors; native IAM role support for better security; integrated streaming response handling
python and node.js script provider execution
Medium confidenceExecutes Python scripts (3.7+) and Node.js scripts (18+) as providers, passing prompt and variables as command-line arguments or stdin. Scripts can implement arbitrary logic (e.g., calling local models, preprocessing inputs, routing to multiple models). Output is captured from stdout and parsed as JSON or plain text. Enables teams to test custom inference logic without modifying promptfoo.
Supports Python and Node.js scripts as first-class providers, receiving prompt and variables as command-line arguments or stdin. Scripts can implement arbitrary logic (preprocessing, routing, local model calls). Output is captured from stdout and parsed as JSON or plain text.
More flexible than HTTP provider for local execution; enables testing of custom inference logic without external servers; supports both Python and Node.js
ollama and local model integration
Medium confidenceProvides native integration with Ollama (local LLM inference engine) and compatible local model servers (LLaMA.cpp, LocalAI). Connects to local HTTP endpoints, enabling teams to test open-source models (Llama, Mistral, etc.) without cloud API costs or latency. Supports model selection, parameter tuning, and streaming responses.
Native Ollama integration with support for local model servers (LLaMA.cpp, LocalAI). Connects to local HTTP endpoints, enabling zero-cost local inference. Supports model selection, parameter tuning, and streaming responses.
Purpose-built for local model testing; enables cost-free evaluation of open-source models; supports multiple local model servers (Ollama, LLaMA.cpp, LocalAI)
evaluation result filtering and search
Medium confidenceProvides CLI and web UI search/filtering capabilities to navigate large evaluation result sets. Supports filtering by test case name, provider, model, pass/fail status, and custom metadata. Search uses full-text indexing for fast queries. Enables teams to quickly find specific test cases or failure patterns without manually reviewing all results.
Provides both CLI and web UI search/filtering with full-text indexing. Supports filtering by test case name, provider, model, status, and custom metadata. Enables fast navigation of large result sets without manual review.
Integrated search (not a separate tool); supports both CLI and web UI; enables efficient navigation of large result sets
automated red-team vulnerability scanning
Medium confidenceGenerates adversarial test cases using attack strategies (jailbreaks, prompt injection, prompt leaking, toxicity, bias) to probe LLM vulnerabilities. Uses a plugin-based attack provider system where each strategy (e.g., 'crescendo jailbreak', 'SQL injection') generates variations of inputs designed to trigger unsafe behavior. Results are graded using guardrails (safety checks) to identify which attacks succeeded, producing a vulnerability report.
Implements a modular attack strategy system where each vulnerability type (jailbreak, injection, prompt leaking, toxicity, bias) is a pluggable provider that generates test cases. Strategies can be composed and parameterized (e.g., 'crescendo jailbreak with 5 iterations'), and results are graded against guardrails (safety checks) to produce a structured vulnerability report.
Purpose-built red-teaming system integrated into evaluation pipeline (not a separate tool); supports custom attack strategies via plugins; generates reproducible adversarial test cases that can be version-controlled and shared
test configuration and variable substitution
Medium confidenceDefines test suites as YAML/JSON files with templated prompts, test cases, and variables. Supports variable substitution using {{variable}} syntax, allowing a single prompt template to be tested against multiple input combinations. Test cases can include expected outputs, assertions, and metadata. Configuration is declarative and version-controllable, enabling teams to track prompt changes over time.
Uses declarative YAML/JSON configuration with {{variable}} substitution syntax, allowing test suites to be defined without code. Configuration files are first-class artifacts that can be version-controlled, reviewed, and shared. Supports nested variables, array expansion, and metadata annotations on test cases.
More human-readable and version-control-friendly than programmatic test definition; enables non-technical stakeholders to contribute test cases without writing code
ci/cd pipeline integration with regression detection
Medium confidenceIntegrates evaluation results into CI/CD workflows via GitHub Actions, GitLab CI, or generic webhook triggers. Compares current evaluation results against baseline results to detect regressions (e.g., 'pass rate dropped from 95% to 90%'). Fails CI builds if regressions exceed configured thresholds, preventing degraded prompts from being merged. Results can be stored locally or uploaded to cloud storage for historical tracking.
Provides native GitHub Actions integration and generic webhook support for CI/CD platforms. Regression detection compares current results against baseline using configurable thresholds (pass rate, latency, cost). Results can be stored as artifacts or uploaded to cloud storage, enabling historical tracking and trend analysis.
Purpose-built for prompt evaluation in CI/CD (not a generic testing framework); detects regressions specific to LLM outputs (quality, latency, cost) rather than just test pass/fail
web-based results viewer and comparison ui
Medium confidenceProvides a local web interface (React-based frontend) for visualizing evaluation results, filtering by test case or provider, and comparing model outputs side-by-side. Results can be shared via shareable URLs (with optional cloud storage backend) or self-hosted. The UI supports real-time updates when new evaluation results are available, and includes search/filtering to navigate large result sets.
React-based frontend with real-time updates via WebSocket, supporting side-by-side comparison of model outputs with filtering/search. Results can be shared via shareable URLs (with optional cloud backend) or self-hosted. Includes red-team setup UI for configuring attack strategies interactively.
Integrated web UI (not a separate tool) with native support for sharing and self-hosting; real-time updates enable collaborative evaluation workflows
provider-agnostic http and script execution
Medium confidenceSupports custom providers via HTTP endpoints (POST requests with prompt/variables, returns output) and script execution (Python, Node.js, shell scripts). Allows teams to test against proprietary models, internal APIs, or custom inference servers without modifying promptfoo code. Scripts receive prompt and variables as arguments, execute locally, and return output to be graded.
Pluggable provider system allowing HTTP endpoints and local scripts (Python, Node.js, shell) to be treated as first-class providers. Scripts receive full test context (prompt, variables, metadata) and can implement arbitrary logic. HTTP provider enables integration with any inference server without code changes.
More flexible than competitors for integrating custom models; supports both HTTP APIs and local script execution, enabling teams to test proprietary or fine-tuned models
cost and latency tracking across providers
Medium confidenceAutomatically tracks API costs per provider using model-specific pricing tables (OpenAI, Anthropic, Google, AWS, etc.), and measures latency for each API call. Aggregates costs and latency by provider, test case, and overall suite. Enables cost-benefit analysis (e.g., 'GPT-4 is 10x more expensive but only 5% more accurate'). Pricing tables are updated with each release to reflect current API costs.
Maintains model-specific pricing tables for 10+ providers (OpenAI, Anthropic, Google, AWS, Azure, etc.) and automatically calculates costs based on token counts. Tracks latency per API call and aggregates by provider/test case. Pricing tables are updated with each release to reflect current API costs.
Native cost tracking (not a separate tool) with support for multiple providers; enables cost-benefit analysis across models without manual calculation
prompt template processing with variable expansion
Medium confidenceProcesses prompt templates with {{variable}} syntax, supporting variable substitution, array expansion (cartesian product of multiple variable values), and nested variable references. Allows a single prompt template to generate multiple test cases by expanding variables. Supports both simple string substitution and complex variable structures (objects, arrays).
Supports {{variable}} syntax with array expansion (cartesian product) and nested variable references. Allows a single prompt template to generate multiple test cases by expanding variable combinations. Handles both simple strings and complex variable structures (objects, arrays).
More flexible than simple string substitution; supports array expansion and nested variables, enabling compact test suite definitions
llm-based grading with custom rubrics
Medium confidenceUses another LLM (OpenAI, Anthropic, Google, etc.) to grade model outputs against custom rubrics. Rubrics are defined as text descriptions of evaluation criteria (e.g., 'Is the response accurate? Is it helpful? Is it concise?'). The grading LLM receives the prompt, output, and rubric, and returns a score (0-1) and reasoning. Enables subjective quality evaluation without manual review.
Integrates LLM-as-judge grading directly into evaluation pipeline using custom rubrics. Grading LLM receives full context (prompt, output, rubric) and returns score + reasoning. Supports any LLM provider, enabling teams to choose grading model independently of evaluation model.
Native LLM-based grading (not a separate tool); supports custom rubrics and any LLM provider; enables subjective quality evaluation at scale
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with promptfoo, ranked by overlap. Discovered automatically through the match graph.
promptfoo
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Promptmetheus
ChatGPT prompt engineering...
Anthropic courses
Anthropic's educational courses.
Pezzo
Accelerate AI development with streamlined collaboration and deployment...
prompt-optimizer
An AI prompt optimizer for writing better prompts and getting better AI results.
Best For
- ✓teams evaluating which LLM provider to use for production
- ✓prompt engineers optimizing prompts across model families
- ✓organizations with multi-model strategies needing unified testing
- ✓QA engineers building automated test suites for LLM applications
- ✓teams with domain-specific grading logic (e.g., SQL correctness, code compilation)
- ✓organizations needing reproducible, version-controlled evaluation criteria
- ✓teams iterating on prompts over weeks/months and needing trend analysis
- ✓organizations with compliance requirements needing audit trails of prompt changes
Known Limitations
- ⚠Parallel execution speed limited by slowest provider (no timeout per provider by default)
- ⚠Cost accumulates across all providers — testing 10 prompts × 5 models = 50 API calls
- ⚠Provider API rate limits may throttle concurrent requests; no built-in backoff strategy per provider
- ⚠LLM-based graders add latency (~1-5s per test) and cost (additional API calls)
- ⚠Custom evaluators must be synchronous; async operations require wrapping in promises
- ⚠No built-in support for probabilistic or threshold-based grading (e.g., 'pass if 70% of evaluators agree')
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
CLI and library for testing and evaluating LLM prompts. Run prompts against multiple models, compare outputs, detect regressions. Features assertions, scoring, red teaming, and CI/CD integration. The standard tool for prompt testing.
Categories
Alternatives to promptfoo
Are you the builder of promptfoo?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →