MBPP+
DatasetFreeEnhanced Python coding benchmark with rigorous testing.
Capabilities10 decomposed
extended-test-case-generation-for-code-problems
Medium confidenceGenerates 35x more test cases per problem than the original MBPP benchmark by creating edge-case and boundary-condition tests beyond base inputs. The system uses a contract-based validation approach with input constraints (contract field), floating-point tolerance specifications (atol), and canonical solution execution to derive comprehensive test suites that expose fragile implementations passing only base tests.
Multiplies test coverage by 35x through systematic generation of plus_input test cases derived from canonical solutions and input contracts, rather than relying on manually curated test suites. Includes atol (absolute tolerance) fields for floating-point comparisons and contract specifications for input validation, enabling detection of solutions that pass base tests but fail on boundary conditions.
Provides 35x more test cases per problem than original MBPP (35 vs ~3 tests per task), catching incorrect implementations that pass minimal test suites where competitors like HumanEval or raw MBPP would miss them.
safe-isolated-code-execution-with-resource-limits
Medium confidenceExecutes untrusted LLM-generated Python code in isolated processes with multi-layer sandboxing: process isolation via multiprocessing, memory limits (default 4GB via EVALPLUS_MAX_MEMORY_BYTES), dynamically calculated time limits based on canonical solution execution time, I/O suppression via swallow_io, and system call guards via reliability_guard. Each sample runs in a separate process with shared memory for inter-process communication.
Combines process isolation, memory limits, dynamic timeout calculation (based on canonical solution execution), I/O suppression, and system call guards in a single execution pipeline. Timeout is not fixed but derived from ground-truth execution time, preventing both premature termination of slow-but-correct solutions and runaway execution of inefficient code.
More comprehensive than simple timeout-based execution (e.g., raw subprocess calls) by adding memory limits, I/O suppression, and system call guards; more flexible than fixed timeouts by dynamically calibrating to canonical solution performance.
pass-at-k-metric-calculation-for-code-generation
Medium confidenceCalculates pass@k metrics by executing k independent code samples per problem and computing the probability that at least one passes all test cases. Aggregates results across the full problem set to produce benchmark-wide pass@k scores. Supports multiple k values (k=1, 5, 10, etc.) to measure model robustness and sample efficiency.
Implements pass@k calculation across extended test suites (35x more tests than original MBPP), making the metric more stringent and revealing model weaknesses that pass@k on minimal test coverage would miss. Aggregates results across 378 problems with comprehensive test coverage per problem.
More rigorous than pass@k on original MBPP (which uses ~3 tests per problem) because extended test suites expose fragile solutions; comparable to HumanEval+ but with 2.3x more problems (378 vs 164 tasks).
code-sanitization-and-safety-preprocessing
Medium confidencePreprocesses LLM-generated code before execution by removing or neutralizing potentially dangerous constructs: strips import statements that could access system resources, removes eval/exec calls, sanitizes file I/O operations, and disables network access. The sanitize.py module applies these transformations while preserving functional code logic, enabling safe execution of untrusted code without manual review.
Applies pattern-based sanitization to remove dangerous constructs (imports, eval/exec, file I/O, network access) before execution, complementing process-level isolation. Works in conjunction with reliability_guard system calls filtering to provide defense-in-depth against malicious or accidental harmful code.
Combines code-level sanitization (removing dangerous constructs) with process-level isolation (memory/time limits, system call guards), providing layered defense; simpler than full AST-based code analysis but faster and more practical for high-volume evaluation.
multi-backend-llm-code-generation-with-provider-abstraction
Medium confidenceProvides unified interface for code generation across 8+ LLM providers (vLLM, HuggingFace, OpenAI, Anthropic, Google Gemini, AWS Bedrock, Ollama) through a provider abstraction layer. Each provider implements a common interface for prompt submission, sampling, and result retrieval, enabling seamless switching between models without changing evaluation code. Supports batch generation and configurable sampling parameters (temperature, top_p, max_tokens).
Implements provider abstraction layer supporting 8+ LLM backends (vLLM, HuggingFace, OpenAI, Anthropic, Google Gemini, AWS Bedrock, Ollama) through common interface in evalplus/provider/__init__.py, enabling single evaluation pipeline to work across local and cloud models without code changes. Supports both local inference (vLLM, Ollama) and cloud APIs with unified sampling parameter handling.
More comprehensive provider support than single-model evaluation frameworks; more flexible than hardcoded provider integrations by using abstraction layer pattern; enables fair comparison across providers by normalizing sampling parameters and result formats.
performance-evaluation-via-cpu-instruction-counting
Medium confidenceMeasures code efficiency using CPU instruction counting (via Linux perf) rather than wall-clock time, providing hardware-independent performance metrics. Generates performance-exercising inputs with exponential scaling (2^1 to 2^26) to stress-test algorithms, filters tasks based on profile size and compute cost, and produces EvalPerf dataset with instruction count baselines for each problem.
Uses CPU instruction counting via Linux perf instead of wall-clock time, providing hardware-independent performance metrics. Generates exponentially-scaled performance-exercising inputs (2^1 to 2^26) to stress-test algorithms and expose inefficient implementations. Filters tasks based on profile size, compute cost, coefficient of variation, and performance clustering to create manageable EvalPerf dataset.
More rigorous than wall-clock time measurement (which varies with system load) and more practical than full algorithmic complexity analysis; provides objective hardware-independent performance baseline for comparing generated code efficiency.
structured-dataset-management-with-metadata-fields
Medium confidenceOrganizes code problems as structured objects with standardized metadata fields: base_input (original test cases), plus_input (extended test cases), contract (input validation constraints), atol (floating-point tolerance), canonical_solution (ground truth implementation), and entry_point (function name). Provides dataset loading, filtering, and iteration utilities through evalplus/data/__init__.py, enabling programmatic access to 378 MBPP+ problems with consistent schema.
Provides standardized schema for 378 MBPP+ problems with fields for base/extended test cases (base_input, plus_input), input validation (contract), floating-point tolerance (atol), ground truth (canonical_solution), and function entry point. Enables programmatic dataset access through consistent interface rather than raw JSON files.
More structured than raw JSON dataset files; provides consistent schema across all problems enabling reliable programmatic access; includes extended test cases (plus_input) and validation constraints (contract) not present in original MBPP.
command-line-evaluation-pipeline-orchestration
Medium confidenceProvides CLI tools (evalplus.evaluate, evalplus.codegen, evalplus.evalperf, evalplus.sanitize) that orchestrate the complete evaluation workflow: code generation from LLM → sanitization → correctness evaluation → optional performance evaluation. Each CLI tool accepts configuration parameters (model, dataset, sampling params) and produces structured output (JSON results, pass@k metrics, performance data). Enables end-to-end benchmark execution without writing custom Python code.
Provides four integrated CLI tools (evalplus.codegen, evalplus.evaluate, evalplus.evalperf, evalplus.sanitize) that chain together to form complete evaluation pipeline: generation → sanitization → correctness evaluation → performance evaluation. Each tool accepts configuration parameters and produces structured JSON output, enabling end-to-end benchmark execution from command line.
More integrated than individual tools (e.g., separate code generation and evaluation scripts); more accessible than programmatic API for non-developers; enables reproducible evaluation workflows via CLI commands.
batch-code-generation-with-configurable-sampling
Medium confidenceGenerates multiple code samples per problem with configurable sampling parameters (temperature, top_p, max_tokens, num_samples) through evalplus.codegen CLI and codegen.py module. Supports batch processing across all 378 MBPP+ problems, with results organized by problem ID and sample index. Integrates with multi-provider LLM abstraction to support diverse model backends without code changes.
Integrates with multi-provider LLM abstraction to generate code samples across vLLM, OpenAI, Anthropic, Google, AWS, and Ollama without provider-specific code. Supports configurable sampling (temperature, top_p, max_tokens) and batch processing across 378 problems with results organized by problem ID and sample index.
More flexible than fixed-temperature generation by supporting configurable sampling parameters; more convenient than manual per-provider code generation by using unified provider abstraction; enables fair comparison across models by normalizing generation parameters.
comprehensive-test-result-aggregation-and-reporting
Medium confidenceAggregates execution results across all 378 problems and k samples to produce comprehensive benchmark metrics: pass@k scores, per-problem pass/fail results, sample-level execution details (timeout, memory exceeded, exception), and statistical summaries (mean, std dev, confidence intervals). Results are organized hierarchically (benchmark → problem → sample) and exported as structured JSON for further analysis and visualization.
Aggregates execution results hierarchically (benchmark → problem → sample) with detailed error classification (timeout, memory exceeded, exception) and produces pass@k metrics across extended test suites (35x more tests than original MBPP). Exports structured JSON results enabling downstream analysis and visualization.
More detailed than simple pass/fail counting by including error classification and per-sample execution details; more structured than flat result lists by organizing results hierarchically; enables fine-grained analysis of model failures.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MBPP+, ranked by overlap. Discovered automatically through the match graph.
MBPP (Mostly Basic Python Problems)
974 basic Python problems complementing HumanEval for code evaluation.
CodeContests
13K competitive programming problems from AlphaCode research.
phantom-lens
A Cluely / Interview Coder alternative with features we probably shouldn’t talk about, built for winning exams..
HumanEval
OpenAI's code generation benchmark — 164 Python problems with unit tests, pass@k evaluation.
Kwaipilot: KAT-Coder-Pro V2
KAT-Coder-Pro V2 is the latest high-performance model in KwaiKAT’s KAT-Coder series, designed for complex enterprise-grade software engineering and SaaS integration. It builds on the agentic coding strengths of earlier versions,...
Mutable AI
AI-Accelerated Software Development
Best For
- ✓ML researchers evaluating code generation models (GPT, Claude, open-source LLMs)
- ✓benchmark designers creating rigorous evaluation suites
- ✓teams building code synthesis systems that need ground-truth validation
- ✓code evaluation platforms running untrusted LLM-generated code
- ✓benchmark evaluation systems requiring safe execution of thousands of code samples
- ✓research teams evaluating code generation models in isolated environments
- ✓ML researchers publishing code generation benchmarks
- ✓model evaluation teams comparing LLM code generation capabilities
Known Limitations
- ⚠Test case generation is Python-specific; no support for other programming languages
- ⚠Extended tests increase evaluation runtime by 35x compared to original MBPP
- ⚠Floating-point tolerance (atol) handling may not cover all numerical precision edge cases across different implementations
- ⚠Contract-based validation requires manual specification per problem; not automatically inferred
- ⚠Process isolation adds overhead (~50-200ms per execution); not suitable for real-time code execution
- ⚠Memory limits are global (4GB default) and may be insufficient for memory-intensive algorithms
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Enhanced version of the Mostly Basic Python Problems benchmark with 35x more test cases per problem, providing rigorous evaluation of code generation models by catching solutions that pass original tests but are incorrect.
Categories
Alternatives to MBPP+
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of MBPP+?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →