Pass At K Scoring With Multiple Generation Attempts

1

xCodeEvalBenchmark64/100

via “program synthesis task generation and evaluation with pass@k metrics”

Multilingual code evaluation across 17 languages.

Unique: Implements a three-phase evaluation pipeline (Generation → Execution → Metrics) with explicit pass@k computation that measures the probability of finding a correct solution within k attempts, rather than just binary pass/fail. Supports multi-sample evaluation across 17 languages with language-specific compiler configurations and timeout handling.

vs others: More rigorous than HumanEval's simple pass@k because it handles language-specific compilation errors and timeouts explicitly, and scales to 25M training examples vs HumanEval's 164 problems.

2

MBPP+Benchmark63/100

via “pass@k metric calculation with configurable sample aggregation”

Enhanced Python coding benchmark with rigorous testing.

Unique: Implements pass@k metric using combinatorial formula (1 - C(n-c,k)/C(n,k)) rather than empirical sampling, enabling exact calculation without Monte Carlo approximation. Supports configurable k values and aggregation across problems, enabling multi-level analysis (per-problem, per-category, dataset-wide).

vs others: More statistically rigorous than simple accuracy metrics because it accounts for sampling variance and model reliability; enables fair comparison between models with different single-shot accuracy but similar pass@k. Combinatorial calculation is faster and more precise than empirical sampling approaches.

3

LiveCodeBenchBenchmark62/100

via “pass-at-k-scoring-with-multiple-generation-attempts”

Continuously updated coding benchmark — new competitive programming problems, prevents contamination.

Unique: Applies pass@k metric from prior code generation benchmarks (HumanEval, MBPP) to LiveCodeBench's continuously-updated problem set, enabling fair comparison of models with different generation strategies while accounting for sampling variance inherent in LLM outputs.

vs others: More realistic than pass@1 metrics because it acknowledges that LLMs generate stochastically and users can sample multiple times; more fair than fixed-temperature evaluation because it doesn't penalize models with higher generation diversity.

4

HumanEvalBenchmark61/100

via “pass@k metric calculation with unbiased statistical estimation”

OpenAI's code generation benchmark — 164 Python problems with unit tests, pass@k evaluation.

Unique: Implements unbiased pass@k estimator that corrects for sampling without replacement, preventing overestimation of model performance when fewer than k samples are available; formula accounts for the hypergeometric distribution rather than assuming independence

vs others: More statistically rigorous than naive pass@k calculation (which assumes independence) because it uses the unbiased estimator formula, enabling fair comparison of models with different sample budgets

5

MBPP (Mostly Basic Python Problems)Dataset56/100

via “pass@k metric computation and aggregation”

974 basic Python problems complementing HumanEval for code evaluation.

Unique: Implements the standard pass@k metric used across code generation research, enabling direct comparison with published results; accounts for sampling variance by checking if any of k attempts solves the problem, reflecting real-world usage where multiple attempts are feasible

vs others: More realistic than pass@1 alone because it accounts for the fact that code generation models can produce multiple solutions; standardized metric enables comparison across papers and research groups; computationally tractable for k up to 100 on 974 problems

6

AlphaCodiumRepository46/100

via “batch dataset processing with pass@k evaluation metrics”

Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

Unique: Implements pass@K evaluation as a first-class metric, generating multiple solution candidates per problem and evaluating them to compute pass rates at different K values. This enables measuring the probability that at least one of K attempts solves the problem, which is more realistic than single-attempt metrics.

vs others: Provides pass@K metrics that account for multiple attempts, giving a more realistic picture of system performance than single-attempt pass rates, and enables comparison with other code generation systems using standard evaluation methodology.

Top Matches

Also Known As

Company