Mathematical discoveries from program search with large language models (FunSearch)

Q: What can Mathematical discoveries from program search with large language models (FunSearch) do?

program-space search with llm-guided exploration, iterative program refinement with failure-driven learning, constraint-aware program generation with multi-objective evaluation, domain-specific program synthesis with problem-aware prompting, mathematical conjecture validation through program discovery, scalable evaluation and ranking of program candidates

Product

### Audio Processing <a name="2023ap"></a>

/ 100

6 capabilities

Capabilities6 decomposed

program-space search with llm-guided exploration

Medium confidence

Searches through discrete program spaces (e.g., algorithm implementations, mathematical proofs) by using an LLM as a heuristic guide to propose candidate programs, then evaluates them against test cases or mathematical constraints. The system iteratively refines the search by learning from successful and failed program attempts, effectively treating program synthesis as a guided exploration problem rather than pure generation.

Solves for

Discover novel algorithmic solutions that outperform known implementations on specific problem classesFind mathematical constructs (sequences, functions) that satisfy previously unproven conjecturesAutomatically generate optimized code for computationally hard problems without manual algorithmic insightExplore combinatorial solution spaces too large for exhaustive search but tractable with intelligent pruning

Best for

Research teams exploring mathematical conjectures and algorithm discovery

Optimization specialists seeking novel solutions to NP-hard or combinatorial problems

Academic institutions validating computational mathematics hypotheses

Requires

LLM API access (GPT-4 or equivalent) with function-calling capability

Formal specification of program constraints or test cases

Computational budget for iterative evaluation (hours to days per discovery)

Limitations

Requires well-defined evaluation metrics or test suites to judge program correctness — works poorly on subjective or open-ended problems

Search time grows exponentially with program complexity and constraint count; practical for small-to-medium programs only

LLM guidance is probabilistic and may miss solution regions if training data doesn't cover similar problem structures

What makes it unique

Uses LLM as a learned heuristic within a structured search loop rather than as a one-shot generator, combining neural guidance with deterministic evaluation to explore discrete program spaces. Implements iterative refinement where the LLM learns from failed attempts through in-context examples, enabling discovery of solutions outside typical training data distributions.

vs alternatives

Outperforms pure LLM code generation by grounding proposals in executable feedback, and outperforms traditional program synthesis by leveraging learned heuristics to prune the search space intelligently rather than relying on exhaustive enumeration or hand-crafted rules.

iterative program refinement with failure-driven learning

Medium confidence

Maintains a feedback loop where failed program attempts are converted into in-context examples that guide the LLM toward better proposals in subsequent iterations. The system tracks which program structures, algorithmic patterns, and constraint violations led to failures, then uses this history to steer the LLM away from unpromising regions of the solution space.

Solves for

Progressively improve program quality by learning from past mistakes without retraining the LLMReduce search time by avoiding repeated exploration of similar failed patternsUnderstand why certain algorithmic approaches fail on specific problem instancesBuild a corpus of working solutions that can be used as in-context examples for related problems

Best for

Teams running long-horizon program search experiments where iteration count is high (100s to 1000s of attempts)

Researchers studying how LLMs learn from negative examples in structured domains

Optimization workflows where solution quality improves monotonically with iteration

Requires

LLM with large context window (8K+ tokens) to retain failure history

Deterministic evaluation function that produces consistent pass/fail results

Structured logging of program attempts with failure reasons and constraint violations

Limitations

Context window limits the number of failure examples that can be retained — typically 10-50 examples before context overflow

LLM may overfit to recent failures and miss alternative solution strategies if failure patterns are not diverse

No mechanism to escape local optima if all nearby proposals fail — requires random restarts or search space diversification

What makes it unique

Implements a closed-loop learning system where failure information is explicitly encoded into prompts as negative examples, allowing the LLM to adapt its generation strategy without fine-tuning. Uses the LLM's in-context learning capability as a lightweight alternative to gradient-based optimization.

vs alternatives

More sample-efficient than pure random search because failures directly inform future proposals, and faster than fine-tuning-based approaches because it avoids retraining overhead while still adapting to problem-specific constraints.

constraint-aware program generation with multi-objective evaluation

Medium confidence

Generates program candidates that must satisfy multiple evaluation criteria simultaneously (e.g., correctness on test cases, runtime performance, code simplicity, mathematical elegance). The system ranks candidates by a composite score that balances these objectives, allowing users to explore trade-offs between solution quality dimensions.

Solves for

Find algorithms that are both correct and efficient, not just correctDiscover mathematically elegant solutions that are also computationally practicalOptimize for multiple performance metrics (speed, memory, numerical stability) in a single searchUnderstand Pareto frontiers of solution quality across different evaluation dimensions

Best for

Algorithm researchers optimizing for both theoretical and practical performance

Mathematicians seeking proofs that are both correct and insightful

Performance engineers tuning code for multiple hardware or resource constraints

Requires

Quantifiable evaluation metrics for each objective (test pass rate, execution time, code length, etc.)

Weighting scheme or Pareto ranking method to combine objectives

Evaluation harness capable of measuring all objectives on each candidate

Limitations

Defining and weighting multiple objectives requires domain expertise; poor objective design leads to irrelevant solutions

Evaluation cost scales linearly with number of objectives — each candidate must be tested against all metrics

Trade-offs between objectives may not be obvious to the LLM; it may generate solutions that are mediocre on all dimensions

What makes it unique

Embeds multi-objective evaluation directly into the program search loop, allowing the LLM to see composite scores and trade-offs during generation. This differs from post-hoc ranking because the LLM can learn which objective combinations are achievable and adjust proposals accordingly.

vs alternatives

More nuanced than single-metric optimization because it exposes solution trade-offs, and more practical than pure Pareto enumeration because the LLM's guidance reduces the number of candidates that need evaluation.

domain-specific program synthesis with problem-aware prompting

Medium confidence

Tailors LLM prompts to specific problem domains (e.g., combinatorial optimization, mathematical sequences, algorithm design) by embedding domain knowledge, common patterns, and successful solution templates into the prompt context. The system adapts its generation strategy based on the problem class, improving proposal quality without retraining.

Solves for

Generate programs that leverage domain-specific algorithmic patterns (e.g., dynamic programming for optimization problems)Reduce search time by seeding the LLM with relevant solution templates and known techniquesAdapt the search strategy to problem characteristics (e.g., use greedy heuristics for NP-hard problems)Improve solution quality by incorporating domain expertise into the generation process

Best for

Research teams working repeatedly on problems within a specific domain (e.g., combinatorics, number theory)

Organizations building domain-specific program synthesis tools

Practitioners who can articulate domain patterns and best practices

Requires

Explicit articulation of domain patterns, heuristics, and common solution structures

Library of successful solution templates or reference implementations

Domain expert to design and validate prompts

Limitations

Requires manual curation of domain knowledge and solution templates — not automated

Domain-specific prompts may bias the search toward known techniques, reducing novelty

Transferability is limited — prompts optimized for one problem class may not work for related classes

What makes it unique

Encodes domain expertise as structured prompt context rather than as hard-coded rules or fine-tuned models, enabling rapid adaptation to new domains while maintaining the generality of the underlying LLM. Uses problem-aware prompting to guide the LLM toward domain-appropriate solutions.

vs alternatives

More flexible than domain-specific code generators because it leverages the LLM's general reasoning, and more practical than generic program synthesis because domain knowledge directly improves proposal quality and reduces search time.

mathematical conjecture validation through program discovery

Medium confidence

Automatically discovers programs (algorithms, constructions, proofs) that either validate or refute mathematical conjectures by searching for counterexamples or constructive proofs. The system translates mathematical statements into executable test cases or constraint specifications, then uses program search to find solutions that satisfy or violate the conjecture.

Solves for

Find counterexamples to mathematical conjectures by searching for inputs that violate the conjectureDiscover constructive proofs or algorithms that demonstrate conjecture validityAutomatically generate test cases that probe conjecture boundariesExplore the space of possible solutions to open mathematical problems

Best for

Mathematicians and theoretical computer scientists exploring conjectures

Research teams validating or refuting open problems computationally

Educational institutions teaching mathematical discovery and proof techniques

Requires

Formal specification of the conjecture as executable constraints or test cases

Bounded search space or finite domain (e.g., integers up to 10^6)

Evaluation harness that can check conjecture satisfaction

Limitations

Only applicable to conjectures that can be formalized as executable constraints or test cases

Computational search is limited to finite domains or bounded search spaces — cannot prove universal statements

Discovering counterexamples does not prove a conjecture false in general; requires mathematical verification

What makes it unique

Bridges mathematical reasoning and program synthesis by translating conjectures into executable specifications, then using program search to explore the solution space. Treats mathematical discovery as a search problem rather than a pure reasoning task.

vs alternatives

More systematic than manual exploration because it exhaustively searches bounded domains, and more practical than formal theorem proving because it uses heuristic search rather than requiring hand-crafted proofs.

scalable evaluation and ranking of program candidates

Medium confidence

Efficiently evaluates large numbers of program candidates (100s to 1000s) against test suites and performance metrics, then ranks them by quality scores. The system uses parallel evaluation, caching, and early termination to reduce computational overhead while maintaining ranking accuracy.

Solves for

Quickly identify the best programs from a large candidate poolUnderstand the distribution of solution quality across the search spaceAllocate computational budget efficiently by prioritizing promising candidatesTrack convergence and improvement over search iterations

Best for

Teams running large-scale program search experiments with 1000s of candidates

Researchers studying solution quality distributions and search landscapes

Optimization workflows where evaluation cost is a bottleneck

Requires

Parallel execution environment (multi-core, distributed cluster, or cloud compute)

Deterministic evaluation harness with consistent pass/fail results

Caching layer for test results (in-memory or persistent)

Limitations

Parallel evaluation requires multi-core or distributed infrastructure — not practical on single machines for large candidate pools

Caching assumes deterministic evaluation; non-deterministic programs or stochastic metrics break caching assumptions

Early termination may miss high-quality solutions if they fail early test cases

What makes it unique

Implements a scalable evaluation pipeline that treats program testing as a data processing problem, using caching, parallelization, and early termination to handle large candidate pools efficiently. Decouples evaluation from generation, allowing flexible ranking strategies.

vs alternatives

More efficient than sequential evaluation because it parallelizes test execution, and more flexible than hard-coded ranking because it supports pluggable evaluation metrics and ranking algorithms.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Mathematical discoveries from program search with large language models (FunSearch), ranked by overlap. Discovered automatically through the match graph.

Product19

Large Language Models as Optimizers (OPRO)

* ⏫ 10/2023: [Eureka: Human-Level Reward Design via Coding Large Language Models (Eureka)](https://arxiv.org/abs/2310.12931)

hyperparameter optimization via llm-guided searchllm-based gradient-free optimization via in-context learningmulti-step reasoning trajectory generation for complex optimizationtrajectory-conditioned solution generation with scoring feedback

4 shared capabilities

Prompt32

AlphaCodium

Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

test-driven code refinement with failure analysismulti-stage iterative code generation with test-driven refinementllm-driven problem understanding and self-reflectionsolution planning with multiple candidate generation

4 shared capabilities

Product17

Voyager

LLM-powered lifelong learning agent in Minecraft

iterative skill refinement through execution-based learningllm-guided hierarchical task planning with dynamic subtask generation

2 shared capabilities

Product17

BabyElfAGI

Mod of BabyDeerAGI, with ~895 lines of code

dynamic-goal-refinement-via-llm-feedback

1 shared capability

Agent39

code-act

Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji.

multi-turn-code-generation-and-refinement-loop

1 shared capability

Model22

Mistral: Devstral 2 2512

Devstral 2 is a state-of-the-art open-source model by Mistral AI specializing in agentic coding. It is a 123B-parameter dense transformer model supporting a 256K context window. Devstral 2 supports exploring...

iterative-code-refinement-with-feedback-loops

1 shared capability

Best For

✓Research teams exploring mathematical conjectures and algorithm discovery
✓Optimization specialists seeking novel solutions to NP-hard or combinatorial problems
✓Academic institutions validating computational mathematics hypotheses
✓Teams running long-horizon program search experiments where iteration count is high (100s to 1000s of attempts)
✓Researchers studying how LLMs learn from negative examples in structured domains
✓Optimization workflows where solution quality improves monotonically with iteration
✓Algorithm researchers optimizing for both theoretical and practical performance
✓Mathematicians seeking proofs that are both correct and insightful

Known Limitations

⚠Requires well-defined evaluation metrics or test suites to judge program correctness — works poorly on subjective or open-ended problems
⚠Search time grows exponentially with program complexity and constraint count; practical for small-to-medium programs only
⚠LLM guidance is probabilistic and may miss solution regions if training data doesn't cover similar problem structures
⚠No guarantees of optimality or completeness — discovered solutions are heuristically good, not proven optimal
⚠Context window limits the number of failure examples that can be retained — typically 10-50 examples before context overflow
⚠LLM may overfit to recent failures and miss alternative solution strategies if failure patterns are not diverse

Requirements

LLM API access (GPT-4 or equivalent) with function-calling capabilityFormal specification of program constraints or test casesComputational budget for iterative evaluation (hours to days per discovery)Domain-specific evaluation harness (mathematical validator, performance benchmarker, etc.)LLM with large context window (8K+ tokens) to retain failure historyDeterministic evaluation function that produces consistent pass/fail resultsStructured logging of program attempts with failure reasons and constraint violationsMechanism to serialize and deserialize program candidates for storage

Input / Output

Accepts: natural language problem description, formal mathematical constraints or conjectures, test case suites with expected outputs, performance benchmarks or optimization objectives, previous program attempts (code or pseudocode), failure logs with constraint violations or test case failures, performance metrics from prior iterations, domain-specific error messages or validation feedback, problem specification with multiple success criteria, weighted objective function or Pareto ranking rules, test suites for correctness evaluation, performance benchmarks or resource constraints, problem specification in domain-specific language or natural language, domain knowledge base (patterns, heuristics, templates), reference solutions or exemplars, problem-specific constraints or performance targets, mathematical conjecture in natural language or formal notation, formalized constraints or test case specifications, domain bounds and search space definition, reference materials or related theorems, program candidates (code or pseudocode), test suites with expected outputs, performance benchmarks, evaluation configuration (timeout, resource limits, etc.)

Produces: executable program code (Python, pseudocode, or domain-specific language), mathematical proof sketches or constructive proofs, performance metrics and comparison against baselines, structured explanation of discovered solution logic, refined program proposals with modified logic or structure, prioritized list of next candidates to evaluate, summary of failure patterns and avoided strategies, convergence metrics showing improvement over iterations, ranked list of candidate programs with per-objective scores, Pareto frontier of non-dominated solutions, trade-off analysis showing which objectives conflict, recommended solution based on user-specified preference weights, program candidates tailored to domain conventions, explanation of which domain patterns were applied, comparison against domain-specific baselines, structured solution that follows domain best practices, counterexample (if conjecture is false) with proof of violation, constructive proof or algorithm (if conjecture is true), search statistics showing coverage of solution space, mathematical explanation of discovered solution, ranked list of candidates with quality scores, evaluation statistics (pass rate, performance metrics, etc.), convergence plots showing improvement over iterations, detailed evaluation logs for debugging

UnfragileRank

Adoption15%(30% weight)

Quality14%(25% weight)

Ecosystem15%(15% weight)

Match Graph10%(25% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Product

6 capabilities

Visit Mathematical discoveries from program search with large language models (FunSearch)→

About

### Audio Processing <a name="2023ap"></a>

Alternatives to Mathematical discoveries from program search with large language models (FunSearch)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of Mathematical discoveries from program search with large language models (FunSearch)?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github awesome

Looking for something else?

Search →

Capabilities6 decomposed

program-space search with llm-guided exploration

Medium confidence

Solves for

Best for

Research teams exploring mathematical conjectures and algorithm discovery

Optimization specialists seeking novel solutions to NP-hard or combinatorial problems

Academic institutions validating computational mathematics hypotheses

Requires

LLM API access (GPT-4 or equivalent) with function-calling capability

Formal specification of program constraints or test cases

Computational budget for iterative evaluation (hours to days per discovery)

Limitations

Requires well-defined evaluation metrics or test suites to judge program correctness — works poorly on subjective or open-ended problems

Search time grows exponentially with program complexity and constraint count; practical for small-to-medium programs only

LLM guidance is probabilistic and may miss solution regions if training data doesn't cover similar problem structures

What makes it unique

vs alternatives

iterative program refinement with failure-driven learning

Medium confidence

Solves for

Best for

Teams running long-horizon program search experiments where iteration count is high (100s to 1000s of attempts)

Researchers studying how LLMs learn from negative examples in structured domains

Optimization workflows where solution quality improves monotonically with iteration

Requires

LLM with large context window (8K+ tokens) to retain failure history

Deterministic evaluation function that produces consistent pass/fail results

Structured logging of program attempts with failure reasons and constraint violations

Limitations

Context window limits the number of failure examples that can be retained — typically 10-50 examples before context overflow

LLM may overfit to recent failures and miss alternative solution strategies if failure patterns are not diverse

No mechanism to escape local optima if all nearby proposals fail — requires random restarts or search space diversification

What makes it unique

vs alternatives

constraint-aware program generation with multi-objective evaluation

Medium confidence

Solves for

Best for

Algorithm researchers optimizing for both theoretical and practical performance

Mathematicians seeking proofs that are both correct and insightful

Performance engineers tuning code for multiple hardware or resource constraints

Requires

Quantifiable evaluation metrics for each objective (test pass rate, execution time, code length, etc.)

Weighting scheme or Pareto ranking method to combine objectives

Evaluation harness capable of measuring all objectives on each candidate

Limitations

Defining and weighting multiple objectives requires domain expertise; poor objective design leads to irrelevant solutions

Evaluation cost scales linearly with number of objectives — each candidate must be tested against all metrics

Trade-offs between objectives may not be obvious to the LLM; it may generate solutions that are mediocre on all dimensions

What makes it unique

vs alternatives

domain-specific program synthesis with problem-aware prompting

Medium confidence

Solves for

Best for

Research teams working repeatedly on problems within a specific domain (e.g., combinatorics, number theory)

Organizations building domain-specific program synthesis tools

Practitioners who can articulate domain patterns and best practices

Requires

Explicit articulation of domain patterns, heuristics, and common solution structures

Library of successful solution templates or reference implementations

Domain expert to design and validate prompts

Limitations

Requires manual curation of domain knowledge and solution templates — not automated

Domain-specific prompts may bias the search toward known techniques, reducing novelty

Transferability is limited — prompts optimized for one problem class may not work for related classes

What makes it unique

vs alternatives

mathematical conjecture validation through program discovery

Medium confidence

Solves for

Best for

Mathematicians and theoretical computer scientists exploring conjectures

Research teams validating or refuting open problems computationally

Educational institutions teaching mathematical discovery and proof techniques

Requires

Formal specification of the conjecture as executable constraints or test cases

Bounded search space or finite domain (e.g., integers up to 10^6)

Evaluation harness that can check conjecture satisfaction

Limitations

Only applicable to conjectures that can be formalized as executable constraints or test cases

Computational search is limited to finite domains or bounded search spaces — cannot prove universal statements

Discovering counterexamples does not prove a conjecture false in general; requires mathematical verification

What makes it unique

vs alternatives

scalable evaluation and ranking of program candidates

Medium confidence

Solves for

Best for

Teams running large-scale program search experiments with 1000s of candidates

Researchers studying solution quality distributions and search landscapes

Optimization workflows where evaluation cost is a bottleneck

Requires

Parallel execution environment (multi-core, distributed cluster, or cloud compute)

Deterministic evaluation harness with consistent pass/fail results

Caching layer for test results (in-memory or persistent)

Limitations

Parallel evaluation requires multi-core or distributed infrastructure — not practical on single machines for large candidate pools

Caching assumes deterministic evaluation; non-deterministic programs or stochastic metrics break caching assumptions

Early termination may miss high-quality solutions if they fail early test cases

What makes it unique

vs alternatives

More efficient than sequential evaluation because it parallelizes test execution, and more flexible than hard-coded ranking because it supports pluggable evaluation metrics and ranking algorithms.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to Mathematical discoveries from program search with large language models (FunSearch)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Mathematical discoveries from program search with large language models (FunSearch)

Capabilities6 decomposed

program-space search with llm-guided exploration

iterative program refinement with failure-driven learning

constraint-aware program generation with multi-objective evaluation

domain-specific program synthesis with problem-aware prompting

mathematical conjecture validation through program discovery

scalable evaluation and ranking of program candidates

Related Artifactssharing capabilities

Large Language Models as Optimizers (OPRO)

AlphaCodium

Voyager

BabyElfAGI

code-act

Mistral: Devstral 2 2512

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Mathematical discoveries from program search with large language models (FunSearch)

Are you the builder of Mathematical discoveries from program search with large language models (FunSearch)?

Get the weekly brief

Data Sources

Mathematical discoveries from program search with large language models (FunSearch)

Capabilities6 decomposed

program-space search with llm-guided exploration

iterative program refinement with failure-driven learning

constraint-aware program generation with multi-objective evaluation

domain-specific program synthesis with problem-aware prompting

mathematical conjecture validation through program discovery

scalable evaluation and ranking of program candidates

Related Artifactssharing capabilities

Large Language Models as Optimizers (OPRO)

AlphaCodium

Voyager

BabyElfAGI

code-act

Mistral: Devstral 2 2512

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Mathematical discoveries from program search with large language models (FunSearch)

Are you the builder of Mathematical discoveries from program search with large language models (FunSearch)?

Get the weekly brief

Data Sources