Watch LLMs play 21,000 hands of Poker
BenchmarkPokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, an
- Best for
- simulated poker gameplay analysis
- Type
- Benchmark
- Score
- 28/100
- Best alternative
- v0
Capabilities1 decomposed
simulated poker gameplay analysis
Medium confidenceThis capability allows users to observe and analyze the performance of LLMs as they play 21,000 hands of poker. It utilizes a custom-built simulation environment that integrates LLM decision-making processes with poker game rules, enabling real-time tracking of strategies, outcomes, and player interactions. The architecture supports detailed logging and performance metrics, providing insights into the models' decision-making patterns and effectiveness in various scenarios.
The implementation leverages a specialized simulation engine that combines LLM outputs with poker game mechanics, allowing for a comprehensive analysis of AI strategies over a large number of hands.
More extensive and detailed than other poker AI benchmarks due to the sheer volume of hands played and the depth of analysis provided.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Watch LLMs play 21,000 hands of Poker, ranked by overlap. Discovered automatically through the match graph.
Suspicion Agent
Paper on imperfect information games
Gave Claude a casino bankroll – it gambles till it's too broke to think
Inspired by ALMA. As Claude loses money gambling on provably-fair slots, it's forced to downgrade from Opus → Sonnet → Haiku, making worse decisions and accelerating the spiral. Donations go to gambling support charities.
Chess
Enhance chess skills with AI-driven analysis and strategic...
AgentBench
8-environment benchmark for evaluating LLM agents.
nephyr-backtest
Strategy backtesting with real on-chain Polymarket data. Backtest weather-based prediction market strategies, simulate copy-trading top wallets, and query available historical data. Validate your strategies against real market outcomes before risking capital.
dino-game-chatgpt-app
MCP server: dino-game-chatgpt-app
Best For
- ✓researchers studying AI decision-making in games
- ✓developers building AI models for strategic gameplay
Known Limitations
- ⚠Limited to poker; does not support other card games or variations
- ⚠Performance metrics may not reflect real-world poker scenarios
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Show HN: Watch LLMs play 21,000 hands of Poker
Categories
Alternatives to Watch LLMs play 21,000 hands of Poker
See all alternatives to Watch LLMs play 21,000 hands of Poker→Are you the builder of Watch LLMs play 21,000 hands of Poker?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →