AgentDish directory
benchmark
Accepted listings with this tag.
| Listing | Category | Score | Trend | Checked | |
|---|---|---|---|---|---|
|
#94
↓ -3
CAD-Bench
An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition. |
Research / Knowledge Work | 88 | ↓ -3 | 42 days ago | Details |
|
#306
↓ -6
DeepSWE
DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results. |
Developer Tools / AI Benchmarking | 84 | ↓ -6 | 23 days ago | Details |
|
A black-box benchmark report on how AI-generated tests detect functional bugs in live APIs across 20 scenarios and 7 systems. |
Developer Tools / Code Assistant | 83 | ↓ -3 | 16 days ago | Details |
|
A workbench report comparing MiniMax M3 and GLM 5.2 on autonomous coding tasks, with scored results, latency and cost data, task-type breakdowns, and examples of where each model performed better. |
Developer Tools / Code Assistant | 81 | ↑ +2 | 8 hours ago | Details |