AgentDish directory

benchmark

Accepted listings with this tag.

Listing	Category	Score	Trend	Checked
#94 ↓ -3 CAD-Bench An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition.	Research / Knowledge Work	88	↓ -3	42 days ago	Details
#306 ↓ -6 DeepSWE DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results.	Developer Tools / AI Benchmarking	84	↓ -6	23 days ago	Details
#397 ↓ -3 AI Agent Benchmark: API Bug Detection \| KushoAI A black-box benchmark report on how AI-generated tests detect functional bugs in live APIs across 20 scenarios and 7 systems.	Developer Tools / Code Assistant	83	↓ -3	16 days ago	Details
#492 ↑ +2 MiniMax M3 vs. GLM 5.2: Codegen comparison across autonomous coding tasks A workbench report comparing MiniMax M3 and GLM 5.2 on autonomous coding tasks, with scored results, latency and cost data, task-type breakdowns, and examples of where each model performed better.	Developer Tools / Code Assistant	81	↑ +2	8 hours ago	Details