AgentDish directory

benchmark

Accepted listings with this tag.

Listing Category Score Trend Checked
#94 ↓ -3
CAD-Bench

An open benchmark and leaderboard for AI CAD agents, with 308 prompts across 20 categories and layered scoring for geometry, engineering, manufacturability, and cognition.

Research / Knowledge Work 88 ↓ -3 42 days ago Details
#306 ↓ -6
DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks. The page shows a leaderboard, methodology overview, task examples, and a full blog explaining the benchmark design and results.

Developer Tools / AI Benchmarking 84 ↓ -6 23 days ago Details

A black-box benchmark report on how AI-generated tests detect functional bugs in live APIs across 20 scenarios and 7 systems.

Developer Tools / Code Assistant 83 ↓ -3 16 days ago Details

A workbench report comparing MiniMax M3 and GLM 5.2 on autonomous coding tasks, with scored results, latency and cost data, task-type breakdowns, and examples of where each model performed better.

Developer Tools / Code Assistant 81 ↑ +2 8 hours ago Details