Real-world agent benchmark — 100 tasks across 8 domains. [qwenclawbench-v1.1-100]
↵ click any row for per-task breakdown
Three grading modes depending on what the task requires.
Python functions verify outputs directly. Binary pass/fail with optional partial credit.
Claude scores the agent's output against a rubric, handling open-ended or hard-to-verify answers.
A weighted combination of automated and LLM scores. However, if the automated score falls below a minimum threshold, the LLM component is zeroed — the agent must pass the hard checks first.