Hugging Face GitHub

QwenClawBench

Real-world agent benchmark — 100 tasks across 8 domains. [qwenclawbench-v1.1-100]

Models

100

Tasks

Top Score

60.5

Claude Sonnet 4.6

Sort Score 12 models

↵ click any row for per-task breakdown

Scoring Methodology

Three grading modes depending on what the task requires.

Auto

Automated

Python functions verify outputs directly. Binary pass/fail with optional partial credit.

LLM

LLM Judge

Claude scores the agent's output against a rubric, handling open-ended or hard-to-verify answers.

Hybrid

A weighted combination of automated and LLM scores. However, if the automated score falls below a minimum threshold, the LLM component is zeroed — the agent must pass the hard checks first.

Hybrid Formula

final = weight_auto × auto + weight_llm × llm
but if auto < 0.75, treat llm as 0 (hard checks must pass first)

Passes hard checks w=(0.6/0.4) A:90% L:85% = 88.0%

Fails hard checks — LLM zeroed w=(0.6/0.4) A:70% L:95% = 42.0%

Perfect auto, weak LLM w=(0.6/0.4) A:100% L:50% = 80.0%