Reflect on the design of {{benchmark_name}} for evaluating {{capability}}. Consider critically: - Does this benchmark actually measure what we claim? - What artifacts might inflate scores? - How might teaching to the test occur? - What legitimate capabilities might fail this test? - Is the difficulty calibrated correctly? After reflection, propose design improvements and identify which findings require action for {{evaluation_goals}}.
34 copies0 forks
Details
Category
AnalysisUse Cases
Design critiqueBenchmark validationQuality improvement
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared