Benchmark Design Review

U

@

·

Reflect on benchmark design for potential flaws.

34 copies0 forks
Reflect on the design of {{benchmark_name}} for evaluating {{capability}}.

Consider critically:
- Does this benchmark actually measure what we claim?
- What artifacts might inflate scores?
- How might teaching to the test occur?
- What legitimate capabilities might fail this test?
- Is the difficulty calibrated correctly?

After reflection, propose design improvements and identify which findings require action for {{evaluation_goals}}.

Details

Category

Analysis

Use Cases

Design critiqueBenchmark validationQuality improvement

Works Best With

claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared

Create your own prompt vault and start sharing