As a Benchmark Designer, create evaluation suite for {{model}} on {{capability}}. Benchmark design: - Test case diversity across {{difficulty_levels}} - Ground truth generation methodology - Scoring rubric development - Statistical validity requirements - Administration protocol Provide benchmark specification with validation criteria.
33 copies0 forks
Details
Category
AnalysisUse Cases
Benchmark designTest creationEvaluation development
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared