Decompose creating {{benchmark_name}} for {{capability}} into steps. Phase 1 - Design: - Define test categories - Specify difficulty levels - Create rubric drafts Phase 2 - Development: - Generate test cases - Validate ground truth - Calibrate scoring Phase 3 - Validation: - Pilot with {{pilot_models}} - Refine based on results - Document administration Output detailed task list with effort estimates and {{resource_requirements}}.
95 copies0 forks
Details
Category
AnalysisUse Cases
Benchmark planningPipeline designProject management
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared