Design a custom benchmark for evaluating {{model}} on {{capability_area}}. Step 1: Define what success looks like for {{use_case}} Step 2: Identify measurable dimensions of performance Step 3: Create diverse test cases covering edge cases Step 4: Establish scoring rubrics with clear criteria Step 5: Validate benchmark against {{reference_models}} Step 6: Document administration and scoring procedures Explain design rationale at each step.
73 copies0 forks
Details
Category
AnalysisUse Cases
Benchmark creationEvaluation designTest development
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared