Review evaluation results for {{model}} on {{benchmark}} critically. Reflect on: - What assumptions did the evaluation make? - What scenarios were not tested? - Could the metrics be gamed? - Are there confounding variables? - What would invalidate these results? After reflection, identify the top 3 evaluation blind spots and recommend additional tests to address them for {{deployment_context}}.
23 copies0 forks
Details
Category
AnalysisUse Cases
Result validationBlind spot detectionEvaluation improvement
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared