Score {{model}} output quality on {{evaluation_criteria}} using multiple evaluators. Run {{evaluator_count}} independent quality assessments: - Each evaluator scores 1-10 per criterion - Record all scores - Calculate inter-evaluator agreement Report aggregated scores with confidence intervals. Flag criteria with high disagreement. Weight final scores by evaluator reliability on {{calibration_set}}.
53 copies0 forks
Details
Category
AnalysisUse Cases
Quality aggregationMulti-evaluator scoringAgreement analysis
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared