Compare {{model_a}} vs {{model_b}} on {{evaluation_set}} robustly. Conduct 5 comparison trials: - Randomize evaluation order each trial - Use different prompt phrasings - Score both models per trial Determine winner by majority vote. Calculate win margin confidence. Report comparison only if winner is consistent across {{agreement_threshold}} trials.
7 copies0 forks
Details
Category
AnalysisUse Cases
Robust comparisonWinner determinationConfidence scoring
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared