Measure {{model}} stability on {{benchmark}} through repeated runs. Execute benchmark {{run_count}} times with identical conditions: - Record score for each run - Calculate score variance - Identify outlier runs Report stable score estimate with confidence bounds. Flag if standard deviation exceeds {{stability_threshold}}. Recommend minimum runs for reliable measurement.
6 copies0 forks
Details
Category
AnalysisUse Cases
Stability measurementScore reliabilityBenchmark validation
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared