Benchmark Score Stability Check

Check benchmark stability through repeated measurements.

6 copies0 forks
Share this prompt:
Measure {{model}} stability on {{benchmark}} through repeated runs.

Execute benchmark {{run_count}} times with identical conditions:
- Record score for each run
- Calculate score variance
- Identify outlier runs

Report stable score estimate with confidence bounds. Flag if standard deviation exceeds {{stability_threshold}}. Recommend minimum runs for reliable measurement.

Details

Category

Analysis

Use Cases

Stability measurementScore reliabilityBenchmark validation

Works Best With

claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Updated Shared

Related Prompts

Create your own prompt vault and start sharing