Evaluate {{model}} accuracy on {{dataset}} using self-consistency. Run evaluation 5 independent times with different random seeds: - Sample different test subsets - Vary evaluation order - Record accuracy for each run Calculate mean accuracy with confidence interval. Flag if variance exceeds {{variance_threshold}}. Report most reliable estimate with uncertainty bounds.
39 copies0 forks
Details
Category
AnalysisUse Cases
Accuracy validationConsistency checkingReliability testing
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared