Evaluate {{model}} reasoning on {{problem_set}} requiring {{reasoning_depth}} steps. Analyze logical consistency, intermediate step accuracy, and conclusion validity. Score overall reasoning quality and identify common failure modes.
42 copies0 forks
Details
Category
AnalysisUse Cases
Logic testingReasoning assessmentCapability evaluation
Works Best With
claude-opus-4.5gpt-5.2gemini-2.0-flash
Created Shared