To get the baseline model scores on the ARC-AGI-Pub leaderboard, we're using the same baseline prompt we used to test GPT-4o. When we test and report results on pure models like o1, our intention is to get a measurement of base model performance, as much as that is possible, without layering on any optimization. Others may discover better ways to prompt CoT-style models in the future, and we are h