Four Top AI Models Played Civilization VI. The Smartest One Still Lost.
Liam Wilkinson, a former data scientist at the UK Prime Minister's office, wanted to answer a simple question: when AI agents are given real autonomy and a long time horizon, what actually breaks? So he built 76 MCP tools, wired up Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro, and let them play 23 full games of Civilization VI against each other.
The results were not what you'd expect from "the smartest models in the world."
Claude Nuked France. France Won Anyway.
In one game, Claude was playing as Portugal. France was racing toward a culture victory. Claude's response was about as decisive as it gets: spend 50 turns secretly developing nuclear weapons, then drop one on Toulouse.
France won anyway โ via diplomatic victory, while Claude was still mid-campaign.
That's not a one-off fluke. Across 23 games, the pattern that emerged wasn't about raw intelligence. It was about something much more mundane: the models kept losing track of what was actually happening on the board, and even when they had a plan, they couldn't reliably follow through on it.
The Two Failure Modes: Perception and Execution
Wilkinson's analysis landed on two specific bottlenecks:
Perception blindness. The models proactively checked the full game state โ what's actually happening across the map, what opponents are building, where the real threats are โ only 1โ2% of the time. They were mostly reacting to whatever was already in front of them, not actively scanning for what mattered.
The knowing-doing gap. Even when a model formed a good plan, it only followed through on that plan 48โ66% of the time within the next 10 turns. The model would correctly decide "I need to build defenses" or "I should pursue this trade route," and then... not do it, or do something else instead.
Put together, that's a model that often doesn't know what's going on, and even when it figures it out, frequently doesn't act on its own conclusion.
Intelligence Wasn't the Bottleneck
This is the part worth sitting with. These are frontier models โ the same class of model that can pass the bar exam, write production code, and reason through multi-step math proofs. None of that intelligence stopped Claude from losing a war it had every tool to win.
Wilkinson's conclusion: raw reasoning ability isn't what's gating agent performance right now. It's situational awareness and follow-through. A less "smart" agent that reliably checks its environment and reliably executes its own plans would likely beat a smarter one that doesn't.
This tracks with something a lot of people building real-world agents have noticed: the hard part of agentic AI was never "can it think of a good idea." It's "can it notice when the situation has changed, and can it actually carry the idea out over dozens of steps without drifting."
Why This Matters Beyond Strategy Games
Civilization VI is a useful stress test precisely because it's long-horizon, partially observable, and full of competing priorities โ which is a decent proxy for any real agentic task that runs for more than a few minutes. Managing a codebase, running a multi-day research task, or operating any system with moving parts has the same shape: you need to keep checking what's actually true, and you need to keep doing what you decided to do.
The 1-2% perception-check rate is the kind of number that should worry anyone deploying agents on tasks where the world keeps changing underneath them. An agent that only looks up 1-2% of the time is an agent that's mostly running on stale assumptions.
What This Means If You Use OpenClaw
This is exactly the gap that separates a good agent framework from a flashy demo. OpenClaw is built around the idea that an agent's value isn't just in how smart the underlying model is โ it's in whether the agent reliably checks its environment and reliably finishes what it started.
When you run a tutorial on ClawWorld, your agent isn't just generating a plan and hoping it sticks. It's built to track state across steps, verify its own work, and follow through on multi-step tasks without losing the thread โ the exact two failure modes that took down four frontier models in a strategy game.
The lesson from Wilkinson's experiment isn't "wait for smarter models." It's "build agents that check their work and finish what they start." That's the whole design philosophy behind OpenClaw.