โ† Back to blog

EurekAgent: Autonomous Scientific Discovery at Under $11

EurekAgent: Autonomous Scientific Discovery at Under $11

What if the secret to building better AI scientists wasn't a smarter model โ€” but a better environment?

A new paper from Tsinghua University researchers argues exactly that. EurekAgent is an LLM agent system for metric-driven autonomous scientific discovery, and its core insight is that the bottleneck has shifted from agent architecture to environment design.

The results speak for themselves: new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks, all at an average API cost under $17 โ€” with the 26-circle packing problem solved for less than $11.

What Happened

Published on arXiv on June 12, 2026, and featured on HuggingFace Daily Papers and AI Hot, EurekAgent introduces a framework called environment engineering: systematically designing the resources, constraints, and interfaces that shape agent behavior, rather than prescribing agent workflows directly.

The authors โ€” Xin, Siow, Wang, Yao, Zhang, Song, Hou, and Li โ€” define four engineering dimensions:

1. Permissions Engineering. The agent runs in a bounded execution environment with isolated evaluation. It has access to useful capabilities (code execution, file I/O) but is prevented from actions that compromise research integrity. This mirrors how you'd design a safe sandbox for any production agent system.

2. Artifact Engineering. All solutions, logs, and evaluation results are structured as shared progress memory using the filesystem and Git. This enables systematic artifact management and inter-agent collaboration โ€” agents build on each other's work naturally because the environment preserves history.

3. Budget Engineering. Cost-aware exploration with runtime and compute boundaries. The agent can explore freely โ€” but only within defined budget limits. This forces efficient search and prevents runaway API costs. The $11 circle packing result is the direct outcome of this design choice.

4. Human-in-the-Loop Engineering. Easy human supervision and intervention points are built into the environment. Researchers can inspect progress, approve next steps, and steer the agent without friction.

EurekAgent achieves these results using off-the-shelf CLI agents โ€” Claude Code and GLM-5.1 as base models โ€” without custom agent architectures. The key differentiator is the environment, not the model.

Why It Matters

The deeper argument EurekAgent makes is that reward hacking, not reasoning capability, has been the silent killer of prior autonomous discovery attempts. By treating the execution environment as the primary engineering surface, the paper argues that most previous failures were misdiagnosed as model problems when they were actually incentive-structure problems.

This is a provocative shift. For the past two years, the dominant narrative has been "better models + better prompts = better agents." EurekAgent suggests that equation is incomplete: the environment the agent operates in shapes behavior more than the prompt or the model does.

Consider the 26-circle packing problem โ€” a classic optimization challenge where the goal is to fit 26 circles inside a unit square to maximize the sum of their radii. Previous approaches required expensive, multi-stage pipelines. EurekAgent's agent, operating in a budget-constrained environment with Git-tracked artifacts and isolated evaluation, discovered a new SOTA configuration for under $11. That's less than the cost of lunch for most research teams.

What Agents Can Learn

EurekAgent's four engineering dimensions translate directly to practical lessons for anyone building AI agents today:

  • Start with the environment, not the prompt. Before tuning system prompts, define where the agent runs, what it has access to, and how its outputs are validated. This is the same principle behind designing effective AI agent workflows โ€” structure precedes optimization.

  • Artifacts are memory. If your agent doesn't persist its outputs in a structured, versioned way, it can't learn from its own history. Filesystem + Git is a low-friction way to turn every execution into reusable knowledge.

  • Budget constraints drive better behavior. Fixed-cost exploration forces agents to be strategic. This applies whether your budget is measured in API credits, compute time, or iteration limits.

  • Human oversight points must be designed, not bolted on. EurekAgent builds intervention points into the environment itself, making supervision natural rather than interruptive. This is especially relevant when deploying agents in production โ€” as covered in our guide to what an AI agent is and how to manage agent autonomy responsibly.

Practical Implications

EurekAgent arrives alongside a cluster of related papers โ€” EvoArena and HyperTool โ€” that collectively suggest a quiet consensus forming around environment engineering as the near-term lever for agent capability. These papers all point to the same conclusion: what an agent can do depends more on how its environment is structured than on the architecture of the agent itself.

For teams building agent systems, this has immediate practical implications:

  1. Audit your agent's environment before tuning its prompts. Map out permissions, artifact storage, budget limits, and human intervention points. These dimensions likely matter more than your system prompt.

  2. Treat failed experiments as reusable assets. EurekAgent's artifact engineering means every failed run is preserved in Git, creating a searchable history. This turns failure into compound knowledge.

  3. Set explicit cost boundaries before the agent runs. Budget engineering forces strategic exploration. Without it, agents can exhaust resources on dead ends without learning.

If you're building agents and want to apply these principles, the tutorials on ClawWorld walk through designing agent environments step by step โ€” from permission boundaries to artifact management to human oversight patterns.

The full EurekAgent paper is available on arXiv, with open-source code on GitHub. It's worth reading for anyone serious about building agents that don't just work โ€” but discover.

Explore agent environment tutorials โ†’