How to build an AI agent for science
By Ryan Kosai
Potato is building an environment for biological sciences. The environment includes tools to search and review scientific literature, build wet lab protocols, create and run bioinformatics scripts, and drive robots for full lab automation. If we are successful in our goal, millions of AI scientists running in a high-fidelity environment will accelerate scientific discovery to an unfathomable scale.
We've also created an AI agent (Tater) that is capable of autonomously performing experimentation and analysis using the environment. While Tater is currently a leading scientific AI agent, we believe that all agents will likely be commoditized by progressively smarter frontier models from OpenAI, Anthropic, Meta, and xAI. In the long run, the prime defensibility of AI-powered science will come from engineering accurate and detailed “world models” that those agents can interact with. With this in mind, we are publishing the details of Tater's implementation, which not only allows us to interact and learn from how others are building their agents, but also provides transparency to our users regarding how we use AI for scientific reasoning.
The Environment
Our original vision for Potato was not to build an AI agent, but to provide an active “world model” for a tool-wielding, scientifically inclined artificial general intelligence (AGI). This meant creating prompt-based wrappers around infrastructure at the intersection of open and proprietary tools, high-quality literature, and public and private user datasets. We continue to pursue this strategy, and we have spent the last two years building the datasets and tools necessary for a scientific environment.
AGI has not yet arrived. So, as an early bridge, we've produced a lightweight agent ourselves, capable of using our AI-ready tools. Here's how we've done it, using the accumulated best practices for modern agents.
The Loop
At the core of an agent is a loop comprising a thinking phase and a tool-calling phase. The thinking phase keeps track of knowledge and identifies what tool to call next; the tool-calling phase engages the appropriate tool to do the actual work.
To reduce the number of tools considered at any one time, we modeled our approach on the OpenAI Agent SDK, which supports the idea of handoffs between unique agent capabilities. This is critical because it introduces a natural hierarchical organization that helps reduce agent context, as well as shorten the developer prompt.
We expand on the native capabilities of the Agent SDK with support for asynchronous, long-running calls. Our agent waits in the polling phase until tasks are complete, then the response is handled and the agent enters the next thinking phase.
Tool Context
One key aspect of our architecture is that the tools themselves are AI powered, and each tool exposes a prompt format to indicate the inputs and desired result. This is particularly important because each tool run comprises significant documentation, internal instructions, and API calls that would otherwise fill the agent's context. Instead, we return just the necessary results, with a reference ID to later access the tool's internal reasoning and artifacts.
For example, we’ve found that in-silico computational simulation requires detailed module documentation, but also benefits heavily from literature-drawn scientific context. This context is crucial for accurately setting parameters, such as in protein-ligand docking, where environmental conditions directly impact affinity scoring.
Likewise, evaluating literature involves search and reranking, scanning documents for relevant facts, and finding corroboration or divergence between documents. All of these are token-heavy activities, and benefit from the externalizing effects of a tool-oriented architecture.
Storing the artifacts in a structured format additionally provides a medium-term “memory” for the agent, which allows it to retrieve specific details by key, without automatically consuming valuable context space.
Rewriting History
Even with a dedicated tool context, information accumulates rather quickly. To further simplify the context window for each thinking phase, we rewrite the entire history into concise XML blocks representing previous tool executions. This has the benefit of compressing both the tool request and response into a single chunk of information, eliminating content duplication. Internal testing has also shown the XML structure to be more amenable to handling by the latest models than the native JSON formats used for calling the API.
We also compress the historical context, using the compression method published by Factory AI, and have found it to be very effective and tunable across context window sizes.
Branching the Timeline
One major challenge to building a robust agent in the current era is that frontier model outputs still exhibit both high variability and sensitivity to input conditions. Practically speaking, this means that a non-trivial number of decisions made by the model are wrong or suboptimal, and these errors compound when working over a long time horizon.
To mitigate this issue in the short-term, we allow the user-in-the-loop to branch the timeline after any thinking phase. During a branching event, the historical record is maintained, but the model is prompted to reevaluate its decision with additional user suggestions and context. This can be performed post hoc, so even if the agent has completed a run, a user can revisit the timeline and create a branch at any desired point along that timeline.
In the long term, a more sustainable solution is to allow automated recursive backtracking, so the agent can decide against its current trajectory and try again from an earlier iteration. However, this approach requires very robust instantaneous scoring toward the final outcome, a significant data science challenge that we continue to pursue.
Scoring the Results
Once the agent has ended its session, the system automatically considers the full history and artifacts to produce a complete summary report. This report provides the scientific context, experiments run, and a summary of the evaluated data.
The system also generates an LLM-as-judge numerical scoring rubric with categories such as completion, correctness, and efficiency. These scores form an early framework for measuring how well the agent utilized the environment, as well as how well it explains, justifies, and improves its reasoning. Depending on how the future unfolds, this may unlock meta-learning for optimized execution pathways, such as continuous improvement by reinforcement learning.
The AI-powered Future
As suggested earlier, we expect all this functionality to be subsumed by frontier models by the major LLM providers. While we've built an AI-agent resting on existing frontier models, we continue to advance and build the environment for an AI-powered future — a high-fidelity world model for scientific research.
As AGI approaches, we see millions of agents from a variety of providers evaluating literature, running in silico experiments, and operating fully autonomous labs — enabling new knowledge production at an unimaginable speed. The upside of progressively faster discovery loops is ahistorical, and even a 10x improvement in the rate of discovery compresses a century of discovery into a decade.
Thank you for coming to my TED talk.
Ready to experience the future of science?
Join the Beta WaitlistLatest News & Press See all
Blog Post: How to build an AI agent for science
Blog Post: The next internet moment for science
Potato + Wiley: Grounding AI Experimentation in Trusted Protocols
Why Ensemble is backing Potato on their mission to deploy AI scientists
Draper Associates leads $4.5M investment in startup Potato to drive AI-driven ‘runaway knowledge production'
Contact Us
Interested in piloting Potato? Have a partnership idea?
We'd love to hear from you. hello@potato.ai