A Triphasic Architecture for Coding Agents

Spoiler alert: This post might seem like it's just defining agents, but it's really my argument for why we need to shift focus. We need to move away from trying to build one-size-fits-all generic agents and towards creating domain-specific toolchains.

2025 is the year of the agent, as we often hear, and more specifically the coding agent. Below are my thoughts on agents and where I see them heading. I try to outline why I advocate for a triphasic architecture and what I think could supercharge them, especially for complex software development tasks. For any significant task, my thesis will be that an agent needs distinct capabilities, and thus distinct tools, for three core phases: Information Gathering, Implementation/Action, and Verification/Validation. And to achieve an agent from that, you throw it into a loop, with some more generic tools, some (automatic) initial context selection and context distillation for longer tasks.

A diagram showing a triphasic agent architecture
Figure 1: A simple agent architecture.

Phase 1: Information Gathering. What Makes it an "Agent" Task?

For me, a key distinction between simple LLM tasks and true 'agent' tasks lies in the requirement for information. Does the task inherently require gathering external information or context that isn't provided upfront? If not, the LLM primarily operates on the provided input, transforming information and intent already present.

A diagram showing the difference between a simple LLM task and an agent task
Figure 2: The difference between a simple LLM task and an agent task.

Asking an LLM to write a basic HTML/JS/CSS weather app is complex, but all the necessary intent and information is in the prompt (and the weights of the model to understand the intent). The LLM operates on the input. Contrast that with: Integrate a new notification feature into our existing React codebase. This is harder. Why? Because the agent needs to:

  • Understand the existing codebase structure.
  • Identify relevant components, services, and state management patterns.
  • Figure out where the new code should live.
  • Understand existing styling conventions or component libraries.
  • Check for potential conflicts or side effects.

This requires active exploration and information gathering beyond the initial prompt. It's the difference between writing a standalone creative blog post versus compiling a detailed research report on a specific scientific topic. One relies heavily on internal generation, the other demands external data acquisition.

An ideal agent performs this information gathering. Just having this capability, however, only gets you partway there. You've essentially built a sophisticated "deep (re)search system," but it's not yet acting or solving the core problem autonomously.

The Human Context Problem

Most human work and what we value most can not be created in an instant. We build upon foundations, existing structures and knowledge. We rarely start from scratch for a task. We build on what already exists in the world we know and see around us. For an agent to be practical and useful, it needs to be able to provide value in these existing systems.

The only way to do so is to integrate into the world we know. We often fail to provide all the necessary information, party because we have internalized so much of it. Maybe most importantly: if we were to actually dig up and collect all the information, this would take us longer than to just do the task ourselves. I mean actually gathering all the information about the codebase, the dependencies, the documentation, the project structure, the existing code, the existing tests, the existing documentation, etc. All the relevant knowledge built upon by spending time in the codebase, building and solving problems. Programmers working in codebases have internalized so much of it, much more than they even realize. To actually provide this information to an agent, we would have to make it explicit and that is a lot of work. The agent needs to be able to do this for itself, or we need to find a way to do it for it. Manually providing all this information is not feasible and we as humans are not good at it.

This is a fundamental problem to solve in the quest for efficient agents and (semantic embedding based) RAG is not going to get us there.

A diagram showing the a problem that would be hard to solve with RAG
Figure 3: An example of a problem that would be hard to solve with RAG.

My assertion that RAG alone won't suffice, especially for complex codebases, stems from its fundamental reliance on semantic similarity, often overlooking crucial structural context. Embedding models, typically trained on general text, struggle to capture the deep semantic nuances, architectural purpose, and specific abstraction layers inherent in large-scale software. More critically, this semantic focus often fails to trace essential structural dependencies. For instance, if your task requires understanding function X, standard RAG might retrieve documentation or code semantically related to X's purpose. However, if X calls a vital but generically named utility function Y (e.g., process_item), RAG is unlikely to retrieve Y when its content lacks direct semantic overlap with the primary task, even though understanding Y is structurally essential to understanding X. This isn't just about getting noisy results; it's about potentially missing indispensable pieces of the puzzle entirely. This highlights the challenge: RAG might return superficially plausible context while failing to provide a complete picture, and the agent often lacks a reliable way to verify this completeness. In contrast, leveraging the code's explicit structure—navigating its call graph or hierarchy using specialized tools—offers a more deterministic and verifiable path to gathering the necessary context, especially multiple levels deep.

I do believe multi-modality is also part of the puzzle here but I will save that for another post.

The Context Crunch: Distilling Information

Okay, so the agent needs to gather information. Great. But here's the immediate roadblock: context windows, the limited amount of text an LLM can process at one time. LLMs can only pay attention to so much information at once. Real-world codebases are massive. Documentation can be sprawling. Add in dependencies, and the potential context explodes.

Simply finding relevant files isn't enough. We need effective context distillation (or maybe "crystallization" is a better word?). This isn't just summarization; it's about extracting the essence relevant to the specific task at hand, stripping away the noise and fluff. This seems absolutely crucial for any realistic agent system today.

Could this problem disappear? Maybe. If we ever get truly massive, coherent context windows where we can dump entire multi-million-line codebases, project docs, dependency graphs, and API specs without performance hits or the model getting lost... well, things change. We also need to consider security, how do we prevent 'malicious context injection' via comments in dependencies if the model sees everything?

But until that future arrives (if ever), efficiently selecting and refining context and being able to do as much of this in an autonomous way is paramount.

For codebases, this might be a combination of a hierarchical graph:

  • Level 0 (Overview): Files, Class names, Function signatures
  • Level 1 (Implementation): Full function bodies, with edges representing calls to other functions/classes.
  • Traversal: An agent could navigate this graph, requesting deeper levels (e.g. 'show implementation of function Y called by function X') only when needed.

A very concrete minimal textual representation of the graph could be something like FileA.hs -> ClassB -> methodC() -> calls: methodD(), ClassE.methodF() This kind of structure allows targeted context retrieval, providing a codebase-specific tool where context granularity is controlled. There are some projects out there that (I suspect) do similar things, but I don't think to the extent I imagine would be most useful. I think because this will probably be language but maybe even software architecture specific. I don't know if generic tools will quickly become better at this than what is possible with a bespoke approach.

Phase 2: Acting on Information - Beyond Text Editing

Gathering and understanding context is necessary, but not sufficient. An agent needs to act. This is where the output of the model starts having real-world side effects, a concept familiar to functional programmers when dealing with purity.

Action is obviously needed during information gathering too, the agent uses tools (like file readers, search functions, API callers) to get the data. The efficiency here matters hugely. Better tools mean faster, more relevant context gathering, less context pollution, fewer expensive LLM calls. This is another reason I believe we'll see a shift from generic tools (like read file, run command) towards more specialized ones for specific information-gathering goals in coding (e.g., find all callers of this function, get relevant React component props, summarize recent changes in this directory).

But let's talk about the implementation action, actually changing the codebase. Right now, a lot of agent work seems focused on using text-editing primitives: insert line, delete line, replace block. Is this the most effective way forward? I suspect not.

I think we'll move towards higher-order tools that allow agents to express changes more abstractly and robustly. Think beyond basic regex find-and-replace. What could this look like?

  • AST Manipulation Tools: Modify the code's underlying structure (the Abstract Syntax Tree or AST, which represent the code's grammar) directly (e.g., rename variable 'foo' to 'bar' in this scope, extract these lines into a new function called 'baz'). This is much safer than text patching.
  • Refactoring Primitives: Expose common IDE refactorings as tools (e.g., inline function, change function signature, introduce parameter object).
  • Framework-Specific Tools: Tools aware of framework conventions (e.g., register this new component with the router, add a field to the database schema and generate migrations, inject this service dependency).
  • API/Schema Tools: Apply changes based on schemas (e.g., update API client based on this new OpenAPI spec, generate boilerplate from this GraphQL schema change).

Your main implementing agent should not be concerned with making sure it edits every single call site and the variable to be renamed. Such things are context pollution. It needs to be able to focus on the architecting of durable solutions.

A diagram showing a the difference between simple text editing and domain specific tools
Figure 4: The difference between simple text editing and domain specific tools.

I've also been intrigued by the idea of giving the agent a Domain Specific Language (DSL) for the tools it can use. Instead of trying to parse structured JSON output of LLMs or using the tool calling API's of the different tool providers for tool calls, the LLM could get a DSL language spec with functions signatures that have clear side effects and allow the agent to output a small program in this DSL, which we then execute. For example, instead of complex JSON for a refactor, the LLM might output DSL code like: Refactor.RenameVariable(filePath: 'userAuth.js', scope: 'loginUser', oldName: 'pwd', newName: 'passwordHash'); Refactor.ExtractMethod(filePath: 'userAuth.js', startLine: 42, endLine: 55, newMethodName: 'validateCredentials'); This seems incredibly promising for tool call efficiency but also expressing complex actions, actions that could take orders of magnitude more calls in other tool paradigms. I'm planning to experiment with this and might write about it later!

Right now my mind is on developing a system like this in Haskell to leverage the type system and tight control on side effects and parsing UX to allow for evaluation of this DSL with the necessary guardrails and control over the side effects. But I will need to develop a lightweight agent framework for Haskell first.

Phase 3: Verification and Iteration - Closing the Loop

Okay, the agent gathered info, figured out the context, and made a change. Are we done? Depends on the task but for high performing agent systems with current model capabilities we do not want to be done here. A crucial part of any robust agent system is the ability to verify the change and iterate based on feedback.

This step highlights why not all tasks are equally suited to agents right now. If trying or testing solutions is expensive or dangerous, and there's no other easy way of verifying or validating success short of deploying it (think some complex biological experiments, manufacturing or controlling physical hardware if it weren't for the software test environments available), then we're essentially forced to one-shot trust the AI. That's a big ask. Humans in these roles often have years of education and hands-on trial-and-error experience.

This is where coding agents have a potential advantage, if we give them the right tools. We have existing mechanisms:

  • Build tools (compilers, bundlers)
  • Linters and formatters
  • Automated tests (unit, integration, end-to-end)

These form the basis of the third essential toolkit: Verification Tools. But again, raw tools might not be enough. Handing an LLM a million-token build log or a massive Jest test failure output is often counterproductive due to context limits and the model's ability to parse noise. We need to make these tools more LLM-friendly.

A diagram showing a script that an agent could run to verify the changes
Figure 5: A script that an agent could run to verify the changes.

I think we need to explore tiered or hierarchical approaches, similar to how a human programmer debugs:

  1. Did it compile/build? (Basic check)
  2. Did linters/formatters pass? (Code style/quality check)
  3. Did unit tests pass? (Focused functionality check)
  4. Did integration tests pass? (Interaction check)
  5. Can we summarize the failures concisely? (Instead of raw logs)
  6. Can we pinpoint the likely cause based on the changes made and the tests failed?

But most importantly not only this series of checks but a tiered approach within each check. Did it build? No? What is the top-level error? Where did the error actually bubble up? Is there anything wrong in that piece of code? ...

Other domains will face similar challenges in verifying outcomes and making feedback loops effective and efficient for an AI to process. A lot of domains have software suites to do verification and validation but making them extractable from the software and LLM-interpretable will be the challenge.

A diagram showing the triphasic architecture
Figure 6: The triphasic architecture of an agent as defined in this post.

Tying it Together: The Three Toolkits

So, my core argument is this: instead of chasing a single, generic "agent," we should focus on building suites of specialized tools tailored to specific tasks or domains. For any given complex task (like modifying a codebase), a capable agent needs access to integrated toolkits for:

  1. Information Gathering: Efficiently exploring and finding relevant context (code structure, docs, dependencies).
  2. Implementation/Action: Making changes reliably and expressing intent clearly, likely via higher-order tools or DSLs, not just text patching.
  3. Verification/Validation: Checking the results, interpreting feedback (build logs, test results) effectively, and enabling iteration.

Conclusion

Building truly useful coding agents is less about creating one god-like AI and more about engineering the right ecosystem of specialized tools. The challenges are significant, context limits, tool design, verification interpretation, but the potential payoff is huge. By focusing on these three phases (Gather, Act, Verify) and developing tailored toolkits for each, I think we can move beyond the hype and start building agents that genuinely augment our capabilities as developers. I'm excited to experiment more in this space, particularly with DSLs for agent actions. Lots to explore!