Essay

Why I built my own agent framework before Claude Max

May 9, 2026 · ~8 min read

September 2024. No production agent frameworks existed. GPT-4o was the frontier model. Here's what that forced me to learn — and why it mattered later.

September 2024. I was at MakenaAI — an AI / mobile / property tech company — as the sole AI engineer on a team that needed internal tooling built from scratch. The AI landscape looked almost nothing like it does now: GPT-4o was the frontier model, Claude 3.5 Sonnet had just shipped, and what we now call “agent frameworks” were either academic experiments or vaporware.

I needed something that worked.

The task was automating parts of a document-heavy research and operations workflow — repetitive, structured at the core, judgment-requiring at the edges. The obvious move was to pick up LangChain or AutoGPT and bolt it onto the codebase. I tried both. They were either too rigid for the task or too poorly documented to trust in production. So I started writing my own.

What you learn when there’s no framework to hide behind

The first thing you realize is that an “agent” is just a loop with a tool palette. The model generates a response; if it contains a tool call, you execute the tool and feed the result back; you repeat until the model stops calling tools. That’s it. The intelligence is in the model. Your job as the engineer is to constrain the loop correctly.

The second thing you realize is that the hard problems aren’t in the loop — they’re in the edges. What happens when the model calls a tool with malformed arguments? What happens when a tool call times out? What happens when the model gets stuck in a cycle? A framework that doesn’t have answers to these questions ships you bugs, not features.

I built the loop from scratch. Python, no dependencies except the Anthropic SDK. Tool dispatch via a registry pattern — each tool registered with a schema and a handler. Error handling at every layer. Retry logic with exponential backoff. Context window management (truncation, not sliding — MakenaAI’s use case was document-heavy and truncation was more predictable than sliding). Logging that made debugging possible.

The model routing problem

The second major thing I built was model routing. Not every task needs the smartest model. GPT-4o was good at judgment calls. GPT-3.5 was fast and cheap and fine for formatting tasks. I wired in OpenRouter early and built a task-type classifier that routed calls to the appropriate tier.

This sounds obvious in retrospect. It wasn’t obvious in September 2024 when every framework assumed you were using one model for everything. The cost difference between routing correctly and routing naively was roughly 10x for high-volume tasks.

Why this mattered for everything that came after

When Claude Max shipped and I moved into the 18-day burst period that produced 1,005 merged tickets, I wasn’t starting from zero. I understood exactly what the frameworks underneath me were doing because I’d built the same things by hand. When a hook didn’t fire the way I expected, I knew how to debug it. When a ticket executor produced malformed output, I knew which layer to look at.

The engineers who learned agents by picking up a framework first are going to spend months debugging problems they don’t have the vocabulary to name. The engineers who understand the loop — the tool dispatch, the error handling, the context management — can work at a different level.

September 2024 forced me to build that vocabulary. I’m glad it did.


Questions about the architecture? Reach out at sptraynor2001@gmail.com.