$ cat why-yes-i-ab-tested-ai-agent-frameworks.md

March 21, 2026

Why Yes, I A/B Tested AI Agent Frameworks

I was debugging a container patch on a Tuesday when I realized how far I've come on my product journey.

At work, I'm Head of Product at an open-source infrastructure and AI company. I spend my days thinking about how organizations deploy and manage agents at scale as well as how end users want their AI agents to work for them. Phrases like "agent lifecycle management" and "secure containerized execution" often float around my daily speak.

At home, I had a Mac Mini running an AI agent that kept breaking. And yes you can do this without a Mac Mini, I just happened to have one long before the tech bros lost their minds.

The home setup started because I needed to understand agents from the inside. Not from a product strategy doc. Not from a vendor demo. I needed to know what it actually felt like to build one, run one, maintain one, and watch it fail at 2am because a rate limit changed. You can't make good product decisions about something you've only read about. (This is the same thing I tell people about improv. Reading about it and doing it are different activities that share a vocabulary.)

I wanted a personal AI assistant that was always on, actually useful, and didn't require me to open a browser and start a new chat every time I had a thought. Something I could message from Telegram or Slack. Something that knew my meeting schedule, could dig through my transcripts, and could run tasks overnight while I slept.

Not a demo. Not a weekend project I'd abandon. A real thing I'd use every day.

The first version ran Agent Zero, an open-source agent framework that had been around long enough to feel mature. Docker containers, file-based storage, MCP integrations. I wired it up to AnythingLLM for search across my meeting transcripts, a Neo4j knowledge graph for relationship mapping, and my own Slack for posting results. It had scheduled tasks that ran every night. Story extraction from my meeting transcripts, research digests, a weekly intelligence roundup.

Agent Zero was genuinely capable. The MCP integration pattern was solid. Multi-model support meant I could run Anthropic Sonnet for real conversations, OpenRouter for scheduled tasks, Gemini for utility work. The Telegram and Slack bridges meant I could reach it from anywhere.

But keeping it running required the kind of maintenance that makes you question your life choices. Three separate container patches just to get scheduled tasks working reliably. History accumulated with each task run, bloating context until the agent got confused about what it was supposed to be doing. Rate limiting at Anthropic's Tier 1 (30K tokens per minute) was tight enough that two simultaneous conversations would choke. Running Sonnet for everything was expensive. And file access inside containers was unreliable for mounted directories, which meant the tasks that needed to read my meeting transcripts would randomly fail to find them.

I was spending more time maintaining the agent than the agent was saving me. That's a product failure, even when the product is your own side project.

In March, NanoClaw announced their partnership with Docker to run AI agents safely. It was the first time I felt comfortable jumping in on the claw phase. The flexibility angle also caught my attention. Understanding NanoClaw's architecture would directly inform the product decisions I was making at my day job.

I could have just switched. Rip out Agent Zero, install NanoClaw, move on. But the product manager in me couldn't do it. I had assumptions about which framework was better. I did not have data. (Sound familiar?)

So I ran an A/B test.

running both frameworks side by side

Both agents ran the same scheduled tasks against the same transcripts on the same schedule. Story extraction at 10pm weekdays. Research digest at 10:30pm. Weekly roundup on Sundays. Same models where possible so I was testing the framework, not the AI. I tracked cost per interaction, setup complexity, ongoing maintenance, tool access, crash recovery, and file I/O reliability.

This is the part where I tell you NanoClaw won. It did. But the interesting result wasn't the winner. It was what the test revealed about the architecture itself.

That was the real insight. Not "NanoClaw is better than Agent Zero." The insight was that I'd been jamming two very different workloads into the same container and wondering why things kept breaking.

Interactive chat (me messaging the agent from Telegram, asking it to look something up, having a conversation) worked great in a container. The container spins up, does its thing, spins down. Clean isolation. Good security model.

File-heavy scheduled tasks (reading dozens of meeting transcripts, parsing them, extracting stories, writing digests) were a nightmare in containers. Mounted directories were unreliable. Context bloated across runs. Every failure was silent until I went looking.

So I split them. NanoClaw handles interactive chat through Slack and Telegram, plus one premium scheduled task that needs its sub-agent swarm capability (a weekly production brief for something creative i'm working on). Everything file-heavy moved to standalone Python scripts running directly on the Mac Mini host via cron. Story extraction, link research, weekly digests, content mining, intelligence briefs, cost tracking, work briefings. Eight separate scripts, each doing one thing well, each running on the host where file access just works.

The cost reduction was around 80%. Gemini Flash's free tier handles most of the scheduled work. Expensive models are reserved for the tasks where quality actually matters. Bifrost, an AI gateway I set up on the same machine, routes requests to the cheapest appropriate provider automatically.

Zero scheduled task failures after the migration. Not because NanoClaw is magic. Because the architecture finally matched the workload.

Agent Zero wasn't a failure. The MCP integration pattern I built for it became the blueprint for NanoClaw's tooling. The self-improving memory concepts evolved into persistent memory files and a skills library. The three-pool rate-limiting strategy (expensive model for chat, medium for tasks, free for utility) informed how I configured Bifrost's provider routing.

The first version of anything teaches you what the second version needs to be. That's true for personal projects and it's true for products.

Then as seems to be the pattern Claude Code dropped yet another feature. This time it was Channels. A way to control a Claude Code session from Telegram via an MCP plugin. Runs on your subscription, not API credits. Zero container complexity. Full access to all local MCP tools natively.

So off I've gone to set up another A/B test. Channels versus NanoClaw for interactive Telegram chat. It's running right now. The first session died after about 12 hours (auth token expiry), which is the kind of thing you only learn by testing.

I could get precious about this. I just finished migrating everything to NanoClaw. The architecture diagram looks clean for the first time in months. The last thing I want is to rip it up again.

But if you're building with AI right now, your setup has a shelf life measured in weeks. The discipline isn't picking the right tool. It's being willing to test, measure, and switch when something better shows up. Treat your AI tooling the way you'd treat any product decision. Hypothesize, test, measure, decide. The best architecture is the one that's easy to change.

I'll let you know how Channels goes. Or I'll be writing about the next thing that replaced it. That's sort of the point.

LIKED THIS?

I write about AI in plain English every other Sunday. No hype, no jargon — just the stuff that actually helps.

I'M IN →

← Back to the blog