The hard part of building agents was never getting them to do things. It was understanding what they did when you weren't looking.
The Summary
- Raindrop AI launched Workshop, an open-source tool that lets developers debug and evaluate AI agents entirely on their local machines, capturing every token, tool call, and decision in a single SQL database file.
- The key shift: observability without external servers. Everything runs at localhost:5899, eliminating the privacy issues of sending agent traces to third-party platforms.
- Workshop's "self-healing eval loop" lets coding agents like Claude read their own failure traces, write tests against them, and fix broken code autonomously.
The Signal
The agent economy has a debugging problem. When your agent fails at 3 AM, you don't get a stack trace. You get a vague error message and 47 API calls to reconstruct. Workshop attacks this from first principles: stream everything to a local .db file the moment it happens, then make it queryable.
This matters because agent development right now feels like flying blind. Unlike traditional software where you set breakpoints and step through code, agents make probabilistic decisions across dozens of API calls. When something breaks, you're left guessing which tool call in the chain went sideways.
"The hard part isn't building agents. It's understanding what they actually did."
Raindrop's approach solves two problems simultaneously:
- Privacy: No agent traces leave your machine. For enterprise teams building proprietary agents, this removes the legal headache of cloud observability.
- Speed: Real-time streaming versus polling means you see failures as they happen, not after batch processing.
- Storage: A single .db file that "takes up relatively little memory" according to Raindrop CTO Ben Hylak, a former Apple and SpaceX engineer.
The self-healing eval loop is where this gets interesting. Instead of developers manually debugging agent failures, Workshop captures the full trajectory, then lets coding agents like Claude Code read their own traces, write evaluations against the codebase, and fix the broken logic. It's agents debugging agents.
Here's the example Raindrop provides: a veterinary assistant agent fails to ask necessary follow-up questions. Workshop captures the full conversation trajectory. Claude Code reads the trace, writes a specific eval, identifies the logic error in the prompt or code, and patches it. The next time the agent runs, it asks the right questions.
This is the infrastructure layer for agentic systems starting to mature. For the past year, developers have been building agents with consumer LLM interfaces and cobbled-together logging. Workshop gives them actual tooling designed for how agents actually work: non-deterministic, multi-step, tool-using systems that need observability at the token level.
The Implication
Watch for two things. First, whether this becomes table stakes for agent development frameworks. If Workshop proves the model works, expect similar tools from Anthropic, OpenAI, and the agent platform companies. Second, whether the self-healing eval loop actually holds up in production. Agents debugging agents sounds elegant, but it only works if the underlying models are good enough at reading their own mistakes. That's still an open question.
If you're building agents, install this and see what your system is actually doing. The .db file is portable, which means you can share failure traces with your team without privacy concerns. One-line install for macOS, Linux, and Windows. GitHub repo uses the Bun runtime if you want to build from source.