Andrej Karpathy just open-sourced the research lab that fires you in your sleep.
The Summary
- Karpathy released autoresearch, a system where AI agents autonomously modify LLM training code, run experiments on a single GPU overnight, and keep improvements while you're offline
- The agents edit the actual training code (train.py), not just hyperparameters — they're rewriting how the model learns, not just tweaking knobs
- You program in Markdown files that set up your "research org," not Python files. The code becomes the employee, not the tool.
The Signal
This is what autonomous research actually looks like when you strip away the enterprise software wrapper. Karpathy's autoresearch runs on consumer hardware. One GPU. Five-minute training cycles. The agent wakes up, tweaks the training code for his nanochat implementation, runs a quick experiment, checks if loss improved, and either commits or reverts. Then it does it again. All night. You get a log in the morning.
The framing matters here. Karpathy wrote the README as if it's March 2026, looking back at "how it all began" before research became "entirely the domain of autonomous swarms." It's tongue-in-cheek, but the underlying point is dead serious: the transition from human-driven to agent-driven research isn't some distant future. It's a GitHub repo you can clone right now.
"You're not touching Python files like you normally would as a researcher. You are programming the program.md Markdown files that provide context to the AI agents."
The technical setup is intentionally minimal. The repo has three core files. One handles data prep and evaluation (fixed, the agent never touches it). One is the training code (train.py, which the agent modifies). One is program.md, a Markdown file where you describe what the research org should optimize for. That's it. No complex orchestration layer. No API to a foundation model service. Just an agent with write access to the code that trains the model it's trying to improve.
This is different from hyperparameter search or AutoML. Those systems operate within a fixed architecture. They tune learning rates, batch sizes, model dimensions. Autoresearch gives the agent permission to rewrite the training loop itself. Change how gradients flow. Swap out optimizers. Rethink the loss function. The search space isn't a grid of numbers. It's the entire codebase.
What makes this work on a single GPU is the short iteration cycle. Five minutes per experiment. Most academic research runs take hours or days, which makes human-in-the-loop experimentation the only practical option. But if you can test an idea in five minutes, you can test 100 ideas overnight. The constraint isn't compute anymore. It's how fast you can generate hypotheses and parse results. Humans are slow at both. Agents are not.
Here's the economic angle: research labor has always been the bottleneck in AI progress. You need smart people who understand the math, can code, and have the intuition to know what to try next. That intuition comes from running lots of experiments and building mental models of what works. Autoresearch compresses that loop. The agent builds intuition at machine speed.
The program.md abstraction is the real innovation. Instead of writing code, you write a charter for your research organization. What's the goal. What constraints matter. What you've tried before. The agent reads it, internalizes it, and starts executing. As Karpathy describes it in his tweets, you're not doing research. You're programming a research org. The code isn't the artifact anymore. The org structure is.
The Implication
If you're a researcher, this is your job description changing in real time. The question isn't whether agents can run experiments for you. They can. The question is whether you can articulate what experiments are worth running better than the agent can figure it out by brute force. Start thinking about how you'd write your program.md.
If you're building AI tools, the interface here matters. Karpathy didn't build a GUI or a chat interface. He used Markdown. That's not an accident. Markdown is structured enough for agents to parse reliably and flexible enough for humans to write naturally. It's the format of instructions that scale.
Watch the forks. This repo will spawn variations: multi-GPU versions, longer training runs, agents that coordinate with each other, agents that read papers and incorporate new techniques. The current version is a proof of concept. The next version might be your competition.