While the AI world's been busy building cloud empires, someone just turned every Mac into a frontier-class inference machine that beats Ollama by 4x.
The Summary
- Rapid-MLX delivers local AI inference on Apple Silicon at 4.2x Ollama's speed, with 0.08s cached time-to-first-token and drop-in OpenAI API compatibility
- A 16GB MacBook Air runs Qwen3.5-4B at 160 tokens/sec; 96GB Mac Studios handle frontier-class Qwen3.5-122B at 57 tok/s
- Works immediately with Cursor, Claude Code, Aider, PydanticAI, and LangChain — no code changes, just point at localhost
The Signal
The local inference race just got interesting. Rapid-MLX is an open-source inference engine optimized for Apple's Metal architecture that runs AI models directly on Mac hardware with speeds that make cloud APIs look expensive and Ollama look slow. The benchmark that matters: 4.2x faster than Ollama on the same hardware, with 100% tool calling accuracy and 17 different tool parsers built in.
The timing matters because Apple Silicon's unified memory architecture has been sitting there like a loaded gun nobody's properly fired. Rapid-MLX finally does. A base M3 MacBook Air with 16GB RAM runs Qwen3.5-4B at 160 tokens per second. That's not "good enough for a laptop" speed — that's faster than most people read. Push up to a Mac Studio with 96GB and you're running Qwen3.5-122B, a frontier-class model, at 57 tokens/sec locally. No API bills. No latency. No data leaving your machine.
"A 16GB MacBook Air runs local AI faster than most people read, with zero cloud dependency."
The developer tooling integration is where this gets practical:
- Drop-in OpenAI API replacement — change one URL, keep your entire stack
- Native support for Cursor, Claude Code, and Aider without configuration
- Works with PydanticAI and LangChain frameworks out of the box
- 17 tool parsers for function calling that actually work
Here's why the tool calling piece matters. Most local inference engines treat function calling as a "we're working on it" feature. Rapid-MLX ships with 100% tool calling accuracy and 17 parsers because the developer clearly understands that AI agents without reliable tool use are just expensive chatbots. Every coding assistant, every automation framework, every agent platform assumes your LLM can reliably call functions. This delivers that locally.
The model support tells you who this is really for. They're not pushing Llama variants for casual chat. The default recommendations are Qwen3.5 and Qwen3.6 models — Chinese-origin models that consistently punch above their weight class for code and reasoning. The new Nemotron-Nano 30B at 141 tok/s on 32GB Macs is positioned as the fastest 30B with full tool support. Day-0 support for DeepSeek V4 Flash, a frontier mixture-of-experts model with 1M context window. This is infrastructure for people building real agent systems.
The Implication
If you're building anything with agents and you own a Mac, this just changed your economics. Every API call you're making to OpenAI, Anthropic, or Google for local development work is now optional. The "build in the cloud, test locally" workflow flips to "build and test locally, route to cloud only when you need it." Watch for this pattern to spread. The inference layer is commodifying faster than anyone expected, and the next bottleneck is orchestration, memory, and multi-agent coordination — not raw speed.
For companies betting on agent infrastructure: this is what your competitive moat is not. Any decent engineer can now run frontier-class models locally at production speed. Your value is in the layer above — the reasoning loops, the memory systems, the toolchains that make agents actually useful.