Most AI agents fail the moment you ask them to connect a sales number to a customer review.
The Summary
- Databricks research shows multi-step agents beat single-turn RAG by 20%+ on hybrid data tasks that require joining structured databases with unstructured documents
- The performance gap is architectural, not a model quality problem, even when single-turn systems use stronger base models, they still lose by 21%
- Tests ran across nine enterprise knowledge tasks using Stanford's STaRK benchmark and Databricks' KARLBench framework
The Signal
Enterprise AI hits a wall the moment you ask it to do what humans do naturally: connect numbers to narrative. Why are sales declining in the Midwest? That question requires pulling revenue data from SQL tables, customer feedback from support tickets, and market analysis from PDFs. Single-turn RAG systems break on exactly this class of question.
Databricks put numbers to that failure mode. Their research team tested multi-step agentic architectures against state-of-the-art single-turn RAG across nine enterprise knowledge tasks. The multi-step approach won by 20% or more on Stanford's STaRK benchmark suite, with consistent gains across Databricks' own evaluation framework.
"RAG works, but it doesn't scale. If you want to understand why you have declining sales, you have to help the agent see the tables and look at the sales data."
Here's the kicker: When Databricks tested a stronger base model in the single-turn system, it still lost by 21%. This isn't about throwing more parameters at the problem. It's about architecture. Single-turn systems make one retrieval call, generate one answer, done. Multi-step agents reason across multiple data sources, backtrack when initial queries return nothing useful, and refine their approach mid-task.
The research builds on Databricks' earlier work on instructed retrievers, which improved unstructured data retrieval using metadata-aware queries. This latest work adds relational tables and SQL warehouses into the same reasoning loop. The agent doesn't just search documents. It queries databases, correlates results, and synthesizes answers that span both.
Key differences between the approaches:
- Single-turn RAG: one query, one retrieval pass, one answer generation
- Multi-step agents: iterative queries, cross-source correlation, backtracking on dead ends
- Performance gap holds even when single-turn systems use stronger LLMs
The benchmark tasks weren't toy problems. They tested real enterprise scenarios where answers live in multiple systems. Citation counts alongside academic papers. Sales figures alongside customer sentiment. Inventory levels alongside supplier communications. These are the questions knowledge workers ask every day and the questions current RAG systems consistently fail to answer.
The Implication
If you're building agents for enterprise knowledge work, single-turn RAG is a dead end for anything beyond simple document search. The architecture can't handle the hybrid reasoning enterprise questions actually require. Multi-step agent frameworks add complexity, more API calls, higher latency, harder debugging, but the performance gap is too wide to ignore.
Watch for agent platforms to split into two tiers. Simple RAG for straightforward lookup tasks. Multi-step agentic systems for anything that requires correlating structured and unstructured data. The companies that figure out how to make the multi-step approach reliable at scale will own enterprise AI deployment for the next three years.