Amazon just bet nine figures that the future of search isn't text-based at all.

The Summary

  • Twelve Labs raised $100 million from Amazon, NEA, and Naver to build AI that searches and analyzes video at scale
  • Amazon's strategic investment signals a shift from keyword-based retrieval to multimodal understanding as the next search paradigm
  • Video is the last unstructured data frontier: trillions of hours sit unsearchable, making this a classic Web4 agent play

The Signal

Twelve Labs is building foundation models specifically for video understanding, which means making the semantic content of any video as searchable as text. Not just transcribing audio. Actual visual understanding: what's in frame, what actions are happening, what context surrounds them. The company's API lets developers query video with natural language and get back timestamped results based on what's actually shown, not just what's said.

Amazon doesn't write checks this size for marginal improvements. They're positioning for a world where every security camera feed, training video, customer service call, warehouse operation, and product demo becomes queryable in real time. That's an agent-native interface. Today you need humans to watch footage. Tomorrow agents watch everything, flag anomalies, extract insights, and route decisions.

"The company's API lets developers query video with natural language and get back timestamped results based on what's actually shown, not just what's said."

The timing matters. Text-based search peaked years ago. Image search is commodity. But video represents 80% of internet traffic and almost none of it is truly searchable at the semantic level. YouTube's search relies on titles, descriptions, and transcripts. Ring cameras create terabytes of footage that gets watched only after something goes wrong. Corporate training libraries are graveyards of unlabeled content.

Twelve Labs is attacking that gap with multimodal models trained on video from the ground up. The technical unlock is processing video as a unified sequence of visual, audio, and temporal information rather than splitting it into frames and audio tracks. That architecture lets the models understand narrative flow, spatial relationships, and complex actions that single-frame analysis misses.

Key competitive dynamics:

  • Google has YouTube but treats video as a text search problem with visual aids
  • OpenAI and Anthropic focused on text-first, bolted on image understanding later
  • Twelve Labs is video-native, which matters for handling motion, context, and time-series data

The Amazon involvement is the tell. They need this for AWS customers building surveillance, retail analytics, and content moderation systems. They need it for their own logistics operations. And they need it because whoever owns video understanding owns the next generation of human-computer interaction. Your agents can't operate in the physical world if they can't understand what they're seeing in motion.

The Implication

If you're building agents that need to understand the physical world through cameras, or if you're sitting on archives of video content that could train models or serve customers, this is your category to watch. The companies that crack video search first will build the pipes that every Web4 application flows through.

For everyone else, the shift is simpler: the internet stops being a text database with video attachments. It becomes a video database that you query like you talk. The search bar doesn't go away. It just starts accepting questions like "show me every time someone picked up a package in this warehouse yesterday" instead of keywords.

Sources

Bloomberg Tech