English captured AI's first wave by accident—web scraping favored the dominant internet language—but the economics are finally cracking open to let everyone else in.

The Summary

  • Egyptian developer Assem Sabry built Horus, an Arabic-focused LLM trained on open-source datasets and cloud GPUs, pulling 800+ downloads in week one
  • The AI language gap isn't about minority languages—those "minorities" are actually the global majority, locked out by training data economics
  • Open-source models and tightening token limits from Big Tech are creating space for local AI that actually understands culture, not just vocabulary

The Signal

The AI industry has a colonial problem it never meant to create. When you train models by scraping the internet, you get the internet's demographics. English dominates online content. Chinese comes second because of population scale and government investment. Everything else gets the scraps. The result, per 2023 research from the Center for Democracy & Technology, is that billions of people trying to use AI in their native language are working with models that smooth over nuance, miss context, and fundamentally don't understand their world.

This wasn't malice. It was economics. Training costs money. Big money. And if you're OpenAI or Anthropic burning millions per training run, you optimize for the largest addressable market. That's English speakers with money to spend on subscriptions. You might throw in Spanish or Mandarin support if the ROI pencils out. But Wolof? Tagalog? Arabic that actually sounds like how Egyptians talk, not formal Modern Standard Arabic? Not worth the compute.

"Two years ago, AI wasn't as good as now, and the LLMs weren't open-source. Now we can really build our AI models from scratch."

The game changed when Meta released Llama and the open-source floodgates opened. Suddenly you didn't need $100 million and a data center. You needed hustle, cloud credits, and a dataset that actually represented your language. Sabry built Horus using Google Colab GPUs and open datasets. Not cheap, but achievable. The model understands Egyptian dialect, cultural references, local context. Things GPT-4 will never prioritize because Egypt isn't a top-10 revenue market.

What's driving this isn't altruism. It's necessity and opportunity colliding:

  • Data scarcity is becoming a Big Tech problem. OpenAI and others are running out of quality English training data. They're tightening token limits, raising prices, trying to squeeze more value from existing models.
  • Smaller players can now compete on culture. You don't need the best general model. You need the best Egyptian model, or Kenyan model, or Indonesian model.
  • Local LLMs serve local needs better. A developer in Cairo building customer service tools doesn't need ChatGPT's ability to write Shakespearean sonnets. They need something that understands how actual Egyptians communicate.

The infrastructure is finally there. Cloud compute is commoditized. Open-source models are good enough to fine-tune. Hugging Face gives you distribution. What was impossible two years ago is now just hard work.

But obstacles remain real:

  • Quality datasets for non-English languages are still scarce
  • GPU access is cheaper but not cheap
  • Most developers in these markets are working solo or in tiny teams
  • There's no venture funding ecosystem for "Egyptian LLMs"

The Implication

Watch this space for the next 18 months. If local LLMs start outperforming Big Tech models in specific markets, the entire AI stack fractures. We won't have one global AI layer. We'll have dozens of regional models, each deeply embedded in local culture and commerce. That's both more resilient and more complex.

For developers: if you're building agents or AI products for non-English markets, betting on OpenAI alone is increasingly risky. The models that will win locally are being built locally. Find them, partner with them, or build your own.

Sources

Fast Company Tech