LLMs with Custom Tokenizers for Underrepresented Languages

Cost-Effective Innovation To Unlock Global Markets While Preserving Cultural Authenticity

Six months ago, we published our first paper on fine-tuning LLMs for non-English languages. Little did we know we would hit a nerve that would send ripples through the AI community. Conference talks, meetups, panel discussions, research groups – suddenly everyone wanted to talk about the elephant in the room: the myth of truly multilingual AI.

From talks at Google HQ to lectures in Qatar and conferences in Armenia/Georgia, we've been talking a lot about this mythical creature of multilingual LLM.

And turns out this mythical creature is exactly that – a myth. Despite the marketing hype (and big tech claiming to support 120+ languages) truly multilingual AI simply doesn't exist yet, at least not in any way that would satisfy a native speaker of any language other than English.

As Ukrainians on the founding team, we had front-row seats to this linguistic challenge. Ever seen an AI confidently switch from Ukrainian to Russian or from Arabic to English mid-sentence like a confused exchange student? We have. Way too often. It would be funny if it weren't so problematic.

We've documented these fails extensively, but the core issue kept coming back to one thing: tokenization. A fundamental component of any LLM that actually transforms the text into tokens, which are the units the model processes. The efficiency and accuracy of tokenization directly impact the model's performance, influencing its ability to understand and generate text effectively.

Today, we are presenting a culmination of 6 months of collaboration and research across multiple organizations, academia and volunteers.

Our paper "From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages" introduces a transformative approach to developing bilingual AI models that are both cost-effective and culturally authentic.

The research introduces a novel approach that achieves two critical objectives that have long eluded the industry:

Reducing computational costs for non-English language processing by up to 3 times.
Maintaining high performance in both English and target languages without compromise

Our methodology represents a fundamental shift in how bilingual AI models are developed. By extending vocabulary tokens intelligently and optimizing embedding initialization, we've created a scalable solution that can be applied to any language pair.

What makes our approach revolutionary is its resource efficiency. While others are throwing massive computing power at the problem, we've developed an elegant solution that requires fraction of tokens of training data and compute, compared to what conventional approaches demand.

Key achievements include:

90% reduction in non-existing word generation for given languages
Near-elimination of inappropriate code-switching
Significant improvements in grammar accuracy
Preservation of English language capabilities

Market Impact and Commercial Applications

The implications for businesses are immediate and far-reaching. Organizations can now:

Deploy truly bilingual AI solutions at significantly lower operational costs
Serve international markets with culturally authentic AI interactions
Maintain brand consistency across languages while preserving local nuance
Scale their AI solutions globally without proportional infrastructure costs

This research isn't just about technical advancement – it's about creating real business value. Companies have been struggling with the cost and complexity of serving multiple language markets. Our approach offers a clear path forward that is both technically superior and commercially viable.

Acknowledgments

We're not just solving today's problems – we're building the foundation for truly inclusive AI. And our vision is to democratize access to advanced AI capabilities, ensuring that language is never a barrier to participating in the digital economy. And all this work wouldn't be possible without strategic partners that we extend our deepest gratitude to:

- Observea's provision of their 16x Tesla H100 cluster proved instrumental in accelerating our research timeline.

- NVIDIA's support through their DGX Workstation equipped with 4x Tesla V100 enabled critical model evaluations and testing phases.

- HotAisle's generous access to their 8x AMD MI300x node was crucial for our training experiments.

- We're especially thankful to cloud partners AWS, Google Cloud Platform, and TPU Research Cloud, whose infrastructure and credits facilitated extensive model training and testing. A.I. Hero's early-stage support with 8x A100 access was vital for our initial experiments.

Special recognition goes to our academic collaborators at the Doha Graduate Studies University and the Arab Center for Research and Policy Studies, whose expertise in Arabic language and cultural nuances significantly enhanced our research outcomes.

Our keynote presentation from DataFest conference on Sep 2024 is also available now.