Grab Builds Custom Vision AI Model After Finding Major LLMs Struggle with Southeast Asian Languages

2 Sources

Share

Singapore's superapp giant Grab developed its own lightweight vision language model after discovering that major proprietary and open-source AI models perform poorly on Southeast Asian languages, highlighting broader accessibility challenges for non-English AI applications.

Grab's AI Language Challenge

Singapore-based superapp giant Grab has revealed that major artificial intelligence models struggle significantly with Southeast Asian languages, prompting the company to develop its own custom vision language model. The revelation highlights broader challenges facing AI accessibility for non-English speaking populations worldwide

1

.

Grab, which dominates ride-sharing, food delivery, and financial services across eight Southeast Asian countries, requires accurate document processing for compliance tasks including know-your-customer checks. The company processes ID cards, driver's licenses, and registration certificates written in scripts that don't use Latin alphabets

1

.

Source: The Register

Source: The Register

Proprietary Models Fall Short

According to Grab's engineering team, "powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency." Open-source vision LLMs proved "more efficient but not accurate enough for production"

1

.

The company's experience reflects a broader industry challenge. Research has consistently shown that AI models developed by both Western and Chinese companies struggle with minority languages, even within their own linguistic regions. A recent study found that Chinese AI models perform as poorly on Chinese minority languages as Western models do

2

.

Building a Custom Solution

Faced with these limitations, Grab decided to build its own Vision LLM. The team started by evaluating existing models and selected Alibaba Cloud's Qwen2-VL 2B as their foundation. They extracted Southeast Asian language content from Common Crawl and created "an in-house synthetic data pipeline to generate text images by rendering SEA text contents in various fonts, backgrounds and augmentations"

1

.

Initial experiments using Low-Rank Adaptation (LoRA) fine-tuning showed promise for Latin script documents, particularly achieving high accuracy for Indonesian documents. However, Thai and Vietnamese languages remained challenging, as did documents with unstructured layouts and dense text

1

.

Full-Parameter Training Breakthrough

Grab's team discovered that existing vision LLMs "lack visual text in SEA languages during vision encoder and joint training." This insight led them to perform full-parameter fine-tuning, first training vision components using synthetic OCR datasets for Bahasa Indonesia, Thai, Vietnamese, and English

1

.

The intensive training process "pushed the limits of GPUs," ultimately leading Grab to build a lightweight Vision LLM with approximately 1 billion parameters from scratch. The resulting model outperformed existing OCR tools, Qwen2, ChatGPT, and Google's Gemini on their specific tasks

1

.

Industry-Wide Language Accessibility Crisis

Grab's experience illuminates a significant challenge facing the AI industry. The lack of adequate datasets for training models on diverse languages creates barriers to accessibility and functionality. Major AI companies are investing heavily to address this gap - Google partnered with IIT Bombay for Indic language speech models, Meta reportedly pays $55 per hour for Hindi language training contractors, and OpenAI announced a $500,000 research collaboration with IIT Madras

2

.

Source: Gadgets 360

Source: Gadgets 360

However, while collecting data for prominent Asian languages remains expensive but feasible, minority languages face even greater challenges. These languages may never achieve adequate representation in major AI models, creating persistent accessibility limitations

2

.

Future Development Plans

Grab plans to expand its AI capabilities, developing "Chain of Thought-based OCR and Key Information Extraction models to strengthen generalisation capabilities." The company also intends to extend its document processing technology to Myanmar, Cambodia, and other markets

1

.

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo