2 Sources
2 Sources
[1]
LLMs are lousy at reading Asian languages, finds Grab
Superapp company that chased Uber away built its own model to do the job right Proprietary large language models are bad at interpreting Asian languages, according to Singaporean super-app company Grab, which has built its own model instead. Grab's superapp offers ride-sharing, food delivery, shopping, and even some financial services. The company is so prominent and dominant in some Asian countries that Uber sold itself to Grab and took a stake in the Singaporean company rather than compete directly. Today, Grab is a major player in Singapore, Malaysia, Indonesia, the Philippines, Vietnam, Thailand, Cambodia, and Myanmar, all of which use scripts that employ alphabets other than the Latin script used by English. In a Tuesday post on its Engineering blog, four Grab staffers explained that the company needs to accurately extract information from ID cards, driver's licenses, and registration certificates for compliance chores like know-your-customer checks. Grab tried Optical Character Recognition (OCR) systems, but its chosen tech "struggled with the variety of document templates it had to process." It's 2025, so the org investigated whether large language models could solve its problem. "While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding [South East Asian] SEA languages, produced errors, hallucinations, and had high latency," the post reveals. "On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production." The company decided building its own Vision LLM - a model that vectorizes images so a large language model can extract text - was its best option. "We evaluated a range of LLMs capable of performing OCR and Key Information Extraction (KIE)," the post states, and chose Alibaba Cloud's Qwen2-VL 2B for reasons including: To build its model, Grab extracted SEA language content from the Common Crawl, an open collection of data scraped from the web, then built what the authors describe as "an in-house synthetic data pipeline to generate text images by rendering SEA text contents in various fonts, backgrounds and augmentations." The team next tried to fine-tune a Vision LLM using Qwen2VL and Low-Rank Adaptation (LoRA), a technique they found "efficient because it allows lightweight updates to the model's parameters, minimizing the need for extensive computational resources." "We trained the model on our curated document data, which included various document templates in multiple languages. The performance was promising for documents with Latin scripts. Our experiment of LoRA fine-tuned Qwen2VL-2B achieved high field-level of accuracy for Indonesian documents." Thai and Vietnamese remained hard to recognize, as did documents with unstructured layouts and small, dense text. Further experiments showed that existing vision LLMs "lack visual text in SEA languages during vision encoder and joint training." Grab's team therefore decided to perform full-parameter fine-tuning of its model. "We first trained the vision components of the model using synthetic OCR datasets that we created for Bahasa Indonesia, Thai, Vietnamese, and English. This helps the model to learn the unique visual patterns of SEA scripts," the team wrote. Next came full-parameter fine-tuning to refine all components of the model with task-specific document data. Grab rated the resulting model a success but admitted the fine-tuning process "pushed the limits of GPUs." "To optimize resources used and to create a model perfectly tailored to our needs, we decided to build a lightweight Vision LLM (~1B parameters) from scratch." Grab's post explains the process it used to create its model, and the results - performance better than OCR tools, Qwen2, ChatGPT, and Google's Gemini. The company concluded that "strategic training with high-quality data enables smaller, specialized models to achieve remarkable efficiency and effectiveness." Grab now plans more of its own models. "We're developing Chain of Thought-based OCR and Key Information Extraction (KIE) models to strengthen generalisation capabilities and tackle even more diverse document scenarios," the post states, and will also extend its advanced document processing tech "to Myanmar, Cambodia, and beyond." Grab's experience aligns with predictions this Vulture often hears about the future of AI in the enterprise, namely that many organizations will develop their own models to handle specialized tasks that general-purpose models weren't built to address. ®
[2]
AI Models are Bad at Understanding Asian Languages, Says Singapore's Grab
Grab used both online and synthetic datasets to train the model Grab, the Singapore-based superapp company, highlighted on Monday that it was forced to develop an in-house artificial intelligence (AI) model for internal use. It is a lightweight vision large language model (LLM) that can scan documents and extract information from them. The company said the decision to develop the model was made as both proprietary and open-source models were not good at understanding Southeast Asian languages. The company's statement has raised fresh concerns around the accessibility of frontier models by Google, OpenAI, and Anthropic. AI Models' Struggle With Non-English Languages In a blog post detailing the architecture and training process of their in-house vision model, Grab highlighted the shortcomings they experienced when they tried to outsource the technology. "While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency. On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production," the post mentioned. AI models' struggle with non-English languages is not a new finding. For years, researchers have pointed it out, and AI players have tried to fix the issue. However, despite gaining basic competence in popular foreign languages such as Hindi, Japanese, Spanish (Latin America and Spain), and Chinese, the models have yet to understand the lexicon enough to differentiate between the nuances. So, they might be useful in general conversations, but for enterprise or research-based needs, the applicability falls short. For instance, a paper published earlier this year found that even AI models developed by Chinese companies are as bad in Chinese minority languages as are Western models. And the issue persists in both proprietary models from Google, OpenAI, Meta, and Anthropic, as well as in open-source models. The reason behind this struggle is the lack of readily available, adequate datasets to train the model on these languages. This is one of the reasons major AI companies are partnering with Indian companies and institutions to collect more Indic language datasets. In July, Google teamed up with IIT Bombay to develop Indic language AI speech models. Meta is reportedly paying $55 an hour to contractors to train its models in the Hindi language, and OpenAI has announced a research collaboration with IIT Madras, backed by $500,000 from the ChatGPT maker. While collecting data this way is expensive, it is still possible to eventually build large enough datasets in prominent Asian and other languages. However, the minority languages, such as the non-scheduled Indian languages, will still be a struggle for these models to gain competence in. And unless they can learn these languages, accessibility and functionality will always be limited.
Share
Share
Copy Link
Singapore's superapp giant Grab developed its own lightweight vision language model after discovering that major proprietary and open-source AI models perform poorly on Southeast Asian languages, highlighting broader accessibility challenges for non-English AI applications.
Singapore-based superapp giant Grab has revealed that major artificial intelligence models struggle significantly with Southeast Asian languages, prompting the company to develop its own custom vision language model. The revelation highlights broader challenges facing AI accessibility for non-English speaking populations worldwide
1
.Grab, which dominates ride-sharing, food delivery, and financial services across eight Southeast Asian countries, requires accurate document processing for compliance tasks including know-your-customer checks. The company processes ID cards, driver's licenses, and registration certificates written in scripts that don't use Latin alphabets
1
.
Source: The Register
According to Grab's engineering team, "powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency." Open-source vision LLMs proved "more efficient but not accurate enough for production"
1
.The company's experience reflects a broader industry challenge. Research has consistently shown that AI models developed by both Western and Chinese companies struggle with minority languages, even within their own linguistic regions. A recent study found that Chinese AI models perform as poorly on Chinese minority languages as Western models do
2
.Faced with these limitations, Grab decided to build its own Vision LLM. The team started by evaluating existing models and selected Alibaba Cloud's Qwen2-VL 2B as their foundation. They extracted Southeast Asian language content from Common Crawl and created "an in-house synthetic data pipeline to generate text images by rendering SEA text contents in various fonts, backgrounds and augmentations"
1
.Initial experiments using Low-Rank Adaptation (LoRA) fine-tuning showed promise for Latin script documents, particularly achieving high accuracy for Indonesian documents. However, Thai and Vietnamese languages remained challenging, as did documents with unstructured layouts and dense text
1
.Grab's team discovered that existing vision LLMs "lack visual text in SEA languages during vision encoder and joint training." This insight led them to perform full-parameter fine-tuning, first training vision components using synthetic OCR datasets for Bahasa Indonesia, Thai, Vietnamese, and English
1
.The intensive training process "pushed the limits of GPUs," ultimately leading Grab to build a lightweight Vision LLM with approximately 1 billion parameters from scratch. The resulting model outperformed existing OCR tools, Qwen2, ChatGPT, and Google's Gemini on their specific tasks
1
.Related Stories
Grab's experience illuminates a significant challenge facing the AI industry. The lack of adequate datasets for training models on diverse languages creates barriers to accessibility and functionality. Major AI companies are investing heavily to address this gap - Google partnered with IIT Bombay for Indic language speech models, Meta reportedly pays $55 per hour for Hindi language training contractors, and OpenAI announced a $500,000 research collaboration with IIT Madras
2
.
Source: Gadgets 360
However, while collecting data for prominent Asian languages remains expensive but feasible, minority languages face even greater challenges. These languages may never achieve adequate representation in major AI models, creating persistent accessibility limitations
2
.Grab plans to expand its AI capabilities, developing "Chain of Thought-based OCR and Key Information Extraction models to strengthen generalisation capabilities." The company also intends to extend its document processing technology to Myanmar, Cambodia, and other markets
1
.Summarized by
Navi
[1]
1
Business and Economy

2
Technology

3
Policy and Regulation
