Curated by THEOUTPOST
On Wed, 5 Mar, 12:05 AM UTC
3 Sources
[1]
Cohere Releases Aya Vision Models That Can Analyse Images
Cohere's Aya Vision models can generate output in 23 languages Aya Vision is available in 8B and 32B parameter sizes The models are said to outperform Meta's Llama-3.2 90B Vision Cohere For AI, the firm's open research division, released new state-of-the-art (SOTA) vision models on Tuesday. Dubbed Aya Vision, the artificial intelligence (AI) models are available in two parameter sizes. The company's latest frontier models address the inconsistent performance of existing large language models (LLMs) across different languages, especially for multimodal tasks. Aya Vision models can generate outputs in 23 languages and can perform both text-based and image-based tasks. However, it cannot generate images. Cohere has made the AI models available on open-source repositories as well as via WhatsApp. In a blog post, the AI firm detailed the new vision models. Aya Vision is available in 8B and 32B parameter sizes. These models can generate text, translate text and images across 23 languages, analyse images and answer queries about them, as well as caption images. Both models can be accessed via Cohere's Hugging Face page and on Kaggle. Additionally, general users can try out Cohere's models via a dedicated WhatsApp chat account that can be accessed here. The company says the Aya Vision models are useful for instances when people come across images or artworks they would like to learn more about. Based on the company's internal testing, the Aya Vision 8B model outperforms Qwen2.5-VL 7B, Gemini Flash 1.5 8B, and Llama 3.2 11B Vision models on the AyaVisionBench and m-WildVision benchmarks. Notably, the AyaVisionBench benchmark was also developed by Cohere, and its details have been shared in the public domain. Coming to the Aya Vision 32B model, the company claimed that it outperformed Llama 3.2 90B Vision and Qwen2-VL 72B models on the same benchmarks. To achieve frontier performance, Cohere claimed that several algorithmic innovations were developed. The Aya Vision models were fed synthetic annotations, developers scaled up multilingual data through translation and rephrasing, and multiple multimodal models were merged in separate steps. The developers observed that in each step, the performance was significantly improved. Notably, developers can access the open weights of the Aya Vision models from Kaggle and Hugging Face, however, these models are available with a Creative Commons Attribution Non Commercial 4.0 license. It allows for academic and research-based usage but prohibits commercial use cases.
[2]
Cohere claims its new Aya Vision AI model is best-in-class | TechCrunch
Cohere for AI, AI startup Cohere's nonprofit research lab, this week released a multimodal "open" AI model, Aya Vision, the lab claimed is best-in-class. Aya Vision can perform tasks like writing image captions, answering questions about photos, translating text, and generating summaries in 23 major languages. Cohere, which is also making Aya Vision available for free through WhatsApp, called it "a significant step towards making technical breakthroughs accessible to researchers worldwide." "While AI has made significant progress, there is still a big gap in how well models perform across different languages -- one that becomes even more noticeable in multimodal tasks that involve both text and images," Cohere wrote in a blog post. "Aya Vision aims to explicitly help close that gap." Aya Vision comes in a couple of flavors: Aya Vision 32B and Aya Vision 8B. The more sophisticated of the two, Aya Vision 32B, sets a "new frontier," Cohere said, outperforming models 2x its size including Meta's Llama-3.2 90B Vision on certain visual understanding benchmarks. Meanwhile, Aya Vision 8B scores better on some evaluations than models 10x its size, according to Cohere. Both models are available from AI dev platform Hugging Face under a Creative Commons 4.0 license with Cohere's acceptable use addendum. They can't be used for commercial applications. Cohere said that Aya Vision was trained using a "diverse pool" of English datasets, which the lab translated and used to create synthetic annotations. Annotations, also known as tags or labels, help models understand and interpret data during the training process. For example, annotation to train an image recognition model might take the form of markings around objects or captions referring to each person, place, or object depicted in an image. Cohere's use of synthetic annotations -- that is, annotations generated by AI -- is on trend. Despite its potential downsides, rivals including OpenAI are increasingly leveraging synthetic data to train models as the well of real-world data dries up. Research firm Gartner estimates that 60% of the data used for AI and analytics projects last year was synthetically created. According to Cohere, training Aya Vision on synthetic annotations enabled the lab to use fewer resources while achieving competitive performance. "This showcases our critical focus on efficiency and [doing] more using less compute," Cohere wrote in its blog. "This also enables greater support for the research community, who often have more limited access to compute resources." Together with Aya Vision, Cohere also released a new benchmark suite, AyaVisionBench, designed to probe a model's skills in "vision-language" tasks like identifying differences between two images and converting screenshots to code. The AI industry is in the midst of what some have called an "evaluation crisis," a consequence of the popularization of benchmarks that give aggregate scores that correlate poorly to proficiency on tasks most AI users care about. Cohere asserts that AyaVisionBench is a step toward rectifying this, providing a "broad and challenging" framework for assessing a model's cross-lingual and multimodal understanding. With any luck, that's indeed the case. "[T]he dataset serves as a robust benchmark for evaluating vision-language models in multilingual and real-world settings," Cohere researchers wrote in a post on Hugging Face. "We make this evaluation set available to the research community to push forward multilingual multimodal evaluations."
[3]
Cohere's first vision model Aya Vision is here with broad, multilingual understanding and open weights -- but there's a catch
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Canadian AI startup Cohere launched in 2019 specifically targeting the enterprise, but independent research has shown it has so far struggled to gain much of a market share among third-party developers compared to rival proprietary U.S. model providers such as OpenAI and Anthropic, not to mention the rise of Chinese open source competitor DeepSeek. Yet Cohere continues to bolster its offerings: today, its non-profit research division Cohere For AI announced the release of its first vision model, Aya Vision, a new open-weight multimodal AI model that integrates language and vision capabilities and boasts the differentiator of supporting inputs in 23 different languages spoken by what Cohere says in an official blog post is "half the world's population," making it appeal to a wide global audience. Aya Vision is designed to enhance AI's ability to interpret images, generate text, and translate visual content into natural language, making multilingual AI more accessible and effective. This would be especially helpful for enterprises and organizations operating in multiple different markets around the world with different language preferences. It's available now on Cohere's website and on AI code communities Hugging Face and Kaggle under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, allowing researchers and developers to freely use, modify, and share the model for non-commercial purposes as long as proper attribution is given. In addition, Aya Vision is available through WhatsApp, allowing users to interact with the model directly in a familiar environment. This limits its use for enterprises and as an engine for paid apps or moneymaking workflows, unfortunately. It comes in 8-billion and 32-billion parameter versions (parameters refer to the number of internal settings in AI model, including its weights and biases, with more usually denoting a more powerful and performant model). Supports 23 languages and counting Even though leading AI models from rivals can understand text across multiple languages, extending this capability to vision-based tasks is a challenge. But Aya Vision overcomes this by allowing users to generate image captions, answer visual questions, translate images, and perform text-based language tasks in a diverse set of languages including: In its blog post, Cohere showed how Aya Vision can analyze imagery and text on product packaging and provide translations or explanations. It can also identify and describe art styles from different cultures, helping users learn about objects and traditions through AI-powered visual understanding. Aya Vision's capabilities have broad implications across multiple fields: * Language Learning and Education: Users can translate and describe images in multiple languages, making educational content more accessible. * Cultural Preservation: The model can generate detailed descriptions of art, landmarks, and historical artifacts, supporting cultural documentation in underrepresented languages. * Accessibility Tools: Vision-based AI can assist visually impaired users by providing detailed image descriptions in their native language. * Global Communication: Real-time multimodal translation enables organizations and individuals to communicate across languages more effectively. Strong performance and high efficiency across leading benchmarks One of Aya Vision's standout features is its efficiency and performance relative to model size. Despite being significantly smaller than some leading multimodal models, Aya Vision has outperformed much larger alternatives in several key benchmarks. * Aya Vision 8B outperforms Llama 90B, which is 11 times larger. * Aya Vision 32B outperforms Qwen 72B, Llama 90B, and Molmo 72B, all of which are at least twice as large (or more)/ * Benchmarking results on AyaVisionBench and m-WildVision show Aya Vision 8B achieving win rates of up to 79%, and Aya Vision 32B reaching 72% win rates in multilingual image understanding tasks. A visual comparison of efficiency vs. performance highlights Aya Vision's advantage. As shown in the efficiency vs. performance trade-off graph, Aya Vision 8B and 32B demonstrate best-in-class performance relative to their parameter size, outperforming much larger models while maintaining computational efficiency. The tech innovations powering Aya Vision Cohere For AI attributes Aya Vision's performance gains to several key innovations: * Synthetic Annotations: The model leverages synthetic data generation to enhance training on multimodal tasks. * Multilingual Data Scaling: By translating and rephrasing data across languages, the model gains a broader understanding of multilingual contexts. * Multimodal Model Merging: Advanced techniques combine insights from both vision and language models, improving overall performance. These advancements allow Aya Vision to process images and text with greater accuracy while maintaining strong multilingual capabilities. The step-by-step performance improvement chart showcases how incremental innovations, including synthetic fine-tuning (SFT), model merging, and scaling, contributed to Aya Vision's high win rates. Implications for enterprise decision makers Despite ostensibly catering to the enterprise, businesses may have a hard time making much use of Aya Vision given its restrictive non-commercial licensing terms. Nonetheless, CEOs, CTOs, IT leaders, and AI researchers may use the models to explore AI-driven multilingual and multimodal capabilities within their organizations -- particularly in research, prototyping, and benchmarking. Enterprises can still use it for internal research and development, evaluating multilingual AI performance, and experimenting with multimodal applications. CTOs and AI teams will find Aya Vision valuable as a highly efficient, open-weight model that outperforms much larger alternatives while requiring fewer computational resources. This makes it a useful tool for benchmarking against proprietary models, exploring potential AI-driven solutions, and testing multilingual multimodal interactions before committing to a commercial deployment strategy. For data scientists and AI researchers, Aya Vision is much more useful. Its open-source nature and rigorous benchmarks provide a transparent foundation for studying model behavior, fine-tuning in non-commercial settings, and contributing to open AI advancements. Whether used for internal research, academic collaborations, or AI ethics evaluations, Aya Vision serves as a cutting-edge resource for enterprises looking to stay at the forefront of multilingual and multimodal AI -- without the constraints of proprietary, closed-source models. Open source research and collaboration Aya Vision is part of Aya, a broader initiative by Cohere focused on making AI and related tech more multilingual. Since its inception in February 2024, the Aya initiative has engaged a global research community of over 3,000 independent researchers across 119 countries, working together to improve language AI models. To further its commitment to open science, Cohere has released the open weights for both Aya Vision 8B and 32B on Kaggle and Hugging Face, ensuring researchers worldwide can access and experiment with the models. In addition, Cohere For AI has introduced the AyaVisionBenchmark, a new multilingual vision evaluation set designed to provide a rigorous assessment framework for multimodal AI. The availability of Aya Vision as an open-weight model marks an important step in making multilingual AI research more inclusive and accessible. Aya Vision builds on the success of Aya Expanse, another LLM family from Cohere For AI focused on multilingual AI. By expanding its focus to multimodal AI, Cohere For AI is positioning Aya Vision as a key tool for researchers, developers, and businesses looking to integrate multilingual AI into their workflows. As the Aya initiative continues to evolve, Cohere For AI has also announced plans to launch a new collaborative research effort in the coming weeks. Researchers and developers interested in contributing to multilingual AI advancements can join the open science community or apply for research grants. For now, Aya Vision's release represents a significant leap in multilingual multimodal AI, offering a high-performance, open-weight solution that challenges the dominance of larger, closed-source models. By making these advancements available to the broader research community, Cohere For AI continues to push the boundaries of what is possible in AI-driven multilingual communication.
Share
Share
Copy Link
Cohere's non-profit research division has released Aya Vision, a state-of-the-art open-source AI model capable of analyzing images and generating text in 23 languages, outperforming larger models in efficiency and multilingual capabilities.
Cohere For AI, the non-profit research division of Canadian AI startup Cohere, has unveiled Aya Vision, a groundbreaking open-source AI model that combines advanced image analysis capabilities with multilingual text generation. This release marks a significant step forward in making sophisticated AI technology accessible to researchers worldwide 1.
Aya Vision is available in two sizes: 8B and 32B parameters. The model boasts an impressive array of functionalities:
Notably, Aya Vision supports languages spoken by approximately half of the world's population, making it a versatile tool for global applications 2.
Despite its relatively smaller size, Aya Vision has demonstrated remarkable performance:
Cohere claims that Aya Vision sets a "new frontier" in efficiency, achieving competitive results with fewer computational resources 3.
The impressive performance of Aya Vision is attributed to several key innovations:
These techniques have allowed Cohere to achieve high win rates in benchmarks like AyaVisionBench and m-WildVision 3.
Aya Vision is available through multiple channels:
The model is released under a Creative Commons Attribution Non-Commercial 4.0 license, allowing free use for academic and research purposes but prohibiting commercial applications 1.
The multilingual and multimodal capabilities of Aya Vision open up numerous possibilities:
However, the non-commercial license may limit its adoption in enterprise settings 3.
While Cohere has positioned itself as an enterprise-focused AI company, its market share among third-party developers has been limited compared to rivals like OpenAI and Anthropic. The release of Aya Vision, with its open-source nature and impressive capabilities, could potentially boost Cohere's standing in the AI research community 3.
Reference
[1]
Cohere's research arm releases Aya Expanse, a family of multilingual AI models that outperform leading open-source alternatives, aiming to bridge the global language divide in AI technology.
3 Sources
3 Sources
Cohere releases Command A, a new large language model designed for enterprise use, offering high performance with minimal hardware requirements and expanded multilingual capabilities.
5 Sources
5 Sources
Cohere launches Embed 3, an advanced multimodal AI model that integrates text and image embeddings, setting new standards for enterprise search and multilingual retrieval tasks.
2 Sources
2 Sources
Cohere introduces Command R7B, the smallest model in its R series, designed for enterprise use with a focus on efficiency, performance, and versatility across multiple languages and tasks.
2 Sources
2 Sources
Alibaba's Qwen research team has released QVQ-72B, an experimental open-source AI model that combines visual analysis with advanced reasoning capabilities, potentially outperforming some closed-source competitors in specific benchmarks.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved