Curated by THEOUTPOST
On Fri, 6 Dec, 4:02 PM UTC
2 Sources
[1]
Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer
These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis. Google has announced the launch of PaliGemma 2, a family of vision-language models (VLMs) based on the Gemma 2 architecture, building on its predecessor with broader task applicability. The upgrade includes three model sizes (3B, 10B, and 28B) and three resolutions (224px², 448px², and 896px²), designed to optimise transfer learning across diverse domains. According to Google, the models were trained in three stages using Cloud TPU infrastructure to handle multimodal datasets spanning captioning, optical character recognition (OCR), and radiography report generation. These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis. In their paper, the researchers explain, "We observed that increasing the image resolution and model size significantly impacts transfer performance, especially for document and visual-text recognition tasks." The models achieved state-of-the-art accuracy on datasets such as HierText for OCR and GrandStaff for music score transcription. The fine-tuning capabilities of PaliGemma 2 allow it to address applications beyond traditional benchmarks. The researchers noted that while increasing compute resources yields better results for most tasks, certain specialised applications benefit more from either higher resolution or larger model size, depending on task complexity. PaliGemma 2 also emphasises accessibility, with models designed to operate on low-precision formats for on-device inference. Researchers highlight, "Quantization of models for CPU-only environments retains nearly equivalent quality, making it suitable for broader deployments." Google DeepMind has introduced Genie 2, a large-scale foundation world model capable of generating diverse playable 3D environments. Genie 2 transforms a single image into interactive virtual worlds that can be explored by humans or AI using standard keyboard and mouse controls, facilitating the development of embodied AI agents. Additionally, Google DeepMind has launched GenCast, an AI model that enhances weather predictions by providing faster and more accurate forecasts up to 15 days in advance, while also addressing uncertainties and risks. Google has also unveiled its experimental AI model, Gemini-Exp-1121, positioned as a competitor to OpenAI's GPT-4o. The company is gearing up to release Google Gemini 2, which is expected to compete with OpenAI's forthcoming model, o1.
[2]
Google Open Sources PaliGemma 2 AI Model That Can 'See' Visual Inputs
Google says PaliGemma 2 can describe actions and emotions in an image Google introduced the successor to its PaliGemma artificial intelligence (AI) vision-language model on Thursday. Dubbed PaliGemma 2, the family of AI models improve upon the capabilities of the older generation. The Mountain View-based tech giant said the vision-language model can see, understand, and interact with visual input such as images and other visual assets. It is built using the Gemma 2 small language models (SLM) which were released in August. Interestingly, the tech giant claimed that the model can analyse emotions in the uploaded images. In a blog post, the tech giant detailed the new PaliGemma 2 AI model. While Google has several vision-language models, PaliGemma was the first such model in the Gemma family. Vision models are different from typical large language models (LLMs) in that they have additional encoders that can analyse visual content and convert it into familiar data form. This way, vision models can technically "see" and understand the external world. One benefit of a smaller vision model is that it can be used for a large number of applications as smaller models are optimised for speed and accuracy. With PaliGemma 2 being open-sourced, developers can use its capabilities to build into apps. The PaliGemma 2 comes in three different parameter sizes of 3 billion, 10 billion, and 28 billion. It is also available in 224p, 448p, 896p resolutions. Due to this, the tech giant claims that it is easy to optimise the AI model's performance for a wide range of tasks. Google says it generates detailed, contextually relevant captions for images. It can not only identify objects but also describe actions, emotions, and overall narrative of the scene. Google highlighted that the tool can be used for chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation. The company has also published a paper in the online pre-print journal arXiv. Developers and AI enthusiasts can download the PaliGemma 2 model and its code on Hugging Face and Kaggle here and here. The AI model supports frameworks such as Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.
Share
Share
Copy Link
Google has introduced PaliGemma 2, an advanced family of vision-language AI models built on the Gemma 2 architecture. These open-source models offer improved capabilities in visual understanding and task transfer across various domains.
Google has unveiled PaliGemma 2, a new family of vision-language models (VLMs) that represents a significant advancement in artificial intelligence technology. Built upon the Gemma 2 architecture, these models are designed to enhance visual understanding and task transfer capabilities across diverse domains 12.
PaliGemma 2 comes in three model sizes (3B, 10B, and 28B parameters) and three resolutions (224px², 448px², and 896px²), offering flexibility for various applications. This structure allows for optimization across a wide range of tasks, from basic image recognition to complex visual analysis 1.
The models demonstrate impressive capabilities in:
Google's researchers employed a three-stage training process using Cloud TPU infrastructure, focusing on multimodal datasets that span:
This comprehensive training has resulted in state-of-the-art performance on various benchmarks, including:
PaliGemma 2's versatility extends to numerous specialized fields:
Researchers noted that while increased computational resources generally improve results, certain tasks benefit more from either higher resolution or larger model size, depending on their complexity 1.
A key feature of PaliGemma 2 is its emphasis on accessibility:
The release of PaliGemma 2 is part of Google's broader efforts in AI development:
These developments underscore Google's commitment to advancing AI technology across multiple domains, with PaliGemma 2 representing a significant step forward in vision-language models.
Reference
[1]
Analytics India Magazine
|Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer[2]
Google has introduced Gemma, a compact and efficient open-source AI model, aiming to compete with other generative AI models in the market. This release marks a significant step in Google's AI strategy and accessibility efforts.
2 Sources
2 Sources
Google has released updated versions of its Gemma large language models, focusing on improved performance, reduced size, and enhanced safety features. These open-source AI models aim to democratize AI development while prioritizing responsible use.
2 Sources
2 Sources
Google introduces Gemma 3, an open-source AI model optimized for single-GPU performance, featuring multimodal capabilities, extended context window, and improved efficiency compared to larger models.
19 Sources
19 Sources
Google introduces new Gemini 2.0 models, including Flash, Pro Experimental, and Flash-Lite, offering improved performance, expanded capabilities, and cost-effective options for developers and users across various AI tasks.
41 Sources
41 Sources
Google introduces Gemini 2.5 Flash, a new AI model optimized for speed and efficiency, alongside updates to its AI ecosystem and agent technologies.
8 Sources
8 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved