2 Sources
[1]
Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer
These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis. Google has announced the launch of PaliGemma 2, a family of vision-language models (VLMs) based on the Gemma 2 architecture, building on its predecessor with broader task applicability. The upgrade includes three model sizes (3B, 10B, and 28B) and three resolutions (224px², 448px², and 896px²), designed to optimise transfer learning across diverse domains. According to Google, the models were trained in three stages using Cloud TPU infrastructure to handle multimodal datasets spanning captioning, optical character recognition (OCR), and radiography report generation. These open-weight models facilitate fine-tuning across more than 30 transfer tasks, improving state-of-the-art results in fields such as molecular structure recognition, optical music score transcription, and table structure analysis. In their paper, the researchers explain, "We observed that increasing the image resolution and model size significantly impacts transfer performance, especially for document and visual-text recognition tasks." The models achieved state-of-the-art accuracy on datasets such as HierText for OCR and GrandStaff for music score transcription. The fine-tuning capabilities of PaliGemma 2 allow it to address applications beyond traditional benchmarks. The researchers noted that while increasing compute resources yields better results for most tasks, certain specialised applications benefit more from either higher resolution or larger model size, depending on task complexity. PaliGemma 2 also emphasises accessibility, with models designed to operate on low-precision formats for on-device inference. Researchers highlight, "Quantization of models for CPU-only environments retains nearly equivalent quality, making it suitable for broader deployments." Google DeepMind has introduced Genie 2, a large-scale foundation world model capable of generating diverse playable 3D environments. Genie 2 transforms a single image into interactive virtual worlds that can be explored by humans or AI using standard keyboard and mouse controls, facilitating the development of embodied AI agents. Additionally, Google DeepMind has launched GenCast, an AI model that enhances weather predictions by providing faster and more accurate forecasts up to 15 days in advance, while also addressing uncertainties and risks. Google has also unveiled its experimental AI model, Gemini-Exp-1121, positioned as a competitor to OpenAI's GPT-4o. The company is gearing up to release Google Gemini 2, which is expected to compete with OpenAI's forthcoming model, o1.
[2]
Google Open Sources PaliGemma 2 AI Model That Can 'See' Visual Inputs
Google says PaliGemma 2 can describe actions and emotions in an image Google introduced the successor to its PaliGemma artificial intelligence (AI) vision-language model on Thursday. Dubbed PaliGemma 2, the family of AI models improve upon the capabilities of the older generation. The Mountain View-based tech giant said the vision-language model can see, understand, and interact with visual input such as images and other visual assets. It is built using the Gemma 2 small language models (SLM) which were released in August. Interestingly, the tech giant claimed that the model can analyse emotions in the uploaded images. In a blog post, the tech giant detailed the new PaliGemma 2 AI model. While Google has several vision-language models, PaliGemma was the first such model in the Gemma family. Vision models are different from typical large language models (LLMs) in that they have additional encoders that can analyse visual content and convert it into familiar data form. This way, vision models can technically "see" and understand the external world. One benefit of a smaller vision model is that it can be used for a large number of applications as smaller models are optimised for speed and accuracy. With PaliGemma 2 being open-sourced, developers can use its capabilities to build into apps. The PaliGemma 2 comes in three different parameter sizes of 3 billion, 10 billion, and 28 billion. It is also available in 224p, 448p, 896p resolutions. Due to this, the tech giant claims that it is easy to optimise the AI model's performance for a wide range of tasks. Google says it generates detailed, contextually relevant captions for images. It can not only identify objects but also describe actions, emotions, and overall narrative of the scene. Google highlighted that the tool can be used for chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation. The company has also published a paper in the online pre-print journal arXiv. Developers and AI enthusiasts can download the PaliGemma 2 model and its code on Hugging Face and Kaggle here and here. The AI model supports frameworks such as Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.
Share
Copy Link
Google has introduced PaliGemma 2, an advanced family of vision-language AI models built on the Gemma 2 architecture. These open-source models offer improved capabilities in visual understanding and task transfer across various domains.
Google has unveiled PaliGemma 2, a new family of vision-language models (VLMs) that represents a significant advancement in artificial intelligence technology. Built upon the Gemma 2 architecture, these models are designed to enhance visual understanding and task transfer capabilities across diverse domains 12.
PaliGemma 2 comes in three model sizes (3B, 10B, and 28B parameters) and three resolutions (224px², 448px², and 896px²), offering flexibility for various applications. This structure allows for optimization across a wide range of tasks, from basic image recognition to complex visual analysis 1.
The models demonstrate impressive capabilities in:
Google's researchers employed a three-stage training process using Cloud TPU infrastructure, focusing on multimodal datasets that span:
This comprehensive training has resulted in state-of-the-art performance on various benchmarks, including:
PaliGemma 2's versatility extends to numerous specialized fields:
Researchers noted that while increased computational resources generally improve results, certain tasks benefit more from either higher resolution or larger model size, depending on their complexity 1.
A key feature of PaliGemma 2 is its emphasis on accessibility:
The release of PaliGemma 2 is part of Google's broader efforts in AI development:
These developments underscore Google's commitment to advancing AI technology across multiple domains, with PaliGemma 2 representing a significant step forward in vision-language models.
Summarized by
Navi
[1]
Analytics India Magazine
|Google Unveils PaliGemma 2 Vision-Language Models for Advanced Task Transfer[2]
NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.
9 Sources
Technology
6 hrs ago
9 Sources
Technology
6 hrs ago
As nations compete for dominance in space, the risk of satellite hijacking and space-based weapons escalates, transforming outer space into a potential battlefield with far-reaching consequences for global security and economy.
7 Sources
Technology
22 hrs ago
7 Sources
Technology
22 hrs ago
OpenAI updates GPT-5 to make it more approachable following user feedback, sparking debate about AI personality and user preferences.
6 Sources
Technology
14 hrs ago
6 Sources
Technology
14 hrs ago
A pro-Russian propaganda group, Storm-1679, is using AI-generated content and impersonating legitimate news outlets to spread disinformation, raising concerns about the growing threat of AI-powered fake news.
2 Sources
Technology
22 hrs ago
2 Sources
Technology
22 hrs ago
A study reveals patients' increasing reliance on AI for medical advice, often trusting it over doctors. This trend is reshaping doctor-patient dynamics and raising concerns about AI's limitations in healthcare.
3 Sources
Health
14 hrs ago
3 Sources
Health
14 hrs ago