BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Mistral AI Releases Pixtral Large: a Multimodal Model for Advanced Image and Text Analysis

Mistral AI Releases Pixtral Large: a Multimodal Model for Advanced Image and Text Analysis

Mistral AI released Pixtral Large, a 124-billion-parameter multimodal model designed for advanced image and text processing with a 1-billion-parameter vision encoder. Built on Mistral Large 2, it achieves leading performance on benchmarks like MathVista and DocVQA, excelling in tasks that require reasoning across text and visual data.

Pixtral Large has demonstrated significant performance improvements on multiple benchmarks. It achieved 69.4% on MathVista, a dataset assessing mathematical reasoning using visual data, surpassing all previous models. In complex document and chart comprehension evaluations, the model outperformed GPT-4o and Gemini-1.5 Pro on DocVQA and ChartQA, solidifying its capabilities in structured visual reasoning tasks. Additionally, on MM-MT-Bench, which reflects real-world use cases for multimodal models, Pixtral Large outperformed Claude-3.5 Sonnet, Gemini-1.5 Pro, and GPT-4o.

Source: Mistral AI Blog

The release has garnered positive reactions from the AI community. Nagesh Nama, a CEO at xLM, shared

The release of Mistral AI's Pixtral Large is a good piece of news for the AI community. Open-sourcing such a massive multimodal model will undoubtedly encourage innovation and collaboration among researchers and smaller companies. The fact that it can handle both text and images together and is easy to fine-tune for specific needs is a significant advantage. It will be exciting to see how this model is utilized and what breakthroughs it will bring to the field of AI. Kudos to Mistral AI for taking this bold step towards open-source AI.

Naveed Sarwar, a CEO at TechloSet Solutions, added:

By making it open-source, Mistral is empowering researchers, start-ups, and innovators to fine-tune and tailor the model to their needs, unlocking the massive potential for new applications.

To evaluate Pixtral Large’s architecture, it combines Mistral Large 2’s text backbone with a vision encoder and extended multimodal capabilities. This integration ensures high performance on tasks requiring advanced reasoning across visual and textual domains while preserving the robustness of text-only processing. For instance, the vision encoder works alongside the text model, enabling seamless multimodal interactions.

Pixtral Large supports document interpretation, chart analysis, and natural image understanding, providing tools for sectors requiring advanced image-text integration. While Pixtral Large is not designed for Optical Character Recognition (OCR), Mistral AI has indicated that enhancing OCR capabilities is a priority for future developments.

Pixtral Large is available under the Mistral Research License (MRL) for academic and non-commercial use, and a separate commercial license is offered for enterprise deployment. Users can access the model via the pixtral-large-latest API or download it for self-hosted implementations on HuggingFace.

About the Author

Rate this Article

Adoption
Style

BT