First Published on: 2025/02/13

Authors: Yi Yang*, Xiaoxuan He*, Hongkun Pan*, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Minfeng Zhu, Bo Zhang

Introduction

We are entering an exciting era of Artificial General Intelligence (AGI), where the pursuit of higher-order reasoning capabilities is becoming a reality. While Large Language Models (LLMs) have demonstrated impressive Chain-of-Thought (CoT) reasoning, multimodal models capable of such complex reasoning remain a significant challenge.

We release R1-Onevision, a multimodal reasoning model designed to bridge the gap between multimodal capabilities and deep reasoning abilities. With its robust multimodal reasoning capabilities, we envision R1-Onevision as a powerful tool in areas such as mathematics, science, deep image understanding, and logical reasoning. R1-Onevision has achieved state-of-the-art performance, surpassing current models such as GPT-4o、GPT-4V、Qwen-VL on multiple challenging multimodal benchmarks.

In this report, we discuss how R1-Onevision represents a major leap forward in multimodal reasoning. This breakthrough is largely driven by the pipeline of a large-scale, high-quality visual reasoning data construction. In particular, we propose a formal language-driven visual reasoning process to enhance to reasoning ability over images. This novel integration between visual reasoning and formal language enables models to interpret and reason with images in a structured, precise manner that aligns with formal logic. Besides, We use Rule-based Reinforcement Learning (RL) to enhance the model's reasoning and output reliability after CoT training, utilizing explicit constraints to ensure structured reasoning and answer validity, all built upon a pre-trained model SFT on the R1-Onevision Dataset.

Due to the lack of available benchmark datasets for multimodal reasoning, we have developed a new benchmark, namely R1-Onevision-bench, which consists of a selection of reasoning tasks from in-the-wild logic, math, physics and chemistry reasoning problems from real-world scenarios.

Method

To bridge the gap between visual and textual reasoning and optimize structured reasoning, we propose two core methodologies employed in the development of the R1-Onevision multimodal reasoning model.

The first key component is the R1-Onevision Dataset Construction, where we bridge the gap between visual and textual reasoning through the use of formal language. By employing dense captioning techniques, language reasoning model, and role-playing approach, we created a dataset that enhances model performance in diverse reasoning tasks, such as natural scenes, mathematical problems, and logical construction.

The second core methodology is the Rule-Based Reinforcement Learning (RL) Framework, which refines the reasoning process by incorporating explicit rules to enforce accuracy and structure. Based on pre-trained model supervised fine-tuning (SFT) on the R1-Onevision Dataset, the model leverages rule-based RL to generate more reliable outputs, with a strong emphasis on structured reasoning, logical deductions, and maintaining response integrity through accuracy and formatting checks. This approach combines the power of reinforcement learning with rule-based instruction to push the boundaries of multimodal reasoning and ensure the generation of coherent and contextually accurate answers.

Frame 2 (5).png

R1-Onevision Dataset Construction: Bridging Visual and Textual Reasoning with Formal Language

The R1-Onevision dataset is not just a collection of data, it is the culmination of a meticulous process designed to bridge the gap between visual and textual reasoning by constructing formal language for expressing images. This dataset empowers models with the ability to perform deep, structured reasoning across diverse domains, ranging from natural scenes to complex mathematical problems.