/ README.md
README.md
  1  <!-- markdownlint-disable first-line-h1 -->
  2  <!-- markdownlint-disable html -->
  3  <!-- markdownlint-disable no-duplicate-header -->
  4  
  5  <div align="center">
  6    <img src="images/logo.svg" width="60%" alt="DeepSeek LLM" />
  7  </div>
  8  <hr>
  9  
 10  <div align="center">
 11  <h1>Thinking with Visual Primitives</h1>
 12  
 13  </div>
 14  
 15  <div align="center">
 16  
 17    <a href="https://www.deepseek.com/" target="_blank">
 18      <img alt="Homepage" src="images/badge.svg" />
 19    </a>
 20    </a>
 21    <a href="https://huggingface.co/deepseek-ai" target="_blank">
 22      <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
 23    </a>
 24  
 25  </div>
 26  
 27  
 28  
 29  <div align="center">
 30  
 31    <a href="LICENSE-CODE">
 32      <img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53">
 33    </a>
 34    <a href="LICENSE-MODEL">
 35      <img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53">
 36    </a>
 37  </div>
 38  
 39  
 40  <p align="center">
 41    <a href="#2-license"><b>📜 License</b></a> |
 42    <a href="#3-citation"><b>📖 Citation</b></a> <br>
 43    <!-- 📄 Paper Link (<a href=""><b>Thinking with Visual Primitives</b></a> | -->
 44  
 45  </p>
 46  
 47  
 48  ## News
 49  
 50  **2026.04.30**: We have released the [technical report](./Thinking_with_Visual_Primitives.pdf) detailing our approach. In the near future, we plan to make the in-house benchmarks and a subset of our cold-start data publicly available. The model weights will be integrated into our foundation model and released in the future.
 51  
 52  
 53  
 54  ## 1. Introduction
 55  While recent Multimodal Large Language Models (MLLMs) have made strides in bridging the *"Perception Gap"* (e.g., through high-resolution cropping or thinking with images), they still struggle with complex structural reasoning. We identify this bottleneck as the **Reference Gap**: natural language is simply too ambiguous to precisely point to dense spatial layouts, often leading to logical collapse and hallucinations in thinking process.
 56  
 57  This project introduces a paradigm shift. Instead of just "seeing clearer", our model learns to **"point while it reasons."** By interleaving spatial markers (points and bounding boxes) directly into the reasoning trajectory as *minimal units of thought*, we anchor abstract linguistic concepts to concrete physical coordinates.
 58  
 59  <table align="center">
 60    <tr>
 61      <td align="center" valign="top">
 62        <img src="./images/coffee.gif" style="height:250px; width:auto; max-width:none;" /><br>      
 63        <b>Grounded Task Reasoning</b>
 64      </td>
 65      <td align="center" valign="top">
 66        <img src="./images/maze.gif" style="height:250px; width:auto; max-width:none;" /><br>
 67        <b>Topological Reasoning</b>
 68      </td>
 69    </tr>
 70  </table>
 71  
 72  
 73  ### Key Highlights
 74  
 75  *  **Point-to-Reason Synergy:** Mimicking human cognitive behavior (like using a finger to count or trace a maze), our framework elevates visual primitives to minimal units of thought, effectively solving the Reference Gap in complex structural reasoning.
 76  *  **Extreme Visual Token Efficiency:** Built upon the architecture of DeepSeek-V4-Flash, we compress the KV cache of every 4 visual tokens into a single entry, drastically reducing image token consumption while maintaining cognitive depth.
 77  *  **Frontier-Competitive Performance:** Despite a compact model scale and a significantly lower image-token budget, our model matches frontier models like **GPT-5.4, Claude-Sonnet-4.6, and Gemini-3-Flash** across challenging counting and spatial reasoning benchmarks. (We note that the reported scores cover only a subset of evaluation dimensions that are directly relevant to the research focus of this paper, and are therefore not indicative of the models' overall capabilities.)
 78  
 79  
 80  <div align="center">
 81  <img alt="image" src="images/teaser.png" style="width:90%;">
 82  </div>
 83  
 84  
 85  
 86  
 87  
 88  
 89  ## 2. License
 90  
 91  This code repository is licensed under [the MIT License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE).
 92  
 93  ## 3. Citation
 94  
 95  ```bibtex
 96  @article{lu2026think,
 97    title={Thinking with Visual Primitives},
 98    author={Lu, Ruijie and Ma, Yiyang and Chen, Xiaokang and Luo, Lingxiao and Wu, Zhiyu and Pan, Zizheng and Liu, Xingchao and Lin, Yutong and Li, Hao and Liu, Wen and Hao, Zhewen and Gao, Xi and Nie, Shaoheng and Wei, Yixuan and Xie, Zhenda and Chen, Ting and Zeng, Gang},
 99    year={2026}
100  }
101  
102  ```
103  
104  ## 4. Contact
105  
106  If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).