/ README.md
README.md
   1  # Qwen2-VL
   2  
   3  
   4  <p align="center">
   5      <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2VL_logo.png" width="400"/>
   6  <p>
   7  
   8  <p align="center">
   9          ๐Ÿค— <a href="https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿค– <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp ๐Ÿ“‘ <a href="https://qwenlm.github.io/blog/qwen2-vl/">Blog</a> &nbsp&nbsp| &nbsp&nbsp ๐Ÿ“‘ <a href="https://arxiv.org/pdf/2409.12191">Paper</a> &nbsp&nbsp  </a>
  10  <br>
  11  ๐Ÿ–ฅ๏ธ <a href="https://huggingface.co/spaces/Qwen/Qwen2-VL">Demo</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿ’ฌ <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (ๅพฎไฟก)</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿซจ <a href="https://discord.gg/CV4E9rpNSD">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api"> ๐Ÿ“‘ API</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿ–ฅ๏ธ <a href="https://gallery.pai-ml.com/#/preview/deepLearning/cv/qwen2-vl">PAI-DSW</a>&nbsp&nbsp
  12  </p>
  13  
  14  
  15  
  16  
  17  ## Introduction
  18  
  19  After a year's relentless efforts, today we are thrilled to release **Qwen2-VL**! Qwen2-VL is the latest version of the vision language models in the Qwen model families. 
  20  
  21  #### Key Enhancements:
  22  
  23  * **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
  24  
  25  * **Understanding videos of 20min+**: with the online streaming capabilities, Qwen2-VL can understand videos over 20 minutes by high-quality video-based question answering, dialog, content creation, etc.
  26  
  27  * **Agent that can operate your mobiles, robots, etc.**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
  28  
  29  * **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
  30  
  31  #### Model Architecture Updates:
  32  
  33  * **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
  34  <p align="center">
  35      <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl_framework.jpg" width="80%"/>
  36  <p>
  37  
  38  * **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
  39  
  40  <p align="center">
  41      <img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="80%"/>
  42  <p>
  43  
  44  
  45  We have open-sourced Qwen2-VL models, including Qwen2-VL-2B and Qwen2-VL-7B under the Apache 2.0 license, as well as Qwen2-VL-72B under the Qwen license. These models are now integrated with Hugging Face Transformers, vLLM, and other third-party frameworks. We hope you enjoy using them!
  46  
  47  
  48  ## News
  49  * 2024.09.19: The instruction-tuned [Qwen2-VL-72B model](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) and its quantized version [[AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ), [GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4), [GPTQ-Int8](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)] are now available. We have also released the [Qwen2-VL paper](https://arxiv.org/pdf/2409.12191) simultaneously.
  50  * 2024.08.30: We have released the [Qwen2-VL series]("https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d). The 2B and 7B models are now available, and the 72B model for opensource is coming soon. For more details, please check our [blog](https://qwenlm.github.io/blog/qwen2-vl/)!
  51  
  52  
  53  ## Performance
  54  ### Image Benchmarks
  55  
  56  | Benchmark | Previous SoTA<br><sup>(Open-source LVLM)<sup> | Claude-3.5 Sonnet | GPT-4o | **Qwen2-VL-72B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) [๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct) |**Qwen2-VL-7B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) [๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) |**Qwen2-VL-2B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) 
  57  | :--- | :---: | :---: | :---: | :---: |:---: |:---: |
  58  | MMMU<sub>val</sub>  | 58.3 | 68.3 | **69.1** | 64.5 | 54.1|41.1
  59  | MMMU-Pro | 46.9 | 51.5 | **51.9** | 46.2 | 43.5 | 37.6 
  60  | DocVQA<sub>test</sub>  | 94.1 | 95.2 | 92.8 | **96.5** | 94.5| 90.1
  61  | InfoVQA<sub>test</sub>  | 82.0 | - | - | **84.5** | 76.5|65.5
  62  | ChartQA<sub>test</sub>  | 88.4 | **90.8** | 85.7 | 88.3 |83.0| 73.5
  63  | TextVQA<sub>val</sub>  | 84.4 | - | - | **85.5** |84.3|79.7
  64  | OCRBench | 852 | 788 | 736 | **877** |845| 794
  65  | MTVQA | 17.3 | 25.7 | 27.8 | **30.9** |25.6| 18.1
  66  | VCR<sub>en easy</sub>  | 84.67 | 63.85 | 91.55 | **91.93** | 89.70| 81.45
  67  | VCR<sub>zh easy</sub>  | 22.09 | 1.0| 14.87 | **65.37** | 59.94| 46.16
  68  | RealWorldQA | 72.2 | 60.1 | 75.4 | **77.8** | 70.1| 62.9
  69  | MME<sub>sum</sub>   | 2414.7 | 1920.0 | 2328.7 | **2482.7** | 2326.8 | 1872.0
  70  | MMBench-EN<sub>test</sub>  | **86.5** | 79.7 | 83.4 | **86.5** | 83.0 | 74.9
  71  | MMBench-CN<sub>test</sub>  | 86.3 | 80.7 | 82.1 | **86.6** | 80.5| 73.5
  72  | MMBench-V1.1<sub>test</sub>  | 85.5 | 78.5 | 82.2 | **85.9** |80.7| 72.2
  73  | MMT-Bench<sub>test</sub> | 63.4 | - | 65.5 | **71.7** |63.7| 54.5
  74  | MMStar | 67.1 | 62.2 | 63.9 | **68.3** |60.7| 48.0
  75  | MMVet<sub>GPT-4-Turbo</sub>  | 65.7 | 66.0 | 69.1 | **74.0** |62.0| 49.5
  76  | HallBench<sub>avg</sub>  | 55.2 | 49.9 | 55.0 | **58.1** | 50.6 | 41.7
  77  | MathVista<sub>testmini</sub>  | 67.5 | 67.7 | 63.8 | **70.5** |58.2| 43.0
  78  | MathVision  | 16.97 | - | **30.4** | 25.9 | 16.3| 12.4
  79  
  80  ### Video Benchmarks
  81  
  82  | Benchmark |  Previous SoTA<br><sup>(Open-source LVLM)<sup> | Gemini 1.5-Pro | GPT-4o | **Qwen2-VL-72B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) [๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct)) |**Qwen2-VL-7B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) [๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) |**Qwen2-VL-2B**<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) 
  83  | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
  84  | MVBench | 69.6 | - | - | **73.6** | 67.0| 63.2 
  85  | PerceptionTest<sub>test</sub> |  66.9 | - | - | **68.0** | 62.3 |53.9
  86  | EgoSchema<sub>test</sub>  | 62.0 | 63.2 | 72.2 | **77.9** | 66.7 |54.9
  87  | Video-MME<br><sub>(wo/w subs)</sub>  | 66.3/69.6  | **75.0**/**81.3** | 71.9/77.2 | 71.2/77.8 | 63.3/69.0 |55.6/60.4
  88  
  89  ### Agent Benchmarks
  90  |     |Benchmark | Metric | Previous SoTA | GPT-4o | **Qwen2-VL-72B** |
  91  | :-- | :-- | :--: | :--: | :--: | :--: |
  92  |   General  | FnCall<sup>[1]</sup> | TM | - | 90.2 | **93.1** |
  93  |     |  | EM | - | 50.0 | **53.2** |
  94  |   Game  | Number Line | SR | 89.4<sup>[2]</sup> | 91.5 | **100.0** |
  95  |     | BlackJack | SR | 40.2<sup>[2]</sup> | 34.5 | **42.6** |
  96  |     | EZPoint | SR | 50.0<sup>[2]</sup> | 85.5 | **100.0** |
  97  |     | Point24 | SR | 2.6<sup>[2]</sup> | 3.0 | **4.5** |
  98  | Android | AITZ  | TM | 83.0<sup>[3]</sup> | 70.0 | **89.6** |
  99  |     |  | EM | 47.7<sup>[3]</sup> | 35.3 | **72.1** |
 100  | AI2THOR | ALFRED<sub>valid-unseen</sub> | SR | 67.7<sup>[4]</sup> | - | **67.8** |
 101  |     |  | GC | 75.3<sup>[4]</sup> | - | **75.8** | 
 102  |  VLN   | R2R<sub>valid-unseen</sub>  | SR | **79.0** | 43.7<sup>[5]</sup> | 51.7 | 
 103  |     | REVERIE<sub>valid-unseen</sub> | SR | **61.0** | 31.6<sup>[5]</sup> | 31.0 | 
 104  
 105  SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM<sup>[6]</sup>.
 106  1. Self-Curated Function Call Benchmark by Qwen Team
 107  2. Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
 108  3. Android in the Zoo: Chain-of-Action-Thought for GUI Agents
 109  4. ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
 110  5. MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
 111  6. Segment Anything.
 112  
 113  ### Multilingual Benchmarks
 114  
 115  <table style="width:75%; text-align:center;">
 116      <tr>
 117          <th>Models</th>
 118          <td>AR </td>
 119          <td>DE </td>
 120          <td>FR </td>
 121          <td>IT </td>
 122          <td>JA </td>
 123          <td>KO </td>
 124          <td>RU </td>
 125          <td>TH </td>
 126          <td>VI </td>
 127          <td>AVG</td>
 128      </tr>
 129      <tr>
 130          <th align="left">Qwen2-VL-72B</th>
 131          <td>20.7 </td>
 132          <td>36.5 </td>
 133          <td>44.1 </td>
 134          <td>42.8 </td>
 135          <td>21.6 </td>
 136          <td>37.4 </td>
 137          <td>15.6 </td>
 138          <td>17.7 </td>
 139          <td>41.6 </td>
 140          <td><b>30.9</b></td>
 141      </tr>
 142      <tr>
 143          <th align="left">GPT-4o</th>
 144          <td>20.2 </td>
 145          <td>34.2 </td>
 146          <td>41.2 </td>
 147          <td>32.7 </td>
 148          <td>20.0 </td>
 149          <td>33.9 </td>
 150          <td>11.5 </td>
 151          <td>22.5 </td>
 152          <td>34.2 </td>
 153          <td>27.8</td>
 154      </tr>
 155          <tr>
 156          <th align="left">Claude3 Opus</th>
 157          <td>15.1 </td>
 158          <td>33.4 </td>
 159          <td>40.6 </td>
 160          <td>34.4 </td>
 161          <td>19.4 </td>
 162          <td>27.2 </td>
 163          <td>13.0 </td>
 164          <td>19.5 </td>
 165          <td>29.1 </td>
 166          <td>25.7 </td>
 167      </tr>
 168      <tr>
 169          <th align="left">Gemini Ultra</th>
 170          <td>14.7 </td>
 171          <td>32.3 </td>
 172          <td>40.0 </td>
 173          <td>31.8 </td>
 174          <td>12.3 </td>
 175          <td>17.2 </td>
 176          <td>11.8 </td>
 177          <td>20.3 </td>
 178          <td>28.6 </td>
 179          <td>23.2</td>
 180      </tr>
 181  </table>
 182  
 183  These results are evaluated on the benchmark of [MTVQA](https://github.com/bytedance/MTVQA/tree/main).
 184  
 185  ## Quickstart
 186  
 187  Below, we provide simple examples to show how to use Qwen2-VL with ๐Ÿค– ModelScope and ๐Ÿค— Transformers.
 188  
 189  The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 190  ```
 191  pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate
 192  ```
 193  or you might encounter the following error:
 194  ```
 195  KeyError: 'qwen2_vl'
 196  ```
 197  
 198  - โš ๏ธ**NOTE**: Current latest version of `transformers` have [a bug](https://github.com/huggingface/transformers/issues/33401) when loading Qwen2-VL config, so you need to install a specific version of transformers as above.
 199  
 200  We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 201  
 202  ```bash
 203  # It's highly recommanded to use `[decord]` feature for faster video loading.
 204  pip install qwen-vl-utils[decord]
 205  ```
 206  
 207  If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
 208  
 209  ### Using ๐Ÿค—  Transformers to Chat
 210  
 211  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
 212  
 213  ```python
 214  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 215  from qwen_vl_utils import process_vision_info
 216  
 217  # default: Load the model on the available device(s)
 218  model = Qwen2VLForConditionalGeneration.from_pretrained(
 219      "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
 220  )
 221  
 222  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
 223  # model = Qwen2VLForConditionalGeneration.from_pretrained(
 224  #     "Qwen/Qwen2-VL-7B-Instruct",
 225  #     torch_dtype=torch.bfloat16,
 226  #     attn_implementation="flash_attention_2",
 227  #     device_map="auto",
 228  # )
 229  
 230  # default processer
 231  processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
 232  
 233  # The default range for the number of visual tokens per image in the model is 4-16384.
 234  # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
 235  # min_pixels = 256*28*28
 236  # max_pixels = 1280*28*28
 237  # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
 238  
 239  messages = [
 240      {
 241          "role": "user",
 242          "content": [
 243              {
 244                  "type": "image",
 245                  "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 246              },
 247              {"type": "text", "text": "Describe this image."},
 248          ],
 249      }
 250  ]
 251  
 252  # Preparation for inference
 253  text = processor.apply_chat_template(
 254      messages, tokenize=False, add_generation_prompt=True
 255  )
 256  image_inputs, video_inputs = process_vision_info(messages)
 257  inputs = processor(
 258      text=[text],
 259      images=image_inputs,
 260      videos=video_inputs,
 261      padding=True,
 262      return_tensors="pt",
 263  )
 264  inputs = inputs.to("cuda")
 265  
 266  # Inference: Generation of the output
 267  generated_ids = model.generate(**inputs, max_new_tokens=128)
 268  generated_ids_trimmed = [
 269      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 270  ]
 271  output_text = processor.batch_decode(
 272      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 273  )
 274  print(output_text)
 275  ```
 276  <details>
 277  <summary>Multi image inference</summary>
 278  
 279  ```python
 280  # Messages containing multiple images and a text query
 281  messages = [
 282      {
 283          "role": "user",
 284          "content": [
 285              {"type": "image", "image": "file:///path/to/image1.jpg"},
 286              {"type": "image", "image": "file:///path/to/image2.jpg"},
 287              {"type": "text", "text": "Identify the similarities between these images."},
 288          ],
 289      }
 290  ]
 291  
 292  # Preparation for inference
 293  text = processor.apply_chat_template(
 294      messages, tokenize=False, add_generation_prompt=True
 295  )
 296  image_inputs, video_inputs = process_vision_info(messages)
 297  inputs = processor(
 298      text=[text],
 299      images=image_inputs,
 300      videos=video_inputs,
 301      padding=True,
 302      return_tensors="pt",
 303  )
 304  inputs = inputs.to("cuda")
 305  
 306  # Inference
 307  generated_ids = model.generate(**inputs, max_new_tokens=128)
 308  generated_ids_trimmed = [
 309      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 310  ]
 311  output_text = processor.batch_decode(
 312      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 313  )
 314  print(output_text)
 315  ```
 316  </details>
 317  
 318  <details>
 319  <summary>Video inference</summary>
 320  
 321  ```python
 322  # Messages containing a images list as a video and a text query
 323  messages = [
 324      {
 325          "role": "user",
 326          "content": [
 327              {
 328                  "type": "video",
 329                  "video": [
 330                      "file:///path/to/frame1.jpg",
 331                      "file:///path/to/frame2.jpg",
 332                      "file:///path/to/frame3.jpg",
 333                      "file:///path/to/frame4.jpg",
 334                  ],
 335              },
 336              {"type": "text", "text": "Describe this video."},
 337          ],
 338      }
 339  ]
 340  
 341  # Messages containing a local video path and a text query
 342  messages = [
 343      {
 344          "role": "user",
 345          "content": [
 346              {
 347                  "type": "video",
 348                  "video": "file:///path/to/video1.mp4",
 349                  "max_pixels": 360 * 420,
 350                  "fps": 1.0,
 351              },
 352              {"type": "text", "text": "Describe this video."},
 353          ],
 354      }
 355  ]
 356  
 357  # Messages containing a video url and a text query
 358  messages = [
 359      {
 360          "role": "user",
 361          "content": [
 362              {
 363                  "type": "video",
 364                  "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
 365              },
 366              {"type": "text", "text": "Describe this video."},
 367          ],
 368      }
 369  ]
 370  
 371  # Preparation for inference
 372  text = processor.apply_chat_template(
 373      messages, tokenize=False, add_generation_prompt=True
 374  )
 375  image_inputs, video_inputs = process_vision_info(messages)
 376  inputs = processor(
 377      text=[text],
 378      images=image_inputs,
 379      videos=video_inputs,
 380      padding=True,
 381      return_tensors="pt",
 382  )
 383  inputs = inputs.to("cuda")
 384  
 385  # Inference
 386  generated_ids = model.generate(**inputs, max_new_tokens=128)
 387  generated_ids_trimmed = [
 388      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 389  ]
 390  output_text = processor.batch_decode(
 391      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 392  )
 393  print(output_text)
 394  ```
 395  
 396  Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
 397  
 398  | Backend     | HTTP | HTTPS |
 399  |-------------|------|-------|
 400  | torchvision >= 0.19.0 | โœ…  | โœ…   |
 401  | torchvision < 0.19.0  | โŒ  | โŒ   |
 402  | decord      | โœ…  | โŒ   |
 403  </details>
 404  
 405  <details>
 406  <summary>Batch inference</summary>
 407  
 408  ```python
 409  # Sample messages for batch inference
 410  messages1 = [
 411      {
 412          "role": "user",
 413          "content": [
 414              {"type": "image", "image": "file:///path/to/image1.jpg"},
 415              {"type": "image", "image": "file:///path/to/image2.jpg"},
 416              {"type": "text", "text": "What are the common elements in these pictures?"},
 417          ],
 418      }
 419  ]
 420  messages2 = [
 421      {"role": "system", "content": "You are a helpful assistant."},
 422      {"role": "user", "content": "Who are you?"},
 423  ]
 424  # Combine messages for batch processing
 425  messages = [messages1, messages2]
 426  
 427  # Preparation for batch inference
 428  texts = [
 429      processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
 430      for msg in messages
 431  ]
 432  image_inputs, video_inputs = process_vision_info(messages)
 433  inputs = processor(
 434      text=texts,
 435      images=image_inputs,
 436      videos=video_inputs,
 437      padding=True,
 438      return_tensors="pt",
 439  )
 440  inputs = inputs.to("cuda")
 441  
 442  # Batch Inference
 443  generated_ids = model.generate(**inputs, max_new_tokens=128)
 444  generated_ids_trimmed = [
 445      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 446  ]
 447  output_texts = processor.batch_decode(
 448      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 449  )
 450  print(output_texts)
 451  ```
 452  </details>
 453  
 454  ### ๐Ÿค– ModelScope
 455  We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
 456  
 457  ### More Usage Tips
 458  
 459  For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 460  
 461  ```python
 462  # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
 463  ## Local file path
 464  messages = [
 465      {
 466          "role": "user",
 467          "content": [
 468              {"type": "image", "image": "file:///path/to/your/image.jpg"},
 469              {"type": "text", "text": "Describe this image."},
 470          ],
 471      }
 472  ]
 473  ## Image URL
 474  messages = [
 475      {
 476          "role": "user",
 477          "content": [
 478              {"type": "image", "image": "http://path/to/your/image.jpg"},
 479              {"type": "text", "text": "Describe this image."},
 480          ],
 481      }
 482  ]
 483  ## Base64 encoded image
 484  messages = [
 485      {
 486          "role": "user",
 487          "content": [
 488              {"type": "image", "image": "data:image;base64,/9j/..."},
 489              {"type": "text", "text": "Describe this image."},
 490          ],
 491      }
 492  ]
 493  ```
 494  #### Image Resolution for performance boost
 495  
 496  The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
 497  
 498  ```python
 499  min_pixels = 256 * 28 * 28
 500  max_pixels = 1280 * 28 * 28
 501  processor = AutoProcessor.from_pretrained(
 502      "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 503  )
 504  ```
 505  
 506  Besides, We provide two methods for fine-grained control over the image size input to the model:
 507  
 508  1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
 509  
 510  2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
 511  
 512  ```python
 513  # resized_height and resized_width
 514  messages = [
 515      {
 516          "role": "user",
 517          "content": [
 518              {
 519                  "type": "image",
 520                  "image": "file:///path/to/your/image.jpg",
 521                  "resized_height": 280,
 522                  "resized_width": 420,
 523              },
 524              {"type": "text", "text": "Describe this image."},
 525          ],
 526      }
 527  ]
 528  # min_pixels and max_pixels
 529  messages = [
 530      {
 531          "role": "user",
 532          "content": [
 533              {
 534                  "type": "image",
 535                  "image": "file:///path/to/your/image.jpg",
 536                  "min_pixels": 50176,
 537                  "max_pixels": 50176,
 538              },
 539              {"type": "text", "text": "Describe this image."},
 540          ],
 541      }
 542  ]
 543  ```
 544  
 545  #### Add ids for Multiple Image Inputs
 546  By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
 547  <details>
 548  <summary>Add vision ids</summary>
 549  
 550  ```python
 551  conversation = [
 552      {
 553          "role": "user",
 554          "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
 555      },
 556      {
 557          "role": "assistant",
 558          "content": "I'm doing well, thank you for asking. How can I assist you today?",
 559      },
 560      {
 561          "role": "user",
 562          "content": [
 563              {"type": "text", "text": "Can you describe these images and video?"},
 564              {"type": "image"},
 565              {"type": "image"},
 566              {"type": "video"},
 567              {"type": "text", "text": "These are from my vacation."},
 568          ],
 569      },
 570      {
 571          "role": "assistant",
 572          "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
 573      },
 574      {
 575          "role": "user",
 576          "content": "It was a trip to the mountains. Can you see the details in the images and video?",
 577      },
 578  ]
 579  
 580  # default:
 581  prompt_without_id = processor.apply_chat_template(
 582      conversation, add_generation_prompt=True
 583  )
 584  # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
 585  
 586  
 587  # add ids
 588  prompt_with_id = processor.apply_chat_template(
 589      conversation, add_generation_prompt=True, add_vision_id=True
 590  )
 591  # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
 592  ```
 593  </details>
 594  
 595  #### Flash-Attention 2 to speed up generation
 596  
 597  First, make sure to install the latest version of Flash Attention 2:
 598  
 599  ```bash
 600  pip install -U flash-attn --no-build-isolation
 601  ```
 602  
 603  Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
 604  
 605  To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
 606  
 607  ```python
 608  from transformers import Qwen2VLForConditionalGeneration
 609  
 610  model = Qwen2VLForConditionalGeneration.from_pretrained(
 611      "Qwen/Qwen2-VL-7B-Instruct", 
 612      torch_dtype=torch.bfloat16, 
 613      attn_implementation="flash_attention_2",
 614  )
 615  ```
 616  
 617  
 618  ### Try Qwen2-VL-72B with API!
 619  
 620  To explore Qwen2-VL-72B, a more fascinating multimodal model, we encourage you to test our cutting-edge API service. Let's start the exciting journey right now!
 621  
 622  #### Installation
 623  ```bash
 624  pip install dashscope
 625  ```
 626  
 627  #### Examples
 628  ```python
 629  import dashscope
 630  
 631  
 632  dashscope.api_key = "your_api_key"
 633  
 634  messages = [{
 635      'role': 'user',
 636      'content': [
 637          {
 638              'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
 639          },
 640          {
 641              'text': 'What are in the image?'
 642          },
 643      ]
 644  }]
 645  # The model name 'qwen-vl-max-0809' is the identity of 'Qwen2-VL-72B'.
 646  response = dashscope.MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
 647  print(response)
 648  ```
 649  
 650  For more usage, please refer to the tutorial at [aliyun](https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api).
 651  
 652  ## Quantization
 653  
 654  For quantized models, we offer two types of quantization: AWQ and GPQ([๐Ÿค—](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)[๐Ÿค–](https://modelscope.cn/organization/qwen)).
 655  
 656  ### AWQ
 657  One of our recommendations is the usage of [AWQ](https://arxiv.org/abs/2306.00978) with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). AWQ refers to Activation-aware Weight Quantization, a hardware-friendly approach for LLM low-bit weight-only quantization. AutoAWQ is an easy-to-use package for 4-bit quantized models.
 658  #### Usage of AWQ Quantized Models with Transformers
 659  Now, Transformers has officially supported AutoAWQ, which means that you can directly use the quantized model with Transformers. The following is a very simple code snippet showing how to run `Qwen2-VL-7B-Instruct-AWQ` with the quantized model:
 660  ```python
 661  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 662  from qwen_vl_utils import process_vision_info
 663  
 664  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
 665  # model = Qwen2VLForConditionalGeneration.from_pretrained(
 666  #     "Qwen/Qwen2-VL-7B-Instruct-AWQ",
 667  #     torch_dtype="auto",
 668  #     attn_implementation="flash_attention_2",
 669  #     device_map="auto",
 670  # )
 671  
 672  # default: Load the model on the available device(s)
 673  model = Qwen2VLForConditionalGeneration.from_pretrained(
 674      "Qwen/Qwen2-VL-7B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
 675  )
 676  
 677  # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
 678  min_pixels = 256 * 28 * 28
 679  max_pixels = 1280 * 28 * 28
 680  processor = AutoProcessor.from_pretrained(
 681      "Qwen/Qwen2-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
 682  )
 683  
 684  messages = [
 685      {
 686          "role": "user",
 687          "content": [
 688              {
 689                  "type": "image",
 690                  "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 691              },
 692              {"type": "text", "text": "Describe this image."},
 693          ],
 694      }
 695  ]
 696  
 697  # Preparation for inference
 698  text = processor.apply_chat_template(
 699      messages, tokenize=False, add_generation_prompt=True
 700  )
 701  image_inputs, video_inputs = process_vision_info(messages)
 702  inputs = processor(
 703      text=[text],
 704      images=image_inputs,
 705      videos=video_inputs,
 706      padding=True,
 707      return_tensors="pt",
 708  )
 709  
 710  # Inference: Generation of the output
 711  generated_ids = model.generate(**inputs, max_new_tokens=128)
 712  generated_ids_trimmed = [
 713      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 714  ]
 715  output_text = processor.batch_decode(
 716      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 717  )
 718  print(output_text)
 719  ```
 720  #### Quantize Your Own Model with AutoAWQ
 721  If you want to quantize your own model to AWQ quantized models, we advise you to use AutoAWQ. It is suggested installing the forked version of the package by installing from source code:
 722  
 723  
 724  ```bash
 725  git clone https://github.com/kq-chen/AutoAWQ.git
 726  cd AutoAWQ
 727  pip install numpy gekko pandas
 728  pip install -e .
 729  ```
 730  
 731  Suppose you have finetuned a model based on `Qwen2-VL-7B`. To build your own AWQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run:
 732  
 733  ```python
 734  from transformers import Qwen2VLProcessor
 735  from awq.models.qwen2vl import Qwen2VLAWQForConditionalGeneration
 736  
 737  # Specify paths and hyperparameters for quantization
 738  model_path = "your_model_path"
 739  quant_path = "your_quantized_model_path"
 740  quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
 741  
 742  # Load your processor and model with AutoAWQ
 743  processor = Qwen2VLProcessor.from_pretrained(model_path)
 744  # We recommend enabling flash_attention_2 for better acceleration and memory saving
 745  # model = Qwen2VLAWQForConditionalGeneration.from_pretrained(
 746  #     model_path, model_type="qwen2_vl", use_cache=False, attn_implementation="flash_attention_2"
 747  # )
 748  model = Qwen2VLAWQForConditionalGeneration.from_pretrained(
 749      model_path, model_type="qwen2_vl", use_cache=False
 750  )
 751  ```
 752  Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a typical chat message as shown below. you can specify `text` and `image` in `content` field, For example:
 753  ```python
 754  dataset = [
 755      # message 0
 756      [
 757          {"role": "system", "content": "You are a helpful assistant."},
 758          {"role": "user", "content": "Tell me who you are."},
 759          {"role": "assistant", "content": "I am a large language model named Qwen..."},
 760      ],
 761      # message 1
 762      [
 763          {
 764              "role": "user",
 765              "content": [
 766                  {"type": "image", "image": "file:///path/to/your/image.jpg"},
 767                  {"type": "text", "text": "Output all text in the image"},
 768              ],
 769          },
 770          {"role": "assistant", "content": "The text in the image is balabala..."},
 771      ],
 772      # other messages...
 773      ...,
 774  ]
 775  ```
 776  here, we use a caption dataset **only for demonstration**. You should replace it with your own sft dataset.
 777  
 778  ```python
 779  def prepare_dataset(n_sample: int = 8) -> list[list[dict]]:
 780      from datasets import load_dataset
 781  
 782      dataset = load_dataset(
 783          "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]"
 784      )
 785      return [
 786          [
 787              {
 788                  "role": "user",
 789                  "content": [
 790                      {"type": "image", "image": sample["url"]},
 791                      {"type": "text", "text": "generate a caption for this image"},
 792                  ],
 793              },
 794              {"role": "assistant", "content": sample["caption"]},
 795          ]
 796          for sample in dataset
 797      ]
 798  
 799  
 800  dataset = prepare_dataset()
 801  ```
 802  
 803  Then process the dataset into tensors:
 804  ```python
 805  from qwen_vl_utils import process_vision_info
 806  
 807  text = processor.apply_chat_template(
 808      dataset, tokenize=False, add_generation_prompt=True
 809  )
 810  image_inputs, video_inputs = process_vision_info(dataset)
 811  inputs = processor(
 812      text=text,
 813      images=image_inputs,
 814      videos=video_inputs,
 815      padding=True,
 816      return_tensors="pt",
 817  )
 818  ```
 819  
 820  Then just run the calibration process by one line of code:
 821  ```python
 822  model.quantize(calib_data=inputs, quant_config=quant_config)
 823  ```
 824  Finally, save the quantized model:
 825  ```python
 826  model.model.config.use_cache = model.model.generation_config.use_cache = True
 827  model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
 828  processor.save_pretrained(quant_path)
 829  ```
 830  Then you can obtain your own AWQ quantized model for deployment. Enjoy!
 831  ### GPTQ
 832  #### Usage of GPTQ Models with Transformers
 833  Now, Transformers has officially supported AutoGPTQ, which means that you can directly use the quantized model with Transformers. The following is a very simple code snippet showing how to run `Qwen2-VL-7B-Instruct-GPTQ-Int4` with the quantized model:
 834  ```python
 835  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 836  from qwen_vl_utils import process_vision_info
 837  
 838  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
 839  # model = Qwen2VLForConditionalGeneration.from_pretrained(
 840  #     "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4",
 841  #     torch_dtype=torch.bfloat16,
 842  #     attn_implementation="flash_attention_2",
 843  #     device_map="auto",
 844  # )
 845  
 846  # default: Load the model on the available device(s)
 847  model = Qwen2VLForConditionalGeneration.from_pretrained(
 848      "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto"
 849  )
 850  
 851  # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
 852  min_pixels = 256 * 28 * 28
 853  max_pixels = 1280 * 28 * 28
 854  processor = AutoProcessor.from_pretrained(
 855      "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels
 856  )
 857  
 858  messages = [
 859      {
 860          "role": "user",
 861          "content": [
 862              {
 863                  "type": "image",
 864                  "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
 865              },
 866              {"type": "text", "text": "Describe this image."},
 867          ],
 868      }
 869  ]
 870  
 871  # Preparation for inference
 872  text = processor.apply_chat_template(
 873      messages, tokenize=False, add_generation_prompt=True
 874  )
 875  image_inputs, video_inputs = process_vision_info(messages)
 876  inputs = processor(
 877      text=[text],
 878      images=image_inputs,
 879      videos=video_inputs,
 880      padding=True,
 881      return_tensors="pt",
 882  )
 883  
 884  # Inference: Generation of the output
 885  generated_ids = model.generate(**inputs, max_new_tokens=128)
 886  generated_ids_trimmed = [
 887      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 888  ]
 889  output_text = processor.batch_decode(
 890      generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 891  )
 892  print(output_text)
 893  ```
 894  #### Quantize Your Own Model with AutoGPTQ
 895  If you want to quantize your own model to GPTQ quantized models, we advise you to use AutoGPTQ. It is suggested installing the forked version of the package by installing from source code:
 896  
 897  ```bash
 898  git clone https://github.com/kq-chen/AutoGPTQ.git
 899  cd AutoGPTQ
 900  pip install numpy gekko pandas
 901  pip install -vvv --no-build-isolation -e .
 902  ```
 903  Suppose you have finetuned a model based on `Qwen2-VL-7B`. To build your own GPTQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run:
 904  ```python
 905  from transformers import Qwen2VLProcessor
 906  from auto_gptq import BaseQuantizeConfig
 907  from auto_gptq.modeling import Qwen2VLGPTQForConditionalGeneration
 908  
 909  # Specify paths and hyperparameters for quantization
 910  model_path = "your_model_path"
 911  quant_path = "your_quantized_model_path"
 912  quantize_config = BaseQuantizeConfig(
 913      bits=8,  # 4 or 8
 914      group_size=128,
 915      damp_percent=0.1,
 916      desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
 917      static_groups=False,
 918      sym=True,
 919      true_sequential=True,
 920  )
 921  # Load your processor and model with AutoGPTQ
 922  processor = Qwen2VLProcessor.from_pretrained(model_path)
 923  # We recommend enabling flash_attention_2 for better acceleration and memory saving
 924  # model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config, attn_implementation="flash_attention_2")
 925  model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config)
 926  ```
 927  Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a typical chat message as shown below. you can specify `text` and `image` in `content` field, For example:
 928  ```python
 929  dataset = [
 930      # message 0
 931      [
 932          {"role": "system", "content": "You are a helpful assistant."},
 933          {"role": "user", "content": "Tell me who you are."},
 934          {"role": "assistant", "content": "I am a large language model named Qwen..."},
 935      ],
 936      # message 1
 937      [
 938          {
 939              "role": "user",
 940              "content": [
 941                  {"type": "image", "image": "file:///path/to/your/image.jpg"},
 942                  {"type": "text", "text": "Output all text in the image"},
 943              ],
 944          },
 945          {"role": "assistant", "content": "The text in the image is balabala..."},
 946      ],
 947      # other messages...
 948      ...,
 949  ]
 950  ```
 951  Here, we use a caption dataset **only for demonstration**. You should replace it with your own sft dataset.
 952  ```python
 953  def prepare_dataset(n_sample: int = 20) -> list[list[dict]]:
 954      from datasets import load_dataset
 955  
 956      dataset = load_dataset(
 957          "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]"
 958      )
 959      return [
 960          [
 961              {
 962                  "role": "user",
 963                  "content": [
 964                      {"type": "image", "image": sample["url"]},
 965                      {"type": "text", "text": "generate a caption for this image"},
 966                  ],
 967              },
 968              {"role": "assistant", "content": sample["caption"]},
 969          ]
 970          for sample in dataset
 971      ]
 972  
 973  
 974  dataset = prepare_dataset()
 975  ```
 976  
 977  Then process the dataset into tensors:
 978  ```python
 979  from qwen_vl_utils import process_vision_info
 980  
 981  
 982  def batched(iterable, n: int):
 983      # batched('ABCDEFG', 3) โ†’ ABC DEF G
 984      assert n >= 1, "batch size must be at least one"
 985      from itertools import islice
 986  
 987      iterator = iter(iterable)
 988      while batch := tuple(islice(iterator, n)):
 989          yield batch
 990  
 991  
 992  batch_size = 1
 993  calib_data = []
 994  for batch in batched(dataset, batch_size):
 995      text = processor.apply_chat_template(
 996          batch, tokenize=False, add_generation_prompt=True
 997      )
 998      image_inputs, video_inputs = process_vision_info(batch)
 999      inputs = processor(
1000          text=text,
1001          images=image_inputs,
1002          videos=video_inputs,
1003          padding=True,
1004          return_tensors="pt",
1005      )
1006      calib_data.append(inputs)
1007  ```
1008  Then just run the calibration process by one line of code:
1009  ```python
1010  model.quantize(dataset, cache_examples_on_gpu=False)
1011  ```
1012  Finally, save the quantized model:
1013  ```python
1014  model.save_quantized(quant_path, use_safetensors=True)
1015  processor.save_pretrained(quant_path)
1016  ```
1017  Then you can obtain your own GPTQ quantized model for deployment. Enjoy!
1018  ### Benchmark
1019  #### Performance of Quantized Models
1020  This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2-VL series. Specifically, we report:
1021  
1022  - MMMU_VAL (Accuracy)
1023  - DocVQA_VAL (Accuracy)
1024  - MMBench_DEV_EN (Accuracy)
1025  - MathVista_MINI (Accuracy)
1026  
1027  We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models.
1028  
1029  | Model Size | Quantization | MMMU | DocVQA | MMBench | MathVista  |
1030  | --- | --- | --- | --- | --- | --- |
1031  | Qwen2-VL-72B-Instruct | BF16<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct)) | 65.44 | 95.79 | 86.94 | 70.19 |
1032  |  | GPTQ-Int8<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)) | 64.56 | 95.84 | 87.03 | 68.90 |
1033  |  | GPTQ-Int4<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)) | 64.00 | 95.70 | 86.68 | 69.20 |
1034  |  | AWQ<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-AWQ)) | 64.22 | 95.72 | 86.43 | 68.40 |
1035  | Qwen2-VL-7B-Instruct | BF16<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) | 53.77 | 93.89 | 81.78 | 58.20 |
1036  |  | GPTQ-Int8<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8)) | 53.00 | 93.94 | 82.38 | 57.90 |
1037  |  | GPTQ-Int4<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)) | 52.55 | 93.16 | 81.27 | 60.30 |
1038  |  | AWQ<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-AWQ)) | 53.66 | 93.10 | 81.61 | 56.80 |
1039  | Qwen2-VL-2B-Instruct | BF16<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) | 41.88 | 88.34 | 72.07 | 44.40 |
1040  |  | GPTQ-Int8<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8)) | 41.55 |  88.28 | 71.99 | 44.60 |
1041  |  | GPTQ-Int4<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)) | 39.22 | 87.21 | 70.87 | 41.69 |
1042  |  | AWQ<br><sup>([๐Ÿค—](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)[๐Ÿค–](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-AWQ)) | 41.33 | 86.96 | 71.64 | 39.90 |
1043  
1044  
1045  
1046  
1047  
1048  
1049  #### Speed Benchmark
1050  This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2-VL series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths.
1051  
1052  The environment of the evaluation with huggingface transformers is:
1053  
1054  - NVIDIA A100 80GB
1055  - CUDA 11.8
1056  - Pytorch 2.2.1+cu118
1057  - Flash Attention 2.6.1
1058  - Transformers 4.38.2
1059  - AutoGPTQ 0.6.0+cu118
1060  - AutoAWQ 0.2.5+cu118 (autoawq_kernels 0.0.6+cu118)
1061  
1062  Note:
1063  
1064  - We use the batch size of 1 and the least number of GPUs as possible for the evalution.
1065  - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens.
1066  - 72B (transformers)
1067  
1068  | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) |
1069  | --- | --- | --- | --- | --- | --- |
1070  | Qwen2-VL-72B-Instruct | 1 | BF16 | 2 | 8.90 | 138.74 |
1071  |  |  | GPTQ-Int8 | 2 | 9.53 | 75.173 |
1072  |  |  | GPTQ-Int4 | 1 | 11.04 | 42.46 |
1073  |  |  | AWQ | 1 | 12.00 | 41.98 |
1074  |  | 6144 | BF16 | 2 | 6.53 | 148.66 |
1075  |  |  | GPTQ-Int8 | 2 | 6.97 | 85.09 |
1076  |  |  | GPTQ-Int4 | 1 | 7.62 | 49.05 |
1077  |  |  | AWQ | 1 | 8.33 | 48.58 |
1078  |  | 14336 | BF16 | 3 | 4.39 | 165.92 |
1079  |  |  | GPTQ-Int8 | 2 | 5.04 | 99.31 |
1080  |  |  | GPTQ-Int4 | 1 | 5.39 | 58.76 |
1081  |  |  | AWQ | 1 | 5.72 | 58.29 |
1082  |  | 30720 | BF16 | 4 | 2.93 | 204.33 |
1083  |  |  | GPTQ-Int8 | 2 | 3.16 | 127.77 |
1084  |  |  | GPTQ-Int4 | 2 | 3.27 | 85.13 |
1085  |  |  | AWQ | 2 | 3.39 | 94.65 |
1086  
1087  - 7B (transformers)
1088  
1089  | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) |
1090  | --- | --- | --- | --- | --- | --- |
1091  | Qwen2-VL-7B-Instruct | 1 | BF16 | 1 | 39.02 | 16.07 |
1092  |  |  | GPTQ-Int8 | 1 | 31.60 | 10.11 |
1093  |  |  | GPTQ-Int4 | 1 | 42.76 | 7.20 |
1094  |  |  | AWQ | 1 | 32.08 | 7.07 |
1095  |  | 6144 | BF16 | 1 | 38.75 | 21.56 |
1096  |  |  | GPTQ-Int8 | 1 | 31.31 | 15.61 |
1097  |  |  | GPTQ-Int4 | 1 | 39.75 | 12.69 |
1098  |  |  | AWQ | 1 | 32.66 | 12.56 |
1099  |  | 14336 | BF16 | 1 | 30.65 | 29.07 |
1100  |  |  | GPTQ-Int8 | 1 | 27.96 | 23.11 |
1101  |  |  | GPTQ-Int4 | 1 | 29.72 | 20.20 |
1102  |  |  | AWQ | 1 | 31.42 | 20.07 |
1103  |  | 30720 | BF16 | 1 | 19.53 | 44.08 |
1104  |  |  | GPTQ-Int8 | 1 | 18.37 | 38.13 |
1105  |  |  | GPTQ-Int4 | 1 | 19.15 | 35.22 |
1106  |  |  | AWQ | 1 | 19.95 | 35.08 |
1107  
1108  
1109  - 2B (transformers)
1110  
1111  | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) |
1112  | --- | --- | --- | --- | --- | --- |
1113  | Qwen2-VL-2B-Instruct | 1 | BF16 | 1 | 35.29 | 4.68 |
1114  |  |  | GPTQ-Int8 | 1 | 28.59 | 3.55 |
1115  |  |  | GPTQ-Int4 | 1 | 39.76 | 2.91 |
1116  |  |  | AWQ | 1 | 29.89 | 2.88 |
1117  |  | 6144 | BF16 | 1 | 36.58 | 10.01 |
1118  |  |  | GPTQ-Int8 | 1 | 29.53  | 8.87 |
1119  |  |  | GPTQ-Int4 | 1 | 39.27 | 8.21 |
1120  |  |  | AWQ | 1 | 33.42 | 8.18 |
1121  |  | 14336 | BF16 | 1 | 36.31 | 17.20 |
1122  |  |  | GPTQ-Int8 | 1 | 31.03 | 16.07 |
1123  |  |  | GPTQ-Int4 | 1 | 39.89 | 15.40 |
1124  |  |  | AWQ | 1 | 32.28 | 15.40 |
1125  |  | 30720 | BF16 | 1 | 32.53 | 31.64 |
1126  |  |  | GPTQ-Int8 | 1 | 27.76 | 30.51 |
1127  |  |  | GPTQ-Int4 | 1 | 30.73 | 29.84 |
1128  |  |  | AWQ | 1 | 31.55 | 29.84 |
1129  
1130  
1131  
1132  ## Deployment
1133  
1134  We recommend using vLLM for fast Qwen2-VL deployment and inference. You need to use `vllm>=0.6.1` to enable Qwen2-VL support. You can also use our [official docker image](#-docker).
1135  
1136  ### Installation
1137  ```bash
1138  pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
1139  pip install accelerate
1140  pip install qwen-vl-utils
1141  # Change to your CUDA version
1142  CUDA_VERSION=cu121
1143  pip install 'vllm==0.6.1' --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
1144  
1145  ```
1146  ### Start an OpenAI API Service
1147  
1148  Run the command below to start an OpenAI-compatible API service:
1149  
1150  ```bash
1151  python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model Qwen/Qwen2-VL-7B-Instruct
1152  ```
1153  
1154  Then you can use the chat API as below (via curl or Python API):
1155  
1156  ```bash
1157  curl http://localhost:8000/v1/chat/completions \
1158      -H "Content-Type: application/json" \
1159      -d '{
1160      "model": "Qwen2-VL-7B-Instruct",
1161      "messages": [
1162      {"role": "system", "content": "You are a helpful assistant."},
1163      {"role": "user", "content": [
1164          {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
1165          {"type": "text", "text": "What is the text in the illustrate?"}
1166      ]}
1167      ]
1168      }'
1169  ```
1170  
1171  ```python
1172  from openai import OpenAI
1173  
1174  # Set OpenAI's API key and API base to use vLLM's API server.
1175  openai_api_key = "EMPTY"
1176  openai_api_base = "http://localhost:8000/v1"
1177  
1178  client = OpenAI(
1179      api_key=openai_api_key,
1180      base_url=openai_api_base,
1181  )
1182  
1183  chat_response = client.chat.completions.create(
1184      model="Qwen2-VL-7B-Instruct",
1185      messages=[
1186          {"role": "system", "content": "You are a helpful assistant."},
1187          {
1188              "role": "user",
1189              "content": [
1190                  {
1191                      "type": "image_url",
1192                      "image_url": {
1193                          "url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"
1194                      },
1195                  },
1196                  {"type": "text", "text": "What is the text in the illustrate?"},
1197              ],
1198          },
1199      ],
1200  )
1201  print("Chat response:", chat_response)
1202  ```
1203  
1204  You can also upload base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
1205  ```python
1206  import base64
1207  from openai import OpenAI
1208  # Set OpenAI's API key and API base to use vLLM's API server.
1209  openai_api_key = "EMPTY"
1210  openai_api_base = "http://localhost:8000/v1"
1211  client = OpenAI(
1212      api_key=openai_api_key,
1213      base_url=openai_api_base,
1214  )
1215  image_path = "/path/to/local/image.png"
1216  with open(image_path, "rb") as f:
1217      encoded_image = base64.b64encode(f.read())
1218  encoded_image_text = encoded_image.decode("utf-8")
1219  base64_qwen = f"data:image;base64,{encoded_image_text}"
1220  chat_response = client.chat.completions.create(
1221      model="Qwen2-7B-Instruct",
1222      messages=[
1223          {"role": "system", "content": "You are a helpful assistant."},
1224          {
1225              "role": "user",
1226              "content": [
1227                  {
1228                      "type": "image_url",
1229                      "image_url": {
1230                          "url": base64_qwen
1231                      },
1232                  },
1233                  {"type": "text", "text": "What is the text in the illustrate?"},
1234              ],
1235          },
1236      ],
1237  )
1238  print("Chat response:", chat_response)
1239  ```
1240  
1241  ### Notes
1242  
1243  - โš ๏ธ**NOTE**: Now `vllm.entrypoints.openai.api_server` does not support set `min_pixels` and `max_pixels` in messages (we are working hard on supporting this feature). If you want to limit the resolution, you can set them in model's `preprocessor_config.json`:
1244  
1245  ```json
1246  {
1247    "min_pixels": 50176,
1248    "max_pixels": 1003520,
1249    ...
1250  }
1251  ```
1252  
1253  - โš ๏ธ**NOTE**: Now `vllm.entrypoints.openai.api_server` does not support video input yet. We are actively developing on it.
1254  - โš ๏ธ**NOTE**: If you want to pass multiple images in a single prompt, you need to pass `--limit-mm-per-prompt image=<N>` argument (`N` is max number of images in each prompt) when launching `vllm.entrypoints.openai.api_server`.
1255  ### Inference Locally
1256  
1257  You can also use vLLM to inference Qwen2-VL locally:
1258  
1259  ```python
1260  from transformers import AutoProcessor
1261  from vllm import LLM, SamplingParams
1262  from qwen_vl_utils import process_vision_info
1263  
1264  MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"
1265  
1266  llm = LLM(
1267      model=MODEL_PATH,
1268      limit_mm_per_prompt={"image": 10, "video": 10},
1269  )
1270  
1271  sampling_params = SamplingParams(
1272      temperature=0.1,
1273      top_p=0.001,
1274      repetition_penalty=1.05,
1275      max_tokens=256,
1276      stop_token_ids=[],
1277  )
1278  
1279  messages = [
1280      {"role": "system", "content": "You are a helpful assistant."},
1281      {
1282          "role": "user",
1283          "content": [
1284              {
1285                  "type": "image",
1286                  "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
1287                  "min_pixels": 224 * 224,
1288                  "max_pixels": 1280 * 28 * 28,
1289              },
1290              {"type": "text", "text": "What is the text in the illustrate?"},
1291          ],
1292      },
1293  ]
1294  # For video input, you can pass following values instead:
1295  # "type": "video",
1296  # "video": "<video URL>",
1297  
1298  processor = AutoProcessor.from_pretrained(MODEL_PATH)
1299  prompt = processor.apply_chat_template(
1300      messages,
1301      tokenize=False,
1302      add_generation_prompt=True,
1303  )
1304  image_inputs, video_inputs = process_vision_info(messages)
1305  
1306  mm_data = {}
1307  if image_inputs is not None:
1308      mm_data["image"] = image_inputs
1309  if video_inputs is not None:
1310      mm_data["video"] = video_inputs
1311  
1312  llm_inputs = {
1313      "prompt": prompt,
1314      "multi_modal_data": mm_data,
1315  }
1316  
1317  outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
1318  generated_text = outputs[0].outputs[0].text
1319  
1320  print(generated_text)
1321  ```
1322  
1323  
1324  ## Training
1325  #### LLaMA-Factory
1326  
1327  Here we provide a script for supervised finetuning Qwen2-VL with
1328  `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`. This
1329  script for supervised finetuning (SFT) has the following features:
1330  
1331  -  Support multi-images input;
1332  
1333  -  Support single-GPU and multi-GPU training;
1334  
1335  -  Support full-parameter tuning, LoRA.
1336  
1337  In the following, we introduce more details about the usage of the
1338  script.
1339  
1340  #### Installation
1341  
1342  Before you start, make sure you have installed the following packages:
1343  
1344  1. Follow the instructions of
1345     `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`, and build
1346     the environment.
1347  2. Install these packages (Optional):
1348  
1349  ```
1350  pip install deepspeed
1351  pip install flash-attn --no-build-isolation
1352  ```
1353  
1354  3. If you want to use
1355     `FlashAttention-2 <https://github.com/Dao-AILab/flash-attention>`,
1356     make sure your CUDA is 11.6 and above.
1357  
1358  #### Data Preparation
1359  
1360  LLaMA-Factory provides several training datasets in ``data`` folder, you
1361  can use it directly. If you are using a custom dataset, please prepare
1362  your dataset as follows.
1363  
1364  1. Organize your data in a **json** file and put your data in ``data``
1365     folder. LLaMA-Factory supports multimodal dataset in ``sharegpt``
1366     format.
1367  
1368  -  The dataset in ``sharegpt`` format should follow the below format:
1369  
1370  ```json
1371  [
1372    {
1373      "messages": [
1374        {
1375          "content": "<image>Who are they?",
1376          "role": "user"
1377        },
1378        {
1379          "content": "They're Kane and Gretzka from Bayern Munich.",
1380          "role": "assistant"
1381        },
1382        {
1383          "content": "What are they doing?<image>",
1384          "role": "user"
1385        },
1386        {
1387          "content": "They are celebrating on the soccer field.",
1388          "role": "assistant"
1389        }
1390      ],
1391      "images": [
1392        "mllm_demo_data/1.jpg",
1393        "mllm_demo_data/1.jpg"
1394      ]
1395    },
1396  ]
1397  ```
1398  
1399  1. Provide your dataset definition in ``data/dataset_info.json`` in the
1400     following format .
1401  
1402  -  For ``sharegpt`` format dataset, the columns in ``dataset_info.json``
1403     should be:
1404  
1405  ```json
1406     "dataset_name": {
1407         "file_name": "dataset_name.json",
1408         "formatting": "sharegpt",
1409         "columns": {
1410            "messages": "messages",
1411            "images": "images"
1412          },
1413        "tags": {
1414           "role_tag": "role",
1415           "content_tag": "content",
1416           "user_tag": "user",
1417           "assistant_tag": "assistant"
1418          }
1419     }
1420  ```
1421  
1422  #### Training
1423  
1424  Lora SFT examples:
1425  ```
1426  llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml
1427  llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml
1428  ```
1429  
1430  LoRA DPO/ORPO/SimPO examples: (using [RLHF-V Dataset](https://huggingface.co/datasets/llamafactory/RLHF-V))
1431  ```
1432  llamafactory-cli train examples/train_lora/qwen2vl_lora_dpo.yaml
1433  ```
1434  
1435  Full SFT examples:
1436  ```
1437  llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml
1438  ```
1439  
1440  Inference examples:
1441  ```
1442  llamafactory-cli webchat examples/inference/qwen2_vl.yaml
1443  llamafactory-cli api examples/inference/qwen2_vl.yaml
1444  ```
1445  
1446  Execute the following training command:
1447  
1448  ```bash
1449  DISTRIBUTED_ARGS="
1450      --nproc_per_node $NPROC_PER_NODE \
1451      --nnodes $NNODES \
1452      --node_rank $NODE_RANK \
1453      --master_addr $MASTER_ADDR \
1454      --master_port $MASTER_PORT
1455      "
1456  
1457  torchrun $DISTRIBUTED_ARGS src/train.py \
1458      --deepspeed $DS_CONFIG_PATH \
1459      --stage sft \
1460      --do_train \
1461      --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \
1462      --dataset mllm_demo \
1463      --template qwen2_vl \
1464      --finetuning_type lora \
1465      --output_dir $OUTPUT_PATH \
1466      --overwrite_cache \
1467      --overwrite_output_dir \
1468      --warmup_steps 100 \
1469      --weight_decay 0.1 \
1470      --per_device_train_batch_size 2 \
1471      --gradient_accumulation_steps 4 \
1472      --ddp_timeout 9000 \
1473      --learning_rate 5e-6 \
1474      --lr_scheduler_type cosine \
1475      --logging_steps 1 \
1476      --cutoff_len 4096 \
1477      --save_steps 1000 \
1478      --plot_loss \
1479      --num_train_epochs 3 \
1480      --bf16 
1481  ```
1482  
1483  and enjoy the training process. To make changes to your training, you
1484  can modify the arguments in the training command to adjust the
1485  hyperparameters. One argument to note is ``cutoff_len``, which is the
1486  maximum length of the training data. Control this parameter to avoid OOM
1487  error.
1488  
1489  ## Function Calling
1490  
1491  Qwen2-VL supports Function Calling (aka. Tool Calling or Tool Use). For details on how to use this capability, please refer to the Qwen-Agent project for [the function calling example](https://github.com/QwenLM/Qwen-Agent/blob/main/examples/qwen2vl_function_calling.py) and [the agent example](https://github.com/QwenLM/Qwen-Agent/blob/main/examples/qwen2vl_assistant_tooluse.py). 
1492  ### Simple Use Case
1493  ```python
1494  # pip install qwen_agent
1495  from typing import List, Union
1496  from datetime import datetime
1497  from qwen_agent.agents import FnCallAgent
1498  from qwen_agent.gui import WebUI
1499  from qwen_agent.tools.base import BaseToolWithFileAccess, register_tool
1500  
1501  @register_tool("get_date")
1502  class GetDate(BaseToolWithFileAccess):
1503      description = "call this tool to get the current date"
1504      parameters = [
1505          {
1506              "name": "lang",
1507              "type": "string",
1508              "description": "one of ['en', 'zh'], default is en",
1509              "required": False
1510          },
1511      ]
1512  
1513      def call(self, params: Union[str, dict], files: List[str] = None, **kwargs) -> str:
1514          super().call(params=params, files=files)
1515          params = self._verify_json_format_args(params)
1516          lang = "zh" if "zh" in params["lang"] else "en"
1517          now = datetime.now()
1518          result = now.strftime("%Y-%m-%d %H:%M:%S") + "\n"
1519          weekday = now.weekday()
1520          if lang == "zh":
1521              days_chinese = ["ไธ€", "ไบŒ", "ไธ‰", "ๅ››", "ไบ”", "ๅ…ญ", "ๆ—ฅ"]
1522              result += "ไปŠๅคฉๆ˜ฏๆ˜ŸๆœŸ" + days_chinese[weekday]
1523          else:
1524              days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
1525              result += "Today is " + days[weekday]
1526          return result
1527  
1528  
1529  def init_agent_service():
1530      llm_cfg_vl = {
1531          # Using Qwen2-VL deployed at any openai-compatible service such as vLLM:
1532          "model_type": "qwenvl_oai",
1533          "model": "Qwen/Qwen2-VL-7B-Instruct",
1534          "model_server": "http://localhost:8000/v1",  # api_base
1535          "api_key": 'EMPTY",
1536      }
1537      tools = [
1538          "get_date",
1539          "code_interpreter",
1540      ]  # code_interpreter is a built-in tool in Qwen-Agent
1541      bot = FnCallAgent(
1542          llm=llm_cfg_vl,
1543          name="Qwen2-VL",
1544          description="function calling",
1545          function_list=tools,
1546      )
1547      return bot
1548  
1549  def app_gui():
1550      # Define the agent
1551      bot = init_agent_service()
1552      WebUI(bot).run()
1553  
1554  # Launch gradio app
1555  app_gui()
1556  ```
1557  
1558  
1559  ## Demo
1560  ### Web UI Example
1561  
1562  In this section, we provide instructions for users to build a web-based user interface (UI) demo. This UI demo allows users to interact with a predefined model or application through a web browser. Follow the steps below to get started.
1563  
1564  #### Installation
1565  
1566  Before you begin, ensure that you have the required dependencies installed on your system. You can install them by running the following command:
1567  
1568  ```bash
1569  pip install -r requirements_web_demo.txt
1570  ```
1571  
1572  #### Running the Demo with FlashAttention-2
1573  
1574  Once the required packages are installed, you can launch the web demo using the following command. This command will start a web server and provide you with a link to access the UI in your web browser.
1575  
1576  **Recommended**: For enhanced performance and efficiency, especially in multi-image and video processing scenarios, we strongly recommend using [FlashAttention-2](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 provides significant improvements in memory usage and speed, making it ideal for handling large-scale models and data processing.
1577  
1578  To enable FlashAttention-2, use the following command:
1579  
1580  ```bash
1581  python web_demo_mm.py --flash-attn2
1582  ```
1583  
1584  This will load the model with FlashAttention-2 enabled.
1585  
1586  **Default Usage**: If you prefer to run the demo without FlashAttention-2 or if you do not specify the `--flash-attn2` option, the demo will load the model using the standard attention implementation:
1587  
1588  ```bash
1589  python web_demo_mm.py
1590  ```
1591  
1592  After running the command, youโ€™ll see a link generated in the terminal similar to this:
1593  
1594  ```
1595  Running on local: http://127.0.0.1:7860/
1596  ```
1597  
1598  Copy this link and paste it into your browser to access the web UI, where you can interact with the model by inputting text, uploading images, or using any other provided functionalities.
1599  
1600  ##### Running the Streaming Video Chat Demo
1601  An experimental streaming video chat demo is also available in the ``web_demo_streaming`` directory.
1602  
1603  To run the streaming video chat demo, use the following command:
1604  
1605  ```bash
1606  cd web_demo_streaming/
1607  python app.py --flash-attn2
1608  ```
1609  
1610  If you prefer to run the demo without FlashAttention-2, use the following command:
1611  ```bash
1612  cd web_demo_streaming/
1613  python app.py
1614  ```
1615  
1616  This demo supports webcam/screen capture as its video input source. To support screen capture video input, we use code snippet from the following hugginface space: [gstaff/gradio-screen-recorder](https://huggingface.co/spaces/gstaff/gradio-screen-recorder/tree/main).
1617  
1618  #### Selecting Different Models (Qwen2-VL Series Only)
1619  
1620  The demo is configured by default to use the `Qwen/Qwen2-VL-7B-Instruct` model, which is part of the Qwen2-VL series and is well-suited for various vision-language tasks. However, if you want to use a different model within the Qwen2-VL series, you can simply update the `DEFAULT_CKPT_PATH` variable in the script:
1621  
1622  1. **Locate the `DEFAULT_CKPT_PATH` Variable**:
1623     Inside `web_demo_mm.py`, find the `DEFAULT_CKPT_PATH` variable that defines the model checkpoint path. It should look like this:
1624  
1625     ```python
1626     DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
1627     ```
1628  
1629  2. **Replace with a Different Qwen2-VL Model Path**:
1630     Modify `DEFAULT_CKPT_PATH` to point to another checkpoint path within the Qwen2-VL series. For example:
1631  
1632     ```python
1633     DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-2B-Instruct'  # Example for a different model in the series
1634     ```
1635  
1636  3. **Save and Re-run**:
1637     After modifying the path, save the script and then re-run the demo using the instructions provided in the `Running the Demo` section above.
1638  
1639  **Note:** This `DEFAULT_CKPT_PATH` only supports models from the Qwen2-VL series. If you're using a model outside of the Qwen2-VL series, additional changes to the codebase may be necessary.
1640  
1641  
1642  #### Customization
1643  
1644  Further customization of the web demo, including UI layout, interactions, and additional functionalities like handling specialized input, can be done by modifying the `web_demo_mm.py` script. This flexibility allows you to tailor the web interface to better fit specific tasks or workflows.
1645  
1646  
1647  ## Limitations
1648  
1649  While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
1650  
1651  1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
1652  2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
1653  3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
1654  4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
1655  5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
1656  6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
1657  
1658  These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
1659  
1660  
1661  ## ๐Ÿณ Docker
1662  
1663  To simplify the deploy process, we provide docker images with pre-build environments: [qwenllm/qwenvl](https://hub.docker.com/r/qwenllm/qwenvl). You only need to install the driver and download model files to launch demos.
1664  
1665  ```bash
1666  docker run --gpus all --ipc=host --network=host --rm --name qwen2 -it qwenllm/qwenvl:2-cu121 bash
1667  ```
1668  
1669  ## Citation
1670  
1671  If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
1672  
1673  
1674  
1675  
1676  ```BibTeX
1677  @article{Qwen2VL,
1678    title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
1679    author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
1680    journal={arXiv preprint arXiv:2409.12191},
1681    year={2024}
1682  }
1683  
1684  @article{Qwen-VL,
1685    title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
1686    author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
1687    journal={arXiv preprint arXiv:2308.12966},
1688    year={2023}
1689  }
1690  ```
1691  
1692  <br>