/ README.md
README.md
1 # Qwen2-VL 2 3 4 <p align="center"> 5 <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2VL_logo.png" width="400"/> 6 <p> 7 8 <p align="center"> 9 ๐ค <a href="https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d">Hugging Face</a>   |   ๐ค <a href="https://modelscope.cn/organization/qwen">ModelScope</a>   |    ๐ <a href="https://qwenlm.github.io/blog/qwen2-vl/">Blog</a>   |    ๐ <a href="https://arxiv.org/pdf/2409.12191">Paper</a>    </a> 10 <br> 11 ๐ฅ๏ธ <a href="https://huggingface.co/spaces/Qwen/Qwen2-VL">Demo</a>   |   ๐ฌ <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (ๅพฎไฟก)</a>   |   ๐ซจ <a href="https://discord.gg/CV4E9rpNSD">Discord</a>   |   <a href="https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api"> ๐ API</a>   |   ๐ฅ๏ธ <a href="https://gallery.pai-ml.com/#/preview/deepLearning/cv/qwen2-vl">PAI-DSW</a>   12 </p> 13 14 15 16 17 ## Introduction 18 19 After a year's relentless efforts, today we are thrilled to release **Qwen2-VL**! Qwen2-VL is the latest version of the vision language models in the Qwen model families. 20 21 #### Key Enhancements: 22 23 * **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. 24 25 * **Understanding videos of 20min+**: with the online streaming capabilities, Qwen2-VL can understand videos over 20 minutes by high-quality video-based question answering, dialog, content creation, etc. 26 27 * **Agent that can operate your mobiles, robots, etc.**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. 28 29 * **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. 30 31 #### Model Architecture Updates: 32 33 * **Naive Dynamic Resolution**: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. 34 <p align="center"> 35 <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/qwen2_vl_framework.jpg" width="80%"/> 36 <p> 37 38 * **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. 39 40 <p align="center"> 41 <img src="http://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen2-VL/mrope.png" width="80%"/> 42 <p> 43 44 45 We have open-sourced Qwen2-VL models, including Qwen2-VL-2B and Qwen2-VL-7B under the Apache 2.0 license, as well as Qwen2-VL-72B under the Qwen license. These models are now integrated with Hugging Face Transformers, vLLM, and other third-party frameworks. We hope you enjoy using them! 46 47 48 ## News 49 * 2024.09.19: The instruction-tuned [Qwen2-VL-72B model](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) and its quantized version [[AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ), [GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4), [GPTQ-Int8](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)] are now available. We have also released the [Qwen2-VL paper](https://arxiv.org/pdf/2409.12191) simultaneously. 50 * 2024.08.30: We have released the [Qwen2-VL series]("https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d). The 2B and 7B models are now available, and the 72B model for opensource is coming soon. For more details, please check our [blog](https://qwenlm.github.io/blog/qwen2-vl/)! 51 52 53 ## Performance 54 ### Image Benchmarks 55 56 | Benchmark | Previous SoTA<br><sup>(Open-source LVLM)<sup> | Claude-3.5 Sonnet | GPT-4o | **Qwen2-VL-72B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) [๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct) |**Qwen2-VL-7B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) [๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) |**Qwen2-VL-2B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) 57 | :--- | :---: | :---: | :---: | :---: |:---: |:---: | 58 | MMMU<sub>val</sub> | 58.3 | 68.3 | **69.1** | 64.5 | 54.1|41.1 59 | MMMU-Pro | 46.9 | 51.5 | **51.9** | 46.2 | 43.5 | 37.6 60 | DocVQA<sub>test</sub> | 94.1 | 95.2 | 92.8 | **96.5** | 94.5| 90.1 61 | InfoVQA<sub>test</sub> | 82.0 | - | - | **84.5** | 76.5|65.5 62 | ChartQA<sub>test</sub> | 88.4 | **90.8** | 85.7 | 88.3 |83.0| 73.5 63 | TextVQA<sub>val</sub> | 84.4 | - | - | **85.5** |84.3|79.7 64 | OCRBench | 852 | 788 | 736 | **877** |845| 794 65 | MTVQA | 17.3 | 25.7 | 27.8 | **30.9** |25.6| 18.1 66 | VCR<sub>en easy</sub> | 84.67 | 63.85 | 91.55 | **91.93** | 89.70| 81.45 67 | VCR<sub>zh easy</sub> | 22.09 | 1.0| 14.87 | **65.37** | 59.94| 46.16 68 | RealWorldQA | 72.2 | 60.1 | 75.4 | **77.8** | 70.1| 62.9 69 | MME<sub>sum</sub> | 2414.7 | 1920.0 | 2328.7 | **2482.7** | 2326.8 | 1872.0 70 | MMBench-EN<sub>test</sub> | **86.5** | 79.7 | 83.4 | **86.5** | 83.0 | 74.9 71 | MMBench-CN<sub>test</sub> | 86.3 | 80.7 | 82.1 | **86.6** | 80.5| 73.5 72 | MMBench-V1.1<sub>test</sub> | 85.5 | 78.5 | 82.2 | **85.9** |80.7| 72.2 73 | MMT-Bench<sub>test</sub> | 63.4 | - | 65.5 | **71.7** |63.7| 54.5 74 | MMStar | 67.1 | 62.2 | 63.9 | **68.3** |60.7| 48.0 75 | MMVet<sub>GPT-4-Turbo</sub> | 65.7 | 66.0 | 69.1 | **74.0** |62.0| 49.5 76 | HallBench<sub>avg</sub> | 55.2 | 49.9 | 55.0 | **58.1** | 50.6 | 41.7 77 | MathVista<sub>testmini</sub> | 67.5 | 67.7 | 63.8 | **70.5** |58.2| 43.0 78 | MathVision | 16.97 | - | **30.4** | 25.9 | 16.3| 12.4 79 80 ### Video Benchmarks 81 82 | Benchmark | Previous SoTA<br><sup>(Open-source LVLM)<sup> | Gemini 1.5-Pro | GPT-4o | **Qwen2-VL-72B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct) [๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct)) |**Qwen2-VL-7B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) [๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) |**Qwen2-VL-2B**<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) 83 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | 84 | MVBench | 69.6 | - | - | **73.6** | 67.0| 63.2 85 | PerceptionTest<sub>test</sub> | 66.9 | - | - | **68.0** | 62.3 |53.9 86 | EgoSchema<sub>test</sub> | 62.0 | 63.2 | 72.2 | **77.9** | 66.7 |54.9 87 | Video-MME<br><sub>(wo/w subs)</sub> | 66.3/69.6 | **75.0**/**81.3** | 71.9/77.2 | 71.2/77.8 | 63.3/69.0 |55.6/60.4 88 89 ### Agent Benchmarks 90 | |Benchmark | Metric | Previous SoTA | GPT-4o | **Qwen2-VL-72B** | 91 | :-- | :-- | :--: | :--: | :--: | :--: | 92 | General | FnCall<sup>[1]</sup> | TM | - | 90.2 | **93.1** | 93 | | | EM | - | 50.0 | **53.2** | 94 | Game | Number Line | SR | 89.4<sup>[2]</sup> | 91.5 | **100.0** | 95 | | BlackJack | SR | 40.2<sup>[2]</sup> | 34.5 | **42.6** | 96 | | EZPoint | SR | 50.0<sup>[2]</sup> | 85.5 | **100.0** | 97 | | Point24 | SR | 2.6<sup>[2]</sup> | 3.0 | **4.5** | 98 | Android | AITZ | TM | 83.0<sup>[3]</sup> | 70.0 | **89.6** | 99 | | | EM | 47.7<sup>[3]</sup> | 35.3 | **72.1** | 100 | AI2THOR | ALFRED<sub>valid-unseen</sub> | SR | 67.7<sup>[4]</sup> | - | **67.8** | 101 | | | GC | 75.3<sup>[4]</sup> | - | **75.8** | 102 | VLN | R2R<sub>valid-unseen</sub> | SR | **79.0** | 43.7<sup>[5]</sup> | 51.7 | 103 | | REVERIE<sub>valid-unseen</sub> | SR | **61.0** | 31.6<sup>[5]</sup> | 31.0 | 104 105 SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM<sup>[6]</sup>. 106 1. Self-Curated Function Call Benchmark by Qwen Team 107 2. Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning 108 3. Android in the Zoo: Chain-of-Action-Thought for GUI Agents 109 4. ThinkBot: Embodied Instruction Following with Thought Chain Reasoning 110 5. MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation 111 6. Segment Anything. 112 113 ### Multilingual Benchmarks 114 115 <table style="width:75%; text-align:center;"> 116 <tr> 117 <th>Models</th> 118 <td>AR </td> 119 <td>DE </td> 120 <td>FR </td> 121 <td>IT </td> 122 <td>JA </td> 123 <td>KO </td> 124 <td>RU </td> 125 <td>TH </td> 126 <td>VI </td> 127 <td>AVG</td> 128 </tr> 129 <tr> 130 <th align="left">Qwen2-VL-72B</th> 131 <td>20.7 </td> 132 <td>36.5 </td> 133 <td>44.1 </td> 134 <td>42.8 </td> 135 <td>21.6 </td> 136 <td>37.4 </td> 137 <td>15.6 </td> 138 <td>17.7 </td> 139 <td>41.6 </td> 140 <td><b>30.9</b></td> 141 </tr> 142 <tr> 143 <th align="left">GPT-4o</th> 144 <td>20.2 </td> 145 <td>34.2 </td> 146 <td>41.2 </td> 147 <td>32.7 </td> 148 <td>20.0 </td> 149 <td>33.9 </td> 150 <td>11.5 </td> 151 <td>22.5 </td> 152 <td>34.2 </td> 153 <td>27.8</td> 154 </tr> 155 <tr> 156 <th align="left">Claude3 Opus</th> 157 <td>15.1 </td> 158 <td>33.4 </td> 159 <td>40.6 </td> 160 <td>34.4 </td> 161 <td>19.4 </td> 162 <td>27.2 </td> 163 <td>13.0 </td> 164 <td>19.5 </td> 165 <td>29.1 </td> 166 <td>25.7 </td> 167 </tr> 168 <tr> 169 <th align="left">Gemini Ultra</th> 170 <td>14.7 </td> 171 <td>32.3 </td> 172 <td>40.0 </td> 173 <td>31.8 </td> 174 <td>12.3 </td> 175 <td>17.2 </td> 176 <td>11.8 </td> 177 <td>20.3 </td> 178 <td>28.6 </td> 179 <td>23.2</td> 180 </tr> 181 </table> 182 183 These results are evaluated on the benchmark of [MTVQA](https://github.com/bytedance/MTVQA/tree/main). 184 185 ## Quickstart 186 187 Below, we provide simple examples to show how to use Qwen2-VL with ๐ค ModelScope and ๐ค Transformers. 188 189 The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command: 190 ``` 191 pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate 192 ``` 193 or you might encounter the following error: 194 ``` 195 KeyError: 'qwen2_vl' 196 ``` 197 198 - โ ๏ธ**NOTE**: Current latest version of `transformers` have [a bug](https://github.com/huggingface/transformers/issues/33401) when loading Qwen2-VL config, so you need to install a specific version of transformers as above. 199 200 We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: 201 202 ```bash 203 # It's highly recommanded to use `[decord]` feature for faster video loading. 204 pip install qwen-vl-utils[decord] 205 ``` 206 207 If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video. 208 209 ### Using ๐ค Transformers to Chat 210 211 Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: 212 213 ```python 214 from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor 215 from qwen_vl_utils import process_vision_info 216 217 # default: Load the model on the available device(s) 218 model = Qwen2VLForConditionalGeneration.from_pretrained( 219 "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto" 220 ) 221 222 # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. 223 # model = Qwen2VLForConditionalGeneration.from_pretrained( 224 # "Qwen/Qwen2-VL-7B-Instruct", 225 # torch_dtype=torch.bfloat16, 226 # attn_implementation="flash_attention_2", 227 # device_map="auto", 228 # ) 229 230 # default processer 231 processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") 232 233 # The default range for the number of visual tokens per image in the model is 4-16384. 234 # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. 235 # min_pixels = 256*28*28 236 # max_pixels = 1280*28*28 237 # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) 238 239 messages = [ 240 { 241 "role": "user", 242 "content": [ 243 { 244 "type": "image", 245 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", 246 }, 247 {"type": "text", "text": "Describe this image."}, 248 ], 249 } 250 ] 251 252 # Preparation for inference 253 text = processor.apply_chat_template( 254 messages, tokenize=False, add_generation_prompt=True 255 ) 256 image_inputs, video_inputs = process_vision_info(messages) 257 inputs = processor( 258 text=[text], 259 images=image_inputs, 260 videos=video_inputs, 261 padding=True, 262 return_tensors="pt", 263 ) 264 inputs = inputs.to("cuda") 265 266 # Inference: Generation of the output 267 generated_ids = model.generate(**inputs, max_new_tokens=128) 268 generated_ids_trimmed = [ 269 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 270 ] 271 output_text = processor.batch_decode( 272 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 273 ) 274 print(output_text) 275 ``` 276 <details> 277 <summary>Multi image inference</summary> 278 279 ```python 280 # Messages containing multiple images and a text query 281 messages = [ 282 { 283 "role": "user", 284 "content": [ 285 {"type": "image", "image": "file:///path/to/image1.jpg"}, 286 {"type": "image", "image": "file:///path/to/image2.jpg"}, 287 {"type": "text", "text": "Identify the similarities between these images."}, 288 ], 289 } 290 ] 291 292 # Preparation for inference 293 text = processor.apply_chat_template( 294 messages, tokenize=False, add_generation_prompt=True 295 ) 296 image_inputs, video_inputs = process_vision_info(messages) 297 inputs = processor( 298 text=[text], 299 images=image_inputs, 300 videos=video_inputs, 301 padding=True, 302 return_tensors="pt", 303 ) 304 inputs = inputs.to("cuda") 305 306 # Inference 307 generated_ids = model.generate(**inputs, max_new_tokens=128) 308 generated_ids_trimmed = [ 309 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 310 ] 311 output_text = processor.batch_decode( 312 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 313 ) 314 print(output_text) 315 ``` 316 </details> 317 318 <details> 319 <summary>Video inference</summary> 320 321 ```python 322 # Messages containing a images list as a video and a text query 323 messages = [ 324 { 325 "role": "user", 326 "content": [ 327 { 328 "type": "video", 329 "video": [ 330 "file:///path/to/frame1.jpg", 331 "file:///path/to/frame2.jpg", 332 "file:///path/to/frame3.jpg", 333 "file:///path/to/frame4.jpg", 334 ], 335 }, 336 {"type": "text", "text": "Describe this video."}, 337 ], 338 } 339 ] 340 341 # Messages containing a local video path and a text query 342 messages = [ 343 { 344 "role": "user", 345 "content": [ 346 { 347 "type": "video", 348 "video": "file:///path/to/video1.mp4", 349 "max_pixels": 360 * 420, 350 "fps": 1.0, 351 }, 352 {"type": "text", "text": "Describe this video."}, 353 ], 354 } 355 ] 356 357 # Messages containing a video url and a text query 358 messages = [ 359 { 360 "role": "user", 361 "content": [ 362 { 363 "type": "video", 364 "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4", 365 }, 366 {"type": "text", "text": "Describe this video."}, 367 ], 368 } 369 ] 370 371 # Preparation for inference 372 text = processor.apply_chat_template( 373 messages, tokenize=False, add_generation_prompt=True 374 ) 375 image_inputs, video_inputs = process_vision_info(messages) 376 inputs = processor( 377 text=[text], 378 images=image_inputs, 379 videos=video_inputs, 380 padding=True, 381 return_tensors="pt", 382 ) 383 inputs = inputs.to("cuda") 384 385 # Inference 386 generated_ids = model.generate(**inputs, max_new_tokens=128) 387 generated_ids_trimmed = [ 388 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 389 ] 390 output_text = processor.batch_decode( 391 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 392 ) 393 print(output_text) 394 ``` 395 396 Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one. 397 398 | Backend | HTTP | HTTPS | 399 |-------------|------|-------| 400 | torchvision >= 0.19.0 | โ | โ | 401 | torchvision < 0.19.0 | โ | โ | 402 | decord | โ | โ | 403 </details> 404 405 <details> 406 <summary>Batch inference</summary> 407 408 ```python 409 # Sample messages for batch inference 410 messages1 = [ 411 { 412 "role": "user", 413 "content": [ 414 {"type": "image", "image": "file:///path/to/image1.jpg"}, 415 {"type": "image", "image": "file:///path/to/image2.jpg"}, 416 {"type": "text", "text": "What are the common elements in these pictures?"}, 417 ], 418 } 419 ] 420 messages2 = [ 421 {"role": "system", "content": "You are a helpful assistant."}, 422 {"role": "user", "content": "Who are you?"}, 423 ] 424 # Combine messages for batch processing 425 messages = [messages1, messages2] 426 427 # Preparation for batch inference 428 texts = [ 429 processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) 430 for msg in messages 431 ] 432 image_inputs, video_inputs = process_vision_info(messages) 433 inputs = processor( 434 text=texts, 435 images=image_inputs, 436 videos=video_inputs, 437 padding=True, 438 return_tensors="pt", 439 ) 440 inputs = inputs.to("cuda") 441 442 # Batch Inference 443 generated_ids = model.generate(**inputs, max_new_tokens=128) 444 generated_ids_trimmed = [ 445 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 446 ] 447 output_texts = processor.batch_decode( 448 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 449 ) 450 print(output_texts) 451 ``` 452 </details> 453 454 ### ๐ค ModelScope 455 We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints. 456 457 ### More Usage Tips 458 459 For input images, we support local files, base64, and URLs. For videos, we currently only support local files. 460 461 ```python 462 # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text. 463 ## Local file path 464 messages = [ 465 { 466 "role": "user", 467 "content": [ 468 {"type": "image", "image": "file:///path/to/your/image.jpg"}, 469 {"type": "text", "text": "Describe this image."}, 470 ], 471 } 472 ] 473 ## Image URL 474 messages = [ 475 { 476 "role": "user", 477 "content": [ 478 {"type": "image", "image": "http://path/to/your/image.jpg"}, 479 {"type": "text", "text": "Describe this image."}, 480 ], 481 } 482 ] 483 ## Base64 encoded image 484 messages = [ 485 { 486 "role": "user", 487 "content": [ 488 {"type": "image", "image": "data:image;base64,/9j/..."}, 489 {"type": "text", "text": "Describe this image."}, 490 ], 491 } 492 ] 493 ``` 494 #### Image Resolution for performance boost 495 496 The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. 497 498 ```python 499 min_pixels = 256 * 28 * 28 500 max_pixels = 1280 * 28 * 28 501 processor = AutoProcessor.from_pretrained( 502 "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels 503 ) 504 ``` 505 506 Besides, We provide two methods for fine-grained control over the image size input to the model: 507 508 1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28. 509 510 2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels. 511 512 ```python 513 # resized_height and resized_width 514 messages = [ 515 { 516 "role": "user", 517 "content": [ 518 { 519 "type": "image", 520 "image": "file:///path/to/your/image.jpg", 521 "resized_height": 280, 522 "resized_width": 420, 523 }, 524 {"type": "text", "text": "Describe this image."}, 525 ], 526 } 527 ] 528 # min_pixels and max_pixels 529 messages = [ 530 { 531 "role": "user", 532 "content": [ 533 { 534 "type": "image", 535 "image": "file:///path/to/your/image.jpg", 536 "min_pixels": 50176, 537 "max_pixels": 50176, 538 }, 539 {"type": "text", "text": "Describe this image."}, 540 ], 541 } 542 ] 543 ``` 544 545 #### Add ids for Multiple Image Inputs 546 By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings: 547 <details> 548 <summary>Add vision ids</summary> 549 550 ```python 551 conversation = [ 552 { 553 "role": "user", 554 "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}], 555 }, 556 { 557 "role": "assistant", 558 "content": "I'm doing well, thank you for asking. How can I assist you today?", 559 }, 560 { 561 "role": "user", 562 "content": [ 563 {"type": "text", "text": "Can you describe these images and video?"}, 564 {"type": "image"}, 565 {"type": "image"}, 566 {"type": "video"}, 567 {"type": "text", "text": "These are from my vacation."}, 568 ], 569 }, 570 { 571 "role": "assistant", 572 "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?", 573 }, 574 { 575 "role": "user", 576 "content": "It was a trip to the mountains. Can you see the details in the images and video?", 577 }, 578 ] 579 580 # default: 581 prompt_without_id = processor.apply_chat_template( 582 conversation, add_generation_prompt=True 583 ) 584 # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n' 585 586 587 # add ids 588 prompt_with_id = processor.apply_chat_template( 589 conversation, add_generation_prompt=True, add_vision_id=True 590 ) 591 # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n' 592 ``` 593 </details> 594 595 #### Flash-Attention 2 to speed up generation 596 597 First, make sure to install the latest version of Flash Attention 2: 598 599 ```bash 600 pip install -U flash-attn --no-build-isolation 601 ``` 602 603 Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. 604 605 To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows: 606 607 ```python 608 from transformers import Qwen2VLForConditionalGeneration 609 610 model = Qwen2VLForConditionalGeneration.from_pretrained( 611 "Qwen/Qwen2-VL-7B-Instruct", 612 torch_dtype=torch.bfloat16, 613 attn_implementation="flash_attention_2", 614 ) 615 ``` 616 617 618 ### Try Qwen2-VL-72B with API! 619 620 To explore Qwen2-VL-72B, a more fascinating multimodal model, we encourage you to test our cutting-edge API service. Let's start the exciting journey right now! 621 622 #### Installation 623 ```bash 624 pip install dashscope 625 ``` 626 627 #### Examples 628 ```python 629 import dashscope 630 631 632 dashscope.api_key = "your_api_key" 633 634 messages = [{ 635 'role': 'user', 636 'content': [ 637 { 638 'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg" 639 }, 640 { 641 'text': 'What are in the image?' 642 }, 643 ] 644 }] 645 # The model name 'qwen-vl-max-0809' is the identity of 'Qwen2-VL-72B'. 646 response = dashscope.MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages) 647 print(response) 648 ``` 649 650 For more usage, please refer to the tutorial at [aliyun](https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api). 651 652 ## Quantization 653 654 For quantized models, we offer two types of quantization: AWQ and GPQ([๐ค](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)[๐ค](https://modelscope.cn/organization/qwen)). 655 656 ### AWQ 657 One of our recommendations is the usage of [AWQ](https://arxiv.org/abs/2306.00978) with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). AWQ refers to Activation-aware Weight Quantization, a hardware-friendly approach for LLM low-bit weight-only quantization. AutoAWQ is an easy-to-use package for 4-bit quantized models. 658 #### Usage of AWQ Quantized Models with Transformers 659 Now, Transformers has officially supported AutoAWQ, which means that you can directly use the quantized model with Transformers. The following is a very simple code snippet showing how to run `Qwen2-VL-7B-Instruct-AWQ` with the quantized model: 660 ```python 661 from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor 662 from qwen_vl_utils import process_vision_info 663 664 # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. 665 # model = Qwen2VLForConditionalGeneration.from_pretrained( 666 # "Qwen/Qwen2-VL-7B-Instruct-AWQ", 667 # torch_dtype="auto", 668 # attn_implementation="flash_attention_2", 669 # device_map="auto", 670 # ) 671 672 # default: Load the model on the available device(s) 673 model = Qwen2VLForConditionalGeneration.from_pretrained( 674 "Qwen/Qwen2-VL-7B-Instruct-AWQ", torch_dtype="auto", device_map="auto" 675 ) 676 677 # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage. 678 min_pixels = 256 * 28 * 28 679 max_pixels = 1280 * 28 * 28 680 processor = AutoProcessor.from_pretrained( 681 "Qwen/Qwen2-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels 682 ) 683 684 messages = [ 685 { 686 "role": "user", 687 "content": [ 688 { 689 "type": "image", 690 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", 691 }, 692 {"type": "text", "text": "Describe this image."}, 693 ], 694 } 695 ] 696 697 # Preparation for inference 698 text = processor.apply_chat_template( 699 messages, tokenize=False, add_generation_prompt=True 700 ) 701 image_inputs, video_inputs = process_vision_info(messages) 702 inputs = processor( 703 text=[text], 704 images=image_inputs, 705 videos=video_inputs, 706 padding=True, 707 return_tensors="pt", 708 ) 709 710 # Inference: Generation of the output 711 generated_ids = model.generate(**inputs, max_new_tokens=128) 712 generated_ids_trimmed = [ 713 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 714 ] 715 output_text = processor.batch_decode( 716 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 717 ) 718 print(output_text) 719 ``` 720 #### Quantize Your Own Model with AutoAWQ 721 If you want to quantize your own model to AWQ quantized models, we advise you to use AutoAWQ. It is suggested installing the forked version of the package by installing from source code: 722 723 724 ```bash 725 git clone https://github.com/kq-chen/AutoAWQ.git 726 cd AutoAWQ 727 pip install numpy gekko pandas 728 pip install -e . 729 ``` 730 731 Suppose you have finetuned a model based on `Qwen2-VL-7B`. To build your own AWQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run: 732 733 ```python 734 from transformers import Qwen2VLProcessor 735 from awq.models.qwen2vl import Qwen2VLAWQForConditionalGeneration 736 737 # Specify paths and hyperparameters for quantization 738 model_path = "your_model_path" 739 quant_path = "your_quantized_model_path" 740 quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} 741 742 # Load your processor and model with AutoAWQ 743 processor = Qwen2VLProcessor.from_pretrained(model_path) 744 # We recommend enabling flash_attention_2 for better acceleration and memory saving 745 # model = Qwen2VLAWQForConditionalGeneration.from_pretrained( 746 # model_path, model_type="qwen2_vl", use_cache=False, attn_implementation="flash_attention_2" 747 # ) 748 model = Qwen2VLAWQForConditionalGeneration.from_pretrained( 749 model_path, model_type="qwen2_vl", use_cache=False 750 ) 751 ``` 752 Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a typical chat message as shown below. you can specify `text` and `image` in `content` field, For example: 753 ```python 754 dataset = [ 755 # message 0 756 [ 757 {"role": "system", "content": "You are a helpful assistant."}, 758 {"role": "user", "content": "Tell me who you are."}, 759 {"role": "assistant", "content": "I am a large language model named Qwen..."}, 760 ], 761 # message 1 762 [ 763 { 764 "role": "user", 765 "content": [ 766 {"type": "image", "image": "file:///path/to/your/image.jpg"}, 767 {"type": "text", "text": "Output all text in the image"}, 768 ], 769 }, 770 {"role": "assistant", "content": "The text in the image is balabala..."}, 771 ], 772 # other messages... 773 ..., 774 ] 775 ``` 776 here, we use a caption dataset **only for demonstration**. You should replace it with your own sft dataset. 777 778 ```python 779 def prepare_dataset(n_sample: int = 8) -> list[list[dict]]: 780 from datasets import load_dataset 781 782 dataset = load_dataset( 783 "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]" 784 ) 785 return [ 786 [ 787 { 788 "role": "user", 789 "content": [ 790 {"type": "image", "image": sample["url"]}, 791 {"type": "text", "text": "generate a caption for this image"}, 792 ], 793 }, 794 {"role": "assistant", "content": sample["caption"]}, 795 ] 796 for sample in dataset 797 ] 798 799 800 dataset = prepare_dataset() 801 ``` 802 803 Then process the dataset into tensors: 804 ```python 805 from qwen_vl_utils import process_vision_info 806 807 text = processor.apply_chat_template( 808 dataset, tokenize=False, add_generation_prompt=True 809 ) 810 image_inputs, video_inputs = process_vision_info(dataset) 811 inputs = processor( 812 text=text, 813 images=image_inputs, 814 videos=video_inputs, 815 padding=True, 816 return_tensors="pt", 817 ) 818 ``` 819 820 Then just run the calibration process by one line of code: 821 ```python 822 model.quantize(calib_data=inputs, quant_config=quant_config) 823 ``` 824 Finally, save the quantized model: 825 ```python 826 model.model.config.use_cache = model.model.generation_config.use_cache = True 827 model.save_quantized(quant_path, safetensors=True, shard_size="4GB") 828 processor.save_pretrained(quant_path) 829 ``` 830 Then you can obtain your own AWQ quantized model for deployment. Enjoy! 831 ### GPTQ 832 #### Usage of GPTQ Models with Transformers 833 Now, Transformers has officially supported AutoGPTQ, which means that you can directly use the quantized model with Transformers. The following is a very simple code snippet showing how to run `Qwen2-VL-7B-Instruct-GPTQ-Int4` with the quantized model: 834 ```python 835 from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor 836 from qwen_vl_utils import process_vision_info 837 838 # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. 839 # model = Qwen2VLForConditionalGeneration.from_pretrained( 840 # "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", 841 # torch_dtype=torch.bfloat16, 842 # attn_implementation="flash_attention_2", 843 # device_map="auto", 844 # ) 845 846 # default: Load the model on the available device(s) 847 model = Qwen2VLForConditionalGeneration.from_pretrained( 848 "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", torch_dtype="auto", device_map="auto" 849 ) 850 851 # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage. 852 min_pixels = 256 * 28 * 28 853 max_pixels = 1280 * 28 * 28 854 processor = AutoProcessor.from_pretrained( 855 "Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4", min_pixels=min_pixels, max_pixels=max_pixels 856 ) 857 858 messages = [ 859 { 860 "role": "user", 861 "content": [ 862 { 863 "type": "image", 864 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", 865 }, 866 {"type": "text", "text": "Describe this image."}, 867 ], 868 } 869 ] 870 871 # Preparation for inference 872 text = processor.apply_chat_template( 873 messages, tokenize=False, add_generation_prompt=True 874 ) 875 image_inputs, video_inputs = process_vision_info(messages) 876 inputs = processor( 877 text=[text], 878 images=image_inputs, 879 videos=video_inputs, 880 padding=True, 881 return_tensors="pt", 882 ) 883 884 # Inference: Generation of the output 885 generated_ids = model.generate(**inputs, max_new_tokens=128) 886 generated_ids_trimmed = [ 887 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) 888 ] 889 output_text = processor.batch_decode( 890 generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False 891 ) 892 print(output_text) 893 ``` 894 #### Quantize Your Own Model with AutoGPTQ 895 If you want to quantize your own model to GPTQ quantized models, we advise you to use AutoGPTQ. It is suggested installing the forked version of the package by installing from source code: 896 897 ```bash 898 git clone https://github.com/kq-chen/AutoGPTQ.git 899 cd AutoGPTQ 900 pip install numpy gekko pandas 901 pip install -vvv --no-build-isolation -e . 902 ``` 903 Suppose you have finetuned a model based on `Qwen2-VL-7B`. To build your own GPTQ quantized model, you need to use the training data for calibration. Below, we provide a simple demonstration for you to run: 904 ```python 905 from transformers import Qwen2VLProcessor 906 from auto_gptq import BaseQuantizeConfig 907 from auto_gptq.modeling import Qwen2VLGPTQForConditionalGeneration 908 909 # Specify paths and hyperparameters for quantization 910 model_path = "your_model_path" 911 quant_path = "your_quantized_model_path" 912 quantize_config = BaseQuantizeConfig( 913 bits=8, # 4 or 8 914 group_size=128, 915 damp_percent=0.1, 916 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad 917 static_groups=False, 918 sym=True, 919 true_sequential=True, 920 ) 921 # Load your processor and model with AutoGPTQ 922 processor = Qwen2VLProcessor.from_pretrained(model_path) 923 # We recommend enabling flash_attention_2 for better acceleration and memory saving 924 # model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config, attn_implementation="flash_attention_2") 925 model = Qwen2VLGPTQForConditionalGeneration.from_pretrained(model_path, quantize_config) 926 ``` 927 Then you need to prepare your data for calibration. What you need to do is just put samples into a list, each of which is a typical chat message as shown below. you can specify `text` and `image` in `content` field, For example: 928 ```python 929 dataset = [ 930 # message 0 931 [ 932 {"role": "system", "content": "You are a helpful assistant."}, 933 {"role": "user", "content": "Tell me who you are."}, 934 {"role": "assistant", "content": "I am a large language model named Qwen..."}, 935 ], 936 # message 1 937 [ 938 { 939 "role": "user", 940 "content": [ 941 {"type": "image", "image": "file:///path/to/your/image.jpg"}, 942 {"type": "text", "text": "Output all text in the image"}, 943 ], 944 }, 945 {"role": "assistant", "content": "The text in the image is balabala..."}, 946 ], 947 # other messages... 948 ..., 949 ] 950 ``` 951 Here, we use a caption dataset **only for demonstration**. You should replace it with your own sft dataset. 952 ```python 953 def prepare_dataset(n_sample: int = 20) -> list[list[dict]]: 954 from datasets import load_dataset 955 956 dataset = load_dataset( 957 "laion/220k-GPT4Vision-captions-from-LIVIS", split=f"train[:{n_sample}]" 958 ) 959 return [ 960 [ 961 { 962 "role": "user", 963 "content": [ 964 {"type": "image", "image": sample["url"]}, 965 {"type": "text", "text": "generate a caption for this image"}, 966 ], 967 }, 968 {"role": "assistant", "content": sample["caption"]}, 969 ] 970 for sample in dataset 971 ] 972 973 974 dataset = prepare_dataset() 975 ``` 976 977 Then process the dataset into tensors: 978 ```python 979 from qwen_vl_utils import process_vision_info 980 981 982 def batched(iterable, n: int): 983 # batched('ABCDEFG', 3) โ ABC DEF G 984 assert n >= 1, "batch size must be at least one" 985 from itertools import islice 986 987 iterator = iter(iterable) 988 while batch := tuple(islice(iterator, n)): 989 yield batch 990 991 992 batch_size = 1 993 calib_data = [] 994 for batch in batched(dataset, batch_size): 995 text = processor.apply_chat_template( 996 batch, tokenize=False, add_generation_prompt=True 997 ) 998 image_inputs, video_inputs = process_vision_info(batch) 999 inputs = processor( 1000 text=text, 1001 images=image_inputs, 1002 videos=video_inputs, 1003 padding=True, 1004 return_tensors="pt", 1005 ) 1006 calib_data.append(inputs) 1007 ``` 1008 Then just run the calibration process by one line of code: 1009 ```python 1010 model.quantize(dataset, cache_examples_on_gpu=False) 1011 ``` 1012 Finally, save the quantized model: 1013 ```python 1014 model.save_quantized(quant_path, use_safetensors=True) 1015 processor.save_pretrained(quant_path) 1016 ``` 1017 Then you can obtain your own GPTQ quantized model for deployment. Enjoy! 1018 ### Benchmark 1019 #### Performance of Quantized Models 1020 This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2-VL series. Specifically, we report: 1021 1022 - MMMU_VAL (Accuracy) 1023 - DocVQA_VAL (Accuracy) 1024 - MMBench_DEV_EN (Accuracy) 1025 - MathVista_MINI (Accuracy) 1026 1027 We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models. 1028 1029 | Model Size | Quantization | MMMU | DocVQA | MMBench | MathVista | 1030 | --- | --- | --- | --- | --- | --- | 1031 | Qwen2-VL-72B-Instruct | BF16<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct)) | 65.44 | 95.79 | 86.94 | 70.19 | 1032 | | GPTQ-Int8<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-GPTQ-Int8)) | 64.56 | 95.84 | 87.03 | 68.90 | 1033 | | GPTQ-Int4<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)) | 64.00 | 95.70 | 86.68 | 69.20 | 1034 | | AWQ<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-72B-Instruct-AWQ)) | 64.22 | 95.72 | 86.43 | 68.40 | 1035 | Qwen2-VL-7B-Instruct | BF16<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct)) | 53.77 | 93.89 | 81.78 | 58.20 | 1036 | | GPTQ-Int8<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-GPTQ-Int8)) | 53.00 | 93.94 | 82.38 | 57.90 | 1037 | | GPTQ-Int4<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)) | 52.55 | 93.16 | 81.27 | 60.30 | 1038 | | AWQ<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct-AWQ)) | 53.66 | 93.10 | 81.61 | 56.80 | 1039 | Qwen2-VL-2B-Instruct | BF16<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct)) | 41.88 | 88.34 | 72.07 | 44.40 | 1040 | | GPTQ-Int8<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-GPTQ-Int8)) | 41.55 | 88.28 | 71.99 | 44.60 | 1041 | | GPTQ-Int4<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)) | 39.22 | 87.21 | 70.87 | 41.69 | 1042 | | AWQ<br><sup>([๐ค](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)[๐ค](https://modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct-AWQ)) | 41.33 | 86.96 | 71.64 | 39.90 | 1043 1044 1045 1046 1047 1048 1049 #### Speed Benchmark 1050 This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2-VL series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. 1051 1052 The environment of the evaluation with huggingface transformers is: 1053 1054 - NVIDIA A100 80GB 1055 - CUDA 11.8 1056 - Pytorch 2.2.1+cu118 1057 - Flash Attention 2.6.1 1058 - Transformers 4.38.2 1059 - AutoGPTQ 0.6.0+cu118 1060 - AutoAWQ 0.2.5+cu118 (autoawq_kernels 0.0.6+cu118) 1061 1062 Note: 1063 1064 - We use the batch size of 1 and the least number of GPUs as possible for the evalution. 1065 - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens. 1066 - 72B (transformers) 1067 1068 | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | 1069 | --- | --- | --- | --- | --- | --- | 1070 | Qwen2-VL-72B-Instruct | 1 | BF16 | 2 | 8.90 | 138.74 | 1071 | | | GPTQ-Int8 | 2 | 9.53 | 75.173 | 1072 | | | GPTQ-Int4 | 1 | 11.04 | 42.46 | 1073 | | | AWQ | 1 | 12.00 | 41.98 | 1074 | | 6144 | BF16 | 2 | 6.53 | 148.66 | 1075 | | | GPTQ-Int8 | 2 | 6.97 | 85.09 | 1076 | | | GPTQ-Int4 | 1 | 7.62 | 49.05 | 1077 | | | AWQ | 1 | 8.33 | 48.58 | 1078 | | 14336 | BF16 | 3 | 4.39 | 165.92 | 1079 | | | GPTQ-Int8 | 2 | 5.04 | 99.31 | 1080 | | | GPTQ-Int4 | 1 | 5.39 | 58.76 | 1081 | | | AWQ | 1 | 5.72 | 58.29 | 1082 | | 30720 | BF16 | 4 | 2.93 | 204.33 | 1083 | | | GPTQ-Int8 | 2 | 3.16 | 127.77 | 1084 | | | GPTQ-Int4 | 2 | 3.27 | 85.13 | 1085 | | | AWQ | 2 | 3.39 | 94.65 | 1086 1087 - 7B (transformers) 1088 1089 | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | 1090 | --- | --- | --- | --- | --- | --- | 1091 | Qwen2-VL-7B-Instruct | 1 | BF16 | 1 | 39.02 | 16.07 | 1092 | | | GPTQ-Int8 | 1 | 31.60 | 10.11 | 1093 | | | GPTQ-Int4 | 1 | 42.76 | 7.20 | 1094 | | | AWQ | 1 | 32.08 | 7.07 | 1095 | | 6144 | BF16 | 1 | 38.75 | 21.56 | 1096 | | | GPTQ-Int8 | 1 | 31.31 | 15.61 | 1097 | | | GPTQ-Int4 | 1 | 39.75 | 12.69 | 1098 | | | AWQ | 1 | 32.66 | 12.56 | 1099 | | 14336 | BF16 | 1 | 30.65 | 29.07 | 1100 | | | GPTQ-Int8 | 1 | 27.96 | 23.11 | 1101 | | | GPTQ-Int4 | 1 | 29.72 | 20.20 | 1102 | | | AWQ | 1 | 31.42 | 20.07 | 1103 | | 30720 | BF16 | 1 | 19.53 | 44.08 | 1104 | | | GPTQ-Int8 | 1 | 18.37 | 38.13 | 1105 | | | GPTQ-Int4 | 1 | 19.15 | 35.22 | 1106 | | | AWQ | 1 | 19.95 | 35.08 | 1107 1108 1109 - 2B (transformers) 1110 1111 | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | 1112 | --- | --- | --- | --- | --- | --- | 1113 | Qwen2-VL-2B-Instruct | 1 | BF16 | 1 | 35.29 | 4.68 | 1114 | | | GPTQ-Int8 | 1 | 28.59 | 3.55 | 1115 | | | GPTQ-Int4 | 1 | 39.76 | 2.91 | 1116 | | | AWQ | 1 | 29.89 | 2.88 | 1117 | | 6144 | BF16 | 1 | 36.58 | 10.01 | 1118 | | | GPTQ-Int8 | 1 | 29.53 | 8.87 | 1119 | | | GPTQ-Int4 | 1 | 39.27 | 8.21 | 1120 | | | AWQ | 1 | 33.42 | 8.18 | 1121 | | 14336 | BF16 | 1 | 36.31 | 17.20 | 1122 | | | GPTQ-Int8 | 1 | 31.03 | 16.07 | 1123 | | | GPTQ-Int4 | 1 | 39.89 | 15.40 | 1124 | | | AWQ | 1 | 32.28 | 15.40 | 1125 | | 30720 | BF16 | 1 | 32.53 | 31.64 | 1126 | | | GPTQ-Int8 | 1 | 27.76 | 30.51 | 1127 | | | GPTQ-Int4 | 1 | 30.73 | 29.84 | 1128 | | | AWQ | 1 | 31.55 | 29.84 | 1129 1130 1131 1132 ## Deployment 1133 1134 We recommend using vLLM for fast Qwen2-VL deployment and inference. You need to use `vllm>=0.6.1` to enable Qwen2-VL support. You can also use our [official docker image](#-docker). 1135 1136 ### Installation 1137 ```bash 1138 pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 1139 pip install accelerate 1140 pip install qwen-vl-utils 1141 # Change to your CUDA version 1142 CUDA_VERSION=cu121 1143 pip install 'vllm==0.6.1' --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION} 1144 1145 ``` 1146 ### Start an OpenAI API Service 1147 1148 Run the command below to start an OpenAI-compatible API service: 1149 1150 ```bash 1151 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model Qwen/Qwen2-VL-7B-Instruct 1152 ``` 1153 1154 Then you can use the chat API as below (via curl or Python API): 1155 1156 ```bash 1157 curl http://localhost:8000/v1/chat/completions \ 1158 -H "Content-Type: application/json" \ 1159 -d '{ 1160 "model": "Qwen2-VL-7B-Instruct", 1161 "messages": [ 1162 {"role": "system", "content": "You are a helpful assistant."}, 1163 {"role": "user", "content": [ 1164 {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, 1165 {"type": "text", "text": "What is the text in the illustrate?"} 1166 ]} 1167 ] 1168 }' 1169 ``` 1170 1171 ```python 1172 from openai import OpenAI 1173 1174 # Set OpenAI's API key and API base to use vLLM's API server. 1175 openai_api_key = "EMPTY" 1176 openai_api_base = "http://localhost:8000/v1" 1177 1178 client = OpenAI( 1179 api_key=openai_api_key, 1180 base_url=openai_api_base, 1181 ) 1182 1183 chat_response = client.chat.completions.create( 1184 model="Qwen2-VL-7B-Instruct", 1185 messages=[ 1186 {"role": "system", "content": "You are a helpful assistant."}, 1187 { 1188 "role": "user", 1189 "content": [ 1190 { 1191 "type": "image_url", 1192 "image_url": { 1193 "url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png" 1194 }, 1195 }, 1196 {"type": "text", "text": "What is the text in the illustrate?"}, 1197 ], 1198 }, 1199 ], 1200 ) 1201 print("Chat response:", chat_response) 1202 ``` 1203 1204 You can also upload base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details): 1205 ```python 1206 import base64 1207 from openai import OpenAI 1208 # Set OpenAI's API key and API base to use vLLM's API server. 1209 openai_api_key = "EMPTY" 1210 openai_api_base = "http://localhost:8000/v1" 1211 client = OpenAI( 1212 api_key=openai_api_key, 1213 base_url=openai_api_base, 1214 ) 1215 image_path = "/path/to/local/image.png" 1216 with open(image_path, "rb") as f: 1217 encoded_image = base64.b64encode(f.read()) 1218 encoded_image_text = encoded_image.decode("utf-8") 1219 base64_qwen = f"data:image;base64,{encoded_image_text}" 1220 chat_response = client.chat.completions.create( 1221 model="Qwen2-7B-Instruct", 1222 messages=[ 1223 {"role": "system", "content": "You are a helpful assistant."}, 1224 { 1225 "role": "user", 1226 "content": [ 1227 { 1228 "type": "image_url", 1229 "image_url": { 1230 "url": base64_qwen 1231 }, 1232 }, 1233 {"type": "text", "text": "What is the text in the illustrate?"}, 1234 ], 1235 }, 1236 ], 1237 ) 1238 print("Chat response:", chat_response) 1239 ``` 1240 1241 ### Notes 1242 1243 - โ ๏ธ**NOTE**: Now `vllm.entrypoints.openai.api_server` does not support set `min_pixels` and `max_pixels` in messages (we are working hard on supporting this feature). If you want to limit the resolution, you can set them in model's `preprocessor_config.json`: 1244 1245 ```json 1246 { 1247 "min_pixels": 50176, 1248 "max_pixels": 1003520, 1249 ... 1250 } 1251 ``` 1252 1253 - โ ๏ธ**NOTE**: Now `vllm.entrypoints.openai.api_server` does not support video input yet. We are actively developing on it. 1254 - โ ๏ธ**NOTE**: If you want to pass multiple images in a single prompt, you need to pass `--limit-mm-per-prompt image=<N>` argument (`N` is max number of images in each prompt) when launching `vllm.entrypoints.openai.api_server`. 1255 ### Inference Locally 1256 1257 You can also use vLLM to inference Qwen2-VL locally: 1258 1259 ```python 1260 from transformers import AutoProcessor 1261 from vllm import LLM, SamplingParams 1262 from qwen_vl_utils import process_vision_info 1263 1264 MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct" 1265 1266 llm = LLM( 1267 model=MODEL_PATH, 1268 limit_mm_per_prompt={"image": 10, "video": 10}, 1269 ) 1270 1271 sampling_params = SamplingParams( 1272 temperature=0.1, 1273 top_p=0.001, 1274 repetition_penalty=1.05, 1275 max_tokens=256, 1276 stop_token_ids=[], 1277 ) 1278 1279 messages = [ 1280 {"role": "system", "content": "You are a helpful assistant."}, 1281 { 1282 "role": "user", 1283 "content": [ 1284 { 1285 "type": "image", 1286 "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png", 1287 "min_pixels": 224 * 224, 1288 "max_pixels": 1280 * 28 * 28, 1289 }, 1290 {"type": "text", "text": "What is the text in the illustrate?"}, 1291 ], 1292 }, 1293 ] 1294 # For video input, you can pass following values instead: 1295 # "type": "video", 1296 # "video": "<video URL>", 1297 1298 processor = AutoProcessor.from_pretrained(MODEL_PATH) 1299 prompt = processor.apply_chat_template( 1300 messages, 1301 tokenize=False, 1302 add_generation_prompt=True, 1303 ) 1304 image_inputs, video_inputs = process_vision_info(messages) 1305 1306 mm_data = {} 1307 if image_inputs is not None: 1308 mm_data["image"] = image_inputs 1309 if video_inputs is not None: 1310 mm_data["video"] = video_inputs 1311 1312 llm_inputs = { 1313 "prompt": prompt, 1314 "multi_modal_data": mm_data, 1315 } 1316 1317 outputs = llm.generate([llm_inputs], sampling_params=sampling_params) 1318 generated_text = outputs[0].outputs[0].text 1319 1320 print(generated_text) 1321 ``` 1322 1323 1324 ## Training 1325 #### LLaMA-Factory 1326 1327 Here we provide a script for supervised finetuning Qwen2-VL with 1328 `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`. This 1329 script for supervised finetuning (SFT) has the following features: 1330 1331 - Support multi-images input; 1332 1333 - Support single-GPU and multi-GPU training; 1334 1335 - Support full-parameter tuning, LoRA. 1336 1337 In the following, we introduce more details about the usage of the 1338 script. 1339 1340 #### Installation 1341 1342 Before you start, make sure you have installed the following packages: 1343 1344 1. Follow the instructions of 1345 `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`, and build 1346 the environment. 1347 2. Install these packages (Optional): 1348 1349 ``` 1350 pip install deepspeed 1351 pip install flash-attn --no-build-isolation 1352 ``` 1353 1354 3. If you want to use 1355 `FlashAttention-2 <https://github.com/Dao-AILab/flash-attention>`, 1356 make sure your CUDA is 11.6 and above. 1357 1358 #### Data Preparation 1359 1360 LLaMA-Factory provides several training datasets in ``data`` folder, you 1361 can use it directly. If you are using a custom dataset, please prepare 1362 your dataset as follows. 1363 1364 1. Organize your data in a **json** file and put your data in ``data`` 1365 folder. LLaMA-Factory supports multimodal dataset in ``sharegpt`` 1366 format. 1367 1368 - The dataset in ``sharegpt`` format should follow the below format: 1369 1370 ```json 1371 [ 1372 { 1373 "messages": [ 1374 { 1375 "content": "<image>Who are they?", 1376 "role": "user" 1377 }, 1378 { 1379 "content": "They're Kane and Gretzka from Bayern Munich.", 1380 "role": "assistant" 1381 }, 1382 { 1383 "content": "What are they doing?<image>", 1384 "role": "user" 1385 }, 1386 { 1387 "content": "They are celebrating on the soccer field.", 1388 "role": "assistant" 1389 } 1390 ], 1391 "images": [ 1392 "mllm_demo_data/1.jpg", 1393 "mllm_demo_data/1.jpg" 1394 ] 1395 }, 1396 ] 1397 ``` 1398 1399 1. Provide your dataset definition in ``data/dataset_info.json`` in the 1400 following format . 1401 1402 - For ``sharegpt`` format dataset, the columns in ``dataset_info.json`` 1403 should be: 1404 1405 ```json 1406 "dataset_name": { 1407 "file_name": "dataset_name.json", 1408 "formatting": "sharegpt", 1409 "columns": { 1410 "messages": "messages", 1411 "images": "images" 1412 }, 1413 "tags": { 1414 "role_tag": "role", 1415 "content_tag": "content", 1416 "user_tag": "user", 1417 "assistant_tag": "assistant" 1418 } 1419 } 1420 ``` 1421 1422 #### Training 1423 1424 Lora SFT examples: 1425 ``` 1426 llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml 1427 llamafactory-cli export examples/merge_lora/qwen2vl_lora_sft.yaml 1428 ``` 1429 1430 LoRA DPO/ORPO/SimPO examples: (using [RLHF-V Dataset](https://huggingface.co/datasets/llamafactory/RLHF-V)) 1431 ``` 1432 llamafactory-cli train examples/train_lora/qwen2vl_lora_dpo.yaml 1433 ``` 1434 1435 Full SFT examples: 1436 ``` 1437 llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml 1438 ``` 1439 1440 Inference examples: 1441 ``` 1442 llamafactory-cli webchat examples/inference/qwen2_vl.yaml 1443 llamafactory-cli api examples/inference/qwen2_vl.yaml 1444 ``` 1445 1446 Execute the following training command: 1447 1448 ```bash 1449 DISTRIBUTED_ARGS=" 1450 --nproc_per_node $NPROC_PER_NODE \ 1451 --nnodes $NNODES \ 1452 --node_rank $NODE_RANK \ 1453 --master_addr $MASTER_ADDR \ 1454 --master_port $MASTER_PORT 1455 " 1456 1457 torchrun $DISTRIBUTED_ARGS src/train.py \ 1458 --deepspeed $DS_CONFIG_PATH \ 1459 --stage sft \ 1460 --do_train \ 1461 --model_name_or_path Qwen/Qwen2-VL-7B-Instruct \ 1462 --dataset mllm_demo \ 1463 --template qwen2_vl \ 1464 --finetuning_type lora \ 1465 --output_dir $OUTPUT_PATH \ 1466 --overwrite_cache \ 1467 --overwrite_output_dir \ 1468 --warmup_steps 100 \ 1469 --weight_decay 0.1 \ 1470 --per_device_train_batch_size 2 \ 1471 --gradient_accumulation_steps 4 \ 1472 --ddp_timeout 9000 \ 1473 --learning_rate 5e-6 \ 1474 --lr_scheduler_type cosine \ 1475 --logging_steps 1 \ 1476 --cutoff_len 4096 \ 1477 --save_steps 1000 \ 1478 --plot_loss \ 1479 --num_train_epochs 3 \ 1480 --bf16 1481 ``` 1482 1483 and enjoy the training process. To make changes to your training, you 1484 can modify the arguments in the training command to adjust the 1485 hyperparameters. One argument to note is ``cutoff_len``, which is the 1486 maximum length of the training data. Control this parameter to avoid OOM 1487 error. 1488 1489 ## Function Calling 1490 1491 Qwen2-VL supports Function Calling (aka. Tool Calling or Tool Use). For details on how to use this capability, please refer to the Qwen-Agent project for [the function calling example](https://github.com/QwenLM/Qwen-Agent/blob/main/examples/qwen2vl_function_calling.py) and [the agent example](https://github.com/QwenLM/Qwen-Agent/blob/main/examples/qwen2vl_assistant_tooluse.py). 1492 ### Simple Use Case 1493 ```python 1494 # pip install qwen_agent 1495 from typing import List, Union 1496 from datetime import datetime 1497 from qwen_agent.agents import FnCallAgent 1498 from qwen_agent.gui import WebUI 1499 from qwen_agent.tools.base import BaseToolWithFileAccess, register_tool 1500 1501 @register_tool("get_date") 1502 class GetDate(BaseToolWithFileAccess): 1503 description = "call this tool to get the current date" 1504 parameters = [ 1505 { 1506 "name": "lang", 1507 "type": "string", 1508 "description": "one of ['en', 'zh'], default is en", 1509 "required": False 1510 }, 1511 ] 1512 1513 def call(self, params: Union[str, dict], files: List[str] = None, **kwargs) -> str: 1514 super().call(params=params, files=files) 1515 params = self._verify_json_format_args(params) 1516 lang = "zh" if "zh" in params["lang"] else "en" 1517 now = datetime.now() 1518 result = now.strftime("%Y-%m-%d %H:%M:%S") + "\n" 1519 weekday = now.weekday() 1520 if lang == "zh": 1521 days_chinese = ["ไธ", "ไบ", "ไธ", "ๅ", "ไบ", "ๅ ญ", "ๆฅ"] 1522 result += "ไปๅคฉๆฏๆๆ" + days_chinese[weekday] 1523 else: 1524 days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] 1525 result += "Today is " + days[weekday] 1526 return result 1527 1528 1529 def init_agent_service(): 1530 llm_cfg_vl = { 1531 # Using Qwen2-VL deployed at any openai-compatible service such as vLLM: 1532 "model_type": "qwenvl_oai", 1533 "model": "Qwen/Qwen2-VL-7B-Instruct", 1534 "model_server": "http://localhost:8000/v1", # api_base 1535 "api_key": 'EMPTY", 1536 } 1537 tools = [ 1538 "get_date", 1539 "code_interpreter", 1540 ] # code_interpreter is a built-in tool in Qwen-Agent 1541 bot = FnCallAgent( 1542 llm=llm_cfg_vl, 1543 name="Qwen2-VL", 1544 description="function calling", 1545 function_list=tools, 1546 ) 1547 return bot 1548 1549 def app_gui(): 1550 # Define the agent 1551 bot = init_agent_service() 1552 WebUI(bot).run() 1553 1554 # Launch gradio app 1555 app_gui() 1556 ``` 1557 1558 1559 ## Demo 1560 ### Web UI Example 1561 1562 In this section, we provide instructions for users to build a web-based user interface (UI) demo. This UI demo allows users to interact with a predefined model or application through a web browser. Follow the steps below to get started. 1563 1564 #### Installation 1565 1566 Before you begin, ensure that you have the required dependencies installed on your system. You can install them by running the following command: 1567 1568 ```bash 1569 pip install -r requirements_web_demo.txt 1570 ``` 1571 1572 #### Running the Demo with FlashAttention-2 1573 1574 Once the required packages are installed, you can launch the web demo using the following command. This command will start a web server and provide you with a link to access the UI in your web browser. 1575 1576 **Recommended**: For enhanced performance and efficiency, especially in multi-image and video processing scenarios, we strongly recommend using [FlashAttention-2](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 provides significant improvements in memory usage and speed, making it ideal for handling large-scale models and data processing. 1577 1578 To enable FlashAttention-2, use the following command: 1579 1580 ```bash 1581 python web_demo_mm.py --flash-attn2 1582 ``` 1583 1584 This will load the model with FlashAttention-2 enabled. 1585 1586 **Default Usage**: If you prefer to run the demo without FlashAttention-2 or if you do not specify the `--flash-attn2` option, the demo will load the model using the standard attention implementation: 1587 1588 ```bash 1589 python web_demo_mm.py 1590 ``` 1591 1592 After running the command, youโll see a link generated in the terminal similar to this: 1593 1594 ``` 1595 Running on local: http://127.0.0.1:7860/ 1596 ``` 1597 1598 Copy this link and paste it into your browser to access the web UI, where you can interact with the model by inputting text, uploading images, or using any other provided functionalities. 1599 1600 ##### Running the Streaming Video Chat Demo 1601 An experimental streaming video chat demo is also available in the ``web_demo_streaming`` directory. 1602 1603 To run the streaming video chat demo, use the following command: 1604 1605 ```bash 1606 cd web_demo_streaming/ 1607 python app.py --flash-attn2 1608 ``` 1609 1610 If you prefer to run the demo without FlashAttention-2, use the following command: 1611 ```bash 1612 cd web_demo_streaming/ 1613 python app.py 1614 ``` 1615 1616 This demo supports webcam/screen capture as its video input source. To support screen capture video input, we use code snippet from the following hugginface space: [gstaff/gradio-screen-recorder](https://huggingface.co/spaces/gstaff/gradio-screen-recorder/tree/main). 1617 1618 #### Selecting Different Models (Qwen2-VL Series Only) 1619 1620 The demo is configured by default to use the `Qwen/Qwen2-VL-7B-Instruct` model, which is part of the Qwen2-VL series and is well-suited for various vision-language tasks. However, if you want to use a different model within the Qwen2-VL series, you can simply update the `DEFAULT_CKPT_PATH` variable in the script: 1621 1622 1. **Locate the `DEFAULT_CKPT_PATH` Variable**: 1623 Inside `web_demo_mm.py`, find the `DEFAULT_CKPT_PATH` variable that defines the model checkpoint path. It should look like this: 1624 1625 ```python 1626 DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-7B-Instruct' 1627 ``` 1628 1629 2. **Replace with a Different Qwen2-VL Model Path**: 1630 Modify `DEFAULT_CKPT_PATH` to point to another checkpoint path within the Qwen2-VL series. For example: 1631 1632 ```python 1633 DEFAULT_CKPT_PATH = 'Qwen/Qwen2-VL-2B-Instruct' # Example for a different model in the series 1634 ``` 1635 1636 3. **Save and Re-run**: 1637 After modifying the path, save the script and then re-run the demo using the instructions provided in the `Running the Demo` section above. 1638 1639 **Note:** This `DEFAULT_CKPT_PATH` only supports models from the Qwen2-VL series. If you're using a model outside of the Qwen2-VL series, additional changes to the codebase may be necessary. 1640 1641 1642 #### Customization 1643 1644 Further customization of the web demo, including UI layout, interactions, and additional functionalities like handling specialized input, can be done by modifying the `web_demo_mm.py` script. This flexibility allows you to tailor the web interface to better fit specific tasks or workflows. 1645 1646 1647 ## Limitations 1648 1649 While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1650 1651 1. Lack of Audio Support: The current model does **not comprehend audio information** within videos. 1652 2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered. 1653 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 1654 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 1655 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 1656 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. 1657 1658 These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. 1659 1660 1661 ## ๐ณ Docker 1662 1663 To simplify the deploy process, we provide docker images with pre-build environments: [qwenllm/qwenvl](https://hub.docker.com/r/qwenllm/qwenvl). You only need to install the driver and download model files to launch demos. 1664 1665 ```bash 1666 docker run --gpus all --ipc=host --network=host --rm --name qwen2 -it qwenllm/qwenvl:2-cu121 bash 1667 ``` 1668 1669 ## Citation 1670 1671 If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) 1672 1673 1674 1675 1676 ```BibTeX 1677 @article{Qwen2VL, 1678 title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}, 1679 author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang}, 1680 journal={arXiv preprint arXiv:2409.12191}, 1681 year={2024} 1682 } 1683 1684 @article{Qwen-VL, 1685 title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond}, 1686 author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren}, 1687 journal={arXiv preprint arXiv:2308.12966}, 1688 year={2023} 1689 } 1690 ``` 1691 1692 <br>