README.md
1 ## Setup 2 1. **Create a Conda Environment** 3 Use the following command to create and activate a new environment for the SFT training: 4 ```bash 5 benchmark="eval_plus" 6 conda create -n eval_${benchmark}_env python=3.9 7 conda activate eval_${benchmark}_env 8 cd ${benchmark} 9 pip install -r requirements.txt 10 ``` 11 Please setup all evaluation environments in `test.sh` 12 13 2. **Install Dependencies** 14 After activating the environment, install all required dependencies by running: 15 For each 16 ```bash 17 pip install -r requirements.txt 18 ``` 19 20 ## Evaluation 21 22 1. **Model Path Modification** 23 Before running the evaluation script, ensure that the model paths are correctly set. Modify the paths as needed based on your local environment or cloud storage setup. 24 25 2. **Run Evaluation** 26 Once the environment is ready and the model paths are configured, run the evaluation suite by executing the following script: 27 ```bash 28 EVAL_SCRIPT="./evaluate.sh" 29 MODEL_DIR="/path/to/Qwen2.5-coder-Instruct/" 30 OUTPUT_DIR="/path/to/results/" 31 TP=2 32 bash ${EVAL_SCRIPT} ${MODEL_DIR} ${OUTPUT_DIR} ${TP} 33 ``` 34 35 ## Quantization Evaluation Results 36 37 38 ### Python 39 40 | | HE | HE+ | MBPP | MBPP+ | BCB-inst-full | BCB-inst-hard | LCB (2407-2411) | 41 |--------------------------------------------|:----:|:----:|:----:|:-----:|:-------------:|:-------------:|:---------------:| 42 | **Qwen2.5-Coder-32B-Instruct** | 92.7 | 87.2 | 90.2 | 75.1 | 49.6 | 27.0 | 31.4 | 43 | **Qwen2.5-Coder-32B-AWQ** | 92.1 | 84.1 | 87.8 | 75.1 | 48.9 | 27.0 | 31.7 | 44 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int8** | 92.1 | 85.4 | 90.5 | 76.5 | 48.6 | 26.4 | 30.7 | 45 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int4** | 89.6 | 83.5 | 87.0 | 75.9 | 49.7 | 27.0 | 30.3 | 46 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q8_0** | 90.9 | 86.0 | 89.4 | 76.2 | 47.7 | 23.6 | 31.1 | 47 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q6_K** | 90.2 | 86.0 | 89.7 | 76.2 | 48.3 | 25.7 | 31.6 | 48 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_K_M** | 90.9 | 85.4 | 89.2 | 75.7 | 48.8 | 25.7 | 31.4 | 49 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_0** | 90.9 | 86.0 | 88.9 | 74.9 | 48.4 | 23.6 | 30.7 | 50 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_K_M** | 89.6 | 85.4 | 89.4 | 75.9 | 48.9 | 24.3 | 29.9 | 51 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_0** | 89.6 | 85.4 | 90.2 | 77.8 | 48.2 | 25.7 | 32.6 | 52 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q3_K_M** | 91.5 | 86.6 | 90.7 | 76.7 | 48.3 | 23.6 | 32.0 | 53 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q2_K** | 92.7 | 84.8 | 87.3 | 74.3 | 47.6 | 23.0 | 28.5 | 54 55 56 57 ### Multiple Programming Languages 58 59 | | Python | Java | C++ | C# | TS | JS | PHP | Bash | Avg. | 60 |--------------------------------------------|:------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| 61 | **Qwen2.5-Coder-32B-Instruct** | 92.7 | 80.4 | 79.5 | 82.9 | 86.8 | 85.7 | 78.9 | 48.1 | 79.4 | 62 | **Qwen2.5-Coder-32B-AWQ** | 90.2 | 83.5 | 80.7 | 82.3 | 85.5 | 85.1 | 79.5 | 49.4 | 79.5 | 63 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int8** | 90.9 | 84.2 | 81.4 | 80.4 | 85.5 | 87.6 | 79.5 | 49.4 | 79.8 | 64 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int4** | 92.1 | 83.5 | 82.6 | 80.4 | 84.3 | 86.3 | 78.3 | 50.0 | 79.7 | 65 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q8_0** | 90.2 | 84.8 | 82.0 | 81.0 | 85.5 | 87.6 | 80.1 | 49.4 | 80.1 | 66 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q6_K** | 90.9 | 83.5 | 82.0 | 81.6 | 84.9 | 87.0 | 80.7 | 48.7 | 79.9 | 67 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_K_M** | 90.2 | 83.5 | 82.6 | 81.0 | 85.5 | 87.0 | 79.5 | 48.7 | 79.8 | 68 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_0** | 90.2 | 84.8 | 82.0 | 81.6 | 85.5 | 87.0 | 80.1 | 48.1 | 79.9 | 69 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_K_M** | 90.2 | 84.8 | 81.4 | 82.3 | 85.5 | 86.3 | 80.1 | 50.6 | 80.2 | 70 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_0** | 88.4 | 82.9 | 80.1 | 81.0 | 86.8 | 85.7 | 78.3 | 48.1 | 78.9 | 71 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q3_K_M** | 90.9 | 84.2 | 85.1 | 82.3 | 84.9 | 87.0 | 80.1 | 49.4 | 80.5 | 72 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q2_K** | 90.2 | 81.0 | 82.6 | 81.6 | 83.6 | 84.5 | 80.1 | 48.7 | 79.1 | 73 74 75 76 ### Code Editing & Code Reasoning & SQL 77 78 | | Aider (whole) | Aider (diff) | CRUXEval-Input-CoT | CRUXEval-Output-CoT | Spider | Bird | 79 |--------------------------------------------|:-------------:|:------------:|:------------------:|:-------------------:|:------:|:----:| 80 | **Qwen2.5-Coder-32B-Instruct** | 73.7 | 71.4 | 75.2 | 83.4 | 85.1 | 58.4 | 81 | **Qwen2.5-Coder-32B-AWQ** | 73.7 | 67.7 | 75.1 | 83.1 | 83.6 | 57.3 | 82 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int8** | 74.4 | 73.7 | 75.8 | 83.6 | 84.8 | 58.1 | 83 | **Qwen2.5-Coder-32B-Instruct-GPTQ-Int4** | 72.2 | 67.7 | 75.8 | 83.5 | 85.0 | 57.6 | 84 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q8_0** | 72.9 | 69.9 | 80.5 | 83.8 | 84.5 | 57.9 | 85 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q6_K** | 72.9 | 73.7 | 78.1 | 83.5 | 84.7 | 58.1 | 86 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_K_M** | 74.4 | 69.9 | 78.4 | 84.6 | 85.3 | 57.7 | 87 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q5_0** | 71.4 | 72.2 | 80.6 | 83.2 | 84.9 | 57.4 | 88 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_K_M** | 75.2 | 69.2 | 79.0 | 83.5 | 84.5 | 57.5 | 89 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q4_0** | 74.4 | 71.4 | 78.5 | 84.0 | 84.7 | 57.2 | 90 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q3_K_M** | 72.9 | 68.4 | 78.8 | 83.9 | 84.4 | 57.4 | 91 | **Qwen2.5-Coder-32B-Instruct-GGUF-Q2_K** | 69.9 | 61.7 | 75.5 | 81.1 | 83.4 | 56.1 | 92