/ README.md
README.md
  1  # llama.cpp
  2  
  3  ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
  4  
  5  [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
  6  [![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg?branch=master&event=schedule)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
  7  [![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp)
  8  
  9  [Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
 10  
 11  Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
 12  
 13  > [!IMPORTANT]
 14  [2024 Jun 12] Binaries have been renamed w/ a `llama-` prefix. `main` is now `llama-cli`, `server` is `llama-server`, etc (https://github.com/ggerganov/llama.cpp/pull/7809)
 15  
 16  ### Recent API changes
 17  
 18  - [2024 Apr 21] `llama_token_to_piece` can now optionally render special tokens https://github.com/ggerganov/llama.cpp/pull/6807
 19  - [2024 Apr 4] State and session file functions reorganized under `llama_state_*` https://github.com/ggerganov/llama.cpp/pull/6341
 20  - [2024 Mar 26] Logits and embeddings API updated for compactness https://github.com/ggerganov/llama.cpp/pull/6122
 21  - [2024 Mar 13] Add `llama_synchronize()` + `llama_context_params.n_ubatch` https://github.com/ggerganov/llama.cpp/pull/6017
 22  - [2024 Mar 8] `llama_kv_cache_seq_rm()` returns a `bool` instead of `void`, and new `llama_n_seq_max()` returns the upper limit of acceptable `seq_id` in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
 23  - [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
 24  - [2024 Mar 3] `struct llama_context_params` https://github.com/ggerganov/llama.cpp/pull/5849
 25  
 26  ### Hot topics
 27  
 28  - **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
 29  - Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
 30  - BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
 31  - MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
 32  - Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
 33  - Fix major bug in Metal batched inference https://github.com/ggerganov/llama.cpp/pull/6225
 34  - Multi-GPU pipeline parallelism support https://github.com/ggerganov/llama.cpp/pull/6017
 35  - Looking for contributions to add Deepseek support: https://github.com/ggerganov/llama.cpp/issues/5981
 36  - Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962
 37  - Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
 38  
 39  ----
 40  
 41  <details>
 42    <summary>Table of Contents</summary>
 43    <ol>
 44      <li>
 45        <a href="#description">Description</a>
 46      </li>
 47      <li>
 48        <a href="#usage">Usage</a>
 49        <ul>
 50          <li><a href="#get-the-code">Get the Code</a></li>
 51          <li><a href="#build">Build</a></li>
 52          <li><a href="#blas-build">BLAS Build</a></li>
 53          <li><a href="#prepare-and-quantize">Prepare and Quantize</a></li>
 54          <li><a href="#run-the-quantized-model">Run the quantized model</a></li>
 55          <li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
 56          <li><a href="#quantization">Quantization</a></li>
 57          <li><a href="#interactive-mode">Interactive mode</a></li>
 58          <li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
 59          <li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
 60          <li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
 61          <li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
 62          <li><a href="#android">Android</a></li>
 63          <li><a href="#docker">Docker</a></li>
 64        </ul>
 65      </li>
 66      <li><a href="#contributing">Contributing</a></li>
 67      <li><a href="#coding-guidelines">Coding guidelines</a></li>
 68      <li><a href="#docs">Docs</a></li>
 69    </ol>
 70  </details>
 71  
 72  ## Description
 73  
 74  The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
 75  variety of hardware - locally and in the cloud.
 76  
 77  - Plain C/C++ implementation without any dependencies
 78  - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
 79  - AVX, AVX2 and AVX512 support for x86 architectures
 80  - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
 81  - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
 82  - Vulkan and SYCL backend support
 83  - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
 84  
 85  Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
 86  improved significantly thanks to many contributions. It is the main playground for developing new features for the
 87  [ggml](https://github.com/ggerganov/ggml) library.
 88  
 89  **Supported platforms:**
 90  
 91  - [X] Mac OS
 92  - [X] Linux
 93  - [X] Windows (via CMake)
 94  - [X] Docker
 95  - [X] FreeBSD
 96  
 97  **Supported models:**
 98  
 99  Typically finetunes of the base models below are supported as well.
100  
101  - [X] LLaMA 🦙
102  - [x] LLaMA 2 🦙🦙
103  - [x] LLaMA 3 🦙🦙🦙
104  - [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
105  - [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
106  - [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
107  - [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
108  - [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
109  - [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
110  - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
111  - [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
112  - [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
113  - [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
114  - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
115  - [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
116  - [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
117  - [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
118  - [X] [StableLM models](https://huggingface.co/stabilityai)
119  - [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
120  - [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
121  - [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
122  - [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
123  - [x] [GPT-2](https://huggingface.co/gpt2)
124  - [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
125  - [x] [InternLM2](https://huggingface.co/models?search=internlm2)
126  - [x] [CodeShell](https://github.com/WisdomShell/codeshell)
127  - [x] [Gemma](https://ai.google.dev/gemma)
128  - [x] [Mamba](https://github.com/state-spaces/mamba)
129  - [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
130  - [x] [Xverse](https://huggingface.co/models?search=xverse)
131  - [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
132  - [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
133  - [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
134  - [x] [OLMo](https://allenai.org/olmo)
135  - [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
136  
137  (instructions for supporting more models: [HOWTO-add-model.md](./docs/HOWTO-add-model.md))
138  
139  **Multimodal models:**
140  
141  - [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
142  - [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
143  - [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
144  - [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
145  - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
146  - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
147  - [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
148  - [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
149  - [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
150  
151  **HTTP server**
152  
153  [llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
154  
155  [simplechat](./examples/server/public_simplechat) is a simple chat client, which can be used to chat with the model exposed using above web server (use --path to point to simplechat), from a local web browser.
156  
157  **Bindings:**
158  
159  - Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
160  - Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
161  - Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
162  - JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
163  - JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
164  - Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
165  - Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
166  - Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
167  - Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
168  - Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
169  - C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
170  - Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
171  - Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
172  - React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
173  - Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
174  - Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
175  - Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
176  - PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
177  
178  **UI:**
179  
180  Unless otherwise noted these projects are open-source with permissive licensing:
181  
182  - [iohub/collama](https://github.com/iohub/coLLaMA)
183  - [janhq/jan](https://github.com/janhq/jan) (AGPL)
184  - [nat/openplayground](https://github.com/nat/openplayground)
185  - [Faraday](https://faraday.dev/) (proprietary)
186  - [LMStudio](https://lmstudio.ai/) (proprietary)
187  - [Layla](https://play.google.com/store/apps/details?id=com.laylalite) (proprietary)
188  - [LocalAI](https://github.com/mudler/LocalAI) (MIT)
189  - [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
190  - [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)
191  - [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
192  - [ollama/ollama](https://github.com/ollama/ollama)
193  - [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
194  - [psugihara/FreeChat](https://github.com/psugihara/FreeChat)
195  - [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
196  - [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
197  - [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
198  - [RAGNA Desktop](https://ragna.app/) (proprietary)
199  - [RecurseChat](https://recurse.chat/) (proprietary)
200  - [semperai/amica](https://github.com/semperai/amica)
201  - [withcatai/catai](https://github.com/withcatai/catai)
202  - [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
203  - [Msty](https://msty.app) (proprietary)
204  - [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
205  - [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apachev2.0 or later)
206  - [Dot](https://github.com/alexpinel/Dot) (GPL)
207  - [MindMac](https://mindmac.app) (proprietary)
208  - [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
209  - [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
210  - [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
211  - [AIKit](https://github.com/sozercan/aikit) (MIT)
212  
213  *(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
214  
215  **Tools:**
216  
217  - [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
218  
219  ---
220  
221  Here is a typical run using LLaMA v2 13B on M2 Ultra:
222  
223  ```
224  $ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
225  I llama.cpp build info:
226  I UNAME_S:  Darwin
227  I UNAME_P:  arm
228  I UNAME_M:  arm64
229  I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
230  I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
231  I LDFLAGS:   -framework Accelerate
232  I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
233  I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)
234  
235  make: Nothing to be done for `default'.
236  main: build = 1041 (cf658ad)
237  main: seed  = 1692823051
238  llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
239  llama_model_loader: - type  f32:   81 tensors
240  llama_model_loader: - type q4_0:  281 tensors
241  llama_model_loader: - type q6_K:    1 tensors
242  llm_load_print_meta: format         = GGUF V1 (latest)
243  llm_load_print_meta: arch           = llama
244  llm_load_print_meta: vocab type     = SPM
245  llm_load_print_meta: n_vocab        = 32000
246  llm_load_print_meta: n_merges       = 0
247  llm_load_print_meta: n_ctx_train    = 4096
248  llm_load_print_meta: n_ctx          = 512
249  llm_load_print_meta: n_embd         = 5120
250  llm_load_print_meta: n_head         = 40
251  llm_load_print_meta: n_head_kv      = 40
252  llm_load_print_meta: n_layer        = 40
253  llm_load_print_meta: n_rot          = 128
254  llm_load_print_meta: n_gqa          = 1
255  llm_load_print_meta: f_norm_eps     = 1.0e-05
256  llm_load_print_meta: f_norm_rms_eps = 1.0e-05
257  llm_load_print_meta: n_ff           = 13824
258  llm_load_print_meta: freq_base      = 10000.0
259  llm_load_print_meta: freq_scale     = 1
260  llm_load_print_meta: model type     = 13B
261  llm_load_print_meta: model ftype    = mostly Q4_0
262  llm_load_print_meta: model size     = 13.02 B
263  llm_load_print_meta: general.name   = LLaMA v2
264  llm_load_print_meta: BOS token = 1 '<s>'
265  llm_load_print_meta: EOS token = 2 '</s>'
266  llm_load_print_meta: UNK token = 0 '<unk>'
267  llm_load_print_meta: LF token  = 13 '<0x0A>'
268  llm_load_tensors: ggml ctx size =    0.11 MB
269  llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
270  ...................................................................................................
271  llama_new_context_with_model: kv self size  =  400.00 MB
272  llama_new_context_with_model: compute buffer total size =   75.41 MB
273  
274  system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
275  sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
276  generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
277  
278  
279   Building a website can be done in 10 simple steps:
280  Step 1: Find the right website platform.
281  Step 2: Choose your domain name and hosting plan.
282  Step 3: Design your website layout.
283  Step 4: Write your website content and add images.
284  Step 5: Install security features to protect your site from hackers or spammers
285  Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
286  Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
287  Step 8: Start marketing and promoting the website via social media channels or paid ads
288  Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
289  Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
290  How does a Website Work?
291  A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
292  The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
293  How to
294  llama_print_timings:        load time =   576.45 ms
295  llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
296  llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
297  llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
298  llama_print_timings:       total time = 25431.49 ms
299  ```
300  
301  And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
302  
303  https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
304  
305  ## Usage
306  
307  Here are the end-to-end binary build and model conversion steps for most supported models.
308  
309  ### Get the Code
310  
311  ```bash
312  git clone https://github.com/ggerganov/llama.cpp
313  cd llama.cpp
314  ```
315  
316  ### Build
317  
318  In order to build llama.cpp you have four different options.
319  
320  - Using `make`:
321    - On Linux or MacOS:
322  
323        ```bash
324        make
325        ```
326  
327    - On Windows:
328  
329      1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
330      2. Extract `w64devkit` on your pc.
331      3. Run `w64devkit.exe`.
332      4. Use the `cd` command to reach the `llama.cpp` folder.
333      5. From here you can run:
334          ```bash
335          make
336          ```
337  
338    - Notes:
339      - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `make -j 8` will run 8 jobs in parallel.
340      - For faster repeated compilation, install [ccache](https://ccache.dev/).
341      - For debug builds, run `make LLAMA_DEBUG=1`
342  
343  - Using `CMake`:
344  
345    ```bash
346    cmake -B build
347    cmake --build build --config Release
348    ```
349  
350    **Notes**:
351  
352      - For faster compilation, add the `-j` argument to run multiple jobs in parallel. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
353      - For faster repeated compilation, install [ccache](https://ccache.dev/).
354      - For debug builds, there are two cases:
355  
356        1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
357  
358        ```bash
359        cmake -B build -DCMAKE_BUILD_TYPE=Debug
360        cmake --build build
361        ```
362  
363        2. Multi-config generators (`-G` param set to Visual Studio, XCode...):
364  
365        ```bash
366        cmake -B build -G "Xcode"
367        cmake --build build --config Debug
368        ```
369  
370  -   Using `gmake` (FreeBSD):
371  
372      1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
373      2. Add your user to **video** group
374      3. Install compilation dependencies.
375  
376          ```bash
377          sudo pkg install gmake automake autoconf pkgconf llvm15 openblas
378  
379          gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
380          ```
381  
382  ### Homebrew
383  
384  On Mac and Linux, the homebrew package manager can be used via
385  ```
386  brew install llama.cpp
387  ```
388  The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
389  
390  ### Nix
391  
392  On Mac and Linux, the Nix package manager can be used via
393  ```
394  nix profile install nixpkgs#llama-cpp
395  ```
396  For flake enabled installs.
397  
398  Or
399  ```
400  nix-env --file '<nixpkgs>' --install --attr llama-cpp
401  ```
402  For non-flake enabled installs.
403  
404  This expression is automatically updated within the [nixpkgs repo](https://github.com/NixOS/nixpkgs/blob/nixos-24.05/pkgs/by-name/ll/llama-cpp/package.nix#L164).
405  
406  #### Flox
407  
408  On Mac and Linux, Flox can be used to install llama.cpp within a Flox environment via
409  ```
410  flox install llama-cpp
411  ```
412  Flox follows the nixpkgs build of llama.cpp.
413  
414  ### Metal Build
415  
416  On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
417  To disable the Metal build at compile time use the `LLAMA_NO_METAL=1` flag or the `LLAMA_METAL=OFF` cmake option.
418  
419  When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers|-ngl 0` command-line
420  argument.
421  
422  ### BLAS Build
423  
424  Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
425  
426  - #### Accelerate Framework:
427  
428    This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
429  
430  - #### OpenBLAS:
431  
432    This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
433  
434    - Using `make`:
435      - On Linux:
436        ```bash
437        make LLAMA_OPENBLAS=1
438        ```
439  
440      - On Windows:
441  
442        1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
443        2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
444        3. Extract `w64devkit` on your pc.
445        4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
446        5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
447        6. Run `w64devkit.exe`.
448        7. Use the `cd` command to reach the `llama.cpp` folder.
449        8. From here you can run:
450  
451            ```bash
452            make LLAMA_OPENBLAS=1
453            ```
454  
455    - Using `CMake` on Linux:
456  
457        ```bash
458        cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
459        cmake --build build --config Release
460        ```
461  
462  - #### BLIS
463  
464    Check [BLIS.md](docs/BLIS.md) for more information.
465  
466  - #### SYCL
467    SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
468  
469    llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
470  
471    For detailed info, please refer to [llama.cpp for SYCL](README-sycl.md).
472  
473  - #### Intel oneMKL
474    Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./README-sycl.md).
475  
476    - Using manual oneAPI installation:
477      By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
478        ```bash
479        source /opt/intel/oneapi/setvars.sh # You can skip this step if  in oneapi-basekit docker image, only required for manual installation
480        cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_NATIVE=ON
481        cmake --build build --config Release
482        ```
483  
484    - Using oneAPI docker image:
485      If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: [oneAPI-basekit](https://hub.docker.com/r/intel/oneapi-basekit). Then, you can use the commands given above.
486  
487    Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
488  
489  - #### CUDA
490  
491    This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
492  
493    For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
494  
495    - Using `make`:
496      ```bash
497      make LLAMA_CUDA=1
498      ```
499    - Using `CMake`:
500  
501      ```bash
502      cmake -B build -DLLAMA_CUDA=ON
503      cmake --build build --config Release
504      ```
505  
506    The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
507  
508    | Option                         | Legal values           | Default | Description                                                                                                                                                                                                                                                                             |
509    |--------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
510    | LLAMA_CUDA_FORCE_DMMV          | Boolean                | false   | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
511    | LLAMA_CUDA_DMMV_X              | Positive integer >= 32 | 32      | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants.                                         |
512    | LLAMA_CUDA_MMV_Y               | Positive integer       | 1       | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended.                                               |
513    | LLAMA_CUDA_FORCE_MMQ           | Boolean                | false   | Force the use of dequantization + matrix multiplication kernels instead of leveraging Math libraries. |                                                                                                                                         |
514    | LLAMA_CUDA_F16                 | Boolean                | false   | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs.                                                           |
515    | LLAMA_CUDA_KQUANTS_ITER        | 1 or 2                 | 2       | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs.                                                                                                                     |
516    | LLAMA_CUDA_PEER_MAX_BATCH_SIZE | Positive integer       | 128     | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.                                                                         |
517    | LLAMA_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer.                                                                                                  |
518  
519  - #### hipBLAS
520  
521    This provides BLAS acceleration on HIP-supported AMD GPUs.
522    Make sure to have ROCm installed.
523    You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
524  
525    - Using `make`:
526      ```bash
527      make LLAMA_HIPBLAS=1
528      ```
529    - Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
530      ```bash
531      HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
532          cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
533          && cmake --build build --config Release -- -j 16
534      ```
535      On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DLLAMA_HIP_UMA=ON`.
536      However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
537  
538      Note that if you get the following error:
539      ```
540      clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
541      ```
542      Try searching for a directory under `HIP_PATH` that contains the file
543      `oclc_abi_version_400.bc`. Then, add the following to the start of the
544      command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
545      like:
546      ```bash
547      HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
548      HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
549          cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
550          && cmake --build build -- -j 16
551      ```
552  
553    - Using `make` (example for target gfx1030, build with 16 CPU threads):
554      ```bash
555      make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030
556      ```
557  
558    - Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
559      ```bash
560      set PATH=%HIP_PATH%\bin;%PATH%
561      cmake -S . -B build -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
562      cmake --build build
563      ```
564      Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
565      Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
566  
567  
568    The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
569    If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
570    The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
571  
572    | Option                  | Legal values           | Default | Description                                                                                                                                                                                                                                    |
573    |-------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
574    | LLAMA_CUDA_DMMV_X       | Positive integer >= 32 | 32      | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
575    | LLAMA_CUDA_MMV_Y        | Positive integer       | 1       | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants.                                                                       |
576    | LLAMA_CUDA_KQUANTS_ITER | 1 or 2                 | 2       | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs.                                                                             |
577  
578  - #### Vulkan
579  
580    **With docker**:
581  
582    You don't need to install Vulkan SDK. It will be installed inside the container.
583  
584    ```sh
585    # Build the image
586    docker build -t llama-cpp-vulkan -f .devops/llama-cli-vulkan.Dockerfile .
587  
588    # Then, use it:
589    docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
590    ```
591  
592    **Without docker**:
593  
594    Firstly, you need to make sure you have installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
595  
596    For example, on Ubuntu 22.04 (jammy), use the command below:
597  
598    ```bash
599    wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
600    wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
601    apt update -y
602    apt-get install -y vulkan-sdk
603    # To verify the installation, use the command below:
604    vulkaninfo
605    ```
606  
607    Alternatively your package manager might be able to provide the appropriate libraries.
608    For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
609    For Fedora 40, you can install `vulkan-devel`, `glslc` and `glslang` packages.
610  
611    Then, build llama.cpp using the cmake command below:
612  
613    ```bash
614    cmake -B build -DLLAMA_VULKAN=1
615    cmake --build build --config Release
616    # Test the output binary (with "-ngl 33" to offload all layers to GPU)
617    ./bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4
618  
619    # You should see in the output, ggml_vulkan detected your GPU. For example:
620    # ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
621    ```
622  
623  ### Prepare and Quantize
624  
625  > [!NOTE]
626  > You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
627  
628  To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
629  
630  Note: `convert.py` has been moved to `examples/convert-legacy-llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
631  It does not support LLaMA 3, you can use `convert-hf-to-gguf.py` with LLaMA 3 downloaded from Hugging Face.
632  
633  ```bash
634  # obtain the official LLaMA model weights and place them in ./models
635  ls ./models
636  llama-2-7b tokenizer_checklist.chk tokenizer.model
637  # [Optional] for models using BPE tokenizers
638  ls ./models
639  <folder containing weights and tokenizer json> vocab.json
640  # [Optional] for PyTorch .bin models like Mistral-7B
641  ls ./models
642  <folder containing weights and tokenizer json>
643  
644  # install Python dependencies
645  python3 -m pip install -r requirements.txt
646  
647  # convert the model to ggml FP16 format
648  python3 convert-hf-to-gguf.py models/mymodel/
649  
650  # quantize the model to 4-bits (using Q4_K_M method)
651  ./llama-quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
652  
653  # update the gguf filetype to current version if older version is now unsupported
654  ./llama-quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
655  ```
656  
657  ### Run the quantized model
658  
659  ```bash
660  # start inference on a gguf model
661  ./llama-cli -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
662  ```
663  
664  When running the larger models, make sure you have enough disk space to store all the intermediate files.
665  
666  ### Running on Windows with prebuilt binaries
667  
668  You will find prebuilt Windows binaries on the release page.
669  
670  Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`)
671  
672  From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so:
673  
674  ```
675  .\main -m llama-2-7b.Q4_0.gguf -n 128
676  ```
677  
678  ### Memory/Disk Requirements
679  
680  As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
681  
682  | Model | Original size | Quantized size (Q4_0) |
683  |------:|--------------:|----------------------:|
684  |    7B |         13 GB |                3.9 GB |
685  |   13B |         24 GB |                7.8 GB |
686  |   30B |         60 GB |               19.5 GB |
687  |   65B |        120 GB |               38.5 GB |
688  
689  ### Quantization
690  
691  Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
692  
693  *(outdated)*
694  
695  | Model | Measure      |    F16 |   Q4_0 |   Q4_1 |   Q5_0 |   Q5_1 |   Q8_0 |
696  |------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
697  |    7B | perplexity   | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
698  |    7B | file size    |  13.0G |   3.5G |   3.9G |   4.3G |   4.7G |   6.7G |
699  |    7B | ms/tok @ 4th |    127 |     55 |     54 |     76 |     83 |     72 |
700  |    7B | ms/tok @ 8th |    122 |     43 |     45 |     52 |     56 |     67 |
701  |    7B | bits/weight  |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |
702  |   13B | perplexity   | 5.2543 | 5.3860 | 5.3608 | 5.2856 | 5.2706 | 5.2548 |
703  |   13B | file size    |  25.0G |   6.8G |   7.6G |   8.3G |   9.1G |    13G |
704  |   13B | ms/tok @ 4th |      - |    103 |    105 |    148 |    160 |    131 |
705  |   13B | ms/tok @ 8th |      - |     73 |     82 |     98 |    105 |    128 |
706  |   13B | bits/weight  |   16.0 |    4.5 |    5.0 |    5.5 |    6.0 |    8.5 |
707  
708  - [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
709  - recent k-quants improvements and new i-quants
710    - [#2707](https://github.com/ggerganov/llama.cpp/pull/2707)
711    - [#2807](https://github.com/ggerganov/llama.cpp/pull/2807)
712    - [#4773 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4773)
713    - [#4856 - 2-bit i-quants (inference)](https://github.com/ggerganov/llama.cpp/pull/4856)
714    - [#4861 - importance matrix](https://github.com/ggerganov/llama.cpp/pull/4861)
715    - [#4872 - MoE models](https://github.com/ggerganov/llama.cpp/pull/4872)
716    - [#4897 - 2-bit quantization](https://github.com/ggerganov/llama.cpp/pull/4897)
717    - [#4930 - imatrix for all k-quants](https://github.com/ggerganov/llama.cpp/pull/4930)
718    - [#4951 - imatrix on the GPU](https://github.com/ggerganov/llama.cpp/pull/4957)
719    - [#4969 - imatrix for legacy quants](https://github.com/ggerganov/llama.cpp/pull/4969)
720    - [#4996 - k-qunats tuning](https://github.com/ggerganov/llama.cpp/pull/4996)
721    - [#5060 - Q3_K_XS](https://github.com/ggerganov/llama.cpp/pull/5060)
722    - [#5196 - 3-bit i-quants](https://github.com/ggerganov/llama.cpp/pull/5196)
723    - [quantization tuning](https://github.com/ggerganov/llama.cpp/pull/5320), [another one](https://github.com/ggerganov/llama.cpp/pull/5334), and [another one](https://github.com/ggerganov/llama.cpp/pull/5361)
724  
725  ### Perplexity (measuring model quality)
726  
727  You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
728  For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
729  
730  The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
731  The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
732  
733  #### How to run
734  
735  1. Download/extract: https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
736  2. Run `./llama-perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw`
737  3. Output:
738  ```
739  perplexity : calculating perplexity over 655 chunks
740  24.43 seconds per pass - ETA 4.45 hours
741  [1]4.5970,[2]5.1807,[3]6.0382,...
742  ```
743  And after 4.45 hours, you will have the final perplexity.
744  
745  ### Interactive mode
746  
747  If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
748  In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
749  
750  Here is an example of a few-shot interaction, invoked with the command
751  
752  ```bash
753  # default arguments using a 7B model
754  ./examples/chat.sh
755  
756  # advanced chat with a 13B model
757  ./examples/chat-13B.sh
758  
759  # custom arguments using a 13B model
760  ./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
761  ```
762  
763  Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program.
764  
765  ![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
766  
767  ### Persistent Interaction
768  
769  The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all`. The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh`. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.
770  
771  ```bash
772  # Start a new chat
773  PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
774  
775  # Resume that chat
776  PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
777  
778  # Start a different chat with the same prompt/model
779  PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
780  
781  # Different prompt cache for different prompt/model
782  PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
783      CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
784  ```
785  
786  ### Constrained output with grammars
787  
788  `llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
789  
790  ```bash
791  ./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
792  ```
793  
794  The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
795  
796  For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
797  
798  ### Obtaining and using the Facebook LLaMA 2 model
799  
800  - Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
801  - Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke), including:
802    - [LLaMA 2 7B base](https://huggingface.co/TheBloke/Llama-2-7B-GGUF)
803    - [LLaMA 2 13B base](https://huggingface.co/TheBloke/Llama-2-13B-GGUF)
804    - [LLaMA 2 70B base](https://huggingface.co/TheBloke/Llama-2-70B-GGUF)
805    - [LLaMA 2 7B chat](https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF)
806    - [LLaMA 2 13B chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF)
807    - [LLaMA 2 70B chat](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF)
808  
809  ### Seminal papers and background on the models
810  
811  If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
812  - LLaMA:
813      - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
814      - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
815  - GPT-3
816      - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
817  - GPT-3.5 / InstructGPT / ChatGPT:
818      - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
819      - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
820  
821  ### Android
822  
823  #### Build on Android using Termux
824  [Termux](https://github.com/termux/termux-app#installation) is a method to execute `llama.cpp` on an Android device (no root required).
825  ```
826  apt update && apt upgrade -y
827  apt install git make cmake
828  ```
829  
830  It's recommended to move your model inside the `~/` directory for best performance:
831  ```
832  cd storage/downloads
833  mv model.gguf ~/
834  ```
835  
836  [Get the code](https://github.com/ggerganov/llama.cpp#get-the-code) & [follow the Linux build instructions](https://github.com/ggerganov/llama.cpp#build) to build `llama.cpp`.
837  
838  #### Building the Project using Android NDK
839  Obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake.
840  
841  Execute the following commands on your computer to avoid downloading the NDK to your mobile. Alternatively, you can also do this in Termux:
842  ```
843  $ mkdir build-android
844  $ cd build-android
845  $ export NDK=<your_ndk_directory>
846  $ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
847  $ make
848  ```
849  
850  Install [termux](https://github.com/termux/termux-app#installation) on your device and run `termux-setup-storage` to get access to your SD card (if Android 11+ then run the command twice).
851  
852  Finally, copy these built `llama` binaries and the model file to your device storage. Because the file permissions in the Android sdcard cannot be changed, you can copy the executable files to the `/data/data/com.termux/files/home/bin` path, and then execute the following commands in Termux to add executable permission:
853  
854  (Assumed that you have pushed the built executable files to the /sdcard/llama.cpp/bin path using `adb push`)
855  ```
856  $cp -r /sdcard/llama.cpp/bin /data/data/com.termux/files/home/
857  $cd /data/data/com.termux/files/home/bin
858  $chmod +x ./*
859  ```
860  
861  Download model [llama-2-7b-chat.Q4_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_K_M.gguf), and push it to `/sdcard/llama.cpp/`, then move it to `/data/data/com.termux/files/home/model/`
862  
863  ```
864  $mv /sdcard/llama.cpp/llama-2-7b-chat.Q4_K_M.gguf /data/data/com.termux/files/home/model/
865  ```
866  
867  Now, you can start chatting:
868  ```
869  $cd /data/data/com.termux/files/home/bin
870  $./llama-cli -m ../model/llama-2-7b-chat.Q4_K_M.gguf -n 128 -cml
871  ```
872  
873  Here's a demo of an interactive session running on Pixel 5 phone:
874  
875  https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
876  
877  ### Docker
878  
879  #### Prerequisites
880  * Docker must be installed and running on your system.
881  * Create a folder to store big models & intermediate files (ex. /llama/models)
882  
883  #### Images
884  We have three Docker images available for this project:
885  
886  1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
887  2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
888  3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
889  
890  Additionally, there the following images, similar to the above:
891  
892  - `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
893  - `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
894  - `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
895  - `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
896  - `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
897  - `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
898  
899  The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
900  
901  #### Usage
902  
903  The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
904  
905  Replace `/path/to/models` below with the actual path where you downloaded the models.
906  
907  ```bash
908  docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
909  ```
910  
911  On completion, you are ready to play!
912  
913  ```bash
914  docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
915  ```
916  
917  or with a light image:
918  
919  ```bash
920  docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
921  ```
922  
923  or with a server image:
924  
925  ```bash
926  docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
927  ```
928  
929  ### Docker With CUDA
930  
931  Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
932  
933  #### Building Locally
934  
935  ```bash
936  docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
937  docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
938  docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .
939  ```
940  
941  You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
942  
943  The defaults are:
944  
945  - `CUDA_VERSION` set to `11.7.1`
946  - `CUDA_DOCKER_ARCH` set to `all`
947  
948  The resulting images, are essentially the same as the non-CUDA images:
949  
950  1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
951  2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
952  3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
953  
954  #### Usage
955  
956  After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
957  
958  ```bash
959  docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
960  docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
961  docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
962  ```
963  
964  ### Contributing
965  
966  - Contributors can open PRs
967  - Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
968  - Collaborators will be invited based on contributions
969  - Any help with managing issues and PRs is very appreciated!
970  - Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
971  - A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
972  
973  ### Coding guidelines
974  
975  - Avoid adding third-party dependencies, extra files, extra headers, etc.
976  - Always consider cross-compatibility with other operating systems and architectures
977  - Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
978  - There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
979  - Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
980  - See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
981  - Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
982  - Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$
983  
984  ![matmul](media/matmul.png)
985  
986  ### Docs
987  
988  - [main (cli)](./examples/main/README.md)
989  - [server](./examples/server/README.md)
990  - [jeopardy](./examples/jeopardy/README.md)
991  - [BLIS](./docs/BLIS.md)
992  - [Performance troubleshooting](./docs/token_generation_performance_tips.md)
993  - [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
994  - [GBNF grammars](./grammars/README.md)