/ README.md
README.md
1 # AI Model Cheatsheet 2 3 ## 🏗️ 1. Prerequisites & Installation 4 5 On the M3 Max (Arch ARM), we optimize for **Unified Memory** and **AMX (Apple Matrix Extension)** because GPU acceleration (Metal) for the M3 series is still experimental on Linux as of early 2026. 6 7 ### System Setup 8 9 ```bash 10 # Update and install core tools 11 sudo pacman -Syu 12 sudo pacman -S base-devel git aria2 cmake python python-pip 13 14 # Install the standalone Hugging Face CLI (Recommended) 15 curl -LsSf https://hf.co/cli/install.sh | bash 16 export PATH="$HOME/.huggingface/bin:$PATH" 17 18 # Install Ollama (Official Script) 19 curl -fsSL https://ollama.com/install.sh | sh 20 21 ``` 22 23 --- 24 25 ## 📁 2. Understanding Model Formats 26 27 Based on the [Hugging Face Guide](https://huggingface.co/blog/ngxson/common-ai-model-formats), choosing the right format is critical for M3 Max performance. 28 29 | Format | Best For | Hardware Support (M3 Max) | 30 | --- | --- | --- | 31 | **GGUF** | **Local CPU Inference** | ✅ Best (Native support for AMX/mmap) | 32 | **Safetensors** | Training & GPU Workflows | ✅ Good (Used by MLX framework) | 33 | **PyTorch** | Research & Prototyping | 🟡 Partial (Performance varies on ARM) | 34 | **ONNX** | Browser & Edge Apps | ✅ Good (Via ONNX Runtime) | 35 36 --- 37 38 ## 📥 3. The Download Guide 39 40 ### Method A: The "Speed Demon" (`aria2c`) 41 42 Use this for the fastest raw download speeds by splitting the file into 16 parallel streams. 43 44 ```bash 45 # Example for Qwen 2.5 Coder (32B Quantized GGUF) 46 aria2c --max-download-limit=0 -x 16 -s 16 -k 1M -c \ 47 "https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf?download=true" \ 48 -d ~/.hf-cache \ 49 -o qwen-32b-coder.gguf 50 51 ``` 52 53 ### Method B: The "Smart Manager" (`hf` tool) 54 55 The official CLI manages your cache and verifies file integrity automatically. 56 57 ```bash 58 # Download a specific GGUF file using the official tool 59 hf download bartowski/Qwen2.5-Coder-32B-Instruct-GGUF \ 60 --include "Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf" \ 61 --local-dir ~/.hf-cache 62 63 ``` 64 65 --- 66 67 ## 🦙 4. Ollama Integration & Modelfiles 68 69 Ollama can serve models in different formats by using a `Modelfile`. Here are examples for each major format: 70 71 ### Example 1: Importing a GGUF (The M3 Max "Standard") 72 73 Since GGUF is a single-file format containing both weights and metadata, the Modelfile is simple. 74 75 ```dockerfile 76 # Modelfile-GGUF 77 FROM /home/user/.hf-cache/qwen-32b-coder.gguf 78 PARAMETER num_thread 12 79 PARAMETER num_ctx 32768 80 SYSTEM "You are a coding assistant running on an M3 Max." 81 82 ``` 83 84 ### Example 2: Using Safetensors (MLX/Modern) 85 86 Safetensors are preferred for safety (no code execution) and lazy loading. 87 88 ```dockerfile 89 # Modelfile-Safetensors 90 # Note: Ollama usually handles conversion, but you can point to the directory 91 FROM ./model-directory-containing-safetensors 92 PARAMETER temperature 0.7 93 94 ``` 95 96 ### Example 3: Legacy PyTorch (.bin / .pt) 97 98 Use this if you are working with older research models that haven't been converted yet. 99 100 ```dockerfile 101 # Modelfile-PyTorch 102 FROM ./pytorch_model.bin 103 # Requires manual setting of template since .bin lacks GGUF metadata 104 TEMPLATE """{{ .System }} 105 User: {{ .Prompt }} 106 Assistant: """ 107 108 ``` 109 110 **To create the model in Ollama:** 111 112 ```bash 113 ollama create my-custom-model -f Modelfile 114 ollama run my-custom-model 115 116 ``` 117 118 --- 119 120 ## 🚀 5. Hardware Performance Tips 121 122 * **Memory Bandwidth:** The M3 Max has 400GB/s bandwidth. GGUF models take full advantage of this by using `mmap`, which maps the file directly into memory. 123 * **Quantization Picks:** * **Q4_K_M:** Best balance for 32B models. 124 * **IQ4_XS:** High precision for small models (8B). 125 * **Q8_0:** Use this for 14B models if you have 64GB+ RAM for near-perfect accuracy. 126 127 128 * **Parallelism:** Always set your threads to match your performance cores (`-t 12` for M3 Max). 129 130 --- 131 132 *Reference: [Common AI Model Formats - Hugging Face*](https://huggingface.co/blog/ngxson/common-ai-model-formats)