/ README.md
README.md
  1  # AI Model Cheatsheet
  2  
  3  ## 🏗️ 1. Prerequisites & Installation
  4  
  5  On the M3 Max (Arch ARM), we optimize for **Unified Memory** and **AMX (Apple Matrix Extension)** because GPU acceleration (Metal) for the M3 series is still experimental on Linux as of early 2026.
  6  
  7  ### System Setup
  8  
  9  ```bash
 10  # Update and install core tools
 11  sudo pacman -Syu
 12  sudo pacman -S base-devel git aria2 cmake python python-pip
 13  
 14  # Install the standalone Hugging Face CLI (Recommended)
 15  curl -LsSf https://hf.co/cli/install.sh | bash
 16  export PATH="$HOME/.huggingface/bin:$PATH"
 17  
 18  # Install Ollama (Official Script)
 19  curl -fsSL https://ollama.com/install.sh | sh
 20  
 21  ```
 22  
 23  ---
 24  
 25  ## 📁 2. Understanding Model Formats
 26  
 27  Based on the [Hugging Face Guide](https://huggingface.co/blog/ngxson/common-ai-model-formats), choosing the right format is critical for M3 Max performance.
 28  
 29  | Format | Best For | Hardware Support (M3 Max) |
 30  | --- | --- | --- |
 31  | **GGUF** | **Local CPU Inference** | ✅ Best (Native support for AMX/mmap) |
 32  | **Safetensors** | Training & GPU Workflows | ✅ Good (Used by MLX framework) |
 33  | **PyTorch** | Research & Prototyping | 🟡 Partial (Performance varies on ARM) |
 34  | **ONNX** | Browser & Edge Apps | ✅ Good (Via ONNX Runtime) |
 35  
 36  ---
 37  
 38  ## 📥 3. The Download Guide
 39  
 40  ### Method A: The "Speed Demon" (`aria2c`)
 41  
 42  Use this for the fastest raw download speeds by splitting the file into 16 parallel streams.
 43  
 44  ```bash
 45  # Example for Qwen 2.5 Coder (32B Quantized GGUF)
 46  aria2c --max-download-limit=0 -x 16 -s 16 -k 1M -c \
 47    "https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf?download=true" \
 48    -d ~/.hf-cache \
 49    -o qwen-32b-coder.gguf
 50  
 51  ```
 52  
 53  ### Method B: The "Smart Manager" (`hf` tool)
 54  
 55  The official CLI manages your cache and verifies file integrity automatically.
 56  
 57  ```bash
 58  # Download a specific GGUF file using the official tool
 59  hf download bartowski/Qwen2.5-Coder-32B-Instruct-GGUF \
 60    --include "Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf" \
 61    --local-dir ~/.hf-cache
 62  
 63  ```
 64  
 65  ---
 66  
 67  ## 🦙 4. Ollama Integration & Modelfiles
 68  
 69  Ollama can serve models in different formats by using a `Modelfile`. Here are examples for each major format:
 70  
 71  ### Example 1: Importing a GGUF (The M3 Max "Standard")
 72  
 73  Since GGUF is a single-file format containing both weights and metadata, the Modelfile is simple.
 74  
 75  ```dockerfile
 76  # Modelfile-GGUF
 77  FROM /home/user/.hf-cache/qwen-32b-coder.gguf
 78  PARAMETER num_thread 12
 79  PARAMETER num_ctx 32768
 80  SYSTEM "You are a coding assistant running on an M3 Max."
 81  
 82  ```
 83  
 84  ### Example 2: Using Safetensors (MLX/Modern)
 85  
 86  Safetensors are preferred for safety (no code execution) and lazy loading.
 87  
 88  ```dockerfile
 89  # Modelfile-Safetensors
 90  # Note: Ollama usually handles conversion, but you can point to the directory
 91  FROM ./model-directory-containing-safetensors
 92  PARAMETER temperature 0.7
 93  
 94  ```
 95  
 96  ### Example 3: Legacy PyTorch (.bin / .pt)
 97  
 98  Use this if you are working with older research models that haven't been converted yet.
 99  
100  ```dockerfile
101  # Modelfile-PyTorch
102  FROM ./pytorch_model.bin
103  # Requires manual setting of template since .bin lacks GGUF metadata
104  TEMPLATE """{{ .System }}
105  User: {{ .Prompt }}
106  Assistant: """
107  
108  ```
109  
110  **To create the model in Ollama:**
111  
112  ```bash
113  ollama create my-custom-model -f Modelfile
114  ollama run my-custom-model
115  
116  ```
117  
118  ---
119  
120  ## 🚀 5. Hardware Performance Tips
121  
122  * **Memory Bandwidth:** The M3 Max has 400GB/s bandwidth. GGUF models take full advantage of this by using `mmap`, which maps the file directly into memory.
123  * **Quantization Picks:** * **Q4_K_M:** Best balance for 32B models.
124  * **IQ4_XS:** High precision for small models (8B).
125  * **Q8_0:** Use this for 14B models if you have 64GB+ RAM for near-perfect accuracy.
126  
127  
128  * **Parallelism:** Always set your threads to match your performance cores (`-t 12` for M3 Max).
129  
130  ---
131  
132  *Reference: [Common AI Model Formats - Hugging Face*](https://huggingface.co/blog/ngxson/common-ai-model-formats)