/ README.md
README.md
  1  
  2  ---
  3  
  4  # ๐ŸŽ™๏ธ Speaker Verification and Voiceprint Recognition
  5  
  6  [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
  7  [![GitHub stars](https://img.shields.io/github/stars/zhangzijie-pro/Speaker-Verification.svg?style=social)](https://github.com/zhangzijie-pro/Speaker-Verification/stargazers)
  8  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20%26%20Dataset-yellow.svg)](https://huggingface.co/zzj-pro)
  9  ![Python](https://img.shields.io/badge/Python-3.9%2B-blue?logo=python)
 10  ![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-ee4c2c?logo=pytorch)
 11  ![Task](https://img.shields.io/badge/Task-Speaker%20Verification-green)
 12  
 13  <div align="center">
 14    <a href="Readme_ch.md">ไธญๆ–‡ๆ–‡ๆกฃ</a> โ€ข 
 15    <a href="https://github.com/zhangzijie-pro/Speaker-Verification">GitHub</a> โ€ข 
 16    <a href="https://huggingface.co/zzj-pro">Hugging Face</a>
 17  </div>
 18  
 19  > A practical speaker verification system based on **ECAPA-TDNN + AAM-Softmax**, trained and evaluated on **CN-Celeb**.
 20  
 21  ---
 22  
 23  ## โœจ Features
 24  
 25  - **SOTA Backbone**: ECAPA-TDNN (Res2Net + SE + Attentive Statistics Pooling)
 26  - **Strong Discriminative Loss**: AAM-Softmax with angular margin
 27  - **Balanced Sampling**: PK Batch Sampler (speaker-balanced)
 28  - **Robust Evaluation**: EER, score distribution, t-SNE, Recall@K
 29  - **Stable Inference**: Multi-crop averaging for reliable embeddings
 30  - **Low Memory Design**: Optimized for ~6GB GPU (AMP + gradient clipping)
 31  
 32  ---
 33  
 34  ## ๐Ÿ“‚ Project Structure
 35  
 36  ```
 37  Speaker-Verification/
 38  โ”‚
 39  โ”œโ”€โ”€ processed/ย  ย  ย  ย  ย  ย  ย  # Preprocessed features & metadata
 40  โ”‚ ย  โ”œโ”€โ”€ preprocess_cnceleb2_train.py
 41  โ”‚ ย  โ””โ”€โ”€ cn_celeb2/          # outputs
 42  โ”‚ ย  ย  ย  โ”œโ”€โ”€ fbank_pt/ ย  ย  ย  # Saved fbank features (*.pt)
 43  โ”‚ ย  ย  ย  โ”œโ”€โ”€ train_fbank_list.txt
 44  โ”‚ ย  ย  ย  โ”œโ”€โ”€ val_meta.jsonlย  # Validation metadata (speaker, feature path)
 45  โ”‚ ย  ย  ย  โ””โ”€โ”€ spk2id.json
 46  โ”‚
 47  โ”œโ”€โ”€ configs/
 48  โ”‚ ย  โ”œโ”€โ”€ train.yaml
 49  โ”‚ ย  โ””โ”€โ”€ train_config.py ย  ย  # Training hyperparameters
 50  โ”‚
 51  โ”œโ”€โ”€ demos/
 52  โ”‚ ย  โ””โ”€โ”€ real_time.pyย  ย  ย  ย  # real time to listen audio and test
 53  โ”‚
 54  โ”œโ”€โ”€ data/
 55  โ”‚ ย  โ”œโ”€โ”€ dataset.pyย  ย  ย  ย  ย  # Train / validation datasets
 56  โ”‚ ย  โ””โ”€โ”€ pk_sampler.py ย  ย  ย  # PK batch sampler (speaker-balanced)
 57  โ”‚
 58  โ”œโ”€โ”€ speaker_verification/
 59  โ”‚ ย  โ”œโ”€โ”€ checkpointing.py ย  ย 
 60  โ”‚ ย  โ”œโ”€โ”€ inference.py
 61  โ”‚   โ”œโ”€โ”€ head/
 62  โ”‚ ย  โ”‚ ย  โ””โ”€โ”€ aamsoftmax.py   # AAM-Softmax loss
 63  โ”‚ ย  โ”œโ”€โ”€ models/
 64  โ”‚ ย  โ”‚ ย  โ””โ”€โ”€ epaca.py        # model
 65  โ”‚ ย  โ””โ”€โ”€ audio/
 66  โ”‚ ย  ย  ย  โ””โ”€โ”€ features.py     # extract features
 67  โ”‚
 68  โ”œโ”€โ”€ utils/
 69  โ”‚ ย  โ”œโ”€โ”€ meters.py ย  ย  ย  ย  ย  # Accuracy, average meters
 70  โ”‚ ย  โ”œโ”€โ”€ seed.py ย  ย  ย  ย  ย  ย  # Reproducibility
 71  โ”‚ ย  โ”œโ”€โ”€ plot.py ย  ย  ย  ย  ย  ย  # Training curves
 72  โ”‚ ย  โ”œโ”€โ”€ export.py ย  ย  ย  ย  ย  # export onnx/mnn and split model, head
 73  โ”‚ ย  โ””โ”€โ”€ path_utils.py ย  ย  ย  # Deal to path error
 74  โ”‚
 75  โ”œโ”€โ”€ outputs/ย  ย  ย  ย  ย  ย  ย  ย  # Training outputs (checkpoints, curves)
 76  โ”œโ”€โ”€ outputs_eval/ ย  ย  ย  ย  ย  # Verification results (EER, ROC, DET, t-SNE)
 77  โ”‚
 78  โ”œโ”€โ”€ train.pyย  ย  ย  ย  ย  ย  ย  ย  # Main training script
 79  โ”œโ”€โ”€ finetune.pyย  ย  ย  ย  ย  ย  ย # Main finetune script
 80  โ”œโ”€โ”€ verify_pairs.py ย  ย  ย  ย  # Pairwise speaker verification
 81  โ”œโ”€โ”€ compare_two_wavs.py ย  ย  # Compare two audio files
 82  โ”‚
 83  โ”œโ”€โ”€ README.md
 84  โ”œโ”€โ”€ README_ch.md
 85  โ””โ”€โ”€ LICENSE
 86  ```
 87  
 88  ---
 89  
 90  ## ๐Ÿš€ Quick Start
 91  
 92  ### 1. Installation
 93  
 94  ```bash
 95  git clone https://github.com/zhangzijie-pro/Speaker-Verification.git
 96  cd Speaker-Verification
 97  pip install -r requirements.txt
 98  ```
 99  
100  ### 2. Data Preprocessing
101  
102  ```bash
103  python processed/preprocess_cnceleb2_train.py
104  ```
105  
106  ### 3. Training
107  
108  ```bash
109  # Train with default config
110  python train.py
111  
112  # Override parameters via command line
113  python train.py train.epochs=100 train.lr=5e-4 train.emb_dim=256
114  ```
115  
116  ---
117  
118  ## ๐Ÿ“ˆ Evaluation (Speaker Verification)
119  
120  ### Run full evaluation
121  
122  ```bash
123  python verify.py \
124      --val_meta processed/cn_celeb2/val_meta.jsonl \
125      --ckpt outputs/best.pt \
126      --out_dir outputs_eval
127  ```
128  
129  **Outputs**:
130  - `roc.png`, `det.png`, `score_hist.png`
131  - `tsne.png` (speaker clustering)
132  - `metrics.txt` (EER, Recall@K, etc.)
133  
134  ---
135  
136  ## ๐ŸŽฏ Single Audio Comparison (Most Used)
137  
138  ```bash
139  python compare_two_wavs.py \
140      --wav1 test1.wav \
141      --wav2 test2.wav \
142      --ckpt outputs/export/model.onnx   # Supports ONNX
143  ```
144  
145  ---
146  
147  ## ๐Ÿ› ๏ธ Model Export (Deployment)
148  
149  ```bash
150  # One-click export to ONNX + MNN
151  python scripts/export.py \
152      --ckpt outputs/best.pt \
153      --out_dir outputs/deploy \
154      --onnx --mnn
155  ```
156  
157  **Supported deployment**:
158  - **ONNX Runtime** (Python / C++)
159  - **MNN** (Mobile / Edge)
160  - **TensorRT** (High-performance server)
161  
162  ---
163  
164  ## ๐Ÿง  Model Overview
165  
166  ### Backbone
167  
168  - **ECAPA-TDNN**
169    - Res2Net-style temporal convolutions
170    - Squeeze-and-Excitation (SE)
171    - Attentive Statistics Pooling
172  - Embedding dimension: **192 / 256**
173  
174  ### Loss
175  
176  - **AAM-Softmax (Additive Angular Margin Softmax)**
177    - Encourages large inter-speaker margins
178    - Used only during training
179  
180  ### Embedding
181  
182  - L2-normalized speaker embeddings
183  - Cosine similarity for verification
184  
185  ---
186  
187  ## ๐Ÿ“Š Dataset
188  
189  - **CN-Celeb**
190    - ~1000 speakers
191    - Highly diverse recording conditions
192  - Split:
193    - `train`: speaker-disjoint
194    - `val`: speaker-disjoint
195  - Features:
196    - 80-dim log Mel-filterbank
197    - 16kHz sampling rate
198  
199  ---
200  
201  ## ๐Ÿ“Œ Recommended Configuration (6GB GPU)
202  
203  ```yaml
204  # configs/train.yaml
205  emb_dim: 256
206  channels: 512
207  lr: 1e-3
208  epochs: 80
209  crop_frames: 200          # Training
210  crop_frames_val: 400      # Validation
211  num_crops: 6
212  p: 32
213  k: 4
214  ```
215  
216  ---
217  
218  ## ๐Ÿ”ฎ Future Improvements
219  
220  - [x] Hydra configuration
221  - [x] Parallel preprocessing
222  - [x] ONNX / MNN export
223  - [ ] Noise / RIR augmentation
224  
225  ---
226  
227  ## ๐Ÿ“œ License
228  
229  This project is released under the **Apache License 2.0**.  
230  The CN-Celeb dataset follows its original license and usage terms.
231  
232  ---
233  
234  ## ๐Ÿ™‹ Notes
235  
236  This repository is intended for:
237  
238  - Learning speaker verification systems
239  
240  It is **not** an off-the-shelf commercial system.