Cradicle Explorer

/ README.md
README.md
  1  
  2  ---
  3  
  4  # 🎙️ Speaker Verification and Voiceprint Recognition
  5  
  6  [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
  7  [![GitHub stars](https://img.shields.io/github/stars/zhangzijie-pro/Speaker-Verification.svg?style=social)](https://github.com/zhangzijie-pro/Speaker-Verification/stargazers)
  8  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20%26%20Dataset-yellow.svg)](https://huggingface.co/zzj-pro)
  9  ![Python](https://img.shields.io/badge/Python-3.9%2B-blue?logo=python)
 10  ![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-ee4c2c?logo=pytorch)
 11  ![Task](https://img.shields.io/badge/Task-Speaker%20Verification-green)
 12  
 13  <div align="center">
 14    <a href="Readme_ch.md">中文文档</a> • 
 15    <a href="https://github.com/zhangzijie-pro/Speaker-Verification">GitHub</a> • 
 16    <a href="https://huggingface.co/zzj-pro">Hugging Face</a>
 17  </div>
 18  
 19  > A practical speaker verification system based on **ECAPA-TDNN + AAM-Softmax**, trained and evaluated on **CN-Celeb**.
 20  
 21  ---
 22  
 23  ## ✨ Features
 24  
 25  - **SOTA Backbone**: ECAPA-TDNN (Res2Net + SE + Attentive Statistics Pooling)
 26  - **Strong Discriminative Loss**: AAM-Softmax with angular margin
 27  - **Balanced Sampling**: PK Batch Sampler (speaker-balanced)
 28  - **Robust Evaluation**: EER, score distribution, t-SNE, Recall@K
 29  - **Stable Inference**: Multi-crop averaging for reliable embeddings
 30  - **Low Memory Design**: Optimized for ~6GB GPU (AMP + gradient clipping)
 31  
 32  ---
 33  
 34  ## 📂 Project Structure
 35  
 36  ```
 37  Speaker-Verification/
 38  │
 39  ├── processed/              # Preprocessed features & metadata
 40  │   ├── preprocess_cnceleb2_train.py
 41  │   └── cn_celeb2/          # outputs
 42  │       ├── fbank_pt/       # Saved fbank features (*.pt)
 43  │       ├── train_fbank_list.txt
 44  │       ├── val_meta.jsonl  # Validation metadata (speaker, feature path)
 45  │       └── spk2id.json
 46  │
 47  ├── configs/
 48  │   ├── train.yaml
 49  │   └── train_config.py     # Training hyperparameters
 50  │
 51  ├── demos/
 52  │   └── real_time.py        # real time to listen audio and test
 53  │
 54  ├── data/
 55  │   ├── dataset.py          # Train / validation datasets
 56  │   └── pk_sampler.py       # PK batch sampler (speaker-balanced)
 57  │
 58  ├── speaker_verification/
 59  │   ├── checkpointing.py    
 60  │   ├── inference.py
 61  │   ├── head/
 62  │   │   └── aamsoftmax.py   # AAM-Softmax loss
 63  │   ├── models/
 64  │   │   └── epaca.py        # model
 65  │   └── audio/
 66  │       └── features.py     # extract features
 67  │
 68  ├── utils/
 69  │   ├── meters.py           # Accuracy, average meters
 70  │   ├── seed.py             # Reproducibility
 71  │   ├── plot.py             # Training curves
 72  │   ├── export.py           # export onnx/mnn and split model, head
 73  │   └── path_utils.py       # Deal to path error
 74  │
 75  ├── outputs/                # Training outputs (checkpoints, curves)
 76  ├── outputs_eval/           # Verification results (EER, ROC, DET, t-SNE)
 77  │
 78  ├── train.py                # Main training script
 79  ├── finetune.py             # Main finetune script
 80  ├── verify_pairs.py         # Pairwise speaker verification
 81  ├── compare_two_wavs.py     # Compare two audio files
 82  │
 83  ├── README.md
 84  ├── README_ch.md
 85  └── LICENSE
 86  ```
 87  
 88  ---
 89  
 90  ## 🚀 Quick Start
 91  
 92  ### 1. Installation
 93  
 94  ```bash
 95  git clone https://github.com/zhangzijie-pro/Speaker-Verification.git
 96  cd Speaker-Verification
 97  pip install -r requirements.txt
 98  ```
 99  
100  ### 2. Data Preprocessing
101  
102  ```bash
103  python processed/preprocess_cnceleb2_train.py
104  ```
105  
106  ### 3. Training
107  
108  ```bash
109  # Train with default config
110  python train.py
111  
112  # Override parameters via command line
113  python train.py train.epochs=100 train.lr=5e-4 train.emb_dim=256
114  ```
115  
116  ---
117  
118  ## 📈 Evaluation (Speaker Verification)
119  
120  ### Run full evaluation
121  
122  ```bash
123  python verify.py \
124      --val_meta processed/cn_celeb2/val_meta.jsonl \
125      --ckpt outputs/best.pt \
126      --out_dir outputs_eval
127  ```
128  
129  **Outputs**:
130  - `roc.png`, `det.png`, `score_hist.png`
131  - `tsne.png` (speaker clustering)
132  - `metrics.txt` (EER, Recall@K, etc.)
133  
134  ---
135  
136  ## 🎯 Single Audio Comparison (Most Used)
137  
138  ```bash
139  python compare_two_wavs.py \
140      --wav1 test1.wav \
141      --wav2 test2.wav \
142      --ckpt outputs/export/model.onnx   # Supports ONNX
143  ```
144  
145  ---
146  
147  ## 🛠️ Model Export (Deployment)
148  
149  ```bash
150  # One-click export to ONNX + MNN
151  python scripts/export.py \
152      --ckpt outputs/best.pt \
153      --out_dir outputs/deploy \
154      --onnx --mnn
155  ```
156  
157  **Supported deployment**:
158  - **ONNX Runtime** (Python / C++)
159  - **MNN** (Mobile / Edge)
160  - **TensorRT** (High-performance server)
161  
162  ---
163  
164  ## 🧠 Model Overview
165  
166  ### Backbone
167  
168  - **ECAPA-TDNN**
169    - Res2Net-style temporal convolutions
170    - Squeeze-and-Excitation (SE)
171    - Attentive Statistics Pooling
172  - Embedding dimension: **192 / 256**
173  
174  ### Loss
175  
176  - **AAM-Softmax (Additive Angular Margin Softmax)**
177    - Encourages large inter-speaker margins
178    - Used only during training
179  
180  ### Embedding
181  
182  - L2-normalized speaker embeddings
183  - Cosine similarity for verification
184  
185  ---
186  
187  ## 📊 Dataset
188  
189  - **CN-Celeb**
190    - ~1000 speakers
191    - Highly diverse recording conditions
192  - Split:
193    - `train`: speaker-disjoint
194    - `val`: speaker-disjoint
195  - Features:
196    - 80-dim log Mel-filterbank
197    - 16kHz sampling rate
198  
199  ---
200  
201  ## 📌 Recommended Configuration (6GB GPU)
202  
203  ```yaml
204  # configs/train.yaml
205  emb_dim: 256
206  channels: 512
207  lr: 1e-3
208  epochs: 80
209  crop_frames: 200          # Training
210  crop_frames_val: 400      # Validation
211  num_crops: 6
212  p: 32
213  k: 4
214  ```
215  
216  ---
217  
218  ## 🔮 Future Improvements
219  
220  - [x] Hydra configuration
221  - [x] Parallel preprocessing
222  - [x] ONNX / MNN export
223  - [ ] Noise / RIR augmentation
224  
225  ---
226  
227  ## 📜 License
228  
229  This project is released under the **Apache License 2.0**.  
230  The CN-Celeb dataset follows its original license and usage terms.
231  
232  ---
233  
234  ## 🙋 Notes
235  
236  This repository is intended for:
237  
238  - Learning speaker verification systems
239  
240  It is **not** an off-the-shelf commercial system.