/ docs / deduplication-guide.md
deduplication-guide.md
  1  # Deduplication Guide
  2  
  3  ## Overview
  4  
  5  Deduplication is a storage optimization technique that identifies and eliminates duplicate data, storing only one copy of each unique data chunk. The KeepSync deduplication system provides efficient storage and transfer of files by identifying and eliminating redundant data.
  6  
  7  ## How It Works
  8  
  9  The deduplication system works as follows:
 10  
 11  1. **Chunking**: Files are divided into variable-sized chunks using content-defined chunking algorithms.
 12  2. **Hashing**: Each chunk is hashed using a cryptographic hash function (SHA-256) to create a unique identifier.
 13  3. **Storage**: Chunks are stored in a content-addressable storage system, where the hash serves as the key.
 14  4. **Manifests**: File manifests are created that reference the stored chunks, allowing for efficient reconstruction.
 15  5. **Compression**: Optionally, chunks can be compressed to further reduce storage requirements.
 16  
 17  ## Key Features
 18  
 19  ### Content-Defined Chunking
 20  
 21  The deduplication system uses content-defined chunking to divide files into chunks based on their content rather than fixed sizes. This approach ensures that changes to a file only affect the chunks that contain the changes, maximizing deduplication efficiency.
 22  
 23  ### Content-Addressable Storage
 24  
 25  Chunks are stored in a content-addressable storage system, where the hash of the chunk serves as its identifier. This approach ensures that identical chunks are only stored once, regardless of their source.
 26  
 27  ### Compression
 28  
 29  The system supports optional compression of chunks to further reduce storage requirements. Compression is applied selectively, only when it results in smaller chunk sizes.
 30  
 31  ### Verification
 32  
 33  The system can verify chunks after storage to ensure data integrity. This feature provides confidence that stored data can be correctly retrieved.
 34  
 35  ### Parallel Processing
 36  
 37  The deduplication system supports parallel processing to improve performance on multi-core systems. This feature is particularly beneficial for large files and high-volume operations.
 38  
 39  ### Metrics Collection
 40  
 41  The system collects detailed metrics about deduplication operations, including:
 42  - Number of chunks processed
 43  - Number of new vs. existing chunks
 44  - Original vs. stored data sizes
 45  - Deduplication ratio
 46  - Processing time
 47  
 48  ## Usage
 49  
 50  ### Basic Usage
 51  
 52  ```go
 53  // Create a deduplicator
 54  options := services.DefaultDeduplicationOptions()
 55  deduplicator, err := services.NewDeduplicator(options)
 56  
 57  // Deduplicate a file
 58  manifest, err := deduplicator.DeduplicateFile(ctx, filePath)
 59  
 60  // Restore a file from a manifest
 61  err := deduplicator.RestoreFile(ctx, manifest, outputPath)
 62  ```
 63  
 64  ### Working with Data in Memory
 65  
 66  ```go
 67  // Deduplicate data
 68  manifest, err := deduplicator.DeduplicateData(ctx, data)
 69  
 70  // Restore data from a manifest
 71  restoredData, err := deduplicator.RestoreData(ctx, manifest)
 72  ```
 73  
 74  ### Content Store Statistics
 75  
 76  ```go
 77  // Get statistics about the content store
 78  stats := deduplicator.GetContentStoreStats()
 79  fmt.Printf("Total chunks: %d\n", stats.TotalChunks)
 80  fmt.Printf("Total size: %s\n", formatSize(stats.TotalSize))
 81  fmt.Printf("Stored size: %s\n", formatSize(stats.StoredSize))
 82  fmt.Printf("Space saved: %s\n", formatSize(stats.SpaceSaved))
 83  fmt.Printf("Deduplication ratio: %.2f%%\n", stats.DeduplicationRatio)
 84  ```
 85  
 86  ## Configuration Options
 87  
 88  The deduplication system provides several configuration options:
 89  
 90  ### Content Store Directory
 91  
 92  ```go
 93  options.ContentStoreDir = "/path/to/content/store"
 94  ```
 95  
 96  The content store directory is where deduplicated chunks are stored. This directory should be on a reliable storage device with sufficient space.
 97  
 98  ### Chunk Size Limits
 99  
100  ```go
101  options.MinChunkSize = 4 * 1024     // 4 KB
102  options.MaxChunkSize = 4 * 1024 * 1024 // 4 MB
103  ```
104  
105  These options control the minimum and maximum chunk sizes. Smaller chunks provide better deduplication but increase overhead, while larger chunks reduce overhead but may decrease deduplication efficiency.
106  
107  ### Compression
108  
109  ```go
110  options.CompressChunks = true
111  ```
112  
113  When enabled, chunks are compressed before storage if compression results in smaller sizes.
114  
115  ### Verification
116  
117  ```go
118  options.VerifyChunks = true
119  ```
120  
121  When enabled, chunks are verified after storage to ensure data integrity.
122  
123  ### Parallel Processing
124  
125  ```go
126  options.ParallelProcessing = true
127  options.WorkerCount = 4
128  ```
129  
130  These options control parallel processing for improved performance on multi-core systems.
131  
132  ### Metrics Collection
133  
134  ```go
135  options.MetricsEnabled = true
136  options.ProviderType = "deduplicator"
137  ```
138  
139  These options control metrics collection for performance monitoring and analysis.
140  
141  ## Performance Considerations
142  
143  ### Storage Efficiency
144  
145  The deduplication ratio (the percentage of space saved) depends on several factors:
146  - Data redundancy: Higher redundancy leads to better deduplication
147  - Chunk size: Smaller chunks can identify more redundancy but increase overhead
148  - Compression: Compression can further reduce storage requirements
149  
150  ### Memory Usage
151  
152  The deduplication system uses buffer pooling to minimize memory allocations. However, processing very large files may require significant memory. Consider the following guidelines:
153  - For files under 100 MB, default settings work well
154  - For files between 100 MB and 1 GB, consider increasing worker count
155  - For files over 1 GB, consider processing in smaller batches
156  
157  ### CPU Usage
158  
159  The deduplication system can be CPU-intensive, especially with compression enabled. Consider the following guidelines:
160  - For maximum performance, enable parallel processing
161  - For CPU-constrained environments, disable compression or reduce worker count
162  - For batch processing, consider scheduling during off-peak hours
163  
164  ## Integration with Other Systems
165  
166  ### Differential Chunking
167  
168  The deduplication system works well with differential chunking, providing efficient storage and transfer of files with small changes:
169  
170  ```go
171  // Create a differential chunker with deduplication
172  diffChunker := services.NewDifferentialChunker(chunkerOptions)
173  
174  // Deduplicate the result of a differential update
175  manifest, err := deduplicator.DeduplicateFile(ctx, updatedFilePath)
176  ```
177  
178  ### Versioning
179  
180  The deduplication system supports efficient storage of file versions by sharing chunks between versions:
181  
182  ```go
183  // Deduplicate multiple versions of a file
184  manifest1, err := deduplicator.DeduplicateFile(ctx, version1Path)
185  manifest2, err := deduplicator.DeduplicateFile(ctx, version2Path)
186  
187  // Store the manifests with version metadata
188  ```
189  
190  ### Backup Systems
191  
192  The deduplication system is ideal for backup systems, providing efficient storage of backup data:
193  
194  ```go
195  // Deduplicate backup data
196  manifest, err := deduplicator.DeduplicateFile(ctx, backupFilePath)
197  
198  // Store the manifest with backup metadata
199  ```
200  
201  ## Conclusion
202  
203  The deduplication system provides significant storage and transfer efficiency improvements, especially for data with high redundancy. By identifying and eliminating duplicate data, it reduces storage requirements and improves performance for file operations.