deduplication-guide.md
1 # Deduplication Guide 2 3 ## Overview 4 5 Deduplication is a storage optimization technique that identifies and eliminates duplicate data, storing only one copy of each unique data chunk. The KeepSync deduplication system provides efficient storage and transfer of files by identifying and eliminating redundant data. 6 7 ## How It Works 8 9 The deduplication system works as follows: 10 11 1. **Chunking**: Files are divided into variable-sized chunks using content-defined chunking algorithms. 12 2. **Hashing**: Each chunk is hashed using a cryptographic hash function (SHA-256) to create a unique identifier. 13 3. **Storage**: Chunks are stored in a content-addressable storage system, where the hash serves as the key. 14 4. **Manifests**: File manifests are created that reference the stored chunks, allowing for efficient reconstruction. 15 5. **Compression**: Optionally, chunks can be compressed to further reduce storage requirements. 16 17 ## Key Features 18 19 ### Content-Defined Chunking 20 21 The deduplication system uses content-defined chunking to divide files into chunks based on their content rather than fixed sizes. This approach ensures that changes to a file only affect the chunks that contain the changes, maximizing deduplication efficiency. 22 23 ### Content-Addressable Storage 24 25 Chunks are stored in a content-addressable storage system, where the hash of the chunk serves as its identifier. This approach ensures that identical chunks are only stored once, regardless of their source. 26 27 ### Compression 28 29 The system supports optional compression of chunks to further reduce storage requirements. Compression is applied selectively, only when it results in smaller chunk sizes. 30 31 ### Verification 32 33 The system can verify chunks after storage to ensure data integrity. This feature provides confidence that stored data can be correctly retrieved. 34 35 ### Parallel Processing 36 37 The deduplication system supports parallel processing to improve performance on multi-core systems. This feature is particularly beneficial for large files and high-volume operations. 38 39 ### Metrics Collection 40 41 The system collects detailed metrics about deduplication operations, including: 42 - Number of chunks processed 43 - Number of new vs. existing chunks 44 - Original vs. stored data sizes 45 - Deduplication ratio 46 - Processing time 47 48 ## Usage 49 50 ### Basic Usage 51 52 ```go 53 // Create a deduplicator 54 options := services.DefaultDeduplicationOptions() 55 deduplicator, err := services.NewDeduplicator(options) 56 57 // Deduplicate a file 58 manifest, err := deduplicator.DeduplicateFile(ctx, filePath) 59 60 // Restore a file from a manifest 61 err := deduplicator.RestoreFile(ctx, manifest, outputPath) 62 ``` 63 64 ### Working with Data in Memory 65 66 ```go 67 // Deduplicate data 68 manifest, err := deduplicator.DeduplicateData(ctx, data) 69 70 // Restore data from a manifest 71 restoredData, err := deduplicator.RestoreData(ctx, manifest) 72 ``` 73 74 ### Content Store Statistics 75 76 ```go 77 // Get statistics about the content store 78 stats := deduplicator.GetContentStoreStats() 79 fmt.Printf("Total chunks: %d\n", stats.TotalChunks) 80 fmt.Printf("Total size: %s\n", formatSize(stats.TotalSize)) 81 fmt.Printf("Stored size: %s\n", formatSize(stats.StoredSize)) 82 fmt.Printf("Space saved: %s\n", formatSize(stats.SpaceSaved)) 83 fmt.Printf("Deduplication ratio: %.2f%%\n", stats.DeduplicationRatio) 84 ``` 85 86 ## Configuration Options 87 88 The deduplication system provides several configuration options: 89 90 ### Content Store Directory 91 92 ```go 93 options.ContentStoreDir = "/path/to/content/store" 94 ``` 95 96 The content store directory is where deduplicated chunks are stored. This directory should be on a reliable storage device with sufficient space. 97 98 ### Chunk Size Limits 99 100 ```go 101 options.MinChunkSize = 4 * 1024 // 4 KB 102 options.MaxChunkSize = 4 * 1024 * 1024 // 4 MB 103 ``` 104 105 These options control the minimum and maximum chunk sizes. Smaller chunks provide better deduplication but increase overhead, while larger chunks reduce overhead but may decrease deduplication efficiency. 106 107 ### Compression 108 109 ```go 110 options.CompressChunks = true 111 ``` 112 113 When enabled, chunks are compressed before storage if compression results in smaller sizes. 114 115 ### Verification 116 117 ```go 118 options.VerifyChunks = true 119 ``` 120 121 When enabled, chunks are verified after storage to ensure data integrity. 122 123 ### Parallel Processing 124 125 ```go 126 options.ParallelProcessing = true 127 options.WorkerCount = 4 128 ``` 129 130 These options control parallel processing for improved performance on multi-core systems. 131 132 ### Metrics Collection 133 134 ```go 135 options.MetricsEnabled = true 136 options.ProviderType = "deduplicator" 137 ``` 138 139 These options control metrics collection for performance monitoring and analysis. 140 141 ## Performance Considerations 142 143 ### Storage Efficiency 144 145 The deduplication ratio (the percentage of space saved) depends on several factors: 146 - Data redundancy: Higher redundancy leads to better deduplication 147 - Chunk size: Smaller chunks can identify more redundancy but increase overhead 148 - Compression: Compression can further reduce storage requirements 149 150 ### Memory Usage 151 152 The deduplication system uses buffer pooling to minimize memory allocations. However, processing very large files may require significant memory. Consider the following guidelines: 153 - For files under 100 MB, default settings work well 154 - For files between 100 MB and 1 GB, consider increasing worker count 155 - For files over 1 GB, consider processing in smaller batches 156 157 ### CPU Usage 158 159 The deduplication system can be CPU-intensive, especially with compression enabled. Consider the following guidelines: 160 - For maximum performance, enable parallel processing 161 - For CPU-constrained environments, disable compression or reduce worker count 162 - For batch processing, consider scheduling during off-peak hours 163 164 ## Integration with Other Systems 165 166 ### Differential Chunking 167 168 The deduplication system works well with differential chunking, providing efficient storage and transfer of files with small changes: 169 170 ```go 171 // Create a differential chunker with deduplication 172 diffChunker := services.NewDifferentialChunker(chunkerOptions) 173 174 // Deduplicate the result of a differential update 175 manifest, err := deduplicator.DeduplicateFile(ctx, updatedFilePath) 176 ``` 177 178 ### Versioning 179 180 The deduplication system supports efficient storage of file versions by sharing chunks between versions: 181 182 ```go 183 // Deduplicate multiple versions of a file 184 manifest1, err := deduplicator.DeduplicateFile(ctx, version1Path) 185 manifest2, err := deduplicator.DeduplicateFile(ctx, version2Path) 186 187 // Store the manifests with version metadata 188 ``` 189 190 ### Backup Systems 191 192 The deduplication system is ideal for backup systems, providing efficient storage of backup data: 193 194 ```go 195 // Deduplicate backup data 196 manifest, err := deduplicator.DeduplicateFile(ctx, backupFilePath) 197 198 // Store the manifest with backup metadata 199 ``` 200 201 ## Conclusion 202 203 The deduplication system provides significant storage and transfer efficiency improvements, especially for data with high redundancy. By identifying and eliminating duplicate data, it reduces storage requirements and improves performance for file operations.