/ driftkit-embedding / README.md
README.md
1 # DriftKit Embedding Module 2 3 ## Overview 4 5 The `driftkit-embedding` module provides a unified abstraction layer for text embedding services, supporting multiple providers including OpenAI, Cohere, and local BERT models. It offers a consistent API for generating vector representations of text while maintaining flexibility for different embedding backends. 6 7 ## Spring Boot Initialization 8 9 To use the embedding module in your Spring Boot application, the module will be automatically configured when you provide embedding configuration: 10 11 ```java 12 @SpringBootApplication 13 // No additional annotations needed - auto-configuration handles everything 14 public class YourApplication { 15 public static void main(String[] args) { 16 SpringApplication.run(YourApplication.class, args); 17 } 18 } 19 ``` 20 21 Configuration in `application.yml`: 22 23 ```yaml 24 driftkit: 25 embedding: 26 name: "openai" # or "cohere", "local-bert" 27 config: 28 apiKey: "${OPENAI_API_KEY}" 29 modelName: "text-embedding-ada-002" 30 dimension: "1536" 31 ``` 32 33 The module provides: 34 - **Auto-configuration class**: `EmbeddingAutoConfiguration` 35 - **Conditional activation**: Only when `driftkit.embedding.name` is configured 36 - **Bean creation**: Automatically creates `EmbeddingModel` bean from configuration 37 38 ## Architecture 39 40 ### Module Structure 41 42 ``` 43 driftkit-embedding/ 44 ├── driftkit-embedding-core/ # Core embedding functionality 45 ├── driftkit-embedding-spring-ai/ # Spring AI integration 46 ├── driftkit-embedding-spring-boot-starter/ # Spring Boot auto-configuration 47 └── pom.xml # Parent module configuration 48 ``` 49 50 ### Key Dependencies 51 52 - **DJL API** - Deep Java Library for AI model inference 53 - **HuggingFace Tokenizers** - Text tokenization for local models 54 - **ONNX Runtime** - Efficient model inference for local BERT models 55 - **OpenFeign** - HTTP client for external API integrations 56 - **DriftKit Common** - Shared domain objects and utilities 57 58 ## Core Abstractions 59 60 ### EmbeddingModel Interface 61 62 The central abstraction for all embedding providers: 63 64 The central abstraction for all embedding providers. It provides provider identification through `supportsName()`, model access for local models, configuration handling, and core embedding methods with default implementations for single and batch text processing. 65 66 **Key Features:** 67 - **Provider Identification** - Each implementation declares support via `supportsName()` 68 - **Flexible Architecture** - Default implementations handle common patterns 69 - **Token Counting** - Built-in support for usage estimation 70 - **Batch Processing** - Efficient handling of multiple text segments 71 72 ### EmbeddingFactory 73 74 Factory pattern for dynamic model loading using Java ServiceLoader: 75 76 Factory pattern implementation for dynamic model loading using Java ServiceLoader. It discovers available embedding providers at runtime and initializes them with configuration. 77 78 **Usage Example:** 79 ```java 80 Map<String, String> config = Map.of( 81 "apiKey", "your-api-key", 82 "model", "text-embedding-ada-002" 83 ); 84 85 EmbeddingModel model = EmbeddingFactory.fromName("openai", config); 86 ``` 87 88 ## Domain Objects 89 90 ### Embedding 91 92 Vector representation wrapper with utility methods: 93 94 Vector representation wrapper with factory methods for different input types (double[], float[], List<Float>). Provides utility methods including normalization, dimension retrieval, and vector format conversion. 95 96 ### TextSegment 97 98 Text container with optional metadata: 99 100 Text container with optional metadata. Provides factory methods for creating segments with or without metadata. 101 102 ### Metadata 103 104 Type-safe metadata management system: 105 106 Type-safe metadata management system supporting String, UUID, Integer, Long, Float, and Double types. Provides type-safe getters and a fluent API for adding metadata. 107 108 **Usage Example:** 109 ```java 110 Metadata metadata = new Metadata() 111 .put("source", "document.pdf") 112 .put("page", 1) 113 .put("confidence", 0.95f); 114 115 TextSegment segment = TextSegment.from("Your text content", metadata); 116 ``` 117 118 ## Provider Implementations 119 120 ### Spring AI Integration 121 122 The `driftkit-embedding-spring-ai` module provides seamless integration with Spring AI's embedding capabilities, allowing you to use any Spring AI embedding provider through DriftKit's unified interface. 123 124 #### Key Features 125 126 - **Universal Adapter** - Use ANY Spring AI EmbeddingModel with DriftKit 127 - **Auto-Configuration** - Spring Boot starter for zero-config setup 128 - **Provider Agnostic** - Works with all Spring AI embedding providers 129 - **Type-Safe** - Maintains type safety across the integration 130 131 #### Configuration 132 133 ```java 134 // Configure any Spring AI embedding model 135 @Bean 136 public org.springframework.ai.embedding.EmbeddingModel springAiEmbeddingModel() { 137 return new OpenAiEmbeddingModel(openAiApi, options); 138 // Or any other Spring AI embedding model: Azure OpenAI, Ollama, Vertex AI, etc. 139 } 140 141 // The adapter will automatically be created via auto-configuration 142 @Autowired 143 private ai.driftkit.embedding.core.service.EmbeddingModel embeddingModel; 144 ``` 145 146 #### Spring Boot Auto-Configuration 147 148 Add the starter for automatic configuration: 149 150 ```xml 151 <dependency> 152 <groupId>ai.driftkit</groupId> 153 <artifactId>driftkit-embedding-spring-ai-starter</artifactId> 154 <version>${driftkit.version}</version> 155 </dependency> 156 ``` 157 158 Configuration properties: 159 160 ```yaml 161 driftkit: 162 embedding: 163 spring-ai: 164 enabled: true # Enable Spring AI integration 165 model-name: "my-model" # Name for the adapter 166 auto-create-adapter: true # Auto-create DriftKit adapter 167 ``` 168 169 #### Supported Spring AI Providers 170 171 The adapter works with ALL Spring AI embedding providers including: 172 - **OpenAI** - GPT embeddings (text-embedding-3-small, text-embedding-3-large) 173 - **Azure OpenAI** - Microsoft's hosted OpenAI service 174 - **Ollama** - Local embedding models 175 - **Vertex AI** - Google's AI platform 176 - **Bedrock** - AWS AI services 177 - **PostgreSQL pgvector** - Database-native embeddings 178 - **Any custom Spring AI EmbeddingModel implementation** 179 180 For detailed configuration of specific Spring AI providers, refer to the [Spring AI documentation](https://docs.spring.io/spring-ai/reference/). 181 182 ### OpenAI Integration 183 184 #### OpenAIEmbeddingModel 185 186 ```java 187 @NoArgsConstructor 188 public class OpenAIEmbeddingModel implements EmbeddingModel { 189 private EmbeddingOpenAIApiClient apiClient; 190 private String modelName; 191 192 @Override 193 public boolean supportsName(String name) { 194 return "openai".equals(name); 195 } 196 197 @Override 198 public void configure(EmbeddingServiceConfig config) { 199 this.modelName = config.getConfig().get(EtlConfig.MODEL_NAME); 200 this.apiClient = Feign.builder() 201 .encoder(new JacksonEncoder()) 202 .decoder(new JacksonDecoder()) 203 .requestInterceptor(new OpenAIAuthInterceptor(config.get(EtlConfig.API_KEY))) 204 .target(EmbeddingOpenAIApiClient.class, 205 config.get(EtlConfig.HOST, "https://api.openai.com")); 206 } 207 208 @Override 209 public Response<List<Embedding>> embedAll(List<TextSegment> segments) { 210 List<String> texts = segments.stream() 211 .map(TextSegment::text) 212 .collect(Collectors.toList()); 213 214 EmbeddingRequest request = new EmbeddingRequest(modelName, texts); 215 EmbeddingResponse response = apiClient.getEmbeddings(request); 216 217 List<Embedding> embeddings = response.getData().stream() 218 .map(data -> Embedding.from(data.getEmbedding())) 219 .collect(Collectors.toList()); 220 221 Usage usage = response.getUsage(); 222 TokenUsage tokenUsage = new TokenUsage( 223 usage.getPromptTokens(), 224 usage.getTotalTokens() - usage.getPromptTokens() 225 ); 226 227 return Response.from(embeddings, tokenUsage); 228 } 229 230 @Override 231 public int estimateTokenCount(String text) { 232 return text.length() / 4; // Approximate estimation 233 } 234 235 @Override 236 public AIOnnxBertBiEncoder model() { 237 throw new UnsupportedOperationException("OpenAI models don't expose ONNX encoder"); 238 } 239 } 240 ``` 241 242 #### API Client Definition 243 244 ```java 245 public interface EmbeddingOpenAIApiClient { 246 @RequestLine("POST /v1/embeddings") 247 @Headers("Content-Type: application/json") 248 EmbeddingResponse getEmbeddings(EmbeddingRequest request); 249 } 250 251 public class OpenAIAuthInterceptor implements RequestInterceptor { 252 private final String apiKey; 253 254 public OpenAIAuthInterceptor(String apiKey) { 255 this.apiKey = apiKey; 256 } 257 258 @Override 259 public void apply(RequestTemplate template) { 260 template.header("Authorization", "Bearer " + apiKey); 261 template.header("Content-Type", "application/json"); 262 } 263 } 264 ``` 265 266 **Configuration Example:** 267 ```yaml 268 driftkit: 269 embeddingServices: 270 - name: "openai-embeddings" 271 type: "openai" 272 config: 273 apiKey: "${OPENAI_API_KEY}" 274 model: "text-embedding-ada-002" 275 host: "https://api.openai.com" 276 ``` 277 278 ### Cohere Integration 279 280 #### CohereEmbeddingModel 281 282 ```java 283 public class CohereEmbeddingModel implements EmbeddingModel { 284 private final CohereApiClient apiClient; 285 286 @Override 287 public boolean supportsName(String name) { 288 return "cohere".equals(name); 289 } 290 291 @Override 292 public Response<List<Embedding>> embedAll(List<TextSegment> segments) { 293 List<String> texts = segments.stream() 294 .map(TextSegment::text) 295 .collect(Collectors.toList()); 296 297 CohereEmbeddingRequest request = CohereEmbeddingRequest.builder() 298 .texts(texts) 299 .model("embed-english-v2.0") 300 .build(); 301 302 CohereEmbeddingResponse response = apiClient.getEmbeddings(request); 303 304 List<Embedding> embeddings = new ArrayList<>(); 305 for (List<Double> embeddingValues : response.getEmbeddings()) { 306 double[] embeddingArray = embeddingValues.stream() 307 .mapToDouble(Double::doubleValue) 308 .toArray(); 309 embeddings.add(Embedding.from(embeddingArray)); 310 } 311 312 return Response.from(embeddings, new TokenUsage(0)); // Cohere doesn't provide token usage 313 } 314 315 @Override 316 public int estimateTokenCount(String text) { 317 return text.length() / 5; // Approximate estimation 318 } 319 320 @Override 321 public AIOnnxBertBiEncoder model() { 322 throw new UnsupportedOperationException("Cohere models don't expose ONNX encoder"); 323 } 324 } 325 ``` 326 327 ### Local BERT Models 328 329 #### AIOnnxBertBiEncoder 330 331 Sophisticated ONNX-based BERT encoder with comprehensive features: 332 - **Token Management**: Handles texts up to 510 tokens (512 - 2 special tokens) 333 - **Text Partitioning**: Automatically splits long texts at sentence boundaries 334 - **Embedding Combination**: Uses weighted averaging based on token count 335 - **Pooling Modes**: Supports CLS and MEAN pooling strategies 336 - **L2 Normalization**: Normalizes final embeddings for consistent similarity calculations 337 - **HuggingFace Integration**: Uses HuggingFace tokenizers for accurate token counting 338 339 #### BertGenericEmbeddingModel 340 341 Implementation of EmbeddingModel for local BERT models. Configures an AIOnnxBertBiEncoder with specified model and tokenizer paths, using MEAN pooling mode by default. 342 343 **Configuration Example:** 344 ```yaml 345 driftkit: 346 embeddingServices: 347 - name: "local-bert" 348 type: "local" 349 config: 350 modelPath: "/path/to/bert-base-uncased.onnx" 351 tokenizerPath: "/path/to/tokenizer" 352 ``` 353 354 ## Usage Patterns 355 356 ### Basic Embedding 357 358 ```java 359 @Service 360 public class EmbeddingService { 361 362 public Embedding embedText(String text) throws Exception { 363 Map<String, String> config = Map.of( 364 "apiKey", openaiApiKey, 365 "model", "text-embedding-ada-002" 366 ); 367 368 EmbeddingModel model = EmbeddingFactory.fromName("openai", config); 369 TextSegment segment = TextSegment.from(text); 370 371 Response<Embedding> response = model.embed(segment); 372 return response.content(); 373 } 374 } 375 ``` 376 377 ### Batch Processing 378 379 ```java 380 @Service 381 public class BatchEmbeddingService { 382 383 public List<Embedding> embedTexts(List<String> texts) throws Exception { 384 Map<String, String> config = Map.of( 385 "apiKey", openaiApiKey, 386 "model", "text-embedding-ada-002" 387 ); 388 389 EmbeddingModel model = EmbeddingFactory.fromName("openai", config); 390 391 List<TextSegment> segments = texts.stream() 392 .map(TextSegment::from) 393 .collect(Collectors.toList()); 394 395 Response<List<Embedding>> response = model.embedAll(segments); 396 return response.content(); 397 } 398 } 399 ``` 400 401 ### Document Processing with Metadata 402 403 ```java 404 @Service 405 public class DocumentEmbeddingService { 406 407 public List<Embedding> processDocument(String documentContent, String source) throws Exception { 408 // Split document into chunks 409 DocumentSplitter splitter = DocumentSplitter.builder() 410 .maxChunkSize(512) 411 .overlapSize(50) 412 .build(); 413 414 List<String> chunks = splitter.split(documentContent); 415 416 // Create segments with metadata 417 List<TextSegment> segments = new ArrayList<>(); 418 for (int i = 0; i < chunks.size(); i++) { 419 Metadata metadata = new Metadata() 420 .put("source", source) 421 .put("chunk_index", i) 422 .put("total_chunks", chunks.size()); 423 424 segments.add(TextSegment.from(chunks.get(i), metadata)); 425 } 426 427 // Generate embeddings 428 EmbeddingModel model = EmbeddingFactory.fromName("openai", getOpenAIConfig()); 429 Response<List<Embedding>> response = model.embedAll(segments); 430 431 return response.content(); 432 } 433 } 434 ``` 435 436 ### Local Model Usage 437 438 ```java 439 @Service 440 public class LocalEmbeddingService { 441 442 public Embedding embedWithLocalModel(String text) throws Exception { 443 Map<String, String> config = Map.of( 444 "modelPath", "/models/bert-base-uncased.onnx", 445 "tokenizerPath", "/models/tokenizer" 446 ); 447 448 EmbeddingModel model = EmbeddingFactory.fromName("local", config); 449 TextSegment segment = TextSegment.from(text); 450 451 Response<Embedding> response = model.embed(segment); 452 Embedding embedding = response.content(); 453 454 // Optional: normalize the embedding 455 embedding.normalize(); 456 457 return embedding; 458 } 459 } 460 ``` 461 462 ### Similarity Search 463 464 ```java 465 @Service 466 public class SimilaritySearchService { 467 468 public double calculateCosineSimilarity(Embedding embedding1, Embedding embedding2) { 469 float[] vector1 = embedding1.vector(); 470 float[] vector2 = embedding2.vector(); 471 472 if (vector1.length != vector2.length) { 473 throw new IllegalArgumentException("Embeddings must have the same dimension"); 474 } 475 476 double dotProduct = 0.0; 477 double norm1 = 0.0; 478 double norm2 = 0.0; 479 480 for (int i = 0; i < vector1.length; i++) { 481 dotProduct += vector1[i] * vector2[i]; 482 norm1 += vector1[i] * vector1[i]; 483 norm2 += vector2[i] * vector2[i]; 484 } 485 486 return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2)); 487 } 488 489 public List<ScoredEmbedding> findSimilar(Embedding queryEmbedding, 490 List<Embedding> candidates, 491 int topK) { 492 return candidates.stream() 493 .map(candidate -> new ScoredEmbedding( 494 candidate, 495 calculateCosineSimilarity(queryEmbedding, candidate) 496 )) 497 .sorted((a, b) -> Double.compare(b.score, a.score)) 498 .limit(topK) 499 .collect(Collectors.toList()); 500 } 501 502 @Data 503 @AllArgsConstructor 504 public static class ScoredEmbedding { 505 private final Embedding embedding; 506 private final double score; 507 } 508 } 509 ``` 510 511 ## Configuration and Best Practices 512 513 ### Provider Selection 514 515 ```java 516 @Configuration 517 public class EmbeddingConfiguration { 518 519 @Value("${driftkit.embedding.provider:openai}") 520 private String defaultProvider; 521 522 @Value("${openai.api.key}") 523 private String openaiApiKey; 524 525 @Bean 526 public EmbeddingModel primaryEmbeddingModel() throws Exception { 527 Map<String, String> config = switch (defaultProvider) { 528 case "openai" -> Map.of( 529 "apiKey", openaiApiKey, 530 "model", "text-embedding-ada-002" 531 ); 532 case "cohere" -> Map.of( 533 "apiKey", cohereApiKey, 534 "model", "embed-english-v2.0" 535 ); 536 case "local" -> Map.of( 537 "modelPath", "/models/bert-base-uncased.onnx", 538 "tokenizerPath", "/models/tokenizer" 539 ); 540 default -> throw new IllegalArgumentException("Unknown provider: " + defaultProvider); 541 }; 542 543 return EmbeddingFactory.fromName(defaultProvider, config); 544 } 545 } 546 ``` 547 548 ### Error Handling 549 550 ```java 551 @Service 552 public class RobustEmbeddingService { 553 554 private final EmbeddingModel primaryModel; 555 private final EmbeddingModel fallbackModel; 556 557 public Embedding embedWithFallback(String text) { 558 try { 559 TextSegment segment = TextSegment.from(text); 560 Response<Embedding> response = primaryModel.embed(segment); 561 return response.content(); 562 } catch (Exception e) { 563 log.warn("Primary embedding model failed, using fallback", e); 564 try { 565 TextSegment segment = TextSegment.from(text); 566 Response<Embedding> response = fallbackModel.embed(segment); 567 return response.content(); 568 } catch (Exception fallbackException) { 569 log.error("Both primary and fallback embedding models failed", fallbackException); 570 throw new EmbeddingException("Failed to generate embedding", fallbackException); 571 } 572 } 573 } 574 } 575 ``` 576 577 ### Performance Optimization 578 579 ```java 580 @Service 581 public class OptimizedEmbeddingService { 582 583 private final EmbeddingModel model; 584 private final Cache<String, Embedding> embeddingCache; 585 586 public OptimizedEmbeddingService(EmbeddingModel model) { 587 this.model = model; 588 this.embeddingCache = Caffeine.newBuilder() 589 .maximumSize(10000) 590 .expireAfterWrite(1, TimeUnit.HOURS) 591 .build(); 592 } 593 594 public Embedding embedWithCaching(String text) throws Exception { 595 return embeddingCache.get(text, key -> { 596 try { 597 TextSegment segment = TextSegment.from(key); 598 Response<Embedding> response = model.embed(segment); 599 return response.content(); 600 } catch (Exception e) { 601 throw new RuntimeException("Failed to generate embedding", e); 602 } 603 }); 604 } 605 606 public List<Embedding> embedBatch(List<String> texts, int batchSize) throws Exception { 607 List<Embedding> results = new ArrayList<>(); 608 609 for (int i = 0; i < texts.size(); i += batchSize) { 610 int endIndex = Math.min(i + batchSize, texts.size()); 611 List<String> batch = texts.subList(i, endIndex); 612 613 List<TextSegment> segments = batch.stream() 614 .map(TextSegment::from) 615 .collect(Collectors.toList()); 616 617 Response<List<Embedding>> response = model.embedAll(segments); 618 results.addAll(response.content()); 619 620 // Rate limiting for API providers 621 if (endIndex < texts.size()) { 622 Thread.sleep(100); // 100ms delay between batches 623 } 624 } 625 626 return results; 627 } 628 } 629 ``` 630 631 ## Testing 632 633 ### Unit Tests 634 635 ```java 636 @ExtendWith(MockitoExtension.class) 637 class EmbeddingModelTest { 638 639 @Test 640 void shouldCreateEmbeddingFromText() throws Exception { 641 Map<String, String> config = Map.of( 642 "apiKey", "test-key", 643 "model", "text-embedding-ada-002" 644 ); 645 646 EmbeddingModel model = EmbeddingFactory.fromName("openai", config); 647 TextSegment segment = TextSegment.from("test text"); 648 649 Response<Embedding> response = model.embed(segment); 650 651 assertThat(response).isNotNull(); 652 assertThat(response.content()).isNotNull(); 653 assertThat(response.content().dimension()).isGreaterThan(0); 654 } 655 656 @Test 657 void shouldHandleBatchEmbedding() throws Exception { 658 List<TextSegment> segments = List.of( 659 TextSegment.from("first text"), 660 TextSegment.from("second text"), 661 TextSegment.from("third text") 662 ); 663 664 EmbeddingModel model = EmbeddingFactory.fromName("openai", config); 665 Response<List<Embedding>> response = model.embedAll(segments); 666 667 assertThat(response.content()).hasSize(3); 668 assertThat(response.tokenUsage().inputTokenCount()).isGreaterThan(0); 669 } 670 } 671 ``` 672 673 ### Integration Tests 674 675 ```java 676 @SpringBootTest 677 @TestPropertySource(properties = { 678 "driftkit.embedding.provider=openai", 679 "openai.api.key=test-key" 680 }) 681 class EmbeddingIntegrationTest { 682 683 @Autowired 684 private EmbeddingModel embeddingModel; 685 686 @Test 687 void shouldIntegrateWithSpringBoot() throws Exception { 688 TextSegment segment = TextSegment.from("integration test"); 689 Response<Embedding> response = embeddingModel.embed(segment); 690 691 assertThat(response).isNotNull(); 692 assertThat(response.content().dimension()).isEqualTo(1536); // OpenAI ada-002 dimension 693 } 694 } 695 ``` 696 697 ## Extension Points 698 699 ### Adding New Providers 700 701 1. Implement the `EmbeddingModel` interface 702 2. Add provider identification logic in `supportsName()` 703 3. Implement configuration handling in `configure()` 704 4. Add the implementation to `META-INF/services/ai.driftkit.embedding.core.service.EmbeddingModel` 705 706 ```java 707 public class CustomEmbeddingModel implements EmbeddingModel { 708 709 @Override 710 public boolean supportsName(String name) { 711 return "custom".equals(name); 712 } 713 714 @Override 715 public void configure(EmbeddingServiceConfig config) { 716 // Custom configuration logic 717 } 718 719 @Override 720 public Response<List<Embedding>> embedAll(List<TextSegment> segments) { 721 // Custom embedding logic 722 return null; 723 } 724 725 @Override 726 public AIOnnxBertBiEncoder model() { 727 throw new UnsupportedOperationException("Custom models don't expose ONNX encoder"); 728 } 729 } 730 ``` 731 732 ### Custom Metadata Types 733 734 Extend the `Metadata` class to support additional data types: 735 736 ```java 737 public class ExtendedMetadata extends Metadata { 738 private static final Set<Class<?>> EXTENDED_TYPES = Set.of( 739 LocalDateTime.class, BigDecimal.class, CustomType.class 740 ); 741 742 @Override 743 public <T> Metadata put(String key, T value) { 744 if (value != null && !EXTENDED_TYPES.contains(value.getClass())) { 745 return super.put(key, value); 746 } 747 // Handle extended types 748 return this; 749 } 750 } 751 ``` 752 753 ## Demo Examples 754 755 ### 1. Simple Document Search 756 757 This example demonstrates basic document search using embeddings. 758 759 ```java 760 @Service 761 public class SimpleDocumentSearch { 762 763 private final EmbeddingModel embeddingModel; 764 private final Map<String, SearchDocument> documents = new HashMap<>(); 765 766 public SimpleDocumentSearch() throws Exception { 767 Map<String, String> config = Map.of( 768 "apiKey", System.getenv("OPENAI_API_KEY"), 769 "model", "text-embedding-ada-002" 770 ); 771 this.embeddingModel = EmbeddingFactory.fromName("openai", config); 772 } 773 774 public void addDocument(String id, String title, String content) throws Exception { 775 String fullText = title + " " + content; 776 TextSegment segment = TextSegment.from(fullText); 777 Response<Embedding> response = embeddingModel.embed(segment); 778 779 SearchDocument doc = new SearchDocument(id, title, content, response.content()); 780 documents.put(id, doc); 781 } 782 783 public List<SearchResult> search(String query, int limit) throws Exception { 784 TextSegment querySegment = TextSegment.from(query); 785 Response<Embedding> queryResponse = embeddingModel.embed(querySegment); 786 Embedding queryEmbedding = queryResponse.content(); 787 788 return documents.values().stream() 789 .map(doc -> new SearchResult(doc, calculateSimilarity(queryEmbedding, doc.getEmbedding()))) 790 .sorted((a, b) -> Double.compare(b.getScore(), a.getScore())) 791 .limit(limit) 792 .collect(Collectors.toList()); 793 } 794 795 private double calculateSimilarity(Embedding a, Embedding b) { 796 float[] vectorA = a.vector(); 797 float[] vectorB = b.vector(); 798 799 double dotProduct = 0.0; 800 double normA = 0.0; 801 double normB = 0.0; 802 803 for (int i = 0; i < vectorA.length; i++) { 804 dotProduct += vectorA[i] * vectorB[i]; 805 normA += vectorA[i] * vectorA[i]; 806 normB += vectorB[i] * vectorB[i]; 807 } 808 809 return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)); 810 } 811 812 @Data 813 @AllArgsConstructor 814 public static class SearchDocument { 815 private String id; 816 private String title; 817 private String content; 818 private Embedding embedding; 819 } 820 821 @Data 822 @AllArgsConstructor 823 public static class SearchResult { 824 private SearchDocument document; 825 private double score; 826 } 827 } 828 ``` 829 830 ### 2. Content Recommendation 831 832 This example shows simple content recommendations based on user preferences. 833 834 ```java 835 @Service 836 public class SimpleContentRecommendation { 837 838 private final EmbeddingModel embeddingModel; 839 private final Map<String, ContentItem> content = new HashMap<>(); 840 private final Map<String, Embedding> userPreferences = new HashMap<>(); 841 842 public SimpleContentRecommendation() throws Exception { 843 Map<String, String> config = Map.of( 844 "apiKey", System.getenv("OPENAI_API_KEY"), 845 "model", "text-embedding-ada-002" 846 ); 847 this.embeddingModel = EmbeddingFactory.fromName("openai", config); 848 } 849 850 public void addContent(String contentId, String title, String description) throws Exception { 851 String text = title + " " + description; 852 TextSegment segment = TextSegment.from(text); 853 Response<Embedding> response = embeddingModel.embed(segment); 854 855 ContentItem item = new ContentItem(contentId, title, description, response.content()); 856 content.put(contentId, item); 857 } 858 859 public void updateUserPreferences(String userId, String preferencesText) throws Exception { 860 TextSegment segment = TextSegment.from(preferencesText); 861 Response<Embedding> response = embeddingModel.embed(segment); 862 userPreferences.put(userId, response.content()); 863 } 864 865 public List<RecommendationResult> getRecommendations(String userId, int count) { 866 Embedding userEmbedding = userPreferences.get(userId); 867 if (userEmbedding == null) { 868 return Collections.emptyList(); 869 } 870 871 return content.values().stream() 872 .map(item -> new RecommendationResult(item, calculateSimilarity(userEmbedding, item.getEmbedding()))) 873 .sorted((a, b) -> Double.compare(b.getScore(), a.getScore())) 874 .limit(count) 875 .collect(Collectors.toList()); 876 } 877 878 private double calculateSimilarity(Embedding a, Embedding b) { 879 float[] vectorA = a.vector(); 880 float[] vectorB = b.vector(); 881 882 double dotProduct = 0.0; 883 for (int i = 0; i < vectorA.length; i++) { 884 dotProduct += vectorA[i] * vectorB[i]; 885 } 886 return dotProduct; 887 } 888 889 @Data 890 @AllArgsConstructor 891 public static class ContentItem { 892 private String id; 893 private String title; 894 private String description; 895 private Embedding embedding; 896 } 897 898 @Data 899 @AllArgsConstructor 900 public static class RecommendationResult { 901 private ContentItem content; 902 private double score; 903 } 904 } 905 ``` 906 907 ### 3. Document Similarity 908 909 This example demonstrates finding similar documents. 910 911 ```java 912 @Service 913 public class DocumentSimilarity { 914 915 private final EmbeddingModel embeddingModel; 916 private final Map<String, DocumentItem> documents = new HashMap<>(); 917 918 public DocumentSimilarity() throws Exception { 919 Map<String, String> config = Map.of( 920 "apiKey", System.getenv("OPENAI_API_KEY"), 921 "model", "text-embedding-ada-002" 922 ); 923 this.embeddingModel = EmbeddingFactory.fromName("openai", config); 924 } 925 926 public void addDocument(String id, String content, String category) throws Exception { 927 TextSegment segment = TextSegment.from(content); 928 Response<Embedding> response = embeddingModel.embed(segment); 929 930 DocumentItem doc = new DocumentItem(id, content, category, response.content()); 931 documents.put(id, doc); 932 } 933 934 public List<SimilarDocument> findSimilar(String documentId, int count) { 935 DocumentItem targetDoc = documents.get(documentId); 936 if (targetDoc == null) { 937 return Collections.emptyList(); 938 } 939 940 return documents.values().stream() 941 .filter(doc -> !doc.getId().equals(documentId)) 942 .map(doc -> new SimilarDocument(doc, calculateSimilarity(targetDoc.getEmbedding(), doc.getEmbedding()))) 943 .sorted((a, b) -> Double.compare(b.getSimilarity(), a.getSimilarity())) 944 .limit(count) 945 .collect(Collectors.toList()); 946 } 947 948 public Map<String, List<DocumentItem>> groupByCategory() { 949 return documents.values().stream() 950 .collect(Collectors.groupingBy(DocumentItem::getCategory)); 951 } 952 953 private double calculateSimilarity(Embedding a, Embedding b) { 954 float[] vectorA = a.vector(); 955 float[] vectorB = b.vector(); 956 957 double sum = 0.0; 958 for (int i = 0; i < vectorA.length; i++) { 959 sum += vectorA[i] * vectorB[i]; 960 } 961 return sum; 962 } 963 964 @Data 965 @AllArgsConstructor 966 public static class DocumentItem { 967 private String id; 968 private String content; 969 private String category; 970 private Embedding embedding; 971 } 972 973 @Data 974 @AllArgsConstructor 975 public static class SimilarDocument { 976 private DocumentItem document; 977 private double similarity; 978 } 979 } 980 ``` 981 982 ### 4. Content Tagging 983 984 This example shows automatic content tagging using predefined tag embeddings. 985 986 ```java 987 @Service 988 public class SimpleContentTagging { 989 990 private final EmbeddingModel embeddingModel; 991 private final Map<String, TagItem> tags = new HashMap<>(); 992 993 public SimpleContentTagging() throws Exception { 994 Map<String, String> config = Map.of( 995 "apiKey", System.getenv("OPENAI_API_KEY"), 996 "model", "text-embedding-ada-002" 997 ); 998 this.embeddingModel = EmbeddingFactory.fromName("openai", config); 999 initializeTags(); 1000 } 1001 1002 private void initializeTags() throws Exception { 1003 addTag("technology", "Technology related content including programming, AI, and software"); 1004 addTag("business", "Business topics including marketing, finance, and strategy"); 1005 addTag("education", "Educational content, tutorials, and learning materials"); 1006 addTag("health", "Health and wellness topics"); 1007 addTag("sports", "Sports, fitness, and athletic activities"); 1008 } 1009 1010 public void addTag(String name, String description) throws Exception { 1011 TextSegment segment = TextSegment.from(description); 1012 Response<Embedding> response = embeddingModel.embed(segment); 1013 1014 TagItem tag = new TagItem(name, description, response.content()); 1015 tags.put(name, tag); 1016 } 1017 1018 public List<TagResult> suggestTags(String content, int maxTags) throws Exception { 1019 TextSegment segment = TextSegment.from(content); 1020 Response<Embedding> response = embeddingModel.embed(segment); 1021 Embedding contentEmbedding = response.content(); 1022 1023 return tags.values().stream() 1024 .map(tag -> new TagResult(tag.getName(), calculateSimilarity(contentEmbedding, tag.getEmbedding()))) 1025 .filter(result -> result.getScore() > 0.5) 1026 .sorted((a, b) -> Double.compare(b.getScore(), a.getScore())) 1027 .limit(maxTags) 1028 .collect(Collectors.toList()); 1029 } 1030 1031 private double calculateSimilarity(Embedding a, Embedding b) { 1032 float[] vectorA = a.vector(); 1033 float[] vectorB = b.vector(); 1034 1035 double score = 0.0; 1036 for (int i = 0; i < vectorA.length; i++) { 1037 score += vectorA[i] * vectorB[i]; 1038 } 1039 return score; 1040 } 1041 1042 @Data 1043 @AllArgsConstructor 1044 public static class TagItem { 1045 private String name; 1046 private String description; 1047 private Embedding embedding; 1048 } 1049 1050 @Data 1051 @AllArgsConstructor 1052 public static class TagResult { 1053 private String name; 1054 private double score; 1055 } 1056 } 1057 ``` 1058 1059 ### 5. Multilingual Content Matching 1060 1061 This example demonstrates matching content across different languages. 1062 1063 ```java 1064 @Service 1065 public class MultilingualContentMatcher { 1066 1067 private final EmbeddingModel embeddingModel; 1068 private final Map<String, ContentEntry> contentByLanguage = new HashMap<>(); 1069 1070 public MultilingualContentMatcher() throws Exception { 1071 Map<String, String> config = Map.of( 1072 "apiKey", System.getenv("OPENAI_API_KEY"), 1073 "model", "text-embedding-ada-002" 1074 ); 1075 this.embeddingModel = EmbeddingFactory.fromName("openai", config); 1076 } 1077 1078 public void addContent(String id, String content, String language) throws Exception { 1079 TextSegment segment = TextSegment.from(content); 1080 Response<Embedding> response = embeddingModel.embed(segment); 1081 1082 ContentEntry entry = new ContentEntry(id, content, language, response.content()); 1083 contentByLanguage.put(id, entry); 1084 } 1085 1086 public List<MatchResult> findSimilarAcrossLanguages(String contentId, String excludeLanguage) { 1087 ContentEntry sourceContent = contentByLanguage.get(contentId); 1088 if (sourceContent == null) { 1089 return Collections.emptyList(); 1090 } 1091 1092 return contentByLanguage.values().stream() 1093 .filter(entry -> !entry.getId().equals(contentId)) 1094 .filter(entry -> !entry.getLanguage().equals(excludeLanguage)) 1095 .map(entry -> new MatchResult(entry, calculateSimilarity(sourceContent.getEmbedding(), entry.getEmbedding()))) 1096 .filter(result -> result.getScore() > 0.7) 1097 .sorted((a, b) -> Double.compare(b.getScore(), a.getScore())) 1098 .collect(Collectors.toList()); 1099 } 1100 1101 public Map<String, Long> getLanguageStats() { 1102 return contentByLanguage.values().stream() 1103 .collect(Collectors.groupingBy(ContentEntry::getLanguage, Collectors.counting())); 1104 } 1105 1106 private double calculateSimilarity(Embedding a, Embedding b) { 1107 float[] vectorA = a.vector(); 1108 float[] vectorB = b.vector(); 1109 1110 double similarity = 0.0; 1111 for (int i = 0; i < vectorA.length; i++) { 1112 similarity += vectorA[i] * vectorB[i]; 1113 } 1114 return similarity; 1115 } 1116 1117 @Data 1118 @AllArgsConstructor 1119 public static class ContentEntry { 1120 private String id; 1121 private String content; 1122 private String language; 1123 private Embedding embedding; 1124 } 1125 1126 @Data 1127 @AllArgsConstructor 1128 public static class MatchResult { 1129 private ContentEntry content; 1130 private double score; 1131 } 1132 } 1133 ``` 1134 1135 This comprehensive documentation provides a complete reference for the driftkit-embedding module, covering all major components, usage patterns, and extension points. The module offers a flexible and powerful abstraction for working with text embeddings across multiple providers while maintaining a consistent API.