Cradicle Explorer

/ docs / embeddings / configuration / scoring.md

scoring.md

1 # Scoring
2
3 Enable scoring support via the `scoring` parameter.
4
5 This scoring instance can serve two purposes, depending on the settings.
6
7 One use case is building sparse/keyword indexes. This occurs when the `terms` parameter is set to `True`.
8
9 The other use case is with word vector term weighting. This feature has been available since the initial version but isn't quite as common anymore.
10
11 The following covers the available options.
12
13 ## method
14 ```yaml
15 method: bm25|tfidf|sif|pgtext|sparse|custom
16 ```
17
18 Sets the scoring method. Add custom scoring via setting this parameter to the fully resolvable class string.
19
20 ### pgtext
21 ```yaml
22 schema: database schema to store keyword index - defaults to being
23 determined by the database
24 ```
25
26 Additional settings for Postgres full-text keyword indexes.
27
28 ### sparse
29 ```yaml
30 path: sparse vector model path
31 vectormethod: vector embeddings method
32 vectornormalize: enable vector embeddings normalization (boolean)
33 gpu: boolean|int|string|device
34 normalize: enable score normalization (boolean|float|string|dict)
35 batch: Sets the transform batch size
36 encodebatch: Sets the encode batch size
37 vectors: additional model init args
38 encodeargs: additional encode() args
39 backend: ivfsparse|pgsparse
40 ```
41
42 Sparse vector scoring options. The sparse scoring instance combines a sparse vector model with a sparse approximate nearest neighbor index (ANN). This method supports both vector normalization and score normalization.
43
44 Vector normalization normalizes all vectors to have a magnitude of 1. By extension, all generated scores will be 0 to 1.
45
46 Score normalization scales the output between 0 and 1. This setting supports:
47
48 - `True` for default scale normalization
49 - `float` normalize using this as the scale factor
50 - `"bayes"` for Bayesian normalization using dynamic candidate score statistics
51 - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters
52
53 #### ivfsparse
54 ```yaml
55 ivfsparse:
56 sample: percent of data to use for model training (0.0 - 1.0)
57 nfeatures: top n features to use for model training (int)
58 nlist: desired number of clusters (int)
59 nprobe: search probe setting (int)
60 minpoints: minimum number of points for a cluster (int)
61 ```
62
63 Inverted file (IVF) index with flat vector file storage and sparse array support.
64
65 #### pgsparse
66
67 Sparse ANN backed by Postgres. Supports same options as the [pgvector](../ann/#pgvector) ANN.
68
69 ## terms
70 ```yaml
71 terms: boolean|dict
72 ```
73
74 Enables term frequency sparse arrays for a scoring instance. This is the backend for sparse keyword indexes.
75
76 Supports a `dict` with the parameters `cachelimit` and `cutoff`.
77
78 `cachelimit` is the maximum amount of resident memory in bytes to use during indexing before flushing to disk. This parameter is an `int`.
79
80 `cutoff` is used during search to determine what constitutes a common term. This parameter is a `float`, i.e. 0.1 for a cutoff of 10%.
81
82 When `terms` is set to `True`, default parameters are used for the `cachelimit` and `cutoff`. Normally, these defaults are sufficient.
83
84 ## normalize
85 ```yaml
86 normalize: boolean|str|dict
87 ```
88
89 Enables normalized scoring (ranging from 0 to 1). This setting supports:
90
91 - `True` for standard score normalization
92 - `"bayes"` | `"bb25"` for Bayesian normalization using dynamic candidate score statistics
93 - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters
94
95 When standard normalization is enabled, statistics from the index are used to calculate normalized scores.
96 When Bayesian/BB25 normalization is enabled, it uses positive-score candidates, dynamic `beta=median(scores)`, adaptive
97 `alpha_eff=alpha/std(scores)` and a sigmoid transform (likelihood-only variant with flat prior) to map scores to `[0, 1]`.
98
99 Bayesian normalization references:
100
101 - [https://github.com/instructkr/bb25](https://github.com/instructkr/bb25)
102 - [https://github.com/cognica-io/bayesian-bm25](https://github.com/cognica-io/bayesian-bm25)
103
104 ## tokenizer
105 ```yaml
106 tokenizer: dict
107 ```
108
109 Set tokenization rules. Passes these arguments to the underlying [Tokenization pipeline](../../../pipeline/data/tokenizer#txtai.pipeline.Tokenizer.__init__).