scoring.md
1 # Scoring 2 3 Enable scoring support via the `scoring` parameter. 4 5 This scoring instance can serve two purposes, depending on the settings. 6 7 One use case is building sparse/keyword indexes. This occurs when the `terms` parameter is set to `True`. 8 9 The other use case is with word vector term weighting. This feature has been available since the initial version but isn't quite as common anymore. 10 11 The following covers the available options. 12 13 ## method 14 ```yaml 15 method: bm25|tfidf|sif|pgtext|sparse|custom 16 ``` 17 18 Sets the scoring method. Add custom scoring via setting this parameter to the fully resolvable class string. 19 20 ### pgtext 21 ```yaml 22 schema: database schema to store keyword index - defaults to being 23 determined by the database 24 ``` 25 26 Additional settings for Postgres full-text keyword indexes. 27 28 ### sparse 29 ```yaml 30 path: sparse vector model path 31 vectormethod: vector embeddings method 32 vectornormalize: enable vector embeddings normalization (boolean) 33 gpu: boolean|int|string|device 34 normalize: enable score normalization (boolean|float|string|dict) 35 batch: Sets the transform batch size 36 encodebatch: Sets the encode batch size 37 vectors: additional model init args 38 encodeargs: additional encode() args 39 backend: ivfsparse|pgsparse 40 ``` 41 42 Sparse vector scoring options. The sparse scoring instance combines a sparse vector model with a sparse approximate nearest neighbor index (ANN). This method supports both vector normalization and score normalization. 43 44 Vector normalization normalizes all vectors to have a magnitude of 1. By extension, all generated scores will be 0 to 1. 45 46 Score normalization scales the output between 0 and 1. This setting supports: 47 48 - `True` for default scale normalization 49 - `float` normalize using this as the scale factor 50 - `"bayes"` for Bayesian normalization using dynamic candidate score statistics 51 - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters 52 53 #### ivfsparse 54 ```yaml 55 ivfsparse: 56 sample: percent of data to use for model training (0.0 - 1.0) 57 nfeatures: top n features to use for model training (int) 58 nlist: desired number of clusters (int) 59 nprobe: search probe setting (int) 60 minpoints: minimum number of points for a cluster (int) 61 ``` 62 63 Inverted file (IVF) index with flat vector file storage and sparse array support. 64 65 #### pgsparse 66 67 Sparse ANN backed by Postgres. Supports same options as the [pgvector](../ann/#pgvector) ANN. 68 69 ## terms 70 ```yaml 71 terms: boolean|dict 72 ``` 73 74 Enables term frequency sparse arrays for a scoring instance. This is the backend for sparse keyword indexes. 75 76 Supports a `dict` with the parameters `cachelimit` and `cutoff`. 77 78 `cachelimit` is the maximum amount of resident memory in bytes to use during indexing before flushing to disk. This parameter is an `int`. 79 80 `cutoff` is used during search to determine what constitutes a common term. This parameter is a `float`, i.e. 0.1 for a cutoff of 10%. 81 82 When `terms` is set to `True`, default parameters are used for the `cachelimit` and `cutoff`. Normally, these defaults are sufficient. 83 84 ## normalize 85 ```yaml 86 normalize: boolean|str|dict 87 ``` 88 89 Enables normalized scoring (ranging from 0 to 1). This setting supports: 90 91 - `True` for standard score normalization 92 - `"bayes"` | `"bb25"` for Bayesian normalization using dynamic candidate score statistics 93 - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters 94 95 When standard normalization is enabled, statistics from the index are used to calculate normalized scores. 96 When Bayesian/BB25 normalization is enabled, it uses positive-score candidates, dynamic `beta=median(scores)`, adaptive 97 `alpha_eff=alpha/std(scores)` and a sigmoid transform (likelihood-only variant with flat prior) to map scores to `[0, 1]`. 98 99 Bayesian normalization references: 100 101 - [https://github.com/instructkr/bb25](https://github.com/instructkr/bb25) 102 - [https://github.com/cognica-io/bayesian-bm25](https://github.com/cognica-io/bayesian-bm25) 103 104 ## tokenizer 105 ```yaml 106 tokenizer: dict 107 ``` 108 109 Set tokenization rules. Passes these arguments to the underlying [Tokenization pipeline](../../../pipeline/data/tokenizer#txtai.pipeline.Tokenizer.__init__).