/ docs / embeddings / configuration / scoring.md
scoring.md
  1  # Scoring
  2  
  3  Enable scoring support via the `scoring` parameter.
  4  
  5  This scoring instance can serve two purposes, depending on the settings.
  6  
  7  One use case is building sparse/keyword indexes. This occurs when the `terms` parameter is set to `True`.
  8  
  9  The other use case is with word vector term weighting. This feature has been available since the initial version but isn't quite as common anymore.
 10  
 11  The following covers the available options.
 12  
 13  ## method
 14  ```yaml
 15  method: bm25|tfidf|sif|pgtext|sparse|custom
 16  ```
 17  
 18  Sets the scoring method. Add custom scoring via setting this parameter to the fully resolvable class string.
 19  
 20  ### pgtext
 21  ```yaml
 22  schema: database schema to store keyword index - defaults to being
 23          determined by the database
 24  ```
 25  
 26  Additional settings for Postgres full-text keyword indexes.
 27  
 28  ### sparse
 29  ```yaml
 30  path: sparse vector model path
 31  vectormethod: vector embeddings method
 32  vectornormalize: enable vector embeddings normalization (boolean)
 33  gpu: boolean|int|string|device
 34  normalize: enable score normalization (boolean|float|string|dict)
 35  batch: Sets the transform batch size
 36  encodebatch: Sets the encode batch size
 37  vectors: additional model init args
 38  encodeargs: additional encode() args
 39  backend: ivfsparse|pgsparse
 40  ```
 41  
 42  Sparse vector scoring options. The sparse scoring instance combines a sparse vector model with a sparse approximate nearest neighbor index (ANN). This method supports both vector normalization and score normalization.
 43  
 44  Vector normalization normalizes all vectors to have a magnitude of 1. By extension, all generated scores will be 0 to 1.
 45  
 46  Score normalization scales the output between 0 and 1. This setting supports:
 47  
 48  - `True` for default scale normalization
 49  - `float` normalize using this as the scale factor
 50  - `"bayes"` for Bayesian normalization using dynamic candidate score statistics
 51  - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters
 52  
 53  #### ivfsparse
 54  ```yaml
 55  ivfsparse:
 56    sample: percent of data to use for model training (0.0 - 1.0)
 57    nfeatures: top n features to use for model training (int)
 58    nlist: desired number of clusters (int)
 59    nprobe: search probe setting (int)
 60    minpoints: minimum number of points for a cluster (int)
 61  ```
 62  
 63  Inverted file (IVF) index with flat vector file storage and sparse array support.
 64  
 65  #### pgsparse
 66  
 67  Sparse ANN backed by Postgres. Supports same options as the [pgvector](../ann/#pgvector) ANN.
 68  
 69  ## terms
 70  ```yaml
 71  terms: boolean|dict
 72  ```
 73  
 74  Enables term frequency sparse arrays for a scoring instance. This is the backend for sparse keyword indexes.
 75  
 76  Supports a `dict` with the parameters `cachelimit` and `cutoff`.
 77  
 78  `cachelimit` is the maximum amount of resident memory in bytes to use during indexing before flushing to disk. This parameter is an `int`.
 79  
 80  `cutoff` is used during search to determine what constitutes a common term. This parameter is a `float`, i.e. 0.1 for a cutoff of 10%.
 81  
 82  When `terms` is set to `True`, default parameters are used for the `cachelimit` and `cutoff`. Normally, these defaults are sufficient.
 83  
 84  ## normalize
 85  ```yaml
 86  normalize: boolean|str|dict
 87  ```
 88  
 89  Enables normalized scoring (ranging from 0 to 1). This setting supports:
 90  
 91  - `True` for standard score normalization
 92  - `"bayes"` | `"bb25"` for Bayesian normalization using dynamic candidate score statistics
 93  - `{method: "bayes", alpha: 1.0, beta: null}` for Bayesian normalization with optional custom parameters
 94  
 95  When standard normalization is enabled, statistics from the index are used to calculate normalized scores.
 96  When Bayesian/BB25 normalization is enabled, it uses positive-score candidates, dynamic `beta=median(scores)`, adaptive
 97  `alpha_eff=alpha/std(scores)` and a sigmoid transform (likelihood-only variant with flat prior) to map scores to `[0, 1]`.
 98  
 99  Bayesian normalization references:
100  
101  - [https://github.com/instructkr/bb25](https://github.com/instructkr/bb25)
102  - [https://github.com/cognica-io/bayesian-bm25](https://github.com/cognica-io/bayesian-bm25)
103  
104  ## tokenizer
105  ```yaml
106  tokenizer: dict
107  ```
108  
109  Set tokenization rules. Passes these arguments to the underlying [Tokenization pipeline](../../../pipeline/data/tokenizer#txtai.pipeline.Tokenizer.__init__).