/ docs / api_reference.md
api_reference.md
  1  # API Reference
  2  
  3  This document provides details on the main functions and classes available in the `fast-seqfunc` package.
  4  
  5  ## Core Functions
  6  
  7  ### `train_model`
  8  
  9  ```python
 10  from fast_seqfunc import train_model
 11  
 12  model_info = train_model(
 13      train_data,
 14      val_data=None,
 15      test_data=None,
 16      sequence_col="sequence",
 17      target_col="function",
 18      embedding_method="one-hot",
 19      model_type="regression",
 20      optimization_metric=None,
 21      **kwargs
 22  )
 23  ```
 24  
 25  Trains a sequence-function model using PyCaret.
 26  
 27  **Parameters**:
 28  
 29  - `train_data`: DataFrame or path to CSV file with training data.
 30  - `val_data`: Optional validation data (not directly used, reserved for future).
 31  - `test_data`: Optional test data for final evaluation.
 32  - `sequence_col`: Column name containing sequences.
 33  - `target_col`: Column name containing target values.
 34  - `embedding_method`: Method to use for embedding sequences. Currently only "one-hot" is supported.
 35  - `model_type`: Type of modeling problem ("regression" or "classification").
 36  - `optimization_metric`: Metric to optimize during model selection (e.g., "r2", "accuracy", "f1").
 37  - `**kwargs`: Additional arguments passed to PyCaret setup.
 38  
 39  **Returns**:
 40  
 41  - Dictionary containing the trained model and related metadata.
 42  
 43  ### `predict`
 44  
 45  ```python
 46  from fast_seqfunc import predict
 47  
 48  predictions = predict(
 49      model_info,
 50      sequences,
 51      sequence_col="sequence"
 52  )
 53  ```
 54  
 55  Generates predictions for new sequences using a trained model.
 56  
 57  **Parameters**:
 58  
 59  - `model_info`: Dictionary from `train_model` containing model and related information.
 60  - `sequences`: Sequences to predict (list, Series, or DataFrame).
 61  - `sequence_col`: Column name in DataFrame containing sequences.
 62  
 63  **Returns**:
 64  
 65  - Array of predictions.
 66  
 67  ### `save_model`
 68  
 69  ```python
 70  from fast_seqfunc import save_model
 71  
 72  save_model(model_info, path)
 73  ```
 74  
 75  Saves the model to disk.
 76  
 77  **Parameters**:
 78  
 79  - `model_info`: Dictionary containing model and related information.
 80  - `path`: Path to save the model.
 81  
 82  **Returns**:
 83  
 84  - None
 85  
 86  ### `load_model`
 87  
 88  ```python
 89  from fast_seqfunc import load_model
 90  
 91  model_info = load_model(path)
 92  ```
 93  
 94  Loads a trained model from disk.
 95  
 96  **Parameters**:
 97  
 98  - `path`: Path to saved model file.
 99  
100  **Returns**:
101  
102  - Dictionary containing the model and related information.
103  
104  ## Embedder Classes
105  
106  ### `OneHotEmbedder`
107  
108  ```python
109  from fast_seqfunc.embedders import OneHotEmbedder
110  
111  embedder = OneHotEmbedder(sequence_type="auto")
112  embeddings = embedder.fit_transform(sequences)
113  ```
114  
115  One-hot encoding for protein or nucleotide sequences.
116  
117  **Parameters**:
118  
119  - `sequence_type`: Type of sequences to encode ("protein", "dna", "rna", or "auto").
120  
121  **Methods**:
122  
123  - `fit(sequences)`: Determine alphabet and set up the embedder.
124  - `transform(sequences)`: Transform sequences to one-hot encodings.
125  - `fit_transform(sequences)`: Fit and transform in one step.
126  
127  ## Helper Functions
128  
129  ### `get_embedder`
130  
131  ```python
132  from fast_seqfunc.embedders import get_embedder
133  
134  embedder = get_embedder(method="one-hot")
135  ```
136  
137  Get an embedder instance based on method name.
138  
139  **Parameters**:
140  
141  - `method`: Embedding method (currently only "one-hot" is supported).
142  
143  **Returns**:
144  
145  - Configured embedder instance.
146  
147  ### `evaluate_model`
148  
149  ```python
150  from fast_seqfunc.core import evaluate_model
151  
152  metrics = evaluate_model(
153      model,
154      X_test,
155      y_test,
156      embedder,
157      model_type,
158      embed_cols
159  )
160  ```
161  
162  Evaluate model performance on test data.
163  
164  **Parameters**:
165  
166  - `model`: Trained model.
167  - `X_test`: Test sequences.
168  - `y_test`: True target values.
169  - `embedder`: Embedder to transform sequences.
170  - `model_type`: Type of model (regression or classification).
171  - `embed_cols`: Column names for embedded features.
172  
173  **Returns**:
174  
175  - Dictionary of performance metrics.