api_reference.md
1 # API Reference 2 3 This document provides details on the main functions and classes available in the `fast-seqfunc` package. 4 5 ## Core Functions 6 7 ### `train_model` 8 9 ```python 10 from fast_seqfunc import train_model 11 12 model_info = train_model( 13 train_data, 14 val_data=None, 15 test_data=None, 16 sequence_col="sequence", 17 target_col="function", 18 embedding_method="one-hot", 19 model_type="regression", 20 optimization_metric=None, 21 **kwargs 22 ) 23 ``` 24 25 Trains a sequence-function model using PyCaret. 26 27 **Parameters**: 28 29 - `train_data`: DataFrame or path to CSV file with training data. 30 - `val_data`: Optional validation data (not directly used, reserved for future). 31 - `test_data`: Optional test data for final evaluation. 32 - `sequence_col`: Column name containing sequences. 33 - `target_col`: Column name containing target values. 34 - `embedding_method`: Method to use for embedding sequences. Currently only "one-hot" is supported. 35 - `model_type`: Type of modeling problem ("regression" or "classification"). 36 - `optimization_metric`: Metric to optimize during model selection (e.g., "r2", "accuracy", "f1"). 37 - `**kwargs`: Additional arguments passed to PyCaret setup. 38 39 **Returns**: 40 41 - Dictionary containing the trained model and related metadata. 42 43 ### `predict` 44 45 ```python 46 from fast_seqfunc import predict 47 48 predictions = predict( 49 model_info, 50 sequences, 51 sequence_col="sequence" 52 ) 53 ``` 54 55 Generates predictions for new sequences using a trained model. 56 57 **Parameters**: 58 59 - `model_info`: Dictionary from `train_model` containing model and related information. 60 - `sequences`: Sequences to predict (list, Series, or DataFrame). 61 - `sequence_col`: Column name in DataFrame containing sequences. 62 63 **Returns**: 64 65 - Array of predictions. 66 67 ### `save_model` 68 69 ```python 70 from fast_seqfunc import save_model 71 72 save_model(model_info, path) 73 ``` 74 75 Saves the model to disk. 76 77 **Parameters**: 78 79 - `model_info`: Dictionary containing model and related information. 80 - `path`: Path to save the model. 81 82 **Returns**: 83 84 - None 85 86 ### `load_model` 87 88 ```python 89 from fast_seqfunc import load_model 90 91 model_info = load_model(path) 92 ``` 93 94 Loads a trained model from disk. 95 96 **Parameters**: 97 98 - `path`: Path to saved model file. 99 100 **Returns**: 101 102 - Dictionary containing the model and related information. 103 104 ## Embedder Classes 105 106 ### `OneHotEmbedder` 107 108 ```python 109 from fast_seqfunc.embedders import OneHotEmbedder 110 111 embedder = OneHotEmbedder(sequence_type="auto") 112 embeddings = embedder.fit_transform(sequences) 113 ``` 114 115 One-hot encoding for protein or nucleotide sequences. 116 117 **Parameters**: 118 119 - `sequence_type`: Type of sequences to encode ("protein", "dna", "rna", or "auto"). 120 121 **Methods**: 122 123 - `fit(sequences)`: Determine alphabet and set up the embedder. 124 - `transform(sequences)`: Transform sequences to one-hot encodings. 125 - `fit_transform(sequences)`: Fit and transform in one step. 126 127 ## Helper Functions 128 129 ### `get_embedder` 130 131 ```python 132 from fast_seqfunc.embedders import get_embedder 133 134 embedder = get_embedder(method="one-hot") 135 ``` 136 137 Get an embedder instance based on method name. 138 139 **Parameters**: 140 141 - `method`: Embedding method (currently only "one-hot" is supported). 142 143 **Returns**: 144 145 - Configured embedder instance. 146 147 ### `evaluate_model` 148 149 ```python 150 from fast_seqfunc.core import evaluate_model 151 152 metrics = evaluate_model( 153 model, 154 X_test, 155 y_test, 156 embedder, 157 model_type, 158 embed_cols 159 ) 160 ``` 161 162 Evaluate model performance on test data. 163 164 **Parameters**: 165 166 - `model`: Trained model. 167 - `X_test`: Test sequences. 168 - `y_test`: True target values. 169 - `embedder`: Embedder to transform sequences. 170 - `model_type`: Type of model (regression or classification). 171 - `embed_cols`: Column names for embedded features. 172 173 **Returns**: 174 175 - Dictionary of performance metrics.