regression_tutorial.md
1 # Sequence Regression with Fast-SeqFunc 2 3 This tutorial demonstrates how to use `fast-seqfunc` for regression problems, where you want to predict continuous values from biological sequences. 4 5 ## Overview 6 7 In sequence regression, we want to learn to predict a continuous value (e.g., binding affinity, enzyme efficiency, or protein stability) from a biological sequence (DNA, RNA, or protein). This tutorial will walk you through: 8 9 1. Setting up your environment 10 2. Preparing sequence-function data 11 3. Training a regression model 12 4. Evaluating model performance 13 5. Making predictions on new sequences 14 6. Visualizing results 15 16 ## Prerequisites 17 18 - Python 3.11 or higher 19 - The following packages: 20 ```bash 21 pip install fast-seqfunc pandas numpy matplotlib seaborn scikit-learn loguru 22 ``` 23 24 ## Setup 25 26 First, let's import all necessary packages: 27 28 ```python 29 from pathlib import Path 30 import pandas as pd 31 import numpy as np 32 import matplotlib.pyplot as plt 33 import seaborn as sns 34 from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error 35 from fast_seqfunc import train_model, predict, save_model, load_model 36 from loguru import logger 37 ``` 38 39 ## Working with Sequence-Function Data 40 41 Sequence-function data typically consists of biological sequences paired with measurements of a functional property. For this tutorial, we'll create synthetic data: 42 43 ```python 44 from fast_seqfunc import generate_dataset_by_task 45 46 # Generate a GC content dataset as an example 47 # (where the function is simply the GC content of DNA sequences) 48 data = generate_dataset_by_task( 49 task="gc_content", 50 count=1000, # Number of sequences to generate 51 length=50, # Sequence length 52 noise_level=0.1, # Add some noise to make the task more realistic 53 ) 54 55 # Examine the data 56 print(data.head()) 57 print(f"Data shape: {data.shape}") 58 print(f"Target distribution: min={data['function'].min():.3f}, " 59 f"max={data['function'].max():.3f}, " 60 f"mean={data['function'].mean():.3f}") 61 ``` 62 63 ### Preparing Your Own Data 64 65 If you have your own data, it should be structured in a DataFrame with at least two columns: 66 - A column containing the sequences (e.g., "sequence") 67 - A column containing the target values (e.g., "function") 68 69 For example: 70 ```python 71 # Load your own data 72 # data = pd.read_csv("your_sequence_function_data.csv") 73 ``` 74 75 ## Splitting Data for Training and Testing 76 77 It's important to evaluate your model on data it hasn't seen during training: 78 79 ```python 80 # Split into train and test sets (80/20 split) 81 train_size = int(0.8 * len(data)) 82 train_data = data[:train_size].copy() 83 test_data = data[train_size:].copy() 84 85 logger.info(f"Data split: {len(train_data)} train, {len(test_data)} test samples") 86 87 # Create output directory for results 88 output_dir = Path("output") 89 output_dir.mkdir(parents=True, exist_ok=True) 90 ``` 91 92 ## Training a Regression Model 93 94 Now we can train a regression model using `fast-seqfunc`: 95 96 ```python 97 # Train a regression model 98 logger.info("Training regression model...") 99 model_info = train_model( 100 train_data=train_data, 101 test_data=test_data, 102 sequence_col="sequence", # Column containing sequences 103 target_col="function", # Column containing target values 104 embedding_method="one-hot", # Method to convert sequences to numerical features 105 model_type="regression", # Specify regression task 106 optimization_metric="r2", # Metric to optimize (r2, rmse, mae) 107 ) 108 109 # Display test results 110 if model_info.get("test_results"): 111 logger.info("Test metrics from training:") 112 for metric, value in model_info["test_results"].items(): 113 logger.info(f" {metric}: {value:.4f}") 114 115 # Save the model for later use 116 model_path = output_dir / "regression_model.pkl" 117 save_model(model_info, model_path) 118 logger.info(f"Model saved to {model_path}") 119 ``` 120 121 ### Understanding Embedding Methods 122 123 The `embedding_method` parameter determines how sequences are converted to numerical features: 124 125 - `"one-hot"`: Each position in the sequence is encoded as a one-hot vector indicating which amino acid or nucleotide is present. 126 127 Future versions of `fast-seqfunc` will include more advanced embedding methods such as ESM2 for proteins and CARP for nucleic acids. 128 129 ## Making Predictions 130 131 After training, you can use your model to make predictions on new sequences: 132 133 ```python 134 # Generate some new data for prediction 135 new_data = generate_dataset_by_task( 136 task="gc_content", 137 count=200, 138 length=50, 139 ) 140 141 # Make predictions 142 predictions = predict(model_info, new_data["sequence"]) 143 144 # Create results DataFrame 145 results_df = new_data.copy() 146 results_df["predicted"] = predictions 147 results_df.to_csv(output_dir / "regression_predictions.csv", index=False) 148 149 print(results_df.head()) 150 ``` 151 152 ## Evaluating Regression Performance 153 154 Let's evaluate our model more thoroughly: 155 156 ```python 157 # Calculate regression metrics 158 true_values = test_data["function"] 159 predicted_values = predict(model_info, test_data["sequence"]) 160 161 # Calculate metrics 162 mse = mean_squared_error(true_values, predicted_values) 163 rmse = np.sqrt(mse) 164 r2 = r2_score(true_values, predicted_values) 165 mae = mean_absolute_error(true_values, predicted_values) 166 167 # Print metrics 168 print(f"Test Set Performance:") 169 print(f" MSE: {mse:.4f}") 170 print(f" RMSE: {rmse:.4f}") 171 print(f" R²: {r2:.4f}") 172 print(f" MAE: {mae:.4f}") 173 ``` 174 175 ## Visualizing Results 176 177 Visualizations help in understanding model performance: 178 179 ```python 180 # Create a prediction vs. actual scatter plot 181 plt.figure(figsize=(10, 8)) 182 plt.scatter(true_values, predicted_values, alpha=0.6) 183 plt.plot([true_values.min(), true_values.max()], 184 [true_values.min(), true_values.max()], 185 'r--', lw=2) 186 plt.xlabel("Actual Values") 187 plt.ylabel("Predicted Values") 188 plt.title("Actual vs. Predicted Values") 189 plt.grid(True, alpha=0.3) 190 plt.tight_layout() 191 plt.savefig(output_dir / "regression_scatter_plot.png", dpi=300) 192 193 # Plot residuals 194 residuals = true_values - predicted_values 195 plt.figure(figsize=(10, 8)) 196 plt.scatter(predicted_values, residuals, alpha=0.6) 197 plt.axhline(y=0, color='r', linestyle='--', lw=2) 198 plt.xlabel("Predicted Values") 199 plt.ylabel("Residuals") 200 plt.title("Residual Plot") 201 plt.grid(True, alpha=0.3) 202 plt.tight_layout() 203 plt.savefig(output_dir / "regression_residuals.png", dpi=300) 204 205 # Plot distribution of residuals 206 plt.figure(figsize=(10, 8)) 207 sns.histplot(residuals, kde=True) 208 plt.xlabel("Residuals") 209 plt.ylabel("Frequency") 210 plt.title("Distribution of Residuals") 211 plt.grid(True, alpha=0.3) 212 plt.tight_layout() 213 plt.savefig(output_dir / "regression_residuals_distribution.png", dpi=300) 214 215 logger.info("Visualizations saved to output directory") 216 ``` 217 218 ## Working with Different Sequence Types 219 220 `fast-seqfunc` automatically detects and handles different types of biological sequences: 221 222 - DNA (containing A, C, G, T) 223 - RNA (containing A, C, G, U) 224 - Proteins (containing amino acid letters) 225 226 ```python 227 # Example with protein sequences 228 from fast_seqfunc import generate_random_sequences 229 230 # Generate random protein sequences 231 protein_sequences = generate_random_sequences( 232 length=30, 233 count=100, 234 alphabet="ACDEFGHIKLMNPQRSTVWY", # Protein alphabet 235 fixed_length=True, 236 ) 237 238 # Create a dummy function (e.g., number of hydrophobic residues) 239 hydrophobic = "AVILMFYW" 240 function_values = [ 241 sum(seq.count(aa) for aa in hydrophobic) / len(seq) 242 for seq in protein_sequences 243 ] 244 245 # Create dataset 246 protein_data = pd.DataFrame({ 247 "sequence": protein_sequences, 248 "function": function_values 249 }) 250 251 # Now you could train a model on this protein data using the same workflow 252 ``` 253 254 ## Advanced Model Training Options 255 256 `fast-seqfunc` uses PyCaret behind the scenes, which allows for customization: 257 258 ```python 259 # Example with more options 260 advanced_model_info = train_model( 261 train_data=train_data, 262 test_data=test_data, 263 sequence_col="sequence", 264 target_col="function", 265 embedding_method="one-hot", 266 model_type="regression", 267 optimization_metric="r2", 268 # Additional PyCaret setup options: 269 n_jobs=-1, # Use all available CPU cores 270 fold=5, # 5-fold cross-validation 271 normalize=True, # Normalize features 272 polynomial_features=True, # Generate polynomial features 273 feature_selection=True, # Perform feature selection 274 ) 275 ``` 276 277 ## Loading and Reusing Models 278 279 You can load saved models for reuse: 280 281 ```python 282 # Load a previously saved model 283 loaded_model_info = load_model(model_path) 284 285 # Use the loaded model for predictions 286 new_predictions = predict(loaded_model_info, new_data["sequence"]) 287 288 # Verify the predictions match those from the original model 289 np.allclose(predictions, new_predictions) 290 ``` 291 292 ## Conclusion 293 294 You've now learned how to: 295 1. Prepare sequence-function data 296 2. Train a regression model using `fast-seqfunc` 297 3. Make predictions on new sequences 298 4. Evaluate model performance 299 5. Visualize results 300 301 For more advanced features and applications, check out the [API reference](../api_reference.md) and [additional tutorials](./classification_tutorial.md). 302 303 ## Next Steps 304 305 - Try different regression tasks (e.g., "motif_position", "interaction") 306 - Experiment with different model parameters 307 - Apply these techniques to your own sequence-function data