Cradicle Explorer

/ docs / tutorials / regression_tutorial.md
regression_tutorial.md
  1  # Sequence Regression with Fast-SeqFunc
  2  
  3  This tutorial demonstrates how to use `fast-seqfunc` for regression problems, where you want to predict continuous values from biological sequences.
  4  
  5  ## Overview
  6  
  7  In sequence regression, we want to learn to predict a continuous value (e.g., binding affinity, enzyme efficiency, or protein stability) from a biological sequence (DNA, RNA, or protein). This tutorial will walk you through:
  8  
  9  1. Setting up your environment
 10  2. Preparing sequence-function data
 11  3. Training a regression model
 12  4. Evaluating model performance
 13  5. Making predictions on new sequences
 14  6. Visualizing results
 15  
 16  ## Prerequisites
 17  
 18  - Python 3.11 or higher
 19  - The following packages:
 20    ```bash
 21    pip install fast-seqfunc pandas numpy matplotlib seaborn scikit-learn loguru
 22    ```
 23  
 24  ## Setup
 25  
 26  First, let's import all necessary packages:
 27  
 28  ```python
 29  from pathlib import Path
 30  import pandas as pd
 31  import numpy as np
 32  import matplotlib.pyplot as plt
 33  import seaborn as sns
 34  from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
 35  from fast_seqfunc import train_model, predict, save_model, load_model
 36  from loguru import logger
 37  ```
 38  
 39  ## Working with Sequence-Function Data
 40  
 41  Sequence-function data typically consists of biological sequences paired with measurements of a functional property. For this tutorial, we'll create synthetic data:
 42  
 43  ```python
 44  from fast_seqfunc import generate_dataset_by_task
 45  
 46  # Generate a GC content dataset as an example
 47  # (where the function is simply the GC content of DNA sequences)
 48  data = generate_dataset_by_task(
 49      task="gc_content",
 50      count=1000,  # Number of sequences to generate
 51      length=50,   # Sequence length
 52      noise_level=0.1,  # Add some noise to make the task more realistic
 53  )
 54  
 55  # Examine the data
 56  print(data.head())
 57  print(f"Data shape: {data.shape}")
 58  print(f"Target distribution: min={data['function'].min():.3f}, "
 59        f"max={data['function'].max():.3f}, "
 60        f"mean={data['function'].mean():.3f}")
 61  ```
 62  
 63  ### Preparing Your Own Data
 64  
 65  If you have your own data, it should be structured in a DataFrame with at least two columns:
 66  - A column containing the sequences (e.g., "sequence")
 67  - A column containing the target values (e.g., "function")
 68  
 69  For example:
 70  ```python
 71  # Load your own data
 72  # data = pd.read_csv("your_sequence_function_data.csv")
 73  ```
 74  
 75  ## Splitting Data for Training and Testing
 76  
 77  It's important to evaluate your model on data it hasn't seen during training:
 78  
 79  ```python
 80  # Split into train and test sets (80/20 split)
 81  train_size = int(0.8 * len(data))
 82  train_data = data[:train_size].copy()
 83  test_data = data[train_size:].copy()
 84  
 85  logger.info(f"Data split: {len(train_data)} train, {len(test_data)} test samples")
 86  
 87  # Create output directory for results
 88  output_dir = Path("output")
 89  output_dir.mkdir(parents=True, exist_ok=True)
 90  ```
 91  
 92  ## Training a Regression Model
 93  
 94  Now we can train a regression model using `fast-seqfunc`:
 95  
 96  ```python
 97  # Train a regression model
 98  logger.info("Training regression model...")
 99  model_info = train_model(
100      train_data=train_data,
101      test_data=test_data,
102      sequence_col="sequence",  # Column containing sequences
103      target_col="function",    # Column containing target values
104      embedding_method="one-hot",  # Method to convert sequences to numerical features
105      model_type="regression",     # Specify regression task
106      optimization_metric="r2",    # Metric to optimize (r2, rmse, mae)
107  )
108  
109  # Display test results
110  if model_info.get("test_results"):
111      logger.info("Test metrics from training:")
112      for metric, value in model_info["test_results"].items():
113          logger.info(f"  {metric}: {value:.4f}")
114  
115  # Save the model for later use
116  model_path = output_dir / "regression_model.pkl"
117  save_model(model_info, model_path)
118  logger.info(f"Model saved to {model_path}")
119  ```
120  
121  ### Understanding Embedding Methods
122  
123  The `embedding_method` parameter determines how sequences are converted to numerical features:
124  
125  - `"one-hot"`: Each position in the sequence is encoded as a one-hot vector indicating which amino acid or nucleotide is present.
126  
127  Future versions of `fast-seqfunc` will include more advanced embedding methods such as ESM2 for proteins and CARP for nucleic acids.
128  
129  ## Making Predictions
130  
131  After training, you can use your model to make predictions on new sequences:
132  
133  ```python
134  # Generate some new data for prediction
135  new_data = generate_dataset_by_task(
136      task="gc_content",
137      count=200,
138      length=50,
139  )
140  
141  # Make predictions
142  predictions = predict(model_info, new_data["sequence"])
143  
144  # Create results DataFrame
145  results_df = new_data.copy()
146  results_df["predicted"] = predictions
147  results_df.to_csv(output_dir / "regression_predictions.csv", index=False)
148  
149  print(results_df.head())
150  ```
151  
152  ## Evaluating Regression Performance
153  
154  Let's evaluate our model more thoroughly:
155  
156  ```python
157  # Calculate regression metrics
158  true_values = test_data["function"]
159  predicted_values = predict(model_info, test_data["sequence"])
160  
161  # Calculate metrics
162  mse = mean_squared_error(true_values, predicted_values)
163  rmse = np.sqrt(mse)
164  r2 = r2_score(true_values, predicted_values)
165  mae = mean_absolute_error(true_values, predicted_values)
166  
167  # Print metrics
168  print(f"Test Set Performance:")
169  print(f"  MSE:  {mse:.4f}")
170  print(f"  RMSE: {rmse:.4f}")
171  print(f"  R²:   {r2:.4f}")
172  print(f"  MAE:  {mae:.4f}")
173  ```
174  
175  ## Visualizing Results
176  
177  Visualizations help in understanding model performance:
178  
179  ```python
180  # Create a prediction vs. actual scatter plot
181  plt.figure(figsize=(10, 8))
182  plt.scatter(true_values, predicted_values, alpha=0.6)
183  plt.plot([true_values.min(), true_values.max()],
184           [true_values.min(), true_values.max()],
185           'r--', lw=2)
186  plt.xlabel("Actual Values")
187  plt.ylabel("Predicted Values")
188  plt.title("Actual vs. Predicted Values")
189  plt.grid(True, alpha=0.3)
190  plt.tight_layout()
191  plt.savefig(output_dir / "regression_scatter_plot.png", dpi=300)
192  
193  # Plot residuals
194  residuals = true_values - predicted_values
195  plt.figure(figsize=(10, 8))
196  plt.scatter(predicted_values, residuals, alpha=0.6)
197  plt.axhline(y=0, color='r', linestyle='--', lw=2)
198  plt.xlabel("Predicted Values")
199  plt.ylabel("Residuals")
200  plt.title("Residual Plot")
201  plt.grid(True, alpha=0.3)
202  plt.tight_layout()
203  plt.savefig(output_dir / "regression_residuals.png", dpi=300)
204  
205  # Plot distribution of residuals
206  plt.figure(figsize=(10, 8))
207  sns.histplot(residuals, kde=True)
208  plt.xlabel("Residuals")
209  plt.ylabel("Frequency")
210  plt.title("Distribution of Residuals")
211  plt.grid(True, alpha=0.3)
212  plt.tight_layout()
213  plt.savefig(output_dir / "regression_residuals_distribution.png", dpi=300)
214  
215  logger.info("Visualizations saved to output directory")
216  ```
217  
218  ## Working with Different Sequence Types
219  
220  `fast-seqfunc` automatically detects and handles different types of biological sequences:
221  
222  - DNA (containing A, C, G, T)
223  - RNA (containing A, C, G, U)
224  - Proteins (containing amino acid letters)
225  
226  ```python
227  # Example with protein sequences
228  from fast_seqfunc import generate_random_sequences
229  
230  # Generate random protein sequences
231  protein_sequences = generate_random_sequences(
232      length=30,
233      count=100,
234      alphabet="ACDEFGHIKLMNPQRSTVWY",  # Protein alphabet
235      fixed_length=True,
236  )
237  
238  # Create a dummy function (e.g., number of hydrophobic residues)
239  hydrophobic = "AVILMFYW"
240  function_values = [
241      sum(seq.count(aa) for aa in hydrophobic) / len(seq)
242      for seq in protein_sequences
243  ]
244  
245  # Create dataset
246  protein_data = pd.DataFrame({
247      "sequence": protein_sequences,
248      "function": function_values
249  })
250  
251  # Now you could train a model on this protein data using the same workflow
252  ```
253  
254  ## Advanced Model Training Options
255  
256  `fast-seqfunc` uses PyCaret behind the scenes, which allows for customization:
257  
258  ```python
259  # Example with more options
260  advanced_model_info = train_model(
261      train_data=train_data,
262      test_data=test_data,
263      sequence_col="sequence",
264      target_col="function",
265      embedding_method="one-hot",
266      model_type="regression",
267      optimization_metric="r2",
268      # Additional PyCaret setup options:
269      n_jobs=-1,  # Use all available CPU cores
270      fold=5,     # 5-fold cross-validation
271      normalize=True,  # Normalize features
272      polynomial_features=True,  # Generate polynomial features
273      feature_selection=True,  # Perform feature selection
274  )
275  ```
276  
277  ## Loading and Reusing Models
278  
279  You can load saved models for reuse:
280  
281  ```python
282  # Load a previously saved model
283  loaded_model_info = load_model(model_path)
284  
285  # Use the loaded model for predictions
286  new_predictions = predict(loaded_model_info, new_data["sequence"])
287  
288  # Verify the predictions match those from the original model
289  np.allclose(predictions, new_predictions)
290  ```
291  
292  ## Conclusion
293  
294  You've now learned how to:
295  1. Prepare sequence-function data
296  2. Train a regression model using `fast-seqfunc`
297  3. Make predictions on new sequences
298  4. Evaluate model performance
299  5. Visualize results
300  
301  For more advanced features and applications, check out the [API reference](../api_reference.md) and [additional tutorials](./classification_tutorial.md).
302  
303  ## Next Steps
304  
305  - Try different regression tasks (e.g., "motif_position", "interaction")
306  - Experiment with different model parameters
307  - Apply these techniques to your own sequence-function data