Cradicle Explorer

/ docs / docs / classic-ml / dataset / index.mdx
index.mdx
  1  ---
  2  title: ML Dataset Tracking
  3  description: Track datasets used in ML experiments for reproducibility, lineage, and evaluation.
  4  ---
  5  
  6  import { APILink } from "@site/src/components/APILink";
  7  import Tabs from "@theme/Tabs";
  8  import TabItem from "@theme/TabItem";
  9  
 10  # MLflow Dataset Tracking
 11  
 12  The `mlflow.data` module is a comprehensive solution for dataset management throughout the ML model development workflow. It enables you to track, version, and manage datasets used in training, validation, and evaluation, providing complete lineage from raw data to model predictions.
 13  
 14  ## Why Dataset Tracking Matters
 15  
 16  Dataset tracking is essential for reproducible machine learning and provides several key benefits:
 17  
 18  - **Data Lineage**: Track the complete journey from raw data sources to model inputs
 19  - **Reproducibility**: Ensure experiments can be reproduced with identical datasets
 20  - **Version Control**: Manage different versions of datasets as they evolve
 21  - **Collaboration**: Share datasets and their metadata across teams
 22  - **Evaluation Integration**: Seamlessly integrate with MLflow's evaluation capabilities
 23  - **Production Monitoring**: Track datasets used in production inference and evaluation
 24  
 25  ## Core Components
 26  
 27  MLflow's dataset tracking revolves around two main abstractions:
 28  
 29  ### Dataset
 30  
 31  The `Dataset` abstraction is a metadata tracking object that holds comprehensive information about a logged dataset. The information stored within a `Dataset` object includes:
 32  
 33  **Core Properties:**
 34  
 35  - **Name**: Descriptive identifier for the dataset (defaults to "dataset" if not specified)
 36  - **Digest**: Unique hash/fingerprint for dataset identification (automatically computed)
 37  - **Source**: DatasetSource containing lineage information to the original data location
 38  - **Schema**: Optional dataset schema (implementation-specific, e.g., MLflow Schema)
 39  - **Profile**: Optional summary statistics (implementation-specific, e.g., row count, column stats)
 40  
 41  **Supported Dataset Types:**
 42  
 43  - <APILink fn="mlflow.data.pandas_dataset.PandasDataset">`PandasDataset`</APILink> - For Pandas DataFrames
 44  - <APILink fn="mlflow.data.spark_dataset.SparkDataset">`SparkDataset`</APILink> - For Apache Spark DataFrames
 45  - <APILink fn="mlflow.data.numpy_dataset.NumpyDataset">`NumpyDataset`</APILink> - For NumPy arrays
 46  - <APILink fn="mlflow.data.polars_dataset.PolarsDataset">`PolarsDataset`</APILink> - For Polars DataFrames
 47  - <APILink fn="mlflow.data.huggingface_dataset.HuggingFaceDataset">`HuggingFaceDataset`</APILink> - For Hugging Face datasets
 48  - <APILink fn="mlflow.data.tensorflow_dataset.TensorFlowDataset">`TensorFlowDataset`</APILink> - For TensorFlow datasets
 49  - <APILink fn="mlflow.data.meta_dataset.MetaDataset">`MetaDataset`</APILink> - For metadata-only datasets (no actual data storage)
 50  
 51  **Special Dataset Types:**
 52  
 53  - `EvaluationDataset` - Internal dataset type used specifically with `mlflow.models.evaluate()` for model evaluation workflows
 54  
 55  ### DatasetSource
 56  
 57  The `DatasetSource` component provides linked lineage to the original source of the data, whether it's a file URL, S3 bucket, database table, or any other data source. This ensures you can always trace back to where your data originated.
 58  
 59  The `DatasetSource` can be retrieved using the <APILink fn="mlflow.data.get_source" /> API, which accepts instances of `Dataset`, `DatasetEntity`, or `DatasetInput`.
 60  
 61  ## Quick Start: Basic Dataset Tracking
 62  
 63  <Tabs>
 64    <TabItem value="simple-example" label="Simple Example" default>
 65  
 66  Here's how to get started with basic dataset tracking:
 67  
 68  ```python
 69  import mlflow.data
 70  import pandas as pd
 71  
 72  # Load your data
 73  dataset_source_url = (
 74      "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
 75  )
 76  raw_data = pd.read_csv(dataset_source_url, delimiter=";")
 77  
 78  # Create a Dataset object
 79  dataset = mlflow.data.from_pandas(
 80      raw_data, source=dataset_source_url, name="wine-quality-white", targets="quality"
 81  )
 82  
 83  # Log the dataset to an MLflow run
 84  with mlflow.start_run():
 85      mlflow.log_input(dataset, context="training")
 86  
 87      # Your training code here
 88      # model = train_model(raw_data)
 89      # mlflow.sklearn.log_model(model, "model")
 90  ```
 91  
 92    </TabItem>
 93    <TabItem value="metadata-only" label="Metadata-Only Datasets">
 94  
 95  For cases where you only want to log dataset metadata without the actual data:
 96  
 97  ```python
 98  import mlflow.data
 99  from mlflow.data.meta_dataset import MetaDataset
100  from mlflow.data.http_dataset_source import HTTPDatasetSource
101  from mlflow.types import Schema, ColSpec, DataType
102  
103  # Create a metadata-only dataset for a remote data source
104  source = HTTPDatasetSource(url="https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz")
105  
106  # Option 1: Simple metadata dataset
107  meta_dataset = MetaDataset(source=source, name="imdb-sentiment-dataset")
108  
109  # Option 2: With schema information
110  schema = Schema([
111      ColSpec(type=DataType.string, name="text"),
112      ColSpec(type=DataType.integer, name="label"),
113  ])
114  
115  meta_dataset_with_schema = MetaDataset(
116      source=source, name="imdb-sentiment-dataset-with-schema", schema=schema
117  )
118  
119  with mlflow.start_run():
120      # Log metadata-only dataset (no actual data stored)
121      mlflow.log_input(meta_dataset_with_schema, context="external_data")
122  
123      # The dataset reference and schema are logged, but not the data itself
124      print(f"Logged dataset: {meta_dataset_with_schema.name}")
125      print(f"Data source: {meta_dataset_with_schema.source}")
126  ```
127  
128  **Use Cases for MetaDataset:**
129  Reference datasets hosted on external servers or cloud storage, large datasets where you only want to track metadata and lineage, datasets with restricted access where actual data cannot be stored, and public datasets available via URLs that don't need to be duplicated.
130  
131    </TabItem>
132    <TabItem value="with-splits" label="With Data Splits">
133  
134  Track training, validation, and test splits separately:
135  
136  ```python
137  import mlflow.data
138  import pandas as pd
139  from sklearn.model_selection import train_test_split
140  
141  # Load and split your data
142  data = pd.read_csv("your_dataset.csv")
143  X = data.drop("target", axis=1)
144  y = data["target"]
145  
146  X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
147  X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
148  
149  # Create dataset objects for each split
150  train_data = pd.concat([X_train, y_train], axis=1)
151  val_data = pd.concat([X_val, y_val], axis=1)
152  test_data = pd.concat([X_test, y_test], axis=1)
153  
154  train_dataset = mlflow.data.from_pandas(
155      train_data, source="your_dataset.csv", name="wine-quality-train", targets="target"
156  )
157  val_dataset = mlflow.data.from_pandas(
158      val_data, source="your_dataset.csv", name="wine-quality-val", targets="target"
159  )
160  test_dataset = mlflow.data.from_pandas(
161      test_data, source="your_dataset.csv", name="wine-quality-test", targets="target"
162  )
163  
164  with mlflow.start_run():
165      # Log all dataset splits
166      mlflow.log_input(train_dataset, context="training")
167      mlflow.log_input(val_dataset, context="validation")
168      mlflow.log_input(test_dataset, context="testing")
169  ```
170  
171    </TabItem>
172    <TabItem value="with-predictions" label="With Predictions">
173  
174  Track datasets that include model predictions for evaluation:
175  
176  ```python
177  import mlflow.data
178  import pandas as pd
179  from sklearn.ensemble import RandomForestClassifier
180  
181  # Train a model
182  model = RandomForestClassifier()
183  model.fit(X_train, y_train)
184  
185  # Generate predictions
186  predictions = model.predict(X_test)
187  prediction_probs = model.predict_proba(X_test)[:, 1]
188  
189  # Create evaluation dataset with predictions
190  eval_data = X_test.copy()
191  eval_data["target"] = y_test
192  eval_data["prediction"] = predictions
193  eval_data["prediction_proba"] = prediction_probs
194  
195  # Create dataset with predictions specified
196  eval_dataset = mlflow.data.from_pandas(
197      eval_data,
198      source="your_dataset.csv",
199      name="wine-quality-evaluation",
200      targets="target",
201      predictions="prediction",
202  )
203  
204  with mlflow.start_run():
205      mlflow.log_input(eval_dataset, context="evaluation")
206  
207      # This dataset can now be used directly with mlflow.models.evaluate()
208      result = mlflow.models.evaluate(data=eval_dataset, model_type="classifier")
209  ```
210  
211    </TabItem>
212  </Tabs>
213  
214  ## Dataset Information and Metadata
215  
216  When you create a dataset, MLflow automatically captures rich metadata:
217  
218  ```python
219  # Access dataset metadata
220  print(f"Dataset name: {dataset.name}")  # Defaults to "dataset" if not specified
221  print(f"Dataset digest: {dataset.digest}")  # Unique hash identifier (computed automatically)
222  print(f"Dataset source: {dataset.source}")  # DatasetSource object
223  print(f"Dataset profile: {dataset.profile}")  # Optional: implementation-specific statistics
224  print(f"Dataset schema: {dataset.schema}")  # Optional: implementation-specific schema
225  ```
226  
227  Example output:
228  
229  ```
230  Dataset name: wine-quality-white
231  Dataset digest: 2a1e42c4
232  Dataset profile: {"num_rows": 4898, "num_elements": 58776}
233  Dataset schema: {"mlflow_colspec": [
234      {"type": "double", "name": "fixed acidity"},
235      {"type": "double", "name": "volatile acidity"},
236      ...
237      {"type": "long", "name": "quality"}
238  ]}
239  Dataset source: <DatasetSource object>
240  ```
241  
242  :::note Dataset Properties
243  The `profile` and `schema` properties are implementation-specific and may vary depending on the dataset type (PandasDataset, SparkDataset, etc.). Some dataset types may return `None` for these properties.
244  :::
245  
246  ## Dataset Sources and Lineage
247  
248  <Tabs>
249    <TabItem value="various-sources" label="Various Data Sources" default>
250  
251  MLflow supports datasets from various sources:
252  
253  ```python
254  # From local file
255  local_dataset = mlflow.data.from_pandas(df, source="/path/to/local/file.csv", name="local-data")
256  
257  # From cloud storage
258  s3_dataset = mlflow.data.from_pandas(df, source="s3://bucket/data.parquet", name="s3-data")
259  
260  # From database
261  db_dataset = mlflow.data.from_pandas(df, source="postgresql://user:pass@host/db", name="db-data")
262  
263  # From URL
264  url_dataset = mlflow.data.from_pandas(df, source="https://example.com/data.csv", name="web-data")
265  ```
266  
267    </TabItem>
268    <TabItem value="retrieving-sources" label="Retrieving Data Sources">
269  
270  You can retrieve and reload data from logged datasets:
271  
272  ```python
273  # After logging a dataset, retrieve it later
274  with mlflow.start_run() as run:
275      mlflow.log_input(dataset, context="training")
276  
277  # Retrieve the run and dataset
278  logged_run = mlflow.get_run(run.info.run_id)
279  logged_dataset = logged_run.inputs.dataset_inputs[0].dataset
280  
281  # Get the data source and reload data
282  dataset_source = mlflow.data.get_source(logged_dataset)
283  local_path = dataset_source.load()  # Downloads to local temp file
284  
285  # Reload the data
286  reloaded_data = pd.read_csv(local_path, delimiter=";")
287  print(f"Reloaded {len(reloaded_data)} rows from {local_path}")
288  ```
289  
290    </TabItem>
291    <TabItem value="delta-tables" label="Delta Tables">
292  
293  Special support for Delta Lake tables:
294  
295  ```python
296  # For Delta tables (requires delta-lake package)
297  delta_dataset = mlflow.data.from_spark(
298      spark_df, source="delta://path/to/delta/table", name="delta-table-data"
299  )
300  
301  # Can also specify version
302  versioned_delta_dataset = mlflow.data.from_spark(
303      spark_df, source="delta://path/to/delta/table@v1", name="delta-table-v1"
304  )
305  ```
306  
307    </TabItem>
308  </Tabs>
309  
310  ## Dataset Tracking in MLflow UI
311  
312  When you log datasets to MLflow runs, they appear in the MLflow UI with comprehensive metadata. You can view dataset information, schema, and lineage directly in the interface.
313  
314  ![Dataset in MLflow UI](/images/tracking/dataset-mlflow-ui.png)
315  
316  The UI displays:
317  
318  - Dataset name and digest
319  - Schema information with column types
320  - Profile statistics (row counts, etc.)
321  - Source lineage information
322  - Context in which the dataset was used
323  
324  ## Integration with MLflow Evaluate
325  
326  One of the most powerful features of MLflow datasets is their seamless integration with MLflow's evaluation capabilities. MLflow automatically converts various data types to `EvaluationDataset` objects internally when using `mlflow.models.evaluate()`.
327  
328  :::info EvaluationDataset
329  MLflow uses an internal `EvaluationDataset` class when working with `mlflow.models.evaluate()`. This dataset type is automatically created from your input data and provides optimized hashing and metadata tracking specifically for evaluation workflows.
330  :::
331  
332  <Tabs>
333    <TabItem value="basic-evaluation" label="Basic Evaluation" default>
334  
335  Use datasets directly with MLflow evaluate:
336  
337  ```python
338  import mlflow
339  from sklearn.ensemble import RandomForestClassifier
340  from sklearn.model_selection import train_test_split
341  
342  # Prepare data and train model
343  data = pd.read_csv("classification_data.csv")
344  X = data.drop("target", axis=1)
345  y = data["target"]
346  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
347  
348  model = RandomForestClassifier()
349  model.fit(X_train, y_train)
350  
351  # Create evaluation dataset
352  eval_data = X_test.copy()
353  eval_data["target"] = y_test
354  
355  eval_dataset = mlflow.data.from_pandas(eval_data, targets="target", name="evaluation-set")
356  
357  with mlflow.start_run():
358      # Log model
359      mlflow.sklearn.log_model(model, name="model", input_example=X_test)
360  
361      # Evaluate using the dataset
362      result = mlflow.models.evaluate(
363          model="runs:/{}/model".format(mlflow.active_run().info.run_id),
364          data=eval_dataset,
365          model_type="classifier",
366      )
367  
368      print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
369  ```
370  
371    </TabItem>
372    <TabItem value="static-predictions" label="Static Predictions">
373  
374  Evaluate pre-computed predictions without re-running the model:
375  
376  ```python
377  # Load previously computed predictions
378  batch_predictions = pd.read_parquet("batch_predictions.parquet")
379  
380  # Create dataset with existing predictions
381  prediction_dataset = mlflow.data.from_pandas(
382      batch_predictions,
383      source="batch_predictions.parquet",
384      targets="true_label",
385      predictions="model_prediction",
386      name="batch-evaluation",
387  )
388  
389  with mlflow.start_run():
390      # Evaluate static predictions (no model needed!)
391      result = mlflow.models.evaluate(data=prediction_dataset, model_type="classifier")
392  
393      # Dataset is automatically logged to the run
394      print("Evaluation completed on static predictions")
395  ```
396  
397    </TabItem>
398    <TabItem value="comparative-evaluation" label="Comparative Evaluation">
399  
400  Compare multiple models or datasets:
401  
402  ```python
403  def compare_model_performance(model_uri, datasets_dict):
404      """Compare model performance across multiple evaluation datasets."""
405  
406      results = {}
407  
408      with mlflow.start_run(run_name="Model_Comparison"):
409          for dataset_name, dataset in datasets_dict.items():
410              with mlflow.start_run(run_name=f"Eval_{dataset_name}", nested=True):
411                  result = mlflow.models.evaluate(
412                      model=model_uri,
413                      data=dataset,
414                      targets="target",
415                      model_type="classifier",
416                  )
417                  results[dataset_name] = result.metrics
418  
419                  # Log dataset metadata
420                  mlflow.log_params({"dataset_name": dataset_name, "dataset_size": len(dataset.df)})
421  
422      return results
423  
424  
425  # Usage
426  evaluation_datasets = {
427      "validation": validation_dataset,
428      "test": test_dataset,
429      "holdout": holdout_dataset,
430  }
431  
432  comparison_results = compare_model_performance(model_uri, evaluation_datasets)
433  ```
434  
435    </TabItem>
436  </Tabs>
437  
438  ## MLflow Evaluate Integration Example
439  
440  Here's a complete example showing how datasets integrate with MLflow's evaluation capabilities:
441  
442  <div className="center-div" style={{ width: "80%" }}>
443    ![Dataset Evaluation in MLflow UI](/images/tracking/dataset-evaluate.png)
444  </div>
445  
446  The evaluation run shows how the dataset, model, metrics, and evaluation artifacts (like confusion matrices) are all logged together, providing a complete view of the evaluation process.
447  
448  ## Advanced Dataset Management
449  
450  <Tabs>
451    <TabItem value="versioning" label="Dataset Versioning" default>
452  
453  Track dataset versions as they evolve:
454  
455  ```python
456  def create_versioned_dataset(data, version, base_name="customer-data"):
457      """Create a versioned dataset with metadata."""
458  
459      dataset = mlflow.data.from_pandas(
460          data,
461          source=f"data_pipeline_v{version}",
462          name=f"{base_name}-v{version}",
463          targets="target",
464      )
465  
466      with mlflow.start_run(run_name=f"Dataset_Version_{version}"):
467          mlflow.log_input(dataset, context="versioning")
468  
469          # Log version metadata
470          mlflow.log_params({
471              "dataset_version": version,
472              "data_size": len(data),
473              "features_count": len(data.columns) - 1,
474              "target_distribution": data["target"].value_counts().to_dict(),
475          })
476  
477          # Log data quality metrics
478          mlflow.log_metrics({
479              "missing_values_pct": (data.isnull().sum().sum() / data.size) * 100,
480              "duplicate_rows": data.duplicated().sum(),
481              "target_balance": data["target"].std(),
482          })
483  
484      return dataset
485  
486  
487  # Create multiple versions
488  v1_dataset = create_versioned_dataset(data_v1, "1.0")
489  v2_dataset = create_versioned_dataset(data_v2, "2.0")
490  v3_dataset = create_versioned_dataset(data_v3, "3.0")
491  ```
492  
493    </TabItem>
494    <TabItem value="quality-monitoring" label="Data Quality Monitoring">
495  
496  Monitor data quality and drift over time:
497  
498  ```python
499  def monitor_dataset_quality(dataset, reference_dataset=None):
500      """Monitor dataset quality and compare against reference if provided."""
501  
502      data = dataset.df if hasattr(dataset, "df") else dataset
503  
504      quality_metrics = {
505          "total_rows": len(data),
506          "total_columns": len(data.columns),
507          "missing_values_total": data.isnull().sum().sum(),
508          "missing_values_pct": (data.isnull().sum().sum() / data.size) * 100,
509          "duplicate_rows": data.duplicated().sum(),
510          "duplicate_rows_pct": (data.duplicated().sum() / len(data)) * 100,
511      }
512  
513      # Numeric column statistics
514      numeric_cols = data.select_dtypes(include=["number"]).columns
515      for col in numeric_cols:
516          quality_metrics.update({
517              f"{col}_mean": data[col].mean(),
518              f"{col}_std": data[col].std(),
519              f"{col}_missing_pct": (data[col].isnull().sum() / len(data)) * 100,
520          })
521  
522      with mlflow.start_run(run_name="Data_Quality_Check"):
523          mlflow.log_input(dataset, context="quality_monitoring")
524          mlflow.log_metrics(quality_metrics)
525  
526          # Compare with reference dataset if provided
527          if reference_dataset is not None:
528              ref_data = (
529                  reference_dataset.df if hasattr(reference_dataset, "df") else reference_dataset
530              )
531  
532              # Basic drift detection
533              drift_metrics = {}
534              for col in numeric_cols:
535                  if col in ref_data.columns:
536                      mean_diff = abs(data[col].mean() - ref_data[col].mean())
537                      std_diff = abs(data[col].std() - ref_data[col].std())
538                      drift_metrics.update({
539                          f"{col}_mean_drift": mean_diff,
540                          f"{col}_std_drift": std_diff,
541                      })
542  
543              mlflow.log_metrics(drift_metrics)
544  
545      return quality_metrics
546  
547  
548  # Usage
549  quality_report = monitor_dataset_quality(current_dataset, reference_dataset)
550  ```
551  
552    </TabItem>
553    <TabItem value="automated-tracking" label="Automated Tracking">
554  
555  Set up automated dataset tracking in your ML pipelines:
556  
557  ```python
558  class DatasetTracker:
559      """Automated dataset tracking for ML pipelines."""
560  
561      def __init__(self, experiment_name="Dataset_Tracking"):
562          mlflow.set_experiment(experiment_name)
563          self.tracked_datasets = {}
564  
565      def track_dataset(self, data, stage, source=None, name=None, **metadata):
566          """Track a dataset at a specific pipeline stage."""
567  
568          dataset_name = name or f"{stage}_dataset"
569  
570          dataset = mlflow.data.from_pandas(
571              data, source=source or f"pipeline_stage_{stage}", name=dataset_name
572          )
573  
574          with mlflow.start_run(run_name=f"Pipeline_{stage}"):
575              mlflow.log_input(dataset, context=stage)
576  
577              # Log stage metadata
578              mlflow.log_params({"pipeline_stage": stage, "dataset_name": dataset_name, **metadata})
579  
580              # Automatic quality metrics
581              quality_metrics = {
582                  "rows": len(data),
583                  "columns": len(data.columns),
584                  "missing_pct": (data.isnull().sum().sum() / data.size) * 100,
585              }
586              mlflow.log_metrics(quality_metrics)
587  
588          self.tracked_datasets[stage] = dataset
589          return dataset
590  
591      def compare_stages(self, stage1, stage2):
592          """Compare datasets between pipeline stages."""
593  
594          if stage1 not in self.tracked_datasets or stage2 not in self.tracked_datasets:
595              raise ValueError("Both stages must be tracked first")
596  
597          ds1 = self.tracked_datasets[stage1]
598          ds2 = self.tracked_datasets[stage2]
599  
600          # Implementation of comparison logic
601          with mlflow.start_run(run_name=f"Compare_{stage1}_vs_{stage2}"):
602              comparison_metrics = {
603                  "row_diff": len(ds2.df) - len(ds1.df),
604                  "column_diff": len(ds2.df.columns) - len(ds1.df.columns),
605              }
606              mlflow.log_metrics(comparison_metrics)
607  
608  
609  # Usage in a pipeline
610  tracker = DatasetTracker()
611  
612  # Track at each stage
613  raw_dataset = tracker.track_dataset(raw_data, "raw", source="raw_data.csv")
614  cleaned_dataset = tracker.track_dataset(cleaned_data, "cleaned", source="cleaned_data.csv")
615  features_dataset = tracker.track_dataset(feature_data, "features", source="feature_engineering")
616  
617  # Compare stages
618  tracker.compare_stages("raw", "cleaned")
619  tracker.compare_stages("cleaned", "features")
620  ```
621  
622    </TabItem>
623  </Tabs>
624  
625  ## Production Use Cases
626  
627  <Tabs>
628    <TabItem value="batch-monitoring" label="Batch Prediction Monitoring" default>
629  
630  Monitor datasets used in production batch prediction:
631  
632  ```python
633  def monitor_batch_predictions(batch_data, model_version, date):
634      """Monitor production batch prediction datasets."""
635  
636      # Create dataset for batch predictions
637      batch_dataset = mlflow.data.from_pandas(
638          batch_data,
639          source=f"production_batch_{date}",
640          name=f"batch_predictions_{date}",
641          targets="true_label" if "true_label" in batch_data.columns else None,
642          predictions="prediction" if "prediction" in batch_data.columns else None,
643      )
644  
645      with mlflow.start_run(run_name=f"Batch_Monitor_{date}"):
646          mlflow.log_input(batch_dataset, context="production_batch")
647  
648          # Log production metadata
649          mlflow.log_params({
650              "batch_date": date,
651              "model_version": model_version,
652              "batch_size": len(batch_data),
653              "has_ground_truth": "true_label" in batch_data.columns,
654          })
655  
656          # Monitor prediction distribution
657          if "prediction" in batch_data.columns:
658              pred_metrics = {
659                  "prediction_mean": batch_data["prediction"].mean(),
660                  "prediction_std": batch_data["prediction"].std(),
661                  "unique_predictions": batch_data["prediction"].nunique(),
662              }
663              mlflow.log_metrics(pred_metrics)
664  
665          # Evaluate if ground truth is available
666          if all(col in batch_data.columns for col in ["prediction", "true_label"]):
667              result = mlflow.models.evaluate(data=batch_dataset, model_type="classifier")
668              print(f"Batch accuracy: {result.metrics.get('accuracy_score', 'N/A')}")
669  
670      return batch_dataset
671  
672  
673  # Usage
674  batch_dataset = monitor_batch_predictions(daily_batch_data, "v2.1", "2024-01-15")
675  ```
676  
677    </TabItem>
678    <TabItem value="ab-testing" label="A/B Testing Datasets">
679  
680  Track datasets used in A/B testing scenarios:
681  
682  ```python
683  def track_ab_test_data(control_data, treatment_data, test_name, test_date):
684      """Track datasets for A/B testing experiments."""
685  
686      # Create datasets for each variant
687      control_dataset = mlflow.data.from_pandas(
688          control_data,
689          source=f"ab_test_{test_name}_control",
690          name=f"{test_name}_control_{test_date}",
691          targets="conversion" if "conversion" in control_data.columns else None,
692      )
693  
694      treatment_dataset = mlflow.data.from_pandas(
695          treatment_data,
696          source=f"ab_test_{test_name}_treatment",
697          name=f"{test_name}_treatment_{test_date}",
698          targets="conversion" if "conversion" in treatment_data.columns else None,
699      )
700  
701      with mlflow.start_run(run_name=f"AB_Test_{test_name}_{test_date}"):
702          # Log both datasets
703          mlflow.log_input(control_dataset, context="ab_test_control")
704          mlflow.log_input(treatment_dataset, context="ab_test_treatment")
705  
706          # Log test parameters
707          mlflow.log_params({
708              "test_name": test_name,
709              "test_date": test_date,
710              "control_size": len(control_data),
711              "treatment_size": len(treatment_data),
712              "total_size": len(control_data) + len(treatment_data),
713          })
714  
715          # Calculate and log comparison metrics
716          if "conversion" in control_data.columns and "conversion" in treatment_data.columns:
717              control_rate = control_data["conversion"].mean()
718              treatment_rate = treatment_data["conversion"].mean()
719              lift = (treatment_rate - control_rate) / control_rate * 100
720  
721              mlflow.log_metrics({
722                  "control_conversion_rate": control_rate,
723                  "treatment_conversion_rate": treatment_rate,
724                  "lift_percentage": lift,
725              })
726  
727      return control_dataset, treatment_dataset
728  
729  
730  # Usage
731  control_ds, treatment_ds = track_ab_test_data(
732      control_group_data, treatment_group_data, "new_recommendation_model", "2024-01-15"
733  )
734  ```
735  
736    </TabItem>
737  </Tabs>
738  
739  ## Best Practices
740  
741  When working with MLflow datasets, follow these best practices:
742  
743  **Data Quality**: Always validate data quality before logging datasets. Check for missing values, duplicates, and data types.
744  
745  **Naming Conventions**: Use consistent, descriptive names for datasets that include version information and context.
746  
747  **Source Documentation**: Always specify meaningful source URLs or identifiers that allow you to trace back to the original data.
748  
749  **Context Specification**: Use clear context labels when logging datasets (e.g., "training", "validation", "evaluation", "production").
750  
751  **Metadata Logging**: Include relevant metadata about data collection, preprocessing steps, and data characteristics.
752  
753  **Version Control**: Track dataset versions explicitly, especially when data preprocessing or collection methods change.
754  
755  **Digest Computation**: Dataset digests are computed differently for different dataset types:
756  
757  - **Standard datasets**: Based on data content and structure
758  - **MetaDataset**: Based on metadata only (name, source, schema) - no actual data hashing
759  - **EvaluationDataset**: Optimized hashing using sample rows for large datasets
760  
761  **Source Flexibility**: DatasetSource supports various source types including HTTP URLs, file paths, database connections, and cloud storage locations.
762  
763  **Evaluation Integration**: Design datasets with evaluation in mind by clearly specifying target and prediction columns.
764  
765  ## Key Benefits
766  
767  MLflow dataset tracking provides several key advantages for ML teams:
768  
769  **Reproducibility**: Ensure experiments can be reproduced with identical datasets, even as data sources evolve.
770  
771  **Lineage Tracking**: Maintain complete data lineage from source to model predictions, enabling better debugging and compliance.
772  
773  **Collaboration**: Share datasets and their metadata across team members with consistent interfaces.
774  
775  **Evaluation Integration**: Seamlessly integrate with MLflow's evaluation capabilities for comprehensive model assessment.
776  
777  **Production Monitoring**: Track datasets used in production systems for performance monitoring and data drift detection.
778  
779  **Quality Assurance**: Automatically capture data quality metrics and monitor changes over time.
780  
781  Whether you're tracking training datasets, managing evaluation data, or monitoring production batch predictions, MLflow's dataset tracking capabilities provide the foundation for reliable, reproducible machine learning workflows.