index.mdx
1 --- 2 title: ML Dataset Tracking 3 description: Track datasets used in ML experiments for reproducibility, lineage, and evaluation. 4 --- 5 6 import { APILink } from "@site/src/components/APILink"; 7 import Tabs from "@theme/Tabs"; 8 import TabItem from "@theme/TabItem"; 9 10 # MLflow Dataset Tracking 11 12 The `mlflow.data` module is a comprehensive solution for dataset management throughout the ML model development workflow. It enables you to track, version, and manage datasets used in training, validation, and evaluation, providing complete lineage from raw data to model predictions. 13 14 ## Why Dataset Tracking Matters 15 16 Dataset tracking is essential for reproducible machine learning and provides several key benefits: 17 18 - **Data Lineage**: Track the complete journey from raw data sources to model inputs 19 - **Reproducibility**: Ensure experiments can be reproduced with identical datasets 20 - **Version Control**: Manage different versions of datasets as they evolve 21 - **Collaboration**: Share datasets and their metadata across teams 22 - **Evaluation Integration**: Seamlessly integrate with MLflow's evaluation capabilities 23 - **Production Monitoring**: Track datasets used in production inference and evaluation 24 25 ## Core Components 26 27 MLflow's dataset tracking revolves around two main abstractions: 28 29 ### Dataset 30 31 The `Dataset` abstraction is a metadata tracking object that holds comprehensive information about a logged dataset. The information stored within a `Dataset` object includes: 32 33 **Core Properties:** 34 35 - **Name**: Descriptive identifier for the dataset (defaults to "dataset" if not specified) 36 - **Digest**: Unique hash/fingerprint for dataset identification (automatically computed) 37 - **Source**: DatasetSource containing lineage information to the original data location 38 - **Schema**: Optional dataset schema (implementation-specific, e.g., MLflow Schema) 39 - **Profile**: Optional summary statistics (implementation-specific, e.g., row count, column stats) 40 41 **Supported Dataset Types:** 42 43 - <APILink fn="mlflow.data.pandas_dataset.PandasDataset">`PandasDataset`</APILink> - For Pandas DataFrames 44 - <APILink fn="mlflow.data.spark_dataset.SparkDataset">`SparkDataset`</APILink> - For Apache Spark DataFrames 45 - <APILink fn="mlflow.data.numpy_dataset.NumpyDataset">`NumpyDataset`</APILink> - For NumPy arrays 46 - <APILink fn="mlflow.data.polars_dataset.PolarsDataset">`PolarsDataset`</APILink> - For Polars DataFrames 47 - <APILink fn="mlflow.data.huggingface_dataset.HuggingFaceDataset">`HuggingFaceDataset`</APILink> - For Hugging Face datasets 48 - <APILink fn="mlflow.data.tensorflow_dataset.TensorFlowDataset">`TensorFlowDataset`</APILink> - For TensorFlow datasets 49 - <APILink fn="mlflow.data.meta_dataset.MetaDataset">`MetaDataset`</APILink> - For metadata-only datasets (no actual data storage) 50 51 **Special Dataset Types:** 52 53 - `EvaluationDataset` - Internal dataset type used specifically with `mlflow.models.evaluate()` for model evaluation workflows 54 55 ### DatasetSource 56 57 The `DatasetSource` component provides linked lineage to the original source of the data, whether it's a file URL, S3 bucket, database table, or any other data source. This ensures you can always trace back to where your data originated. 58 59 The `DatasetSource` can be retrieved using the <APILink fn="mlflow.data.get_source" /> API, which accepts instances of `Dataset`, `DatasetEntity`, or `DatasetInput`. 60 61 ## Quick Start: Basic Dataset Tracking 62 63 <Tabs> 64 <TabItem value="simple-example" label="Simple Example" default> 65 66 Here's how to get started with basic dataset tracking: 67 68 ```python 69 import mlflow.data 70 import pandas as pd 71 72 # Load your data 73 dataset_source_url = ( 74 "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv" 75 ) 76 raw_data = pd.read_csv(dataset_source_url, delimiter=";") 77 78 # Create a Dataset object 79 dataset = mlflow.data.from_pandas( 80 raw_data, source=dataset_source_url, name="wine-quality-white", targets="quality" 81 ) 82 83 # Log the dataset to an MLflow run 84 with mlflow.start_run(): 85 mlflow.log_input(dataset, context="training") 86 87 # Your training code here 88 # model = train_model(raw_data) 89 # mlflow.sklearn.log_model(model, "model") 90 ``` 91 92 </TabItem> 93 <TabItem value="metadata-only" label="Metadata-Only Datasets"> 94 95 For cases where you only want to log dataset metadata without the actual data: 96 97 ```python 98 import mlflow.data 99 from mlflow.data.meta_dataset import MetaDataset 100 from mlflow.data.http_dataset_source import HTTPDatasetSource 101 from mlflow.types import Schema, ColSpec, DataType 102 103 # Create a metadata-only dataset for a remote data source 104 source = HTTPDatasetSource(url="https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz") 105 106 # Option 1: Simple metadata dataset 107 meta_dataset = MetaDataset(source=source, name="imdb-sentiment-dataset") 108 109 # Option 2: With schema information 110 schema = Schema([ 111 ColSpec(type=DataType.string, name="text"), 112 ColSpec(type=DataType.integer, name="label"), 113 ]) 114 115 meta_dataset_with_schema = MetaDataset( 116 source=source, name="imdb-sentiment-dataset-with-schema", schema=schema 117 ) 118 119 with mlflow.start_run(): 120 # Log metadata-only dataset (no actual data stored) 121 mlflow.log_input(meta_dataset_with_schema, context="external_data") 122 123 # The dataset reference and schema are logged, but not the data itself 124 print(f"Logged dataset: {meta_dataset_with_schema.name}") 125 print(f"Data source: {meta_dataset_with_schema.source}") 126 ``` 127 128 **Use Cases for MetaDataset:** 129 Reference datasets hosted on external servers or cloud storage, large datasets where you only want to track metadata and lineage, datasets with restricted access where actual data cannot be stored, and public datasets available via URLs that don't need to be duplicated. 130 131 </TabItem> 132 <TabItem value="with-splits" label="With Data Splits"> 133 134 Track training, validation, and test splits separately: 135 136 ```python 137 import mlflow.data 138 import pandas as pd 139 from sklearn.model_selection import train_test_split 140 141 # Load and split your data 142 data = pd.read_csv("your_dataset.csv") 143 X = data.drop("target", axis=1) 144 y = data["target"] 145 146 X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) 147 X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) 148 149 # Create dataset objects for each split 150 train_data = pd.concat([X_train, y_train], axis=1) 151 val_data = pd.concat([X_val, y_val], axis=1) 152 test_data = pd.concat([X_test, y_test], axis=1) 153 154 train_dataset = mlflow.data.from_pandas( 155 train_data, source="your_dataset.csv", name="wine-quality-train", targets="target" 156 ) 157 val_dataset = mlflow.data.from_pandas( 158 val_data, source="your_dataset.csv", name="wine-quality-val", targets="target" 159 ) 160 test_dataset = mlflow.data.from_pandas( 161 test_data, source="your_dataset.csv", name="wine-quality-test", targets="target" 162 ) 163 164 with mlflow.start_run(): 165 # Log all dataset splits 166 mlflow.log_input(train_dataset, context="training") 167 mlflow.log_input(val_dataset, context="validation") 168 mlflow.log_input(test_dataset, context="testing") 169 ``` 170 171 </TabItem> 172 <TabItem value="with-predictions" label="With Predictions"> 173 174 Track datasets that include model predictions for evaluation: 175 176 ```python 177 import mlflow.data 178 import pandas as pd 179 from sklearn.ensemble import RandomForestClassifier 180 181 # Train a model 182 model = RandomForestClassifier() 183 model.fit(X_train, y_train) 184 185 # Generate predictions 186 predictions = model.predict(X_test) 187 prediction_probs = model.predict_proba(X_test)[:, 1] 188 189 # Create evaluation dataset with predictions 190 eval_data = X_test.copy() 191 eval_data["target"] = y_test 192 eval_data["prediction"] = predictions 193 eval_data["prediction_proba"] = prediction_probs 194 195 # Create dataset with predictions specified 196 eval_dataset = mlflow.data.from_pandas( 197 eval_data, 198 source="your_dataset.csv", 199 name="wine-quality-evaluation", 200 targets="target", 201 predictions="prediction", 202 ) 203 204 with mlflow.start_run(): 205 mlflow.log_input(eval_dataset, context="evaluation") 206 207 # This dataset can now be used directly with mlflow.models.evaluate() 208 result = mlflow.models.evaluate(data=eval_dataset, model_type="classifier") 209 ``` 210 211 </TabItem> 212 </Tabs> 213 214 ## Dataset Information and Metadata 215 216 When you create a dataset, MLflow automatically captures rich metadata: 217 218 ```python 219 # Access dataset metadata 220 print(f"Dataset name: {dataset.name}") # Defaults to "dataset" if not specified 221 print(f"Dataset digest: {dataset.digest}") # Unique hash identifier (computed automatically) 222 print(f"Dataset source: {dataset.source}") # DatasetSource object 223 print(f"Dataset profile: {dataset.profile}") # Optional: implementation-specific statistics 224 print(f"Dataset schema: {dataset.schema}") # Optional: implementation-specific schema 225 ``` 226 227 Example output: 228 229 ``` 230 Dataset name: wine-quality-white 231 Dataset digest: 2a1e42c4 232 Dataset profile: {"num_rows": 4898, "num_elements": 58776} 233 Dataset schema: {"mlflow_colspec": [ 234 {"type": "double", "name": "fixed acidity"}, 235 {"type": "double", "name": "volatile acidity"}, 236 ... 237 {"type": "long", "name": "quality"} 238 ]} 239 Dataset source: <DatasetSource object> 240 ``` 241 242 :::note Dataset Properties 243 The `profile` and `schema` properties are implementation-specific and may vary depending on the dataset type (PandasDataset, SparkDataset, etc.). Some dataset types may return `None` for these properties. 244 ::: 245 246 ## Dataset Sources and Lineage 247 248 <Tabs> 249 <TabItem value="various-sources" label="Various Data Sources" default> 250 251 MLflow supports datasets from various sources: 252 253 ```python 254 # From local file 255 local_dataset = mlflow.data.from_pandas(df, source="/path/to/local/file.csv", name="local-data") 256 257 # From cloud storage 258 s3_dataset = mlflow.data.from_pandas(df, source="s3://bucket/data.parquet", name="s3-data") 259 260 # From database 261 db_dataset = mlflow.data.from_pandas(df, source="postgresql://user:pass@host/db", name="db-data") 262 263 # From URL 264 url_dataset = mlflow.data.from_pandas(df, source="https://example.com/data.csv", name="web-data") 265 ``` 266 267 </TabItem> 268 <TabItem value="retrieving-sources" label="Retrieving Data Sources"> 269 270 You can retrieve and reload data from logged datasets: 271 272 ```python 273 # After logging a dataset, retrieve it later 274 with mlflow.start_run() as run: 275 mlflow.log_input(dataset, context="training") 276 277 # Retrieve the run and dataset 278 logged_run = mlflow.get_run(run.info.run_id) 279 logged_dataset = logged_run.inputs.dataset_inputs[0].dataset 280 281 # Get the data source and reload data 282 dataset_source = mlflow.data.get_source(logged_dataset) 283 local_path = dataset_source.load() # Downloads to local temp file 284 285 # Reload the data 286 reloaded_data = pd.read_csv(local_path, delimiter=";") 287 print(f"Reloaded {len(reloaded_data)} rows from {local_path}") 288 ``` 289 290 </TabItem> 291 <TabItem value="delta-tables" label="Delta Tables"> 292 293 Special support for Delta Lake tables: 294 295 ```python 296 # For Delta tables (requires delta-lake package) 297 delta_dataset = mlflow.data.from_spark( 298 spark_df, source="delta://path/to/delta/table", name="delta-table-data" 299 ) 300 301 # Can also specify version 302 versioned_delta_dataset = mlflow.data.from_spark( 303 spark_df, source="delta://path/to/delta/table@v1", name="delta-table-v1" 304 ) 305 ``` 306 307 </TabItem> 308 </Tabs> 309 310 ## Dataset Tracking in MLflow UI 311 312 When you log datasets to MLflow runs, they appear in the MLflow UI with comprehensive metadata. You can view dataset information, schema, and lineage directly in the interface. 313 314  315 316 The UI displays: 317 318 - Dataset name and digest 319 - Schema information with column types 320 - Profile statistics (row counts, etc.) 321 - Source lineage information 322 - Context in which the dataset was used 323 324 ## Integration with MLflow Evaluate 325 326 One of the most powerful features of MLflow datasets is their seamless integration with MLflow's evaluation capabilities. MLflow automatically converts various data types to `EvaluationDataset` objects internally when using `mlflow.models.evaluate()`. 327 328 :::info EvaluationDataset 329 MLflow uses an internal `EvaluationDataset` class when working with `mlflow.models.evaluate()`. This dataset type is automatically created from your input data and provides optimized hashing and metadata tracking specifically for evaluation workflows. 330 ::: 331 332 <Tabs> 333 <TabItem value="basic-evaluation" label="Basic Evaluation" default> 334 335 Use datasets directly with MLflow evaluate: 336 337 ```python 338 import mlflow 339 from sklearn.ensemble import RandomForestClassifier 340 from sklearn.model_selection import train_test_split 341 342 # Prepare data and train model 343 data = pd.read_csv("classification_data.csv") 344 X = data.drop("target", axis=1) 345 y = data["target"] 346 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 347 348 model = RandomForestClassifier() 349 model.fit(X_train, y_train) 350 351 # Create evaluation dataset 352 eval_data = X_test.copy() 353 eval_data["target"] = y_test 354 355 eval_dataset = mlflow.data.from_pandas(eval_data, targets="target", name="evaluation-set") 356 357 with mlflow.start_run(): 358 # Log model 359 mlflow.sklearn.log_model(model, name="model", input_example=X_test) 360 361 # Evaluate using the dataset 362 result = mlflow.models.evaluate( 363 model="runs:/{}/model".format(mlflow.active_run().info.run_id), 364 data=eval_dataset, 365 model_type="classifier", 366 ) 367 368 print(f"Accuracy: {result.metrics['accuracy_score']:.3f}") 369 ``` 370 371 </TabItem> 372 <TabItem value="static-predictions" label="Static Predictions"> 373 374 Evaluate pre-computed predictions without re-running the model: 375 376 ```python 377 # Load previously computed predictions 378 batch_predictions = pd.read_parquet("batch_predictions.parquet") 379 380 # Create dataset with existing predictions 381 prediction_dataset = mlflow.data.from_pandas( 382 batch_predictions, 383 source="batch_predictions.parquet", 384 targets="true_label", 385 predictions="model_prediction", 386 name="batch-evaluation", 387 ) 388 389 with mlflow.start_run(): 390 # Evaluate static predictions (no model needed!) 391 result = mlflow.models.evaluate(data=prediction_dataset, model_type="classifier") 392 393 # Dataset is automatically logged to the run 394 print("Evaluation completed on static predictions") 395 ``` 396 397 </TabItem> 398 <TabItem value="comparative-evaluation" label="Comparative Evaluation"> 399 400 Compare multiple models or datasets: 401 402 ```python 403 def compare_model_performance(model_uri, datasets_dict): 404 """Compare model performance across multiple evaluation datasets.""" 405 406 results = {} 407 408 with mlflow.start_run(run_name="Model_Comparison"): 409 for dataset_name, dataset in datasets_dict.items(): 410 with mlflow.start_run(run_name=f"Eval_{dataset_name}", nested=True): 411 result = mlflow.models.evaluate( 412 model=model_uri, 413 data=dataset, 414 targets="target", 415 model_type="classifier", 416 ) 417 results[dataset_name] = result.metrics 418 419 # Log dataset metadata 420 mlflow.log_params({"dataset_name": dataset_name, "dataset_size": len(dataset.df)}) 421 422 return results 423 424 425 # Usage 426 evaluation_datasets = { 427 "validation": validation_dataset, 428 "test": test_dataset, 429 "holdout": holdout_dataset, 430 } 431 432 comparison_results = compare_model_performance(model_uri, evaluation_datasets) 433 ``` 434 435 </TabItem> 436 </Tabs> 437 438 ## MLflow Evaluate Integration Example 439 440 Here's a complete example showing how datasets integrate with MLflow's evaluation capabilities: 441 442 <div className="center-div" style={{ width: "80%" }}> 443  444 </div> 445 446 The evaluation run shows how the dataset, model, metrics, and evaluation artifacts (like confusion matrices) are all logged together, providing a complete view of the evaluation process. 447 448 ## Advanced Dataset Management 449 450 <Tabs> 451 <TabItem value="versioning" label="Dataset Versioning" default> 452 453 Track dataset versions as they evolve: 454 455 ```python 456 def create_versioned_dataset(data, version, base_name="customer-data"): 457 """Create a versioned dataset with metadata.""" 458 459 dataset = mlflow.data.from_pandas( 460 data, 461 source=f"data_pipeline_v{version}", 462 name=f"{base_name}-v{version}", 463 targets="target", 464 ) 465 466 with mlflow.start_run(run_name=f"Dataset_Version_{version}"): 467 mlflow.log_input(dataset, context="versioning") 468 469 # Log version metadata 470 mlflow.log_params({ 471 "dataset_version": version, 472 "data_size": len(data), 473 "features_count": len(data.columns) - 1, 474 "target_distribution": data["target"].value_counts().to_dict(), 475 }) 476 477 # Log data quality metrics 478 mlflow.log_metrics({ 479 "missing_values_pct": (data.isnull().sum().sum() / data.size) * 100, 480 "duplicate_rows": data.duplicated().sum(), 481 "target_balance": data["target"].std(), 482 }) 483 484 return dataset 485 486 487 # Create multiple versions 488 v1_dataset = create_versioned_dataset(data_v1, "1.0") 489 v2_dataset = create_versioned_dataset(data_v2, "2.0") 490 v3_dataset = create_versioned_dataset(data_v3, "3.0") 491 ``` 492 493 </TabItem> 494 <TabItem value="quality-monitoring" label="Data Quality Monitoring"> 495 496 Monitor data quality and drift over time: 497 498 ```python 499 def monitor_dataset_quality(dataset, reference_dataset=None): 500 """Monitor dataset quality and compare against reference if provided.""" 501 502 data = dataset.df if hasattr(dataset, "df") else dataset 503 504 quality_metrics = { 505 "total_rows": len(data), 506 "total_columns": len(data.columns), 507 "missing_values_total": data.isnull().sum().sum(), 508 "missing_values_pct": (data.isnull().sum().sum() / data.size) * 100, 509 "duplicate_rows": data.duplicated().sum(), 510 "duplicate_rows_pct": (data.duplicated().sum() / len(data)) * 100, 511 } 512 513 # Numeric column statistics 514 numeric_cols = data.select_dtypes(include=["number"]).columns 515 for col in numeric_cols: 516 quality_metrics.update({ 517 f"{col}_mean": data[col].mean(), 518 f"{col}_std": data[col].std(), 519 f"{col}_missing_pct": (data[col].isnull().sum() / len(data)) * 100, 520 }) 521 522 with mlflow.start_run(run_name="Data_Quality_Check"): 523 mlflow.log_input(dataset, context="quality_monitoring") 524 mlflow.log_metrics(quality_metrics) 525 526 # Compare with reference dataset if provided 527 if reference_dataset is not None: 528 ref_data = ( 529 reference_dataset.df if hasattr(reference_dataset, "df") else reference_dataset 530 ) 531 532 # Basic drift detection 533 drift_metrics = {} 534 for col in numeric_cols: 535 if col in ref_data.columns: 536 mean_diff = abs(data[col].mean() - ref_data[col].mean()) 537 std_diff = abs(data[col].std() - ref_data[col].std()) 538 drift_metrics.update({ 539 f"{col}_mean_drift": mean_diff, 540 f"{col}_std_drift": std_diff, 541 }) 542 543 mlflow.log_metrics(drift_metrics) 544 545 return quality_metrics 546 547 548 # Usage 549 quality_report = monitor_dataset_quality(current_dataset, reference_dataset) 550 ``` 551 552 </TabItem> 553 <TabItem value="automated-tracking" label="Automated Tracking"> 554 555 Set up automated dataset tracking in your ML pipelines: 556 557 ```python 558 class DatasetTracker: 559 """Automated dataset tracking for ML pipelines.""" 560 561 def __init__(self, experiment_name="Dataset_Tracking"): 562 mlflow.set_experiment(experiment_name) 563 self.tracked_datasets = {} 564 565 def track_dataset(self, data, stage, source=None, name=None, **metadata): 566 """Track a dataset at a specific pipeline stage.""" 567 568 dataset_name = name or f"{stage}_dataset" 569 570 dataset = mlflow.data.from_pandas( 571 data, source=source or f"pipeline_stage_{stage}", name=dataset_name 572 ) 573 574 with mlflow.start_run(run_name=f"Pipeline_{stage}"): 575 mlflow.log_input(dataset, context=stage) 576 577 # Log stage metadata 578 mlflow.log_params({"pipeline_stage": stage, "dataset_name": dataset_name, **metadata}) 579 580 # Automatic quality metrics 581 quality_metrics = { 582 "rows": len(data), 583 "columns": len(data.columns), 584 "missing_pct": (data.isnull().sum().sum() / data.size) * 100, 585 } 586 mlflow.log_metrics(quality_metrics) 587 588 self.tracked_datasets[stage] = dataset 589 return dataset 590 591 def compare_stages(self, stage1, stage2): 592 """Compare datasets between pipeline stages.""" 593 594 if stage1 not in self.tracked_datasets or stage2 not in self.tracked_datasets: 595 raise ValueError("Both stages must be tracked first") 596 597 ds1 = self.tracked_datasets[stage1] 598 ds2 = self.tracked_datasets[stage2] 599 600 # Implementation of comparison logic 601 with mlflow.start_run(run_name=f"Compare_{stage1}_vs_{stage2}"): 602 comparison_metrics = { 603 "row_diff": len(ds2.df) - len(ds1.df), 604 "column_diff": len(ds2.df.columns) - len(ds1.df.columns), 605 } 606 mlflow.log_metrics(comparison_metrics) 607 608 609 # Usage in a pipeline 610 tracker = DatasetTracker() 611 612 # Track at each stage 613 raw_dataset = tracker.track_dataset(raw_data, "raw", source="raw_data.csv") 614 cleaned_dataset = tracker.track_dataset(cleaned_data, "cleaned", source="cleaned_data.csv") 615 features_dataset = tracker.track_dataset(feature_data, "features", source="feature_engineering") 616 617 # Compare stages 618 tracker.compare_stages("raw", "cleaned") 619 tracker.compare_stages("cleaned", "features") 620 ``` 621 622 </TabItem> 623 </Tabs> 624 625 ## Production Use Cases 626 627 <Tabs> 628 <TabItem value="batch-monitoring" label="Batch Prediction Monitoring" default> 629 630 Monitor datasets used in production batch prediction: 631 632 ```python 633 def monitor_batch_predictions(batch_data, model_version, date): 634 """Monitor production batch prediction datasets.""" 635 636 # Create dataset for batch predictions 637 batch_dataset = mlflow.data.from_pandas( 638 batch_data, 639 source=f"production_batch_{date}", 640 name=f"batch_predictions_{date}", 641 targets="true_label" if "true_label" in batch_data.columns else None, 642 predictions="prediction" if "prediction" in batch_data.columns else None, 643 ) 644 645 with mlflow.start_run(run_name=f"Batch_Monitor_{date}"): 646 mlflow.log_input(batch_dataset, context="production_batch") 647 648 # Log production metadata 649 mlflow.log_params({ 650 "batch_date": date, 651 "model_version": model_version, 652 "batch_size": len(batch_data), 653 "has_ground_truth": "true_label" in batch_data.columns, 654 }) 655 656 # Monitor prediction distribution 657 if "prediction" in batch_data.columns: 658 pred_metrics = { 659 "prediction_mean": batch_data["prediction"].mean(), 660 "prediction_std": batch_data["prediction"].std(), 661 "unique_predictions": batch_data["prediction"].nunique(), 662 } 663 mlflow.log_metrics(pred_metrics) 664 665 # Evaluate if ground truth is available 666 if all(col in batch_data.columns for col in ["prediction", "true_label"]): 667 result = mlflow.models.evaluate(data=batch_dataset, model_type="classifier") 668 print(f"Batch accuracy: {result.metrics.get('accuracy_score', 'N/A')}") 669 670 return batch_dataset 671 672 673 # Usage 674 batch_dataset = monitor_batch_predictions(daily_batch_data, "v2.1", "2024-01-15") 675 ``` 676 677 </TabItem> 678 <TabItem value="ab-testing" label="A/B Testing Datasets"> 679 680 Track datasets used in A/B testing scenarios: 681 682 ```python 683 def track_ab_test_data(control_data, treatment_data, test_name, test_date): 684 """Track datasets for A/B testing experiments.""" 685 686 # Create datasets for each variant 687 control_dataset = mlflow.data.from_pandas( 688 control_data, 689 source=f"ab_test_{test_name}_control", 690 name=f"{test_name}_control_{test_date}", 691 targets="conversion" if "conversion" in control_data.columns else None, 692 ) 693 694 treatment_dataset = mlflow.data.from_pandas( 695 treatment_data, 696 source=f"ab_test_{test_name}_treatment", 697 name=f"{test_name}_treatment_{test_date}", 698 targets="conversion" if "conversion" in treatment_data.columns else None, 699 ) 700 701 with mlflow.start_run(run_name=f"AB_Test_{test_name}_{test_date}"): 702 # Log both datasets 703 mlflow.log_input(control_dataset, context="ab_test_control") 704 mlflow.log_input(treatment_dataset, context="ab_test_treatment") 705 706 # Log test parameters 707 mlflow.log_params({ 708 "test_name": test_name, 709 "test_date": test_date, 710 "control_size": len(control_data), 711 "treatment_size": len(treatment_data), 712 "total_size": len(control_data) + len(treatment_data), 713 }) 714 715 # Calculate and log comparison metrics 716 if "conversion" in control_data.columns and "conversion" in treatment_data.columns: 717 control_rate = control_data["conversion"].mean() 718 treatment_rate = treatment_data["conversion"].mean() 719 lift = (treatment_rate - control_rate) / control_rate * 100 720 721 mlflow.log_metrics({ 722 "control_conversion_rate": control_rate, 723 "treatment_conversion_rate": treatment_rate, 724 "lift_percentage": lift, 725 }) 726 727 return control_dataset, treatment_dataset 728 729 730 # Usage 731 control_ds, treatment_ds = track_ab_test_data( 732 control_group_data, treatment_group_data, "new_recommendation_model", "2024-01-15" 733 ) 734 ``` 735 736 </TabItem> 737 </Tabs> 738 739 ## Best Practices 740 741 When working with MLflow datasets, follow these best practices: 742 743 **Data Quality**: Always validate data quality before logging datasets. Check for missing values, duplicates, and data types. 744 745 **Naming Conventions**: Use consistent, descriptive names for datasets that include version information and context. 746 747 **Source Documentation**: Always specify meaningful source URLs or identifiers that allow you to trace back to the original data. 748 749 **Context Specification**: Use clear context labels when logging datasets (e.g., "training", "validation", "evaluation", "production"). 750 751 **Metadata Logging**: Include relevant metadata about data collection, preprocessing steps, and data characteristics. 752 753 **Version Control**: Track dataset versions explicitly, especially when data preprocessing or collection methods change. 754 755 **Digest Computation**: Dataset digests are computed differently for different dataset types: 756 757 - **Standard datasets**: Based on data content and structure 758 - **MetaDataset**: Based on metadata only (name, source, schema) - no actual data hashing 759 - **EvaluationDataset**: Optimized hashing using sample rows for large datasets 760 761 **Source Flexibility**: DatasetSource supports various source types including HTTP URLs, file paths, database connections, and cloud storage locations. 762 763 **Evaluation Integration**: Design datasets with evaluation in mind by clearly specifying target and prediction columns. 764 765 ## Key Benefits 766 767 MLflow dataset tracking provides several key advantages for ML teams: 768 769 **Reproducibility**: Ensure experiments can be reproduced with identical datasets, even as data sources evolve. 770 771 **Lineage Tracking**: Maintain complete data lineage from source to model predictions, enabling better debugging and compliance. 772 773 **Collaboration**: Share datasets and their metadata across team members with consistent interfaces. 774 775 **Evaluation Integration**: Seamlessly integrate with MLflow's evaluation capabilities for comprehensive model assessment. 776 777 **Production Monitoring**: Track datasets used in production systems for performance monitoring and data drift detection. 778 779 **Quality Assurance**: Automatically capture data quality metrics and monitor changes over time. 780 781 Whether you're tracking training datasets, managing evaluation data, or monitoring production batch predictions, MLflow's dataset tracking capabilities provide the foundation for reliable, reproducible machine learning workflows.