Tutorial.ipynb
1 { 2 "cells": [ 3 { 4 "cell_type": "markdown", 5 "id": "0082b221-e28f-4d0b-8377-f0cc328dff91", 6 "metadata": {}, 7 "source": [ 8 "# Building Advanced RAG with MLflow and LlamaIndex Workflow\n", 9 "\n", 10 "Augmenting LLMs with various data sources is a strong strategy to build LLM applications. However, as the system grows more complex, it is challenging to xxx and running quick development cycle for improvement.\n", 11 "\n", 12 "LlamaIndex Workflow is a great framework to build such compound system, and combined with MLflow brings efficiency and robustness in the development cycle, featuring debugging, tracking experiment, and evaluation for continuous improvement.\n", 13 "\n", 14 "In this notebook, we will go through the journey of building an sophisticated chatbot with LlamaIndex Workflow and MLflow. By the end of this tutorial you will have:\n", 15 "\n", 16 "- Set up a **LlamaIndex Workflow** with multiple retrieval strategies, including vector search, BM25, and web search.\n", 17 "- Logged and tracked the workflow in **MLflow Experiment** and track parameters.\n", 18 "- Evaluated the workflow's performance using **MLflow Evaluate**, with metrics such as latency and answer correctness.\n", 19 "- Explored how to use **MLflow UI** to compare model performance across multiple configurations.\n", 20 "- Gained insights into workflow execution with **MLflow Tracing** to identify quality issues and optimize retriever strategies." 21 ] 22 }, 23 { 24 "cell_type": "markdown", 25 "id": "b93af38f", 26 "metadata": {}, 27 "source": [ 28 "## Strategy: Hybrid Approach Using Multiple Retrieval Methods\n", 29 "\n", 30 "\n", 31 "Retrieval-Augmented Generation (RAG) is a powerful framework, but the retrieval step can often become a bottleneck, because embedding-based retrieval may not always capture the most relevant context. While many techniques exist to improve retrieval quality, no single solution works universally. Therefore, an effective strategy is to combine multiple retrieval approaches.\n", 32 "\n", 33 "The concept we will explore here is to run several retrieval methods in parallel: (1) standard vector search, (2) keyword-based search (BM25), and (3) web search. The retrieved contexts are then merged, with irrelevant data filtered out to enhance the overall quality.\n", 34 "\n", 35 "\n", 36 "\n", 37 "How do we bring this concept to life? Let's dive in and build this hybrid RAG using LlamaIndex Workflow and MLflow." 38 ] 39 }, 40 { 41 "cell_type": "markdown", 42 "id": "10434134", 43 "metadata": {}, 44 "source": [ 45 "## What is LlamaIndex Workflow?\n", 46 "\n", 47 "[LlamaIndex Workflow](https://docs.llamaindex.ai/en/stable/module_guides/workflow/) is an event-driven orchestration framework for designing dynamic AI applications. The core of LlamaIndex Workflow consists of:\n", 48 "\n", 49 "* `Steps` are units of execution, representing distinct actions in the workflow.\n", 50 "\n", 51 "* `Events` trigger these steps, acting as signals that control the workflow's flow.\n", 52 "\n", 53 "* `Workflow` connects these two as a Python class. Each step is implemented as a method of the workflow class, defined with input and output events.\n", 54 "\n", 55 "This simple yet powerful abstraction allows you to break down complex tasks into manageable steps, enabling greater flexibility and scalability. As a nature of event-driven design, it is super easy to design parallel/asynchronous execution flow, which significantly enhances efficiency involve long-running tasks and provides production-ready scalability.\n" 56 ] 57 }, 58 { 59 "cell_type": "markdown", 60 "id": "82e5ada4-db4c-4dfa-bb3e-34cb4e82a2da", 61 "metadata": {}, 62 "source": [ 63 "## 1. Set Up\n", 64 "\n", 65 "Follow the instructions on `README.md` to set up environment if you haven't." 66 ] 67 }, 68 { 69 "cell_type": "markdown", 70 "id": "ec266b4a-b679-4cbb-b309-db25dc627f00", 71 "metadata": {}, 72 "source": [ 73 "## 2. Start an MLflow Experiment\n", 74 "\n", 75 "An **MLflow Experiment** is where you track all aspects of model development, including model definitions, configurations, parameters, dependency versions, and more. Let's start by creating a new MLflow experiment called \"LlamaIndex Workflow RAG\":" 76 ] 77 }, 78 { 79 "cell_type": "code", 80 "execution_count": null, 81 "id": "fe067408", 82 "metadata": {}, 83 "outputs": [], 84 "source": [ 85 "import mlflow\n", 86 "\n", 87 "mlflow.set_experiment(\"LlamaIndex Workflow RAG\")" 88 ] 89 }, 90 { 91 "cell_type": "markdown", 92 "id": "eff2d35e-2061-4307-af75-4fec5b85ede6", 93 "metadata": {}, 94 "source": [ 95 "## 3. Choose your LLM and Embeddings\n", 96 "\n", 97 "\n", 98 "Now, set up your preferred LLM and embeddings models to LlamaIndex's Settings object. These models will be used throughout the LlamaIndex components.\n", 99 "\n", 100 "💡 *MLflow will automatically log the `Settings` configuration into your MLflow Experiment when logging models, ensuring reproducibility and reducing the risk of discrepancies between environments.*\n" 101 ] 102 }, 103 { 104 "cell_type": "markdown", 105 "id": "59ab2584-cca2-4215-aac1-7d8502c43323", 106 "metadata": {}, 107 "source": [ 108 "### Option 1: OpenAI (default)\n", 109 "\n", 110 "LlamaIndex by default uses OpenAI APIs for LLMs and embeddings models. You can use the default model (`gpt-3.5-turbo` and `text-embeddings-ada-002` as of Oct 2024), but we recommend setting them to the latest efficient models for getting better results with lower cost." 111 ] 112 }, 113 { 114 "cell_type": "code", 115 "execution_count": null, 116 "id": "0adf61f1-15cd-4c7b-bc35-f987a3ba12df", 117 "metadata": {}, 118 "outputs": [], 119 "source": [ 120 "import getpass\n", 121 "import os\n", 122 "\n", 123 "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter your OpenAI API key\")" 124 ] 125 }, 126 { 127 "cell_type": "code", 128 "execution_count": null, 129 "id": "da5b534f-035e-4021-8a8b-e5fdf64ca2e9", 130 "metadata": {}, 131 "outputs": [], 132 "source": [ 133 "from llama_index.core import Settings\n", 134 "from llama_index.embeddings.openai import OpenAIEmbedding\n", 135 "from llama_index.llms.openai import OpenAI\n", 136 "\n", 137 "# LlamaIndex by default uses OpenAI APIs for LLMs and embeddings models. You can use the default\n", 138 "# model (`gpt-3.5-turbo` and `text-embeddings-ada-002` as of Oct 2024), but we recommend using the\n", 139 "# latest efficient models instead for getting better results with lower cost.\n", 140 "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\")\n", 141 "Settings.llm = OpenAI(model=\"gpt-4o-mini\")" 142 ] 143 }, 144 { 145 "cell_type": "markdown", 146 "id": "40717988-458a-4b3b-b9f2-ce46ff9299e3", 147 "metadata": {}, 148 "source": [ 149 "### Option 2: Other Hosted Models\n", 150 "\n", 151 "If you want to use other hosted LLMs,\n", 152 "\n", 153 "1. Download the integration package for the model provider of your choice.\n", 154 "2. Set up required environment variables as specified in the integration documentation.\n", 155 "3. Instantiate the LLM and Embeddings instances and set them to the global Settings object.\n", 156 "\n", 157 "The following cells show an example for using Databricks hosted LLMs (Llama3.1 70B instruct).\n", 158 "\n" 159 ] 160 }, 161 { 162 "cell_type": "code", 163 "execution_count": null, 164 "id": "3ea86c74-faac-44d3-9e93-33b22b9e2691", 165 "metadata": {}, 166 "outputs": [], 167 "source": [ 168 "%pip install llama-index-llms-databricks -qU" 169 ] 170 }, 171 { 172 "cell_type": "code", 173 "execution_count": null, 174 "id": "9c5c04d7-d692-4353-8457-2aaac660845a", 175 "metadata": {}, 176 "outputs": [], 177 "source": [ 178 "import getpass\n", 179 "import os\n", 180 "\n", 181 "os.environ[\"DATABRICKS_SERVING_ENDPOINT\"] = \"https://YOUR_DATABRICKS_HOST/serving-endpoints/\"\n", 182 "os.environ[\"DATABRICKS_TOKEN\"] = getpass.getpass(\"Enter Databricks API Key\")" 183 ] 184 }, 185 { 186 "cell_type": "code", 187 "execution_count": null, 188 "id": "7a83e674-903e-41c1-8480-7d95cfef0c9b", 189 "metadata": {}, 190 "outputs": [], 191 "source": [ 192 "from llama_index.core import Settings\n", 193 "from llama_index.embeddings.databricks import DatabricksEmbedding\n", 194 "from llama_index.llms.databricks import Databricks\n", 195 "\n", 196 "Settings.embed_model = DatabricksEmbedding(model=\"databricks-gte-large-en\")\n", 197 "Settings.llm = Databricks(model=\"databricks-meta-llama-3-1-70b-instruct\")" 198 ] 199 }, 200 { 201 "cell_type": "markdown", 202 "id": "e9b1fbb8-e3e3-46a9-9fbb-2244cc818d03", 203 "metadata": {}, 204 "source": [ 205 "### Option 3: Local Models\n", 206 "\n", 207 "LlamaIndex also support locally hosted LLMs. Please refer to the [Starter Tutorial (Local Models)](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/) for how to set them up." 208 ] 209 }, 210 { 211 "cell_type": "markdown", 212 "id": "221e1c89-f046-4fb4-a7f3-c2460c2013a7", 213 "metadata": {}, 214 "source": [ 215 "## 4. Set Up Web Search API\n", 216 "\n", 217 "Later in this notebook, we will add web search capability to the QA bot. Tavily AI provides a search API\n", 218 "optimized for LLM application and natively integrated with LlamaIndex. Visit [their website](https://tavily.com/) to\n", 219 "get an API key for free-tier use, or use different search engine integrated with LlamaIndex, e.g. [GoogleSearchToolSpec](https://docs.llamaindex.ai/en/stable/api_reference/tools/google/#llama_index.tools.google.GoogleSearchToolSpec)." 220 ] 221 }, 222 { 223 "cell_type": "code", 224 "execution_count": null, 225 "id": "09ed9c06-0501-4e80-91a3-3cc38402a1c3", 226 "metadata": {}, 227 "outputs": [], 228 "source": [ 229 "import getpass\n", 230 "import os\n", 231 "\n", 232 "os.environ[\"TAVILY_AI_API_KEY\"] = getpass.getpass(\"Enter your Tavily AI APi Key\")" 233 ] 234 }, 235 { 236 "cell_type": "markdown", 237 "id": "34c673bf", 238 "metadata": {}, 239 "source": [ 240 "## 5. Set Up Document Indices for Retrieval\n", 241 "\n", 242 "The next step is to build a document index for retrieval from MLflow documentation. The `urls.txt` file in the `data` directory contains a list of MLflow documentation pages. These pages can be loaded as document objects using the web page reader utility.\n" 243 ] 244 }, 245 { 246 "cell_type": "code", 247 "execution_count": null, 248 "id": "e7442b40", 249 "metadata": {}, 250 "outputs": [], 251 "source": [ 252 "from llama_index.readers.web import SimpleWebPageReader\n", 253 "\n", 254 "with open(\"data/urls.txt\") as file:\n", 255 " urls = [line.strip() for line in file if line.strip()]\n", 256 "\n", 257 "documents = SimpleWebPageReader(html_to_text=True).load_data(urls)" 258 ] 259 }, 260 { 261 "cell_type": "markdown", 262 "id": "48419c9f-5d3e-4ba4-a1b5-34b26d01c3e0", 263 "metadata": {}, 264 "source": [ 265 "### Vector Index\n", 266 "Next, ingest these documents into a vector database. In this tutorial, we'll use the [Qdrant](https://qdrant.tech/) vector store, which is free if self-hosted. If Docker is installed on your machine, you can start the Qdrant database by running the official Docker container:\n", 267 "\n", 268 "\n", 269 "```shell\n", 270 "$ docker pull qdrant/qdrant\n", 271 "$ docker run -p 6333:6333 -p 6334:6334 \\\n", 272 " -v $(pwd)/.qdrant_storage:/qdrant/storage:z \\\n", 273 " qdrant/qdrant\n", 274 "```\n", 275 "\n", 276 "Once the container is running, you can create an index object that connects to the Qdrant database:" 277 ] 278 }, 279 { 280 "cell_type": "code", 281 "execution_count": null, 282 "id": "f4dbaff9", 283 "metadata": {}, 284 "outputs": [], 285 "source": [ 286 "import qdrant_client\n", 287 "from llama_index.vector_stores.qdrant import QdrantVectorStore\n", 288 "\n", 289 "client = qdrant_client.QdrantClient(host=\"localhost\", port=6333)\n", 290 "vector_store = QdrantVectorStore(client=client, collection_name=\"mlflow_doc\")\n", 291 "\n", 292 "from llama_index.core import StorageContext, VectorStoreIndex\n", 293 "\n", 294 "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n", 295 "index = VectorStoreIndex.from_documents(documents=documents, storage_context=storage_context)" 296 ] 297 }, 298 { 299 "cell_type": "markdown", 300 "id": "9fc7411d", 301 "metadata": {}, 302 "source": [ 303 "Of course, you can use your preferred vector store here. LlamaIndex supports a variety of vector databases, such as [FAISS](https://docs.llamaindex.ai/en/stable/examples/vector_stores/FaissIndexDemo/), [Chroma](https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo/), and [Databricks Vector Search](https://docs.llamaindex.ai/en/stable/examples/vector_stores/DatabricksVectorSearchDemo/). If you choose an alternative, follow the relevant LlamaIndex documentation and update the `workflow/workflow.py` file accordingly." 304 ] 305 }, 306 { 307 "cell_type": "markdown", 308 "id": "db489188-9998-4b81-b0af-6aba4b989470", 309 "metadata": {}, 310 "source": [ 311 "### Keyword-based Retrieval\n", 312 "\n", 313 "In addition to evaluating the vector search retrieval, we will assess the keyword-based retriever (BM25) later.\n", 314 "Let's set up local document storage to enable BM25 retrieval in the workflow." 315 ] 316 }, 317 { 318 "cell_type": "code", 319 "execution_count": null, 320 "id": "1d6b8b1f-bc61-4152-9e04-b93b0d34bc6d", 321 "metadata": {}, 322 "outputs": [], 323 "source": [ 324 "from llama_index.core.node_parser import SentenceSplitter\n", 325 "from llama_index.retrievers.bm25 import BM25Retriever\n", 326 "\n", 327 "splitter = SentenceSplitter(chunk_size=512)\n", 328 "nodes = splitter.get_nodes_from_documents(documents)\n", 329 "bm25_retriever = BM25Retriever.from_defaults(nodes=nodes)\n", 330 "bm25_retriever.persist(\".bm25_retriever\")" 331 ] 332 }, 333 { 334 "cell_type": "markdown", 335 "id": "d2d061ff-1abc-4ce3-88f8-8c7d4fe19d63", 336 "metadata": {}, 337 "source": [ 338 "## 6. Define a Workflow\n", 339 "\n", 340 "Now that the environment and data sources are ready, we can build the workflow and experiment with it. The complete workflow code is defined in the `workflow` directory. Let's explore some key components of the implementation.\n", 341 "\n", 342 "### Events\n", 343 "\n", 344 "The `workflow/events.py` file defines all the events used within the workflow. These are simple Pydantic models that carry information between workflow steps. For example, the `VectorSearchRetrieveEvent` triggers the vector search step by passing the user's query.\n", 345 "\n", 346 "```python\n", 347 "class VectorSearchRetrieveEvent(Event):\n", 348 " \"\"\"Event for triggering VectorStore index retrieval step.\"\"\"\n", 349 " query: str\n", 350 "```\n", 351 "\n", 352 "### Prompts\n", 353 "\n", 354 "Throughout the workflow execution, we call LLMs multiple times. The prompt templates for these LLM calls are defined in the `workflow/prompts.py` file.\n", 355 "\n", 356 "\n", 357 "### Workflow Class\n", 358 "\n", 359 "The main workflow class is defined in `workflow/workflow.py`. Let's break down how it works.\n", 360 "\n", 361 "The constructor accepts a retrievers argument, which specifies the retrieval methods to be used in the workflow. For instance, if `[\"vector_search\", \"bm25\"]` is passed, the workflow performs vector search and keyword-based search, skipping web search.\n", 362 "\n", 363 "💡 Deciding retrievers dynamically allows us to experiment different retrieval strategies without replicating model code with almost same definition.\n", 364 "\n", 365 "```python\n", 366 "class HybridRAGWorkflow(Workflow):\n", 367 "\n", 368 " VALID_RETRIEVERS = {\"vector_search\", \"bm25\", \"web_search\"}\n", 369 "\n", 370 " def __init__(self, retrievers=None, **kwargs):\n", 371 " super().__init__(**kwargs)\n", 372 " self.llm = Settings.llm\n", 373 " self.retrievers = retrievers or []\n", 374 "\n", 375 " if invalid_retrievers := set(self.retrievers) - self.VALID_RETRIEVERS:\n", 376 " raise ValueError(f\"Invalid retrievers specified: {invalid_retrievers}\")\n", 377 "\n", 378 " self._use_vs_retriever = \"vector_search\" in self.retrievers\n", 379 " self._use_bm25_retriever = \"bm25\" in self.retrievers\n", 380 " self._use_web_search = \"web_search\" in self.retrievers\n", 381 "\n", 382 " if self._use_vs_retriever:\n", 383 " qd_client = qdrant_client.QdrantClient(host=_QDRANT_HOST, port=_QDRANT_PORT)\n", 384 " vector_store = QdrantVectorStore(client=qd_client, collection_name=_QDRANT_COLLECTION_NAME)\n", 385 " index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n", 386 " self.vs_retriever = index.as_retriever()\n", 387 "\n", 388 " if self._use_bm25_retriever:\n", 389 " self.bm25_retriever = BM25Retriever.from_persist_dir(_BM25_PERSIST_DIR)\n", 390 "\n", 391 " if self._use_web_search:\n", 392 " self.tavily_tool = TavilyToolSpec(api_key=os.environ.get(\"TAVILY_AI_API_KEY\"))\n", 393 "```\n", 394 "\n", 395 "The workflow begins by executing a step that takes the `StartEvent` as input, which is the `route_retrieval` step in this case. This step inspects the retrievers parameter and triggers the necessary retrieval steps. By using the `send_event()` method of the context object, multiple events can be dispatched in parallel from this single step.\n", 396 "\n", 397 "\n", 398 "```python\n", 399 " # If no retriever is specified, proceed directly to the final query step with an empty context\n", 400 " if len(self.retrievers) == 0:\n", 401 " return QueryEvent(context=\"\")\n", 402 "\n", 403 " # Trigger the retrieval steps based on the configuration\n", 404 " if self._use_vs_retriever:\n", 405 " ctx.send_event(VectorSearchRetrieveEvent(query=query))\n", 406 " if self._use_bm25_retriever:\n", 407 " ctx.send_event(BM25RetrieveEvent(query=query))\n", 408 " if self._use_web_search:\n", 409 " ctx.send_event(TransformQueryEvent(query=query))\n", 410 "```\n", 411 "\n", 412 "The retrieval steps are straightforward. However, the web search step is more advanced as it includes an additional step to transform the user's question into a search-friendly query using an LLM.\n", 413 "\n", 414 "The results from all the retrieval steps are aggregated in the `gather_retrieval_results` step. Here, the `ctx.collect_events()` method is used to poll for the results of the asynchronously executed steps.\n", 415 "\n", 416 "```python\n", 417 " results = ctx.collect_events(ev, [RetrievalResultEvent] * len(self.retrievers))\n", 418 "```\n", 419 "\n", 420 "Passing all results from multiple retrievers often leads to a large context with unrelated or duplicate content. To address this, we need to filter and select the most relevant results. While a score-based approach is common, web search results do not return similarity scores. Therefore, we use an LLM to sort and filter out irrelevant results. The rerank step achieves this by leveraging the built-in reranker integration with [RankGPT](https://github.com/sunnweiwei/RankGPT).\n", 421 "\n", 422 "```python\n", 423 " reranker = RankGPTRerank(llm=self.llm, top_n=5)\n", 424 " reranked_nodes = reranker.postprocess_nodes(ev.nodes, query_str=query)\n", 425 " reranked_context = \"\\n\".join(node.text for node in reranked_nodes)\n", 426 "```\n", 427 "\n", 428 "Finally, the reranked context is passed to the LLM along with the user query to generate the final answer. The result is returned as a `StopEvent` with the `result` key.\n", 429 "\n", 430 "```python\n", 431 " @step\n", 432 " async def query_result(self, ctx: Context, ev: QueryEvent) -> StopEvent:\n", 433 " \"\"\"Get result with relevant text.\"\"\"\n", 434 " query = await ctx.get(\"query\")\n", 435 "\n", 436 " prompt = FINAL_QUERY_TEMPLATE.format(context=ev.context, query=query)\n", 437 " response = self.llm.complete(prompt).text\n", 438 " return StopEvent(result=response)\n", 439 "```\n" 440 ] 441 }, 442 { 443 "cell_type": "markdown", 444 "id": "37b8f614-cc00-4602-b283-0564ce6d1f14", 445 "metadata": {}, 446 "source": [ 447 "Now, let's instantiate the workflow and run it.\n" 448 ] 449 }, 450 { 451 "cell_type": "code", 452 "execution_count": null, 453 "id": "c8c91200", 454 "metadata": {}, 455 "outputs": [], 456 "source": [ 457 "# Workflow with VS + BM25 retrieval\n", 458 "from workflow.workflow import HybridRAGWorkflow\n", 459 "\n", 460 "workflow = HybridRAGWorkflow(retrievers=[\"vector_search\", \"bm25\"], timeout=60)\n", 461 "response = await workflow.run(query=\"Why use MLflow with LlamaIndex?\")\n", 462 "print(response)" 463 ] 464 }, 465 { 466 "cell_type": "markdown", 467 "id": "bef4f09d-98a7-4298-b44d-6ed8eb0d212c", 468 "metadata": {}, 469 "source": [ 470 "## 7. Log the Workflow in MLflow Experiment\n", 471 "\n", 472 "Now we want to run the workflow with various different retrieval strategy and evaluate the performance. However, before running the evaluation, we'll log the model in MLflow to track both the model and its performance within an **MLflow Experiment**.\n", 473 "\n", 474 "For the LlamaIndex Workflow, we use the new [Model-from-code](https://mlflow.org/docs/latest/models.html#models-from-code) method, which logs models as standalone Python scripts. This approach avoids the risks and instability associated with serialization methods like pickle, relying instead on code as the single source of truth for the model definition. When combined with MLflow's environment-freezing capability, it provides a reliable way to persist models. For more details, refer to the [MLflow documentation](https://mlflow.org/docs/latest/models.html#models-from-code).\n", 475 "\n", 476 "💡 In the `workflow` directory, there's a `model.py` script that imports the `HybridRAGWorkflow` and instantiates it with dynamic configurations passed via the `model_config` parameter during logging. This design allows you to track models with different configurations without duplicating the model definition.\n", 477 "\n", 478 "We'll start an MLflow Run and log the model script `model.py` with different configurations using the [mlflow.llama_index.log_model()](https://mlflow.org/docs/latest/python_api/mlflow.llama_index.html#mlflow.llama_index.log_model) API.\n" 479 ] 480 }, 481 { 482 "cell_type": "code", 483 "execution_count": null, 484 "id": "a164450c-f81c-4916-803c-5b5588642bd4", 485 "metadata": {}, 486 "outputs": [], 487 "source": [ 488 "# Different configurations we will evaluate. We don't run evaluation for all permutation\n", 489 "# for demonstration purpose, but you can add as many patterns as you want.\n", 490 "run_name_to_retrievers = {\n", 491 " # 1. No retrievers (prior knowledge in LLM).\n", 492 " \"none\": [],\n", 493 " # 2. Vector search retrieval only.\n", 494 " \"vs\": [\"vector_search\"],\n", 495 " # 3. Vector search and keyword search (BM25)\n", 496 " \"vs + bm25\": [\"vector_search\", \"bm25\"],\n", 497 " # 4. All retrieval methods including web search.\n", 498 " \"vs + bm25 + web\": [\"vector_search\", \"bm25\", \"web_search\"],\n", 499 "}\n", 500 "\n", 501 "# Create an MLflow Run and log model with each configuration.\n", 502 "models = []\n", 503 "for run_name, retrievers in run_name_to_retrievers.items():\n", 504 " with mlflow.start_run(run_name=run_name):\n", 505 " model_info = mlflow.llama_index.log_model(\n", 506 " # Specify the model Python script.\n", 507 " llama_index_model=\"workflow/model.py\",\n", 508 " # Specify retrievers to use.\n", 509 " model_config={\"retrievers\": retrievers},\n", 510 " # Define dependency files to save along with the model\n", 511 " code_paths=[\"workflow\"],\n", 512 " # Subdirectory to save artifacts (not important)\n", 513 " name=\"model\",\n", 514 " )\n", 515 " models.append(model_info)" 516 ] 517 }, 518 { 519 "cell_type": "markdown", 520 "id": "2ac6e827-1d8b-4611-9053-f1ff70497364", 521 "metadata": {}, 522 "source": [ 523 "Now open the MLflow UI again, and this time it should show 4 MLflow Runs are recorded with different `retrievers` parameter values. By clicking each Run name and navigate to the \"Artifacts\" tab, you can see MLflow records the model and various metadata, such as dependency versions and settings.\n" 524 ] 525 }, 526 { 527 "cell_type": "markdown", 528 "id": "74ca55e2-d4b7-45b5-96d8-60a247ddb69f", 529 "metadata": {}, 530 "source": [ 531 "## 8. Enable MLflow Tracing\n", 532 "\n", 533 "Before running the evaluation, there's one final step: enabling **MLflow Tracing**. We'll dive into this feature and why we do this here later, but for now, you can enable it with a simple one-line command. MLflow will automatically trace every LlamaIndex execution." 534 ] 535 }, 536 { 537 "cell_type": "code", 538 "execution_count": null, 539 "id": "f4be3720-2562-422c-b1c0-44e1b19961d7", 540 "metadata": {}, 541 "outputs": [], 542 "source": [ 543 "mlflow.llama_index.autolog()" 544 ] 545 }, 546 { 547 "cell_type": "markdown", 548 "id": "e4110c71", 549 "metadata": {}, 550 "source": [ 551 "## 9. Evaluate the Workflow with Different Retriever Strategies\n", 552 "\n", 553 "The example repository includes a sample evaluation dataset, `mlflow_qa_dataset.csv`, containing 30 question-answer pairs related to MLflow." 554 ] 555 }, 556 { 557 "cell_type": "code", 558 "execution_count": null, 559 "id": "9f97661e", 560 "metadata": {}, 561 "outputs": [], 562 "source": [ 563 "import pandas as pd\n", 564 "\n", 565 "eval_df = pd.read_csv(\"data/mlflow_qa_dataset.csv\")\n", 566 "display(eval_df.head(3))" 567 ] 568 }, 569 { 570 "cell_type": "markdown", 571 "id": "14703e34", 572 "metadata": {}, 573 "source": [ 574 "To evaluate the workflow, use the [mlflow.evaluate()](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate) API, which requires (1) your dataset, (2) the logged model, and (3) the metrics you want to compute." 575 ] 576 }, 577 { 578 "cell_type": "code", 579 "execution_count": null, 580 "id": "47ff88f3", 581 "metadata": {}, 582 "outputs": [], 583 "source": [ 584 "from mlflow.metrics import latency\n", 585 "from mlflow.metrics.genai import answer_correctness\n", 586 "\n", 587 "for model_info in models:\n", 588 " with mlflow.start_run(run_id=model_info.run_id):\n", 589 " result = mlflow.evaluate(\n", 590 " # Pass the URI of the logged model above\n", 591 " model=model_info.model_uri,\n", 592 " data=eval_df,\n", 593 " # Specify the column for ground truth answers.\n", 594 " targets=\"ground_truth\",\n", 595 " # Define the metrics to compute.\n", 596 " extra_metrics=[\n", 597 " latency(),\n", 598 " answer_correctness(\"openai:/gpt-4o-mini\"),\n", 599 " ],\n", 600 " # The answer_correctness metric requires \"inputs\" column to be\n", 601 " # present in the dataset. We have \"query\" instead so need to\n", 602 " # specify the mapping in `evaluator_config` parameter.\n", 603 " evaluator_config={\"col_mapping\": {\"inputs\": \"query\"}},\n", 604 " )" 605 ] 606 }, 607 { 608 "cell_type": "markdown", 609 "id": "811568f0-811e-4b01-b1ab-bebf2c0a0707", 610 "metadata": {}, 611 "source": [ 612 "In this example, we evaluate the model with two metrics:\n", 613 "\n", 614 "1. **Latency**: Measures the time taken to execute a workflow for a single query.\n", 615 "2. **Answer Correctness**: Evaluates the accuracy of answers based on the ground truth, scored by the OpenAI GPT-4o model on a 1–5 scale.\n", 616 "\n", 617 "These metrics are just for demonstration purposes—you can add additional metrics like toxicity or faithfulness, or even create your own.\n", 618 "\n", 619 "The evaluation process will take a few minutes. Once completed, you can view the results in the MLflow UI. Open the Experiment page and click on the chart icon 📈 above the Run list.\n", 620 "\n", 621 "*💡 The evaluation results can be different depending on model set up and some randomness.\n", 622 "\n", 623 "\n", 624 "\n", 625 "The first row shows bar charts for the answer correctness metrics, while the second row displays latency results. The best-performing combination is \"Vector Search + BM25\". Interestingly, adding web search not only increases latency significantly but also decreases answer correctness.\n", 626 "\n", 627 "Why does this happen? It appears some answers from the web-search-enabled model are off-topic. For example, in response to a question about starting Model Registry, the web-search model provides an unrelated answer about model deployment, while the \"vs + bm25\" model offers a correct response.\n", 628 "\n", 629 "\n", 630 "\n", 631 "Where did this incorrect answer come from? This seems to be a retriever issue, as we only changed the retrieval strategy. However, it's difficult to see what each retriever returned from the final result. To gain deeper insights into what's happening behind the scenes, MLflow Tracing is the perfect solution." 632 ] 633 }, 634 { 635 "cell_type": "markdown", 636 "id": "e388dd7c-5479-4d5f-838b-73cf33d073f1", 637 "metadata": {}, 638 "source": [ 639 "## 10. Inspecting Quality Issues with MLflow Trace\n", 640 "\n", 641 "[MLflow Tracing](https://mlflow.org/docs/latest/llms/tracing/index.html) is a new feature that brings observability to LLM applications. It integrates seamlessly with LlamaIndex, recording all inputs, outputs, and metadata about intermediate steps during workflow execution. Since we called `mlflow.llama_index.autolog()` at the start, every LlamaIndex operation has been traced and recorded in the MLflow Experiment.\n", 642 "\n", 643 "To inspect the trace for a specific question from the evaluation, navigate to the \"Traces\" tab on the experiment page. Look for the row with the particular question in the request column and the run name \"vs + bm25 + web.\" Clicking the request ID link opens the Trace UI, where you can view detailed information about each step in the execution, including inputs, outputs, metadata, and latency.\n", 644 "\n", 645 "\n", 646 "\n", 647 "In this case, we identified the issue by examining the reranker step. The web search retriever returned irrelevant context related to model serving, and the reranker incorrectly ranked it as the most relevant. With this insight, we can determine potential improvements, such as refining the reranker to better understand MLflow topics, improving web search precision, or even removing the web search retriever altogether." 648 ] 649 }, 650 { 651 "cell_type": "markdown", 652 "id": "010ebbd4-f394-425b-97fc-a5709e70233c", 653 "metadata": {}, 654 "source": [ 655 "## Conclusion\n", 656 "\n", 657 "In this notebook, we explored how the combination of LlamaIndex and MLflow can elevate the development of Retrieval-Augmented Generation (RAG) workflows, bringing together powerful model management and observability capabilities. By integrating multiple retrieval strategies—such as vector search, BM25, and web search—we demonstrated how flexible retrieval can enhance the performance of LLM-driven applications.\n", 658 "\n", 659 "- **Experiment Tracking** allowed us to organize and log different workflow configurations, ensuring reproducibility and enabling us to track model performance across multiple runs.\n", 660 "- **MLflow Evaluate** enabled us to easily log and evaluate the workflow with different retriever strategies, using key metrics like latency and answer correctness to compare performance.\n", 661 "- **MLflow UI** gave us a clear visualization of how various retrieval strategies impacted both accuracy and latency, helping us identify the most effective configurations.\n", 662 "- **MLflow Tracing**, integrated with LlamaIndex, provided detailed observability into each step of the workflow for diagnosing quality issues, such as incorrect reranking of search results.\n", 663 "\n", 664 "With these tools, you have a complete framework for building, logging, and optimizing RAG workflows. As LLM technology continues to evolve, the ability to track, evaluate, and fine-tune every aspect of model performance will be essential. We highly encourage you to experiment further and see how these tools can be tailored to your own applications.\n", 665 "\n", 666 "To continue learning, explore the following resources:\n", 667 "\n", 668 "* Learn more about the [MLflow LlamaIndex integration](https://mlflow.org/docs/latest/llms/llama-index/index.html).\n", 669 "* Discover additional MLflow LLM features at [LLMs in MLflow](https://mlflow.org/docs/latest/llms/index.html).\n", 670 "* Deploy your workflow to a serving endpoint with [MLflow Deployment](https://mlflow.org/docs/latest/deployment/index.html).\n", 671 "* Check out more [Workflow examples](https://docs.llamaindex.ai/en/stable/module_guides/workflow/#examples) from LlamaIndex.\n" 672 ] 673 } 674 ], 675 "metadata": { 676 "kernelspec": { 677 "display_name": "llama", 678 "language": "python", 679 "name": "llama" 680 }, 681 "language_info": { 682 "codemirror_mode": { 683 "name": "ipython", 684 "version": 3 685 }, 686 "file_extension": ".py", 687 "mimetype": "text/x-python", 688 "name": "python", 689 "nbconvert_exporter": "python", 690 "pygments_lexer": "ipython3", 691 "version": "3.11.7" 692 } 693 }, 694 "nbformat": 4, 695 "nbformat_minor": 5 696 }