47_Building_an_efficient_sparse_keyword_index_in_Python.ipynb
1 { 2 "nbformat": 4, 3 "nbformat_minor": 0, 4 "metadata": { 5 "colab": { 6 "provenance": [] 7 }, 8 "kernelspec": { 9 "name": "python3", 10 "display_name": "Python 3" 11 }, 12 "language_info": { 13 "name": "python" 14 } 15 }, 16 "cells": [ 17 { 18 "cell_type": "markdown", 19 "source": [ 20 "# Building an efficient sparse keyword index in Python\n", 21 "\n", 22 "Semantic search is a new category of search built on recent advances in Natural Language Processing (NLP). Traditional search systems use keywords to find data. Semantic search has an understanding of natural language and identifies results that have the same meaning, not necessarily the same keywords.\n", 23 "\n", 24 "While semantic search adds amazing capabilities, sparse keyword indexes can still add value. There may be cases where finding an exact match is important or we just want a fast index to quickly do an initial scan of a dataset.\n", 25 "\n", 26 "Unfortunately, there aren't a ton of great options for a local Python-based keyword index library. Most of the options available don't scale and/or are highly inefficient, designed only for simple situations.\n", 27 "\n", 28 "Given that Python is an interpreted language, it often gets a bad rap from a performance standpoint. In some cases, it's justified as Python can be memory hungry and has a global interpreter lock (GIL) that forces single thread execution. But it is possible to build performant Python on par with other languages.\n", 29 "\n", 30 "This notebook will explore how to build an efficient sparse keyword index in Python and compare the results with other approaches." 31 ], 32 "metadata": { 33 "id": "v4J3FxbUn9CT" 34 } 35 }, 36 { 37 "cell_type": "markdown", 38 "source": [ 39 "# Install dependencies\n", 40 "\n", 41 "Install `txtai` and all dependencies." 42 ], 43 "metadata": { 44 "id": "W70a-UjTdDiA" 45 } 46 }, 47 { 48 "cell_type": "code", 49 "source": [ 50 "%%capture\n", 51 "!pip install txtai pytrec_eval rank-bm25 elasticsearch==7.10.1\n", 52 "!pip uninstall -y tensorflow" 53 ], 54 "metadata": { 55 "id": "nfgwb14J4LO2" 56 }, 57 "execution_count": null, 58 "outputs": [] 59 }, 60 { 61 "cell_type": "markdown", 62 "source": [ 63 "# Introducing the problem\n", 64 "\n", 65 "At a high level, keyword indexes work by tokenizing text into lists of tokens per document. These tokens are aggregated into frequencies per document and stored in term frequency sparse arrays.\n", 66 "\n", 67 "The term frequency arrays are sparse given that they only store a frequency when the token exists in a document. For example, if a token exists in 1 of 1000 documents, the sparse array only has a single entry. A dense array stores 1000 entries all with zeros except for one.\n", 68 "\n", 69 "One simple approach to store a term frequency sparse array in Python would be having a dictionary of `{id: frequency}` per token. The problem with this approach is that Python has significant object overhead.\n", 70 "\n", 71 "Let's inspect the size used for a single number." 72 ], 73 "metadata": { 74 "id": "vF3hlZGkqMlh" 75 } 76 }, 77 { 78 "cell_type": "code", 79 "source": [ 80 "import sys\n", 81 "\n", 82 "a = 100\n", 83 "sys.getsizeof(a)" 84 ], 85 "metadata": { 86 "colab": { 87 "base_uri": "https://localhost:8080/" 88 }, 89 "id": "nGX8FMTcqn2c", 90 "outputId": "8580d98b-1901-49a5-d81c-8a2827257d3c" 91 }, 92 "execution_count": null, 93 "outputs": [ 94 { 95 "output_type": "execute_result", 96 "data": { 97 "text/plain": [ 98 "28" 99 ] 100 }, 101 "metadata": {}, 102 "execution_count": 2 103 } 104 ] 105 }, 106 { 107 "cell_type": "markdown", 108 "source": [ 109 "28 bytes for a single integer. Compared to a native int/long which is 4 or 8 bytes, this is quite wasteful. Imagine having thousands of `id: frequency` mappings. Memory usage will grow fast.\n", 110 "\n", 111 "Let's demonstrate. The code below runs a self contained Python process that creates a list of 10 million numbers.\n", 112 "\n", 113 "Running as a separate process helps calculate more accurate memory usage stats." 114 ], 115 "metadata": { 116 "id": "99ixudd1q7jB" 117 } 118 }, 119 { 120 "cell_type": "code", 121 "execution_count": null, 122 "metadata": { 123 "colab": { 124 "base_uri": "https://localhost:8080/" 125 }, 126 "id": "e_cBVXU-jYDQ", 127 "outputId": "ba6e00a3-6acb-4f2e-c985-719eb0d1d36e" 128 }, 129 "outputs": [ 130 { 131 "output_type": "stream", 132 "name": "stdout", 133 "text": [ 134 "Writing arrays.py\n" 135 ] 136 } 137 ], 138 "source": [ 139 "%%writefile arrays.py\n", 140 "import psutil\n", 141 "\n", 142 "results = []\n", 143 "for x in range(int(1e7)):\n", 144 " results.append(x)\n", 145 "\n", 146 "print(f\"MEMORY USAGE = {psutil.Process().memory_info().rss / (1024 * 1024)} MB\")" 147 ] 148 }, 149 { 150 "cell_type": "code", 151 "source": [ 152 "!python arrays.py" 153 ], 154 "metadata": { 155 "colab": { 156 "base_uri": "https://localhost:8080/" 157 }, 158 "id": "IJaAjPbGnGya", 159 "outputId": "adab040c-b7de-4d2f-83dc-5ec0f867a993" 160 }, 161 "execution_count": null, 162 "outputs": [ 163 { 164 "output_type": "stream", 165 "name": "stdout", 166 "text": [ 167 "MEMORY USAGE = 394.640625 MB\n" 168 ] 169 } 170 ] 171 }, 172 { 173 "cell_type": "markdown", 174 "source": [ 175 "Approximately 395 MB of memory is used for this array. That seems high." 176 ], 177 "metadata": { 178 "id": "bRrR8eOlvBRD" 179 } 180 }, 181 { 182 "cell_type": "markdown", 183 "source": [ 184 "# Efficient numeric arrays in Python\n", 185 "\n", 186 "Fortunately, Python has a module for building [efficient arrays of numeric values](https://docs.python.org/3/library/array.html). This module enables building arrays with the same native type.\n", 187 "\n", 188 "Let's try doing that with a `long long` type, which takes 8 bytes." 189 ], 190 "metadata": { 191 "id": "C1fA62VGwLya" 192 } 193 }, 194 { 195 "cell_type": "code", 196 "source": [ 197 "%%writefile arrays.py\n", 198 "from array import array\n", 199 "\n", 200 "import psutil\n", 201 "\n", 202 "results = array(\"q\")\n", 203 "for x in range(int(1e7)):\n", 204 " results.append(x)\n", 205 "\n", 206 "print(f\"MEMORY USAGE = {psutil.Process().memory_info().rss / (1024 * 1024)} MB\")" 207 ], 208 "metadata": { 209 "colab": { 210 "base_uri": "https://localhost:8080/" 211 }, 212 "id": "bqtUmEj4kjOQ", 213 "outputId": "736c3780-fed3-458c-be38-49feb42f416b" 214 }, 215 "execution_count": null, 216 "outputs": [ 217 { 218 "output_type": "stream", 219 "name": "stdout", 220 "text": [ 221 "Overwriting arrays.py\n" 222 ] 223 } 224 ] 225 }, 226 { 227 "cell_type": "code", 228 "source": [ 229 "!python arrays.py" 230 ], 231 "metadata": { 232 "colab": { 233 "base_uri": "https://localhost:8080/" 234 }, 235 "id": "5lgHsgQqnI5q", 236 "outputId": "726c596b-9896-4e8d-baa7-04cb8892bdea" 237 }, 238 "execution_count": null, 239 "outputs": [ 240 { 241 "output_type": "stream", 242 "name": "stdout", 243 "text": [ 244 "MEMORY USAGE = 88.54296875 MB\n" 245 ] 246 } 247 ] 248 }, 249 { 250 "cell_type": "markdown", 251 "source": [ 252 "As we can see, memory usage went from 395 MB to 89 MB. That's a 4x reduction which is in line with the earlier calculate of 28 bytes/number vs 8 bytes/number." 253 ], 254 "metadata": { 255 "id": "HS_uKPRhv2mV" 256 } 257 }, 258 { 259 "cell_type": "markdown", 260 "source": [ 261 "# Efficient processing of numeric data\n", 262 "\n", 263 "Large computations in pure Python can also be painfully slow. Luckily, there is a robust landscape of options for numeric processing. The most popular framework is [NumPy](https://github.com/numpy/numpy). There is also [PyTorch](https://github.com/pytorch/pytorch) and other GPU-based tensor processing frameworks.\n", 264 "\n", 265 "Below is a simple example that sorts an array in Python vs NumPy to demonstrate." 266 ], 267 "metadata": { 268 "id": "vGEjWEaGwX9s" 269 } 270 }, 271 { 272 "cell_type": "code", 273 "source": [ 274 "import random\n", 275 "import time\n", 276 "\n", 277 "data = [random.randint(1, 500) for x in range(1000000)]\n", 278 "\n", 279 "start = time.time()\n", 280 "sorted(data, reverse=True)\n", 281 "print(time.time() - start)" 282 ], 283 "metadata": { 284 "colab": { 285 "base_uri": "https://localhost:8080/" 286 }, 287 "id": "Ggln2VOSw5tY", 288 "outputId": "6bcbf8b2-01b1-41a3-bcad-f73f7220ca17" 289 }, 290 "execution_count": null, 291 "outputs": [ 292 { 293 "output_type": "stream", 294 "name": "stdout", 295 "text": [ 296 "0.33922290802001953\n" 297 ] 298 } 299 ] 300 }, 301 { 302 "cell_type": "code", 303 "source": [ 304 "import numpy as np\n", 305 "\n", 306 "data = np.array(data)\n", 307 "\n", 308 "start = time.time()\n", 309 "np.sort(data)[::-1]\n", 310 "print(time.time() - start)" 311 ], 312 "metadata": { 313 "colab": { 314 "base_uri": "https://localhost:8080/" 315 }, 316 "id": "IV-FhI4UxkV2", 317 "outputId": "ac3d33d1-c8e3-4ae0-f6fa-fd619e81dd5a" 318 }, 319 "execution_count": null, 320 "outputs": [ 321 { 322 "output_type": "stream", 323 "name": "stdout", 324 "text": [ 325 "0.10296249389648438\n" 326 ] 327 } 328 ] 329 }, 330 { 331 "cell_type": "markdown", 332 "source": [ 333 "As we can see, sorting an array in NumPy is significantly faster. It might not seem like a lot but this adds up when run in bulk." 334 ], 335 "metadata": { 336 "id": "sejHtgn1zBjk" 337 } 338 }, 339 { 340 "cell_type": "markdown", 341 "source": [ 342 "# Sparse keyword indexes in txtai\n", 343 "\n", 344 "Now that we've discussed the key performance concepts, let's talk about how to apply this to building sparse keyword indexes.\n", 345 "\n", 346 "Going back to the original approach for a term frequency sparse array, we see that using the Python array package is more efficient. In txtai, this method is used to build term frequency arrays for each token. This results in near native speed and memory usage.\n", 347 "\n", 348 "The search method uses a number of NumPy methods to efficiently calculate query term matches. Each query is tokenized and those token term frequency arrays are retrieved to calculate query scores. These NumPy methods are all written in C and often drop the GIL. So once again, near native speed and the ability to use multithreading.\n", 349 "\n", 350 "Read the [full implementation on GitHub](https://github.com/neuml/txtai/blob/master/src/python/txtai/scoring/terms.py) to learn more.\n" 351 ], 352 "metadata": { 353 "id": "gPeTqCflzP5B" 354 } 355 }, 356 { 357 "cell_type": "markdown", 358 "source": [ 359 "# Evaluating performance\n", 360 "\n", 361 "First, a review of the landscape. As said in the introduction, there aren't a ton of good options. [Apache Lucene](https://github.com/apache/lucene) is by far the best traditional search index from a speed, performance and functionality standpoint. It's the base for Elasticsearch/OpenSearch and many other projects. But it requires Java.\n", 362 "\n", 363 "Here are the options we'll explore.\n", 364 "\n", 365 "- [Rank-BM25](https://github.com/dorianbrown/rank_bm25) project, the top result when searching for `python bm25`.\n", 366 "\n", 367 "- [SQLite FTS5](https://www.sqlite.org/fts5.html) extension. This extension builds a sparse keyword index right in SQLite.\n", 368 "\n", 369 "We'll use the BEIR dataset. We'll also use a [benchmarks script](https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py) from the txtai project. This benchmarks script has methods to work with the BEIR dataset.\n", 370 "\n", 371 "Couple important caveats on the benchmarks script.\n", 372 "\n", 373 "- For the SQLite FTS implementation, each token is joined together with an `OR` clause. SQLite FTS [implicitly joins clauses together](https://www.sqlite.org/fts5.html) with `AND` clauses by default. By contrast, [Lucene's default operator](https://lucene.apache.org/core/9_7_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boolean_operators) is an `OR`.\n", 374 "- The Elasticsearch implementation uses 7.x as it's simpler to instantiate in a notebook.\n", 375 "- All methods except Elasticsearch use txtai's [unicode tokenizer](https://github.com/neuml/txtai/blob/master/src/python/txtai/pipeline/data/tokenizer.py) to tokenize text for consistency" 376 ], 377 "metadata": { 378 "id": "rKCRLFNh39hV" 379 } 380 }, 381 { 382 "cell_type": "code", 383 "source": [ 384 "%%capture\n", 385 "import os\n", 386 "\n", 387 "# Get benchmarks script\n", 388 "os.system(\"wget https://raw.githubusercontent.com/neuml/txtai/master/examples/benchmarks.py\")\n", 389 "\n", 390 "# Create output directory\n", 391 "os.makedirs(\"beir\", exist_ok=True)\n", 392 "\n", 393 "# Download subset of BEIR datasets\n", 394 "datasets = [\"trec-covid\", \"nfcorpus\", \"webis-touche2020\", \"scidocs\", \"scifact\"]\n", 395 "for dataset in datasets:\n", 396 " url = f\"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip\"\n", 397 " os.system(f\"wget {url}\")\n", 398 " os.system(f\"mv {dataset}.zip beir\")\n", 399 " os.system(f\"unzip -d beir beir/{dataset}.zip\")\n", 400 "\n", 401 " # Remove existing benchmark data\n", 402 "if os.path.exists(\"benchmarks.json\"):\n", 403 " os.remove(\"benchmarks.json\")" 404 ], 405 "metadata": { 406 "id": "IGKzkKWB60pg" 407 }, 408 "execution_count": null, 409 "outputs": [] 410 }, 411 { 412 "cell_type": "markdown", 413 "source": [ 414 "Now let's run the benchmarks." 415 ], 416 "metadata": { 417 "id": "SEH7Og8LiWRd" 418 } 419 }, 420 { 421 "cell_type": "code", 422 "source": [ 423 "# Remove existing benchmark data\n", 424 "if os.path.exists(\"benchmarks.json\"):\n", 425 " os.remove(\"benchmarks.json\")\n", 426 "\n", 427 "# Runs benchmark evaluation\n", 428 "def evaluate(method):\n", 429 " for dataset in datasets:\n", 430 " command = f\"python benchmarks.py beir {dataset} {method}\"\n", 431 " print(command)\n", 432 " os.system(command)\n", 433 "\n", 434 "# Calculate benchmarks\n", 435 "for method in [\"bm25\", \"rank\", \"sqlite\"]:\n", 436 " evaluate(method)" 437 ], 438 "metadata": { 439 "colab": { 440 "base_uri": "https://localhost:8080/" 441 }, 442 "id": "Hfpok07_5N1m", 443 "outputId": "190d6821-7ff2-4c25-d8ef-5c0ec0b6b04f" 444 }, 445 "execution_count": null, 446 "outputs": [ 447 { 448 "output_type": "stream", 449 "name": "stdout", 450 "text": [ 451 "python benchmarks.py beir trec-covid bm25\n", 452 "python benchmarks.py beir nfcorpus bm25\n", 453 "python benchmarks.py beir webis-touche2020 bm25\n", 454 "python benchmarks.py beir scidocs bm25\n", 455 "python benchmarks.py beir scifact bm25\n", 456 "python benchmarks.py beir trec-covid rank\n", 457 "python benchmarks.py beir nfcorpus rank\n", 458 "python benchmarks.py beir webis-touche2020 rank\n", 459 "python benchmarks.py beir scidocs rank\n", 460 "python benchmarks.py beir scifact rank\n", 461 "python benchmarks.py beir trec-covid sqlite\n", 462 "python benchmarks.py beir nfcorpus sqlite\n", 463 "python benchmarks.py beir webis-touche2020 sqlite\n", 464 "python benchmarks.py beir scidocs sqlite\n", 465 "python benchmarks.py beir scifact sqlite\n" 466 ] 467 } 468 ] 469 }, 470 { 471 "cell_type": "code", 472 "source": [ 473 "import json\n", 474 "import pandas as pd\n", 475 "\n", 476 "def benchmarks():\n", 477 " # Read JSON lines data\n", 478 " with open(\"benchmarks.json\") as f:\n", 479 " data = f.read()\n", 480 "\n", 481 " df = pd.read_json(data, lines=True).sort_values(by=[\"source\", \"search\"])\n", 482 " return df[[\"source\", \"method\", \"index\", \"memory\", \"search\", \"ndcg_cut_10\", \"map_cut_10\", \"recall_10\", \"P_10\"]].reset_index(drop=True)\n", 483 "\n", 484 "# Load benchmarks dataframe\n", 485 "df = benchmarks()" 486 ], 487 "metadata": { 488 "id": "cpmNpwag73DW" 489 }, 490 "execution_count": null, 491 "outputs": [] 492 }, 493 { 494 "cell_type": "code", 495 "source": [ 496 "df[df.source == \"trec-covid\"].reset_index(drop=True)" 497 ], 498 "metadata": { 499 "colab": { 500 "base_uri": "https://localhost:8080/", 501 "height": 143 502 }, 503 "id": "ln4oUAfgLas4", 504 "outputId": "3e3249df-600a-444a-c46e-b839d215ef83" 505 }, 506 "execution_count": null, 507 "outputs": [ 508 { 509 "output_type": "execute_result", 510 "data": { 511 "text/plain": [ 512 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 513 "0 trec-covid bm25 101.96 997 0.28 0.58119 0.01247 \n", 514 "1 trec-covid sqlite 60.16 880 23.09 0.56778 0.01190 \n", 515 "2 trec-covid rank 61.75 3245 75.49 0.57773 0.01210 \n", 516 "\n", 517 " recall_10 P_10 \n", 518 "0 0.01545 0.618 \n", 519 "1 0.01519 0.610 \n", 520 "2 0.01550 0.632 " 521 ], 522 "text/html": [ 523 "\n", 524 "\n", 525 " <div id=\"df-9b67f724-621c-4549-af71-f8852d48ea33\">\n", 526 " <div class=\"colab-df-container\">\n", 527 " <div>\n", 528 "<style scoped>\n", 529 " .dataframe tbody tr th:only-of-type {\n", 530 " vertical-align: middle;\n", 531 " }\n", 532 "\n", 533 " .dataframe tbody tr th {\n", 534 " vertical-align: top;\n", 535 " }\n", 536 "\n", 537 " .dataframe thead th {\n", 538 " text-align: right;\n", 539 " }\n", 540 "</style>\n", 541 "<table border=\"1\" class=\"dataframe\">\n", 542 " <thead>\n", 543 " <tr style=\"text-align: right;\">\n", 544 " <th></th>\n", 545 " <th>source</th>\n", 546 " <th>method</th>\n", 547 " <th>index</th>\n", 548 " <th>memory</th>\n", 549 " <th>search</th>\n", 550 " <th>ndcg_cut_10</th>\n", 551 " <th>map_cut_10</th>\n", 552 " <th>recall_10</th>\n", 553 " <th>P_10</th>\n", 554 " </tr>\n", 555 " </thead>\n", 556 " <tbody>\n", 557 " <tr>\n", 558 " <th>0</th>\n", 559 " <td>trec-covid</td>\n", 560 " <td>bm25</td>\n", 561 " <td>101.96</td>\n", 562 " <td>997</td>\n", 563 " <td>0.28</td>\n", 564 " <td>0.58119</td>\n", 565 " <td>0.01247</td>\n", 566 " <td>0.01545</td>\n", 567 " <td>0.618</td>\n", 568 " </tr>\n", 569 " <tr>\n", 570 " <th>1</th>\n", 571 " <td>trec-covid</td>\n", 572 " <td>sqlite</td>\n", 573 " <td>60.16</td>\n", 574 " <td>880</td>\n", 575 " <td>23.09</td>\n", 576 " <td>0.56778</td>\n", 577 " <td>0.01190</td>\n", 578 " <td>0.01519</td>\n", 579 " <td>0.610</td>\n", 580 " </tr>\n", 581 " <tr>\n", 582 " <th>2</th>\n", 583 " <td>trec-covid</td>\n", 584 " <td>rank</td>\n", 585 " <td>61.75</td>\n", 586 " <td>3245</td>\n", 587 " <td>75.49</td>\n", 588 " <td>0.57773</td>\n", 589 " <td>0.01210</td>\n", 590 " <td>0.01550</td>\n", 591 " <td>0.632</td>\n", 592 " </tr>\n", 593 " </tbody>\n", 594 "</table>\n", 595 "</div>\n", 596 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9b67f724-621c-4549-af71-f8852d48ea33')\"\n", 597 " title=\"Convert this dataframe to an interactive table.\"\n", 598 " style=\"display:none;\">\n", 599 "\n", 600 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 601 " width=\"24px\">\n", 602 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 603 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 604 " </svg>\n", 605 " </button>\n", 606 "\n", 607 "\n", 608 "\n", 609 " <div id=\"df-5316b042-38ac-4117-aed3-124707b9eeba\">\n", 610 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-5316b042-38ac-4117-aed3-124707b9eeba')\"\n", 611 " title=\"Suggest charts.\"\n", 612 " style=\"display:none;\">\n", 613 "\n", 614 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 615 " width=\"24px\">\n", 616 " <g>\n", 617 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 618 " </g>\n", 619 "</svg>\n", 620 " </button>\n", 621 " </div>\n", 622 "\n", 623 "<style>\n", 624 " .colab-df-quickchart {\n", 625 " background-color: #E8F0FE;\n", 626 " border: none;\n", 627 " border-radius: 50%;\n", 628 " cursor: pointer;\n", 629 " display: none;\n", 630 " fill: #1967D2;\n", 631 " height: 32px;\n", 632 " padding: 0 0 0 0;\n", 633 " width: 32px;\n", 634 " }\n", 635 "\n", 636 " .colab-df-quickchart:hover {\n", 637 " background-color: #E2EBFA;\n", 638 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 639 " fill: #174EA6;\n", 640 " }\n", 641 "\n", 642 " [theme=dark] .colab-df-quickchart {\n", 643 " background-color: #3B4455;\n", 644 " fill: #D2E3FC;\n", 645 " }\n", 646 "\n", 647 " [theme=dark] .colab-df-quickchart:hover {\n", 648 " background-color: #434B5C;\n", 649 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 650 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 651 " fill: #FFFFFF;\n", 652 " }\n", 653 "</style>\n", 654 "\n", 655 " <script>\n", 656 " async function quickchart(key) {\n", 657 " const containerElement = document.querySelector('#' + key);\n", 658 " const charts = await google.colab.kernel.invokeFunction(\n", 659 " 'suggestCharts', [key], {});\n", 660 " }\n", 661 " </script>\n", 662 "\n", 663 "\n", 664 " <script>\n", 665 "\n", 666 "function displayQuickchartButton(domScope) {\n", 667 " let quickchartButtonEl =\n", 668 " domScope.querySelector('#df-5316b042-38ac-4117-aed3-124707b9eeba button.colab-df-quickchart');\n", 669 " quickchartButtonEl.style.display =\n", 670 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 671 "}\n", 672 "\n", 673 " displayQuickchartButton(document);\n", 674 " </script>\n", 675 " <style>\n", 676 " .colab-df-container {\n", 677 " display:flex;\n", 678 " flex-wrap:wrap;\n", 679 " gap: 12px;\n", 680 " }\n", 681 "\n", 682 " .colab-df-convert {\n", 683 " background-color: #E8F0FE;\n", 684 " border: none;\n", 685 " border-radius: 50%;\n", 686 " cursor: pointer;\n", 687 " display: none;\n", 688 " fill: #1967D2;\n", 689 " height: 32px;\n", 690 " padding: 0 0 0 0;\n", 691 " width: 32px;\n", 692 " }\n", 693 "\n", 694 " .colab-df-convert:hover {\n", 695 " background-color: #E2EBFA;\n", 696 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 697 " fill: #174EA6;\n", 698 " }\n", 699 "\n", 700 " [theme=dark] .colab-df-convert {\n", 701 " background-color: #3B4455;\n", 702 " fill: #D2E3FC;\n", 703 " }\n", 704 "\n", 705 " [theme=dark] .colab-df-convert:hover {\n", 706 " background-color: #434B5C;\n", 707 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 708 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 709 " fill: #FFFFFF;\n", 710 " }\n", 711 " </style>\n", 712 "\n", 713 " <script>\n", 714 " const buttonEl =\n", 715 " document.querySelector('#df-9b67f724-621c-4549-af71-f8852d48ea33 button.colab-df-convert');\n", 716 " buttonEl.style.display =\n", 717 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 718 "\n", 719 " async function convertToInteractive(key) {\n", 720 " const element = document.querySelector('#df-9b67f724-621c-4549-af71-f8852d48ea33');\n", 721 " const dataTable =\n", 722 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 723 " [key], {});\n", 724 " if (!dataTable) return;\n", 725 "\n", 726 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 727 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 728 " + ' to learn more about interactive tables.';\n", 729 " element.innerHTML = '';\n", 730 " dataTable['output_type'] = 'display_data';\n", 731 " await google.colab.output.renderOutput(dataTable, element);\n", 732 " const docLink = document.createElement('div');\n", 733 " docLink.innerHTML = docLinkHtml;\n", 734 " element.appendChild(docLink);\n", 735 " }\n", 736 " </script>\n", 737 " </div>\n", 738 " </div>\n" 739 ] 740 }, 741 "metadata": {}, 742 "execution_count": 12 743 } 744 ] 745 }, 746 { 747 "cell_type": "code", 748 "source": [ 749 "df[df.source == \"nfcorpus\"].reset_index(drop=True)" 750 ], 751 "metadata": { 752 "colab": { 753 "base_uri": "https://localhost:8080/", 754 "height": 143 755 }, 756 "id": "bSx6dXhLM66g", 757 "outputId": "504f47de-e2ca-4837-a158-4f0f0704f08b" 758 }, 759 "execution_count": null, 760 "outputs": [ 761 { 762 "output_type": "execute_result", 763 "data": { 764 "text/plain": [ 765 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 766 "0 nfcorpus bm25 2.64 648 1.08 0.30639 0.11728 \n", 767 "1 nfcorpus sqlite 1.50 630 12.73 0.30695 0.11785 \n", 768 "2 nfcorpus rank 2.75 700 23.78 0.30692 0.11711 \n", 769 "\n", 770 " recall_10 P_10 \n", 771 "0 0.14891 0.21734 \n", 772 "1 0.14871 0.21641 \n", 773 "2 0.15320 0.21889 " 774 ], 775 "text/html": [ 776 "\n", 777 "\n", 778 " <div id=\"df-883cc6c3-4108-48a2-890d-3a27115a8a34\">\n", 779 " <div class=\"colab-df-container\">\n", 780 " <div>\n", 781 "<style scoped>\n", 782 " .dataframe tbody tr th:only-of-type {\n", 783 " vertical-align: middle;\n", 784 " }\n", 785 "\n", 786 " .dataframe tbody tr th {\n", 787 " vertical-align: top;\n", 788 " }\n", 789 "\n", 790 " .dataframe thead th {\n", 791 " text-align: right;\n", 792 " }\n", 793 "</style>\n", 794 "<table border=\"1\" class=\"dataframe\">\n", 795 " <thead>\n", 796 " <tr style=\"text-align: right;\">\n", 797 " <th></th>\n", 798 " <th>source</th>\n", 799 " <th>method</th>\n", 800 " <th>index</th>\n", 801 " <th>memory</th>\n", 802 " <th>search</th>\n", 803 " <th>ndcg_cut_10</th>\n", 804 " <th>map_cut_10</th>\n", 805 " <th>recall_10</th>\n", 806 " <th>P_10</th>\n", 807 " </tr>\n", 808 " </thead>\n", 809 " <tbody>\n", 810 " <tr>\n", 811 " <th>0</th>\n", 812 " <td>nfcorpus</td>\n", 813 " <td>bm25</td>\n", 814 " <td>2.64</td>\n", 815 " <td>648</td>\n", 816 " <td>1.08</td>\n", 817 " <td>0.30639</td>\n", 818 " <td>0.11728</td>\n", 819 " <td>0.14891</td>\n", 820 " <td>0.21734</td>\n", 821 " </tr>\n", 822 " <tr>\n", 823 " <th>1</th>\n", 824 " <td>nfcorpus</td>\n", 825 " <td>sqlite</td>\n", 826 " <td>1.50</td>\n", 827 " <td>630</td>\n", 828 " <td>12.73</td>\n", 829 " <td>0.30695</td>\n", 830 " <td>0.11785</td>\n", 831 " <td>0.14871</td>\n", 832 " <td>0.21641</td>\n", 833 " </tr>\n", 834 " <tr>\n", 835 " <th>2</th>\n", 836 " <td>nfcorpus</td>\n", 837 " <td>rank</td>\n", 838 " <td>2.75</td>\n", 839 " <td>700</td>\n", 840 " <td>23.78</td>\n", 841 " <td>0.30692</td>\n", 842 " <td>0.11711</td>\n", 843 " <td>0.15320</td>\n", 844 " <td>0.21889</td>\n", 845 " </tr>\n", 846 " </tbody>\n", 847 "</table>\n", 848 "</div>\n", 849 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-883cc6c3-4108-48a2-890d-3a27115a8a34')\"\n", 850 " title=\"Convert this dataframe to an interactive table.\"\n", 851 " style=\"display:none;\">\n", 852 "\n", 853 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 854 " width=\"24px\">\n", 855 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 856 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 857 " </svg>\n", 858 " </button>\n", 859 "\n", 860 "\n", 861 "\n", 862 " <div id=\"df-4743ea70-b222-42ef-86b6-e7dd1684d4ce\">\n", 863 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-4743ea70-b222-42ef-86b6-e7dd1684d4ce')\"\n", 864 " title=\"Suggest charts.\"\n", 865 " style=\"display:none;\">\n", 866 "\n", 867 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 868 " width=\"24px\">\n", 869 " <g>\n", 870 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 871 " </g>\n", 872 "</svg>\n", 873 " </button>\n", 874 " </div>\n", 875 "\n", 876 "<style>\n", 877 " .colab-df-quickchart {\n", 878 " background-color: #E8F0FE;\n", 879 " border: none;\n", 880 " border-radius: 50%;\n", 881 " cursor: pointer;\n", 882 " display: none;\n", 883 " fill: #1967D2;\n", 884 " height: 32px;\n", 885 " padding: 0 0 0 0;\n", 886 " width: 32px;\n", 887 " }\n", 888 "\n", 889 " .colab-df-quickchart:hover {\n", 890 " background-color: #E2EBFA;\n", 891 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 892 " fill: #174EA6;\n", 893 " }\n", 894 "\n", 895 " [theme=dark] .colab-df-quickchart {\n", 896 " background-color: #3B4455;\n", 897 " fill: #D2E3FC;\n", 898 " }\n", 899 "\n", 900 " [theme=dark] .colab-df-quickchart:hover {\n", 901 " background-color: #434B5C;\n", 902 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 903 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 904 " fill: #FFFFFF;\n", 905 " }\n", 906 "</style>\n", 907 "\n", 908 " <script>\n", 909 " async function quickchart(key) {\n", 910 " const containerElement = document.querySelector('#' + key);\n", 911 " const charts = await google.colab.kernel.invokeFunction(\n", 912 " 'suggestCharts', [key], {});\n", 913 " }\n", 914 " </script>\n", 915 "\n", 916 "\n", 917 " <script>\n", 918 "\n", 919 "function displayQuickchartButton(domScope) {\n", 920 " let quickchartButtonEl =\n", 921 " domScope.querySelector('#df-4743ea70-b222-42ef-86b6-e7dd1684d4ce button.colab-df-quickchart');\n", 922 " quickchartButtonEl.style.display =\n", 923 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 924 "}\n", 925 "\n", 926 " displayQuickchartButton(document);\n", 927 " </script>\n", 928 " <style>\n", 929 " .colab-df-container {\n", 930 " display:flex;\n", 931 " flex-wrap:wrap;\n", 932 " gap: 12px;\n", 933 " }\n", 934 "\n", 935 " .colab-df-convert {\n", 936 " background-color: #E8F0FE;\n", 937 " border: none;\n", 938 " border-radius: 50%;\n", 939 " cursor: pointer;\n", 940 " display: none;\n", 941 " fill: #1967D2;\n", 942 " height: 32px;\n", 943 " padding: 0 0 0 0;\n", 944 " width: 32px;\n", 945 " }\n", 946 "\n", 947 " .colab-df-convert:hover {\n", 948 " background-color: #E2EBFA;\n", 949 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 950 " fill: #174EA6;\n", 951 " }\n", 952 "\n", 953 " [theme=dark] .colab-df-convert {\n", 954 " background-color: #3B4455;\n", 955 " fill: #D2E3FC;\n", 956 " }\n", 957 "\n", 958 " [theme=dark] .colab-df-convert:hover {\n", 959 " background-color: #434B5C;\n", 960 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 961 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 962 " fill: #FFFFFF;\n", 963 " }\n", 964 " </style>\n", 965 "\n", 966 " <script>\n", 967 " const buttonEl =\n", 968 " document.querySelector('#df-883cc6c3-4108-48a2-890d-3a27115a8a34 button.colab-df-convert');\n", 969 " buttonEl.style.display =\n", 970 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 971 "\n", 972 " async function convertToInteractive(key) {\n", 973 " const element = document.querySelector('#df-883cc6c3-4108-48a2-890d-3a27115a8a34');\n", 974 " const dataTable =\n", 975 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 976 " [key], {});\n", 977 " if (!dataTable) return;\n", 978 "\n", 979 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 980 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 981 " + ' to learn more about interactive tables.';\n", 982 " element.innerHTML = '';\n", 983 " dataTable['output_type'] = 'display_data';\n", 984 " await google.colab.output.renderOutput(dataTable, element);\n", 985 " const docLink = document.createElement('div');\n", 986 " docLink.innerHTML = docLinkHtml;\n", 987 " element.appendChild(docLink);\n", 988 " }\n", 989 " </script>\n", 990 " </div>\n", 991 " </div>\n" 992 ] 993 }, 994 "metadata": {}, 995 "execution_count": 13 996 } 997 ] 998 }, 999 { 1000 "cell_type": "code", 1001 "source": [ 1002 "df[df.source == \"webis-touche2020\"].reset_index(drop=True)" 1003 ], 1004 "metadata": { 1005 "colab": { 1006 "base_uri": "https://localhost:8080/", 1007 "height": 143 1008 }, 1009 "id": "W-hAhuYHNK_6", 1010 "outputId": "6c28d842-f572-454b-86f3-a19783640757" 1011 }, 1012 "execution_count": null, 1013 "outputs": [ 1014 { 1015 "output_type": "execute_result", 1016 "data": { 1017 "text/plain": [ 1018 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 1019 "0 webis-touche2020 bm25 374.66 1137 0.37 0.36920 0.14588 \n", 1020 "1 webis-touche2020 sqlite 220.46 1416 34.61 0.37194 0.14812 \n", 1021 "2 webis-touche2020 rank 224.07 10347 81.22 0.39861 0.16492 \n", 1022 "\n", 1023 " recall_10 P_10 \n", 1024 "0 0.22736 0.34694 \n", 1025 "1 0.22890 0.35102 \n", 1026 "2 0.23770 0.36122 " 1027 ], 1028 "text/html": [ 1029 "\n", 1030 "\n", 1031 " <div id=\"df-70f16254-6a73-49a4-959c-a8bcd1892c8c\">\n", 1032 " <div class=\"colab-df-container\">\n", 1033 " <div>\n", 1034 "<style scoped>\n", 1035 " .dataframe tbody tr th:only-of-type {\n", 1036 " vertical-align: middle;\n", 1037 " }\n", 1038 "\n", 1039 " .dataframe tbody tr th {\n", 1040 " vertical-align: top;\n", 1041 " }\n", 1042 "\n", 1043 " .dataframe thead th {\n", 1044 " text-align: right;\n", 1045 " }\n", 1046 "</style>\n", 1047 "<table border=\"1\" class=\"dataframe\">\n", 1048 " <thead>\n", 1049 " <tr style=\"text-align: right;\">\n", 1050 " <th></th>\n", 1051 " <th>source</th>\n", 1052 " <th>method</th>\n", 1053 " <th>index</th>\n", 1054 " <th>memory</th>\n", 1055 " <th>search</th>\n", 1056 " <th>ndcg_cut_10</th>\n", 1057 " <th>map_cut_10</th>\n", 1058 " <th>recall_10</th>\n", 1059 " <th>P_10</th>\n", 1060 " </tr>\n", 1061 " </thead>\n", 1062 " <tbody>\n", 1063 " <tr>\n", 1064 " <th>0</th>\n", 1065 " <td>webis-touche2020</td>\n", 1066 " <td>bm25</td>\n", 1067 " <td>374.66</td>\n", 1068 " <td>1137</td>\n", 1069 " <td>0.37</td>\n", 1070 " <td>0.36920</td>\n", 1071 " <td>0.14588</td>\n", 1072 " <td>0.22736</td>\n", 1073 " <td>0.34694</td>\n", 1074 " </tr>\n", 1075 " <tr>\n", 1076 " <th>1</th>\n", 1077 " <td>webis-touche2020</td>\n", 1078 " <td>sqlite</td>\n", 1079 " <td>220.46</td>\n", 1080 " <td>1416</td>\n", 1081 " <td>34.61</td>\n", 1082 " <td>0.37194</td>\n", 1083 " <td>0.14812</td>\n", 1084 " <td>0.22890</td>\n", 1085 " <td>0.35102</td>\n", 1086 " </tr>\n", 1087 " <tr>\n", 1088 " <th>2</th>\n", 1089 " <td>webis-touche2020</td>\n", 1090 " <td>rank</td>\n", 1091 " <td>224.07</td>\n", 1092 " <td>10347</td>\n", 1093 " <td>81.22</td>\n", 1094 " <td>0.39861</td>\n", 1095 " <td>0.16492</td>\n", 1096 " <td>0.23770</td>\n", 1097 " <td>0.36122</td>\n", 1098 " </tr>\n", 1099 " </tbody>\n", 1100 "</table>\n", 1101 "</div>\n", 1102 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-70f16254-6a73-49a4-959c-a8bcd1892c8c')\"\n", 1103 " title=\"Convert this dataframe to an interactive table.\"\n", 1104 " style=\"display:none;\">\n", 1105 "\n", 1106 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1107 " width=\"24px\">\n", 1108 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 1109 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 1110 " </svg>\n", 1111 " </button>\n", 1112 "\n", 1113 "\n", 1114 "\n", 1115 " <div id=\"df-7597ba2a-ecad-4d77-bee4-2d39676261ea\">\n", 1116 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-7597ba2a-ecad-4d77-bee4-2d39676261ea')\"\n", 1117 " title=\"Suggest charts.\"\n", 1118 " style=\"display:none;\">\n", 1119 "\n", 1120 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1121 " width=\"24px\">\n", 1122 " <g>\n", 1123 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 1124 " </g>\n", 1125 "</svg>\n", 1126 " </button>\n", 1127 " </div>\n", 1128 "\n", 1129 "<style>\n", 1130 " .colab-df-quickchart {\n", 1131 " background-color: #E8F0FE;\n", 1132 " border: none;\n", 1133 " border-radius: 50%;\n", 1134 " cursor: pointer;\n", 1135 " display: none;\n", 1136 " fill: #1967D2;\n", 1137 " height: 32px;\n", 1138 " padding: 0 0 0 0;\n", 1139 " width: 32px;\n", 1140 " }\n", 1141 "\n", 1142 " .colab-df-quickchart:hover {\n", 1143 " background-color: #E2EBFA;\n", 1144 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1145 " fill: #174EA6;\n", 1146 " }\n", 1147 "\n", 1148 " [theme=dark] .colab-df-quickchart {\n", 1149 " background-color: #3B4455;\n", 1150 " fill: #D2E3FC;\n", 1151 " }\n", 1152 "\n", 1153 " [theme=dark] .colab-df-quickchart:hover {\n", 1154 " background-color: #434B5C;\n", 1155 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1156 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1157 " fill: #FFFFFF;\n", 1158 " }\n", 1159 "</style>\n", 1160 "\n", 1161 " <script>\n", 1162 " async function quickchart(key) {\n", 1163 " const containerElement = document.querySelector('#' + key);\n", 1164 " const charts = await google.colab.kernel.invokeFunction(\n", 1165 " 'suggestCharts', [key], {});\n", 1166 " }\n", 1167 " </script>\n", 1168 "\n", 1169 "\n", 1170 " <script>\n", 1171 "\n", 1172 "function displayQuickchartButton(domScope) {\n", 1173 " let quickchartButtonEl =\n", 1174 " domScope.querySelector('#df-7597ba2a-ecad-4d77-bee4-2d39676261ea button.colab-df-quickchart');\n", 1175 " quickchartButtonEl.style.display =\n", 1176 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1177 "}\n", 1178 "\n", 1179 " displayQuickchartButton(document);\n", 1180 " </script>\n", 1181 " <style>\n", 1182 " .colab-df-container {\n", 1183 " display:flex;\n", 1184 " flex-wrap:wrap;\n", 1185 " gap: 12px;\n", 1186 " }\n", 1187 "\n", 1188 " .colab-df-convert {\n", 1189 " background-color: #E8F0FE;\n", 1190 " border: none;\n", 1191 " border-radius: 50%;\n", 1192 " cursor: pointer;\n", 1193 " display: none;\n", 1194 " fill: #1967D2;\n", 1195 " height: 32px;\n", 1196 " padding: 0 0 0 0;\n", 1197 " width: 32px;\n", 1198 " }\n", 1199 "\n", 1200 " .colab-df-convert:hover {\n", 1201 " background-color: #E2EBFA;\n", 1202 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1203 " fill: #174EA6;\n", 1204 " }\n", 1205 "\n", 1206 " [theme=dark] .colab-df-convert {\n", 1207 " background-color: #3B4455;\n", 1208 " fill: #D2E3FC;\n", 1209 " }\n", 1210 "\n", 1211 " [theme=dark] .colab-df-convert:hover {\n", 1212 " background-color: #434B5C;\n", 1213 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1214 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1215 " fill: #FFFFFF;\n", 1216 " }\n", 1217 " </style>\n", 1218 "\n", 1219 " <script>\n", 1220 " const buttonEl =\n", 1221 " document.querySelector('#df-70f16254-6a73-49a4-959c-a8bcd1892c8c button.colab-df-convert');\n", 1222 " buttonEl.style.display =\n", 1223 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1224 "\n", 1225 " async function convertToInteractive(key) {\n", 1226 " const element = document.querySelector('#df-70f16254-6a73-49a4-959c-a8bcd1892c8c');\n", 1227 " const dataTable =\n", 1228 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 1229 " [key], {});\n", 1230 " if (!dataTable) return;\n", 1231 "\n", 1232 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 1233 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 1234 " + ' to learn more about interactive tables.';\n", 1235 " element.innerHTML = '';\n", 1236 " dataTable['output_type'] = 'display_data';\n", 1237 " await google.colab.output.renderOutput(dataTable, element);\n", 1238 " const docLink = document.createElement('div');\n", 1239 " docLink.innerHTML = docLinkHtml;\n", 1240 " element.appendChild(docLink);\n", 1241 " }\n", 1242 " </script>\n", 1243 " </div>\n", 1244 " </div>\n" 1245 ] 1246 }, 1247 "metadata": {}, 1248 "execution_count": 14 1249 } 1250 ] 1251 }, 1252 { 1253 "cell_type": "code", 1254 "source": [ 1255 "df[df.source == \"scidocs\"].reset_index(drop=True)" 1256 ], 1257 "metadata": { 1258 "colab": { 1259 "base_uri": "https://localhost:8080/", 1260 "height": 143 1261 }, 1262 "id": "ln7p-b9XNPmO", 1263 "outputId": "26a53b7f-a047-4062-b7a9-45d0b27dbef9" 1264 }, 1265 "execution_count": null, 1266 "outputs": [ 1267 { 1268 "output_type": "execute_result", 1269 "data": { 1270 "text/plain": [ 1271 " source method index memory search ndcg_cut_10 map_cut_10 recall_10 \\\n", 1272 "0 scidocs bm25 17.95 717 1.64 0.15063 0.08756 0.15637 \n", 1273 "1 scidocs sqlite 17.85 670 56.64 0.15156 0.08822 0.15717 \n", 1274 "2 scidocs rank 13.11 1056 162.99 0.14932 0.08670 0.15408 \n", 1275 "\n", 1276 " P_10 \n", 1277 "0 0.0772 \n", 1278 "1 0.0776 \n", 1279 "2 0.0761 " 1280 ], 1281 "text/html": [ 1282 "\n", 1283 "\n", 1284 " <div id=\"df-76355dc9-de17-423d-b8e9-cd47d4dd5f93\">\n", 1285 " <div class=\"colab-df-container\">\n", 1286 " <div>\n", 1287 "<style scoped>\n", 1288 " .dataframe tbody tr th:only-of-type {\n", 1289 " vertical-align: middle;\n", 1290 " }\n", 1291 "\n", 1292 " .dataframe tbody tr th {\n", 1293 " vertical-align: top;\n", 1294 " }\n", 1295 "\n", 1296 " .dataframe thead th {\n", 1297 " text-align: right;\n", 1298 " }\n", 1299 "</style>\n", 1300 "<table border=\"1\" class=\"dataframe\">\n", 1301 " <thead>\n", 1302 " <tr style=\"text-align: right;\">\n", 1303 " <th></th>\n", 1304 " <th>source</th>\n", 1305 " <th>method</th>\n", 1306 " <th>index</th>\n", 1307 " <th>memory</th>\n", 1308 " <th>search</th>\n", 1309 " <th>ndcg_cut_10</th>\n", 1310 " <th>map_cut_10</th>\n", 1311 " <th>recall_10</th>\n", 1312 " <th>P_10</th>\n", 1313 " </tr>\n", 1314 " </thead>\n", 1315 " <tbody>\n", 1316 " <tr>\n", 1317 " <th>0</th>\n", 1318 " <td>scidocs</td>\n", 1319 " <td>bm25</td>\n", 1320 " <td>17.95</td>\n", 1321 " <td>717</td>\n", 1322 " <td>1.64</td>\n", 1323 " <td>0.15063</td>\n", 1324 " <td>0.08756</td>\n", 1325 " <td>0.15637</td>\n", 1326 " <td>0.0772</td>\n", 1327 " </tr>\n", 1328 " <tr>\n", 1329 " <th>1</th>\n", 1330 " <td>scidocs</td>\n", 1331 " <td>sqlite</td>\n", 1332 " <td>17.85</td>\n", 1333 " <td>670</td>\n", 1334 " <td>56.64</td>\n", 1335 " <td>0.15156</td>\n", 1336 " <td>0.08822</td>\n", 1337 " <td>0.15717</td>\n", 1338 " <td>0.0776</td>\n", 1339 " </tr>\n", 1340 " <tr>\n", 1341 " <th>2</th>\n", 1342 " <td>scidocs</td>\n", 1343 " <td>rank</td>\n", 1344 " <td>13.11</td>\n", 1345 " <td>1056</td>\n", 1346 " <td>162.99</td>\n", 1347 " <td>0.14932</td>\n", 1348 " <td>0.08670</td>\n", 1349 " <td>0.15408</td>\n", 1350 " <td>0.0761</td>\n", 1351 " </tr>\n", 1352 " </tbody>\n", 1353 "</table>\n", 1354 "</div>\n", 1355 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-76355dc9-de17-423d-b8e9-cd47d4dd5f93')\"\n", 1356 " title=\"Convert this dataframe to an interactive table.\"\n", 1357 " style=\"display:none;\">\n", 1358 "\n", 1359 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1360 " width=\"24px\">\n", 1361 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 1362 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 1363 " </svg>\n", 1364 " </button>\n", 1365 "\n", 1366 "\n", 1367 "\n", 1368 " <div id=\"df-2e9f319d-70c1-48ec-917c-6f76bf6eb973\">\n", 1369 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-2e9f319d-70c1-48ec-917c-6f76bf6eb973')\"\n", 1370 " title=\"Suggest charts.\"\n", 1371 " style=\"display:none;\">\n", 1372 "\n", 1373 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1374 " width=\"24px\">\n", 1375 " <g>\n", 1376 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 1377 " </g>\n", 1378 "</svg>\n", 1379 " </button>\n", 1380 " </div>\n", 1381 "\n", 1382 "<style>\n", 1383 " .colab-df-quickchart {\n", 1384 " background-color: #E8F0FE;\n", 1385 " border: none;\n", 1386 " border-radius: 50%;\n", 1387 " cursor: pointer;\n", 1388 " display: none;\n", 1389 " fill: #1967D2;\n", 1390 " height: 32px;\n", 1391 " padding: 0 0 0 0;\n", 1392 " width: 32px;\n", 1393 " }\n", 1394 "\n", 1395 " .colab-df-quickchart:hover {\n", 1396 " background-color: #E2EBFA;\n", 1397 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1398 " fill: #174EA6;\n", 1399 " }\n", 1400 "\n", 1401 " [theme=dark] .colab-df-quickchart {\n", 1402 " background-color: #3B4455;\n", 1403 " fill: #D2E3FC;\n", 1404 " }\n", 1405 "\n", 1406 " [theme=dark] .colab-df-quickchart:hover {\n", 1407 " background-color: #434B5C;\n", 1408 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1409 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1410 " fill: #FFFFFF;\n", 1411 " }\n", 1412 "</style>\n", 1413 "\n", 1414 " <script>\n", 1415 " async function quickchart(key) {\n", 1416 " const containerElement = document.querySelector('#' + key);\n", 1417 " const charts = await google.colab.kernel.invokeFunction(\n", 1418 " 'suggestCharts', [key], {});\n", 1419 " }\n", 1420 " </script>\n", 1421 "\n", 1422 "\n", 1423 " <script>\n", 1424 "\n", 1425 "function displayQuickchartButton(domScope) {\n", 1426 " let quickchartButtonEl =\n", 1427 " domScope.querySelector('#df-2e9f319d-70c1-48ec-917c-6f76bf6eb973 button.colab-df-quickchart');\n", 1428 " quickchartButtonEl.style.display =\n", 1429 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1430 "}\n", 1431 "\n", 1432 " displayQuickchartButton(document);\n", 1433 " </script>\n", 1434 " <style>\n", 1435 " .colab-df-container {\n", 1436 " display:flex;\n", 1437 " flex-wrap:wrap;\n", 1438 " gap: 12px;\n", 1439 " }\n", 1440 "\n", 1441 " .colab-df-convert {\n", 1442 " background-color: #E8F0FE;\n", 1443 " border: none;\n", 1444 " border-radius: 50%;\n", 1445 " cursor: pointer;\n", 1446 " display: none;\n", 1447 " fill: #1967D2;\n", 1448 " height: 32px;\n", 1449 " padding: 0 0 0 0;\n", 1450 " width: 32px;\n", 1451 " }\n", 1452 "\n", 1453 " .colab-df-convert:hover {\n", 1454 " background-color: #E2EBFA;\n", 1455 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1456 " fill: #174EA6;\n", 1457 " }\n", 1458 "\n", 1459 " [theme=dark] .colab-df-convert {\n", 1460 " background-color: #3B4455;\n", 1461 " fill: #D2E3FC;\n", 1462 " }\n", 1463 "\n", 1464 " [theme=dark] .colab-df-convert:hover {\n", 1465 " background-color: #434B5C;\n", 1466 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1467 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1468 " fill: #FFFFFF;\n", 1469 " }\n", 1470 " </style>\n", 1471 "\n", 1472 " <script>\n", 1473 " const buttonEl =\n", 1474 " document.querySelector('#df-76355dc9-de17-423d-b8e9-cd47d4dd5f93 button.colab-df-convert');\n", 1475 " buttonEl.style.display =\n", 1476 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1477 "\n", 1478 " async function convertToInteractive(key) {\n", 1479 " const element = document.querySelector('#df-76355dc9-de17-423d-b8e9-cd47d4dd5f93');\n", 1480 " const dataTable =\n", 1481 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 1482 " [key], {});\n", 1483 " if (!dataTable) return;\n", 1484 "\n", 1485 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 1486 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 1487 " + ' to learn more about interactive tables.';\n", 1488 " element.innerHTML = '';\n", 1489 " dataTable['output_type'] = 'display_data';\n", 1490 " await google.colab.output.renderOutput(dataTable, element);\n", 1491 " const docLink = document.createElement('div');\n", 1492 " docLink.innerHTML = docLinkHtml;\n", 1493 " element.appendChild(docLink);\n", 1494 " }\n", 1495 " </script>\n", 1496 " </div>\n", 1497 " </div>\n" 1498 ] 1499 }, 1500 "metadata": {}, 1501 "execution_count": 15 1502 } 1503 ] 1504 }, 1505 { 1506 "cell_type": "code", 1507 "source": [ 1508 "df[df.source == \"scifact\"].reset_index(drop=True)" 1509 ], 1510 "metadata": { 1511 "colab": { 1512 "base_uri": "https://localhost:8080/", 1513 "height": 143 1514 }, 1515 "id": "CsHEwmV0NTjm", 1516 "outputId": "591030c6-57fb-4f06-c133-9ab0fef3646e" 1517 }, 1518 "execution_count": null, 1519 "outputs": [ 1520 { 1521 "output_type": "execute_result", 1522 "data": { 1523 "text/plain": [ 1524 " source method index memory search ndcg_cut_10 map_cut_10 recall_10 \\\n", 1525 "0 scifact bm25 5.51 653 1.07 0.66324 0.61764 0.78761 \n", 1526 "1 scifact sqlite 1.85 631 20.28 0.66630 0.61966 0.79494 \n", 1527 "2 scifact rank 1.85 724 42.22 0.65618 0.61204 0.77400 \n", 1528 "\n", 1529 " P_10 \n", 1530 "0 0.087 \n", 1531 "1 0.088 \n", 1532 "2 0.085 " 1533 ], 1534 "text/html": [ 1535 "\n", 1536 "\n", 1537 " <div id=\"df-2245cf05-3fb4-4f83-9fc3-b0e265f5c080\">\n", 1538 " <div class=\"colab-df-container\">\n", 1539 " <div>\n", 1540 "<style scoped>\n", 1541 " .dataframe tbody tr th:only-of-type {\n", 1542 " vertical-align: middle;\n", 1543 " }\n", 1544 "\n", 1545 " .dataframe tbody tr th {\n", 1546 " vertical-align: top;\n", 1547 " }\n", 1548 "\n", 1549 " .dataframe thead th {\n", 1550 " text-align: right;\n", 1551 " }\n", 1552 "</style>\n", 1553 "<table border=\"1\" class=\"dataframe\">\n", 1554 " <thead>\n", 1555 " <tr style=\"text-align: right;\">\n", 1556 " <th></th>\n", 1557 " <th>source</th>\n", 1558 " <th>method</th>\n", 1559 " <th>index</th>\n", 1560 " <th>memory</th>\n", 1561 " <th>search</th>\n", 1562 " <th>ndcg_cut_10</th>\n", 1563 " <th>map_cut_10</th>\n", 1564 " <th>recall_10</th>\n", 1565 " <th>P_10</th>\n", 1566 " </tr>\n", 1567 " </thead>\n", 1568 " <tbody>\n", 1569 " <tr>\n", 1570 " <th>0</th>\n", 1571 " <td>scifact</td>\n", 1572 " <td>bm25</td>\n", 1573 " <td>5.51</td>\n", 1574 " <td>653</td>\n", 1575 " <td>1.07</td>\n", 1576 " <td>0.66324</td>\n", 1577 " <td>0.61764</td>\n", 1578 " <td>0.78761</td>\n", 1579 " <td>0.087</td>\n", 1580 " </tr>\n", 1581 " <tr>\n", 1582 " <th>1</th>\n", 1583 " <td>scifact</td>\n", 1584 " <td>sqlite</td>\n", 1585 " <td>1.85</td>\n", 1586 " <td>631</td>\n", 1587 " <td>20.28</td>\n", 1588 " <td>0.66630</td>\n", 1589 " <td>0.61966</td>\n", 1590 " <td>0.79494</td>\n", 1591 " <td>0.088</td>\n", 1592 " </tr>\n", 1593 " <tr>\n", 1594 " <th>2</th>\n", 1595 " <td>scifact</td>\n", 1596 " <td>rank</td>\n", 1597 " <td>1.85</td>\n", 1598 " <td>724</td>\n", 1599 " <td>42.22</td>\n", 1600 " <td>0.65618</td>\n", 1601 " <td>0.61204</td>\n", 1602 " <td>0.77400</td>\n", 1603 " <td>0.085</td>\n", 1604 " </tr>\n", 1605 " </tbody>\n", 1606 "</table>\n", 1607 "</div>\n", 1608 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2245cf05-3fb4-4f83-9fc3-b0e265f5c080')\"\n", 1609 " title=\"Convert this dataframe to an interactive table.\"\n", 1610 " style=\"display:none;\">\n", 1611 "\n", 1612 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1613 " width=\"24px\">\n", 1614 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 1615 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 1616 " </svg>\n", 1617 " </button>\n", 1618 "\n", 1619 "\n", 1620 "\n", 1621 " <div id=\"df-20d7f6c0-eca7-46ef-8c4a-fb24e6e56620\">\n", 1622 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-20d7f6c0-eca7-46ef-8c4a-fb24e6e56620')\"\n", 1623 " title=\"Suggest charts.\"\n", 1624 " style=\"display:none;\">\n", 1625 "\n", 1626 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1627 " width=\"24px\">\n", 1628 " <g>\n", 1629 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 1630 " </g>\n", 1631 "</svg>\n", 1632 " </button>\n", 1633 " </div>\n", 1634 "\n", 1635 "<style>\n", 1636 " .colab-df-quickchart {\n", 1637 " background-color: #E8F0FE;\n", 1638 " border: none;\n", 1639 " border-radius: 50%;\n", 1640 " cursor: pointer;\n", 1641 " display: none;\n", 1642 " fill: #1967D2;\n", 1643 " height: 32px;\n", 1644 " padding: 0 0 0 0;\n", 1645 " width: 32px;\n", 1646 " }\n", 1647 "\n", 1648 " .colab-df-quickchart:hover {\n", 1649 " background-color: #E2EBFA;\n", 1650 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1651 " fill: #174EA6;\n", 1652 " }\n", 1653 "\n", 1654 " [theme=dark] .colab-df-quickchart {\n", 1655 " background-color: #3B4455;\n", 1656 " fill: #D2E3FC;\n", 1657 " }\n", 1658 "\n", 1659 " [theme=dark] .colab-df-quickchart:hover {\n", 1660 " background-color: #434B5C;\n", 1661 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1662 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1663 " fill: #FFFFFF;\n", 1664 " }\n", 1665 "</style>\n", 1666 "\n", 1667 " <script>\n", 1668 " async function quickchart(key) {\n", 1669 " const containerElement = document.querySelector('#' + key);\n", 1670 " const charts = await google.colab.kernel.invokeFunction(\n", 1671 " 'suggestCharts', [key], {});\n", 1672 " }\n", 1673 " </script>\n", 1674 "\n", 1675 "\n", 1676 " <script>\n", 1677 "\n", 1678 "function displayQuickchartButton(domScope) {\n", 1679 " let quickchartButtonEl =\n", 1680 " domScope.querySelector('#df-20d7f6c0-eca7-46ef-8c4a-fb24e6e56620 button.colab-df-quickchart');\n", 1681 " quickchartButtonEl.style.display =\n", 1682 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1683 "}\n", 1684 "\n", 1685 " displayQuickchartButton(document);\n", 1686 " </script>\n", 1687 " <style>\n", 1688 " .colab-df-container {\n", 1689 " display:flex;\n", 1690 " flex-wrap:wrap;\n", 1691 " gap: 12px;\n", 1692 " }\n", 1693 "\n", 1694 " .colab-df-convert {\n", 1695 " background-color: #E8F0FE;\n", 1696 " border: none;\n", 1697 " border-radius: 50%;\n", 1698 " cursor: pointer;\n", 1699 " display: none;\n", 1700 " fill: #1967D2;\n", 1701 " height: 32px;\n", 1702 " padding: 0 0 0 0;\n", 1703 " width: 32px;\n", 1704 " }\n", 1705 "\n", 1706 " .colab-df-convert:hover {\n", 1707 " background-color: #E2EBFA;\n", 1708 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 1709 " fill: #174EA6;\n", 1710 " }\n", 1711 "\n", 1712 " [theme=dark] .colab-df-convert {\n", 1713 " background-color: #3B4455;\n", 1714 " fill: #D2E3FC;\n", 1715 " }\n", 1716 "\n", 1717 " [theme=dark] .colab-df-convert:hover {\n", 1718 " background-color: #434B5C;\n", 1719 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 1720 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 1721 " fill: #FFFFFF;\n", 1722 " }\n", 1723 " </style>\n", 1724 "\n", 1725 " <script>\n", 1726 " const buttonEl =\n", 1727 " document.querySelector('#df-2245cf05-3fb4-4f83-9fc3-b0e265f5c080 button.colab-df-convert');\n", 1728 " buttonEl.style.display =\n", 1729 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 1730 "\n", 1731 " async function convertToInteractive(key) {\n", 1732 " const element = document.querySelector('#df-2245cf05-3fb4-4f83-9fc3-b0e265f5c080');\n", 1733 " const dataTable =\n", 1734 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 1735 " [key], {});\n", 1736 " if (!dataTable) return;\n", 1737 "\n", 1738 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 1739 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 1740 " + ' to learn more about interactive tables.';\n", 1741 " element.innerHTML = '';\n", 1742 " dataTable['output_type'] = 'display_data';\n", 1743 " await google.colab.output.renderOutput(dataTable, element);\n", 1744 " const docLink = document.createElement('div');\n", 1745 " docLink.innerHTML = docLinkHtml;\n", 1746 " element.appendChild(docLink);\n", 1747 " }\n", 1748 " </script>\n", 1749 " </div>\n", 1750 " </div>\n" 1751 ] 1752 }, 1753 "metadata": {}, 1754 "execution_count": 16 1755 } 1756 ] 1757 }, 1758 { 1759 "cell_type": "markdown", 1760 "source": [ 1761 "The sections above show the metrics per source and method.\n", 1762 "\n", 1763 "The table headers list the `source (dataset)`, `index method`, `index time(s)`, `memory usage(MB)`, `search time(s)` and `NDCG@10`/`MAP@10`/`RECALL@10`/`P@10` accuracy metrics. The tables are sorted by `search time`.\n", 1764 "\n", 1765 "As we can see, txtai's implementation has the fastest search times across the board. But it is slower when it comes to index time. The accuracy metrics vary slightly but are all about the same per method.\n", 1766 "\n", 1767 "Memory usage stands out. SQLite and txtai both have around the same usage per source. Rank-BM25 memory usage can get out of hand fast. For example, `webis-touch2020`, which is only ~400K records, uses `10 GB` of memory compared to `700 MB` for the other implementations." 1768 ], 1769 "metadata": { 1770 "id": "tU1eFDZUh0NQ" 1771 } 1772 }, 1773 { 1774 "cell_type": "markdown", 1775 "source": [ 1776 "# Compare with Elasticsearch\n", 1777 "\n", 1778 "Now that we've reviewed methods to build keyword indexes in Python, let's see how txtai's sparse keyword index compares to Elasticsearch.\n", 1779 "\n", 1780 "We'll spin up an inline instance and run the same evaluations." 1781 ], 1782 "metadata": { 1783 "id": "_9tn39MN0LV9" 1784 } 1785 }, 1786 { 1787 "cell_type": "code", 1788 "source": [ 1789 "%%capture\n", 1790 "# Download and extract elasticsearch\n", 1791 "os.system(\"wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\")\n", 1792 "os.system(\"tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\")\n", 1793 "os.system(\"chown -R daemon:daemon elasticsearch-7.10.1\")" 1794 ], 1795 "metadata": { 1796 "id": "GZu0nj_R_NqB" 1797 }, 1798 "execution_count": null, 1799 "outputs": [] 1800 }, 1801 { 1802 "cell_type": "code", 1803 "source": [ 1804 "from subprocess import Popen, PIPE, STDOUT\n", 1805 "\n", 1806 "# Start and wait for server\n", 1807 "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n", 1808 "!sleep 30" 1809 ], 1810 "metadata": { 1811 "id": "SsQsr-my_Poy" 1812 }, 1813 "execution_count": null, 1814 "outputs": [] 1815 }, 1816 { 1817 "cell_type": "code", 1818 "source": [ 1819 "# Add benchmark evaluations for Elasticsearch\n", 1820 "evaluate(\"es\")\n", 1821 "\n", 1822 "# Reload benchmarks dataframe\n", 1823 "df = benchmarks()" 1824 ], 1825 "metadata": { 1826 "colab": { 1827 "base_uri": "https://localhost:8080/" 1828 }, 1829 "id": "QSnpA2sjA5X0", 1830 "outputId": "d4805ee9-1e3c-4fec-8f73-9b30edd4854d" 1831 }, 1832 "execution_count": null, 1833 "outputs": [ 1834 { 1835 "output_type": "stream", 1836 "name": "stdout", 1837 "text": [ 1838 "python benchmarks.py beir trec-covid es\n", 1839 "python benchmarks.py beir nfcorpus es\n", 1840 "python benchmarks.py beir webis-touche2020 es\n", 1841 "python benchmarks.py beir scidocs es\n", 1842 "python benchmarks.py beir scifact es\n" 1843 ] 1844 } 1845 ] 1846 }, 1847 { 1848 "cell_type": "code", 1849 "source": [ 1850 "df[df.source == \"trec-covid\"].reset_index(drop=True)" 1851 ], 1852 "metadata": { 1853 "colab": { 1854 "base_uri": "https://localhost:8080/", 1855 "height": 175 1856 }, 1857 "id": "zAZolShYaXyf", 1858 "outputId": "1a338a45-799c-483c-83a4-037b8e8c1780" 1859 }, 1860 "execution_count": null, 1861 "outputs": [ 1862 { 1863 "output_type": "execute_result", 1864 "data": { 1865 "text/plain": [ 1866 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 1867 "0 trec-covid bm25 101.96 997 0.28 0.58119 0.01247 \n", 1868 "1 trec-covid es 71.24 636 2.09 0.59215 0.01261 \n", 1869 "2 trec-covid sqlite 60.16 880 23.09 0.56778 0.01190 \n", 1870 "3 trec-covid rank 61.75 3245 75.49 0.57773 0.01210 \n", 1871 "\n", 1872 " recall_10 P_10 \n", 1873 "0 0.01545 0.618 \n", 1874 "1 0.01590 0.636 \n", 1875 "2 0.01519 0.610 \n", 1876 "3 0.01550 0.632 " 1877 ], 1878 "text/html": [ 1879 "\n", 1880 "\n", 1881 " <div id=\"df-acdc8290-f77c-46eb-8419-91f9ac9ea511\">\n", 1882 " <div class=\"colab-df-container\">\n", 1883 " <div>\n", 1884 "<style scoped>\n", 1885 " .dataframe tbody tr th:only-of-type {\n", 1886 " vertical-align: middle;\n", 1887 " }\n", 1888 "\n", 1889 " .dataframe tbody tr th {\n", 1890 " vertical-align: top;\n", 1891 " }\n", 1892 "\n", 1893 " .dataframe thead th {\n", 1894 " text-align: right;\n", 1895 " }\n", 1896 "</style>\n", 1897 "<table border=\"1\" class=\"dataframe\">\n", 1898 " <thead>\n", 1899 " <tr style=\"text-align: right;\">\n", 1900 " <th></th>\n", 1901 " <th>source</th>\n", 1902 " <th>method</th>\n", 1903 " <th>index</th>\n", 1904 " <th>memory</th>\n", 1905 " <th>search</th>\n", 1906 " <th>ndcg_cut_10</th>\n", 1907 " <th>map_cut_10</th>\n", 1908 " <th>recall_10</th>\n", 1909 " <th>P_10</th>\n", 1910 " </tr>\n", 1911 " </thead>\n", 1912 " <tbody>\n", 1913 " <tr>\n", 1914 " <th>0</th>\n", 1915 " <td>trec-covid</td>\n", 1916 " <td>bm25</td>\n", 1917 " <td>101.96</td>\n", 1918 " <td>997</td>\n", 1919 " <td>0.28</td>\n", 1920 " <td>0.58119</td>\n", 1921 " <td>0.01247</td>\n", 1922 " <td>0.01545</td>\n", 1923 " <td>0.618</td>\n", 1924 " </tr>\n", 1925 " <tr>\n", 1926 " <th>1</th>\n", 1927 " <td>trec-covid</td>\n", 1928 " <td>es</td>\n", 1929 " <td>71.24</td>\n", 1930 " <td>636</td>\n", 1931 " <td>2.09</td>\n", 1932 " <td>0.59215</td>\n", 1933 " <td>0.01261</td>\n", 1934 " <td>0.01590</td>\n", 1935 " <td>0.636</td>\n", 1936 " </tr>\n", 1937 " <tr>\n", 1938 " <th>2</th>\n", 1939 " <td>trec-covid</td>\n", 1940 " <td>sqlite</td>\n", 1941 " <td>60.16</td>\n", 1942 " <td>880</td>\n", 1943 " <td>23.09</td>\n", 1944 " <td>0.56778</td>\n", 1945 " <td>0.01190</td>\n", 1946 " <td>0.01519</td>\n", 1947 " <td>0.610</td>\n", 1948 " </tr>\n", 1949 " <tr>\n", 1950 " <th>3</th>\n", 1951 " <td>trec-covid</td>\n", 1952 " <td>rank</td>\n", 1953 " <td>61.75</td>\n", 1954 " <td>3245</td>\n", 1955 " <td>75.49</td>\n", 1956 " <td>0.57773</td>\n", 1957 " <td>0.01210</td>\n", 1958 " <td>0.01550</td>\n", 1959 " <td>0.632</td>\n", 1960 " </tr>\n", 1961 " </tbody>\n", 1962 "</table>\n", 1963 "</div>\n", 1964 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-acdc8290-f77c-46eb-8419-91f9ac9ea511')\"\n", 1965 " title=\"Convert this dataframe to an interactive table.\"\n", 1966 " style=\"display:none;\">\n", 1967 "\n", 1968 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1969 " width=\"24px\">\n", 1970 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 1971 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 1972 " </svg>\n", 1973 " </button>\n", 1974 "\n", 1975 "\n", 1976 "\n", 1977 " <div id=\"df-9c22c770-e2dc-43b2-8015-2a0cbc49ba8b\">\n", 1978 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-9c22c770-e2dc-43b2-8015-2a0cbc49ba8b')\"\n", 1979 " title=\"Suggest charts.\"\n", 1980 " style=\"display:none;\">\n", 1981 "\n", 1982 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 1983 " width=\"24px\">\n", 1984 " <g>\n", 1985 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 1986 " </g>\n", 1987 "</svg>\n", 1988 " </button>\n", 1989 " </div>\n", 1990 "\n", 1991 "<style>\n", 1992 " .colab-df-quickchart {\n", 1993 " background-color: #E8F0FE;\n", 1994 " border: none;\n", 1995 " border-radius: 50%;\n", 1996 " cursor: pointer;\n", 1997 " display: none;\n", 1998 " fill: #1967D2;\n", 1999 " height: 32px;\n", 2000 " padding: 0 0 0 0;\n", 2001 " width: 32px;\n", 2002 " }\n", 2003 "\n", 2004 " .colab-df-quickchart:hover {\n", 2005 " background-color: #E2EBFA;\n", 2006 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2007 " fill: #174EA6;\n", 2008 " }\n", 2009 "\n", 2010 " [theme=dark] .colab-df-quickchart {\n", 2011 " background-color: #3B4455;\n", 2012 " fill: #D2E3FC;\n", 2013 " }\n", 2014 "\n", 2015 " [theme=dark] .colab-df-quickchart:hover {\n", 2016 " background-color: #434B5C;\n", 2017 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2018 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2019 " fill: #FFFFFF;\n", 2020 " }\n", 2021 "</style>\n", 2022 "\n", 2023 " <script>\n", 2024 " async function quickchart(key) {\n", 2025 " const containerElement = document.querySelector('#' + key);\n", 2026 " const charts = await google.colab.kernel.invokeFunction(\n", 2027 " 'suggestCharts', [key], {});\n", 2028 " }\n", 2029 " </script>\n", 2030 "\n", 2031 "\n", 2032 " <script>\n", 2033 "\n", 2034 "function displayQuickchartButton(domScope) {\n", 2035 " let quickchartButtonEl =\n", 2036 " domScope.querySelector('#df-9c22c770-e2dc-43b2-8015-2a0cbc49ba8b button.colab-df-quickchart');\n", 2037 " quickchartButtonEl.style.display =\n", 2038 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2039 "}\n", 2040 "\n", 2041 " displayQuickchartButton(document);\n", 2042 " </script>\n", 2043 " <style>\n", 2044 " .colab-df-container {\n", 2045 " display:flex;\n", 2046 " flex-wrap:wrap;\n", 2047 " gap: 12px;\n", 2048 " }\n", 2049 "\n", 2050 " .colab-df-convert {\n", 2051 " background-color: #E8F0FE;\n", 2052 " border: none;\n", 2053 " border-radius: 50%;\n", 2054 " cursor: pointer;\n", 2055 " display: none;\n", 2056 " fill: #1967D2;\n", 2057 " height: 32px;\n", 2058 " padding: 0 0 0 0;\n", 2059 " width: 32px;\n", 2060 " }\n", 2061 "\n", 2062 " .colab-df-convert:hover {\n", 2063 " background-color: #E2EBFA;\n", 2064 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2065 " fill: #174EA6;\n", 2066 " }\n", 2067 "\n", 2068 " [theme=dark] .colab-df-convert {\n", 2069 " background-color: #3B4455;\n", 2070 " fill: #D2E3FC;\n", 2071 " }\n", 2072 "\n", 2073 " [theme=dark] .colab-df-convert:hover {\n", 2074 " background-color: #434B5C;\n", 2075 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2076 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2077 " fill: #FFFFFF;\n", 2078 " }\n", 2079 " </style>\n", 2080 "\n", 2081 " <script>\n", 2082 " const buttonEl =\n", 2083 " document.querySelector('#df-acdc8290-f77c-46eb-8419-91f9ac9ea511 button.colab-df-convert');\n", 2084 " buttonEl.style.display =\n", 2085 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2086 "\n", 2087 " async function convertToInteractive(key) {\n", 2088 " const element = document.querySelector('#df-acdc8290-f77c-46eb-8419-91f9ac9ea511');\n", 2089 " const dataTable =\n", 2090 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 2091 " [key], {});\n", 2092 " if (!dataTable) return;\n", 2093 "\n", 2094 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 2095 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 2096 " + ' to learn more about interactive tables.';\n", 2097 " element.innerHTML = '';\n", 2098 " dataTable['output_type'] = 'display_data';\n", 2099 " await google.colab.output.renderOutput(dataTable, element);\n", 2100 " const docLink = document.createElement('div');\n", 2101 " docLink.innerHTML = docLinkHtml;\n", 2102 " element.appendChild(docLink);\n", 2103 " }\n", 2104 " </script>\n", 2105 " </div>\n", 2106 " </div>\n" 2107 ] 2108 }, 2109 "metadata": {}, 2110 "execution_count": 20 2111 } 2112 ] 2113 }, 2114 { 2115 "cell_type": "code", 2116 "source": [ 2117 "df[df.source == \"nfcorpus\"].reset_index(drop=True)" 2118 ], 2119 "metadata": { 2120 "colab": { 2121 "base_uri": "https://localhost:8080/", 2122 "height": 175 2123 }, 2124 "id": "3kKe6A6CbKbp", 2125 "outputId": "5c8ca9cb-0d59-4110-eacf-547bba8f1445" 2126 }, 2127 "execution_count": null, 2128 "outputs": [ 2129 { 2130 "output_type": "execute_result", 2131 "data": { 2132 "text/plain": [ 2133 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 2134 "0 nfcorpus bm25 2.64 648 1.08 0.30639 0.11728 \n", 2135 "1 nfcorpus es 3.95 627 11.47 0.30676 0.11761 \n", 2136 "2 nfcorpus sqlite 1.50 630 12.73 0.30695 0.11785 \n", 2137 "3 nfcorpus rank 2.75 700 23.78 0.30692 0.11711 \n", 2138 "\n", 2139 " recall_10 P_10 \n", 2140 "0 0.14891 0.21734 \n", 2141 "1 0.14894 0.21610 \n", 2142 "2 0.14871 0.21641 \n", 2143 "3 0.15320 0.21889 " 2144 ], 2145 "text/html": [ 2146 "\n", 2147 "\n", 2148 " <div id=\"df-9a99a174-56a2-478f-91c0-31a3c54948cd\">\n", 2149 " <div class=\"colab-df-container\">\n", 2150 " <div>\n", 2151 "<style scoped>\n", 2152 " .dataframe tbody tr th:only-of-type {\n", 2153 " vertical-align: middle;\n", 2154 " }\n", 2155 "\n", 2156 " .dataframe tbody tr th {\n", 2157 " vertical-align: top;\n", 2158 " }\n", 2159 "\n", 2160 " .dataframe thead th {\n", 2161 " text-align: right;\n", 2162 " }\n", 2163 "</style>\n", 2164 "<table border=\"1\" class=\"dataframe\">\n", 2165 " <thead>\n", 2166 " <tr style=\"text-align: right;\">\n", 2167 " <th></th>\n", 2168 " <th>source</th>\n", 2169 " <th>method</th>\n", 2170 " <th>index</th>\n", 2171 " <th>memory</th>\n", 2172 " <th>search</th>\n", 2173 " <th>ndcg_cut_10</th>\n", 2174 " <th>map_cut_10</th>\n", 2175 " <th>recall_10</th>\n", 2176 " <th>P_10</th>\n", 2177 " </tr>\n", 2178 " </thead>\n", 2179 " <tbody>\n", 2180 " <tr>\n", 2181 " <th>0</th>\n", 2182 " <td>nfcorpus</td>\n", 2183 " <td>bm25</td>\n", 2184 " <td>2.64</td>\n", 2185 " <td>648</td>\n", 2186 " <td>1.08</td>\n", 2187 " <td>0.30639</td>\n", 2188 " <td>0.11728</td>\n", 2189 " <td>0.14891</td>\n", 2190 " <td>0.21734</td>\n", 2191 " </tr>\n", 2192 " <tr>\n", 2193 " <th>1</th>\n", 2194 " <td>nfcorpus</td>\n", 2195 " <td>es</td>\n", 2196 " <td>3.95</td>\n", 2197 " <td>627</td>\n", 2198 " <td>11.47</td>\n", 2199 " <td>0.30676</td>\n", 2200 " <td>0.11761</td>\n", 2201 " <td>0.14894</td>\n", 2202 " <td>0.21610</td>\n", 2203 " </tr>\n", 2204 " <tr>\n", 2205 " <th>2</th>\n", 2206 " <td>nfcorpus</td>\n", 2207 " <td>sqlite</td>\n", 2208 " <td>1.50</td>\n", 2209 " <td>630</td>\n", 2210 " <td>12.73</td>\n", 2211 " <td>0.30695</td>\n", 2212 " <td>0.11785</td>\n", 2213 " <td>0.14871</td>\n", 2214 " <td>0.21641</td>\n", 2215 " </tr>\n", 2216 " <tr>\n", 2217 " <th>3</th>\n", 2218 " <td>nfcorpus</td>\n", 2219 " <td>rank</td>\n", 2220 " <td>2.75</td>\n", 2221 " <td>700</td>\n", 2222 " <td>23.78</td>\n", 2223 " <td>0.30692</td>\n", 2224 " <td>0.11711</td>\n", 2225 " <td>0.15320</td>\n", 2226 " <td>0.21889</td>\n", 2227 " </tr>\n", 2228 " </tbody>\n", 2229 "</table>\n", 2230 "</div>\n", 2231 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-9a99a174-56a2-478f-91c0-31a3c54948cd')\"\n", 2232 " title=\"Convert this dataframe to an interactive table.\"\n", 2233 " style=\"display:none;\">\n", 2234 "\n", 2235 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2236 " width=\"24px\">\n", 2237 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 2238 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 2239 " </svg>\n", 2240 " </button>\n", 2241 "\n", 2242 "\n", 2243 "\n", 2244 " <div id=\"df-85659170-eae7-4137-8af2-a8a4e52ecba2\">\n", 2245 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-85659170-eae7-4137-8af2-a8a4e52ecba2')\"\n", 2246 " title=\"Suggest charts.\"\n", 2247 " style=\"display:none;\">\n", 2248 "\n", 2249 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2250 " width=\"24px\">\n", 2251 " <g>\n", 2252 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 2253 " </g>\n", 2254 "</svg>\n", 2255 " </button>\n", 2256 " </div>\n", 2257 "\n", 2258 "<style>\n", 2259 " .colab-df-quickchart {\n", 2260 " background-color: #E8F0FE;\n", 2261 " border: none;\n", 2262 " border-radius: 50%;\n", 2263 " cursor: pointer;\n", 2264 " display: none;\n", 2265 " fill: #1967D2;\n", 2266 " height: 32px;\n", 2267 " padding: 0 0 0 0;\n", 2268 " width: 32px;\n", 2269 " }\n", 2270 "\n", 2271 " .colab-df-quickchart:hover {\n", 2272 " background-color: #E2EBFA;\n", 2273 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2274 " fill: #174EA6;\n", 2275 " }\n", 2276 "\n", 2277 " [theme=dark] .colab-df-quickchart {\n", 2278 " background-color: #3B4455;\n", 2279 " fill: #D2E3FC;\n", 2280 " }\n", 2281 "\n", 2282 " [theme=dark] .colab-df-quickchart:hover {\n", 2283 " background-color: #434B5C;\n", 2284 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2285 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2286 " fill: #FFFFFF;\n", 2287 " }\n", 2288 "</style>\n", 2289 "\n", 2290 " <script>\n", 2291 " async function quickchart(key) {\n", 2292 " const containerElement = document.querySelector('#' + key);\n", 2293 " const charts = await google.colab.kernel.invokeFunction(\n", 2294 " 'suggestCharts', [key], {});\n", 2295 " }\n", 2296 " </script>\n", 2297 "\n", 2298 "\n", 2299 " <script>\n", 2300 "\n", 2301 "function displayQuickchartButton(domScope) {\n", 2302 " let quickchartButtonEl =\n", 2303 " domScope.querySelector('#df-85659170-eae7-4137-8af2-a8a4e52ecba2 button.colab-df-quickchart');\n", 2304 " quickchartButtonEl.style.display =\n", 2305 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2306 "}\n", 2307 "\n", 2308 " displayQuickchartButton(document);\n", 2309 " </script>\n", 2310 " <style>\n", 2311 " .colab-df-container {\n", 2312 " display:flex;\n", 2313 " flex-wrap:wrap;\n", 2314 " gap: 12px;\n", 2315 " }\n", 2316 "\n", 2317 " .colab-df-convert {\n", 2318 " background-color: #E8F0FE;\n", 2319 " border: none;\n", 2320 " border-radius: 50%;\n", 2321 " cursor: pointer;\n", 2322 " display: none;\n", 2323 " fill: #1967D2;\n", 2324 " height: 32px;\n", 2325 " padding: 0 0 0 0;\n", 2326 " width: 32px;\n", 2327 " }\n", 2328 "\n", 2329 " .colab-df-convert:hover {\n", 2330 " background-color: #E2EBFA;\n", 2331 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2332 " fill: #174EA6;\n", 2333 " }\n", 2334 "\n", 2335 " [theme=dark] .colab-df-convert {\n", 2336 " background-color: #3B4455;\n", 2337 " fill: #D2E3FC;\n", 2338 " }\n", 2339 "\n", 2340 " [theme=dark] .colab-df-convert:hover {\n", 2341 " background-color: #434B5C;\n", 2342 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2343 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2344 " fill: #FFFFFF;\n", 2345 " }\n", 2346 " </style>\n", 2347 "\n", 2348 " <script>\n", 2349 " const buttonEl =\n", 2350 " document.querySelector('#df-9a99a174-56a2-478f-91c0-31a3c54948cd button.colab-df-convert');\n", 2351 " buttonEl.style.display =\n", 2352 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2353 "\n", 2354 " async function convertToInteractive(key) {\n", 2355 " const element = document.querySelector('#df-9a99a174-56a2-478f-91c0-31a3c54948cd');\n", 2356 " const dataTable =\n", 2357 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 2358 " [key], {});\n", 2359 " if (!dataTable) return;\n", 2360 "\n", 2361 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 2362 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 2363 " + ' to learn more about interactive tables.';\n", 2364 " element.innerHTML = '';\n", 2365 " dataTable['output_type'] = 'display_data';\n", 2366 " await google.colab.output.renderOutput(dataTable, element);\n", 2367 " const docLink = document.createElement('div');\n", 2368 " docLink.innerHTML = docLinkHtml;\n", 2369 " element.appendChild(docLink);\n", 2370 " }\n", 2371 " </script>\n", 2372 " </div>\n", 2373 " </div>\n" 2374 ] 2375 }, 2376 "metadata": {}, 2377 "execution_count": 21 2378 } 2379 ] 2380 }, 2381 { 2382 "cell_type": "code", 2383 "source": [ 2384 "df[df.source == \"webis-touche2020\"].reset_index(drop=True)" 2385 ], 2386 "metadata": { 2387 "colab": { 2388 "base_uri": "https://localhost:8080/", 2389 "height": 175 2390 }, 2391 "id": "wKCYo54hbVUC", 2392 "outputId": "2314f23c-1ed6-4f77-db4d-c579f163878b" 2393 }, 2394 "execution_count": null, 2395 "outputs": [ 2396 { 2397 "output_type": "execute_result", 2398 "data": { 2399 "text/plain": [ 2400 " source method index memory search ndcg_cut_10 map_cut_10 \\\n", 2401 "0 webis-touche2020 bm25 374.66 1137 0.37 0.36920 0.14588 \n", 2402 "1 webis-touche2020 es 168.28 629 0.62 0.37519 0.14819 \n", 2403 "2 webis-touche2020 sqlite 220.46 1416 34.61 0.37194 0.14812 \n", 2404 "3 webis-touche2020 rank 224.07 10347 81.22 0.39861 0.16492 \n", 2405 "\n", 2406 " recall_10 P_10 \n", 2407 "0 0.22736 0.34694 \n", 2408 "1 0.22889 0.35102 \n", 2409 "2 0.22890 0.35102 \n", 2410 "3 0.23770 0.36122 " 2411 ], 2412 "text/html": [ 2413 "\n", 2414 "\n", 2415 " <div id=\"df-2724b0da-6952-4174-a45b-5c955cba470e\">\n", 2416 " <div class=\"colab-df-container\">\n", 2417 " <div>\n", 2418 "<style scoped>\n", 2419 " .dataframe tbody tr th:only-of-type {\n", 2420 " vertical-align: middle;\n", 2421 " }\n", 2422 "\n", 2423 " .dataframe tbody tr th {\n", 2424 " vertical-align: top;\n", 2425 " }\n", 2426 "\n", 2427 " .dataframe thead th {\n", 2428 " text-align: right;\n", 2429 " }\n", 2430 "</style>\n", 2431 "<table border=\"1\" class=\"dataframe\">\n", 2432 " <thead>\n", 2433 " <tr style=\"text-align: right;\">\n", 2434 " <th></th>\n", 2435 " <th>source</th>\n", 2436 " <th>method</th>\n", 2437 " <th>index</th>\n", 2438 " <th>memory</th>\n", 2439 " <th>search</th>\n", 2440 " <th>ndcg_cut_10</th>\n", 2441 " <th>map_cut_10</th>\n", 2442 " <th>recall_10</th>\n", 2443 " <th>P_10</th>\n", 2444 " </tr>\n", 2445 " </thead>\n", 2446 " <tbody>\n", 2447 " <tr>\n", 2448 " <th>0</th>\n", 2449 " <td>webis-touche2020</td>\n", 2450 " <td>bm25</td>\n", 2451 " <td>374.66</td>\n", 2452 " <td>1137</td>\n", 2453 " <td>0.37</td>\n", 2454 " <td>0.36920</td>\n", 2455 " <td>0.14588</td>\n", 2456 " <td>0.22736</td>\n", 2457 " <td>0.34694</td>\n", 2458 " </tr>\n", 2459 " <tr>\n", 2460 " <th>1</th>\n", 2461 " <td>webis-touche2020</td>\n", 2462 " <td>es</td>\n", 2463 " <td>168.28</td>\n", 2464 " <td>629</td>\n", 2465 " <td>0.62</td>\n", 2466 " <td>0.37519</td>\n", 2467 " <td>0.14819</td>\n", 2468 " <td>0.22889</td>\n", 2469 " <td>0.35102</td>\n", 2470 " </tr>\n", 2471 " <tr>\n", 2472 " <th>2</th>\n", 2473 " <td>webis-touche2020</td>\n", 2474 " <td>sqlite</td>\n", 2475 " <td>220.46</td>\n", 2476 " <td>1416</td>\n", 2477 " <td>34.61</td>\n", 2478 " <td>0.37194</td>\n", 2479 " <td>0.14812</td>\n", 2480 " <td>0.22890</td>\n", 2481 " <td>0.35102</td>\n", 2482 " </tr>\n", 2483 " <tr>\n", 2484 " <th>3</th>\n", 2485 " <td>webis-touche2020</td>\n", 2486 " <td>rank</td>\n", 2487 " <td>224.07</td>\n", 2488 " <td>10347</td>\n", 2489 " <td>81.22</td>\n", 2490 " <td>0.39861</td>\n", 2491 " <td>0.16492</td>\n", 2492 " <td>0.23770</td>\n", 2493 " <td>0.36122</td>\n", 2494 " </tr>\n", 2495 " </tbody>\n", 2496 "</table>\n", 2497 "</div>\n", 2498 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2724b0da-6952-4174-a45b-5c955cba470e')\"\n", 2499 " title=\"Convert this dataframe to an interactive table.\"\n", 2500 " style=\"display:none;\">\n", 2501 "\n", 2502 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2503 " width=\"24px\">\n", 2504 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 2505 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 2506 " </svg>\n", 2507 " </button>\n", 2508 "\n", 2509 "\n", 2510 "\n", 2511 " <div id=\"df-04d55207-a02e-4a45-b86a-f57919678887\">\n", 2512 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-04d55207-a02e-4a45-b86a-f57919678887')\"\n", 2513 " title=\"Suggest charts.\"\n", 2514 " style=\"display:none;\">\n", 2515 "\n", 2516 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2517 " width=\"24px\">\n", 2518 " <g>\n", 2519 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 2520 " </g>\n", 2521 "</svg>\n", 2522 " </button>\n", 2523 " </div>\n", 2524 "\n", 2525 "<style>\n", 2526 " .colab-df-quickchart {\n", 2527 " background-color: #E8F0FE;\n", 2528 " border: none;\n", 2529 " border-radius: 50%;\n", 2530 " cursor: pointer;\n", 2531 " display: none;\n", 2532 " fill: #1967D2;\n", 2533 " height: 32px;\n", 2534 " padding: 0 0 0 0;\n", 2535 " width: 32px;\n", 2536 " }\n", 2537 "\n", 2538 " .colab-df-quickchart:hover {\n", 2539 " background-color: #E2EBFA;\n", 2540 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2541 " fill: #174EA6;\n", 2542 " }\n", 2543 "\n", 2544 " [theme=dark] .colab-df-quickchart {\n", 2545 " background-color: #3B4455;\n", 2546 " fill: #D2E3FC;\n", 2547 " }\n", 2548 "\n", 2549 " [theme=dark] .colab-df-quickchart:hover {\n", 2550 " background-color: #434B5C;\n", 2551 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2552 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2553 " fill: #FFFFFF;\n", 2554 " }\n", 2555 "</style>\n", 2556 "\n", 2557 " <script>\n", 2558 " async function quickchart(key) {\n", 2559 " const containerElement = document.querySelector('#' + key);\n", 2560 " const charts = await google.colab.kernel.invokeFunction(\n", 2561 " 'suggestCharts', [key], {});\n", 2562 " }\n", 2563 " </script>\n", 2564 "\n", 2565 "\n", 2566 " <script>\n", 2567 "\n", 2568 "function displayQuickchartButton(domScope) {\n", 2569 " let quickchartButtonEl =\n", 2570 " domScope.querySelector('#df-04d55207-a02e-4a45-b86a-f57919678887 button.colab-df-quickchart');\n", 2571 " quickchartButtonEl.style.display =\n", 2572 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2573 "}\n", 2574 "\n", 2575 " displayQuickchartButton(document);\n", 2576 " </script>\n", 2577 " <style>\n", 2578 " .colab-df-container {\n", 2579 " display:flex;\n", 2580 " flex-wrap:wrap;\n", 2581 " gap: 12px;\n", 2582 " }\n", 2583 "\n", 2584 " .colab-df-convert {\n", 2585 " background-color: #E8F0FE;\n", 2586 " border: none;\n", 2587 " border-radius: 50%;\n", 2588 " cursor: pointer;\n", 2589 " display: none;\n", 2590 " fill: #1967D2;\n", 2591 " height: 32px;\n", 2592 " padding: 0 0 0 0;\n", 2593 " width: 32px;\n", 2594 " }\n", 2595 "\n", 2596 " .colab-df-convert:hover {\n", 2597 " background-color: #E2EBFA;\n", 2598 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2599 " fill: #174EA6;\n", 2600 " }\n", 2601 "\n", 2602 " [theme=dark] .colab-df-convert {\n", 2603 " background-color: #3B4455;\n", 2604 " fill: #D2E3FC;\n", 2605 " }\n", 2606 "\n", 2607 " [theme=dark] .colab-df-convert:hover {\n", 2608 " background-color: #434B5C;\n", 2609 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2610 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2611 " fill: #FFFFFF;\n", 2612 " }\n", 2613 " </style>\n", 2614 "\n", 2615 " <script>\n", 2616 " const buttonEl =\n", 2617 " document.querySelector('#df-2724b0da-6952-4174-a45b-5c955cba470e button.colab-df-convert');\n", 2618 " buttonEl.style.display =\n", 2619 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2620 "\n", 2621 " async function convertToInteractive(key) {\n", 2622 " const element = document.querySelector('#df-2724b0da-6952-4174-a45b-5c955cba470e');\n", 2623 " const dataTable =\n", 2624 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 2625 " [key], {});\n", 2626 " if (!dataTable) return;\n", 2627 "\n", 2628 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 2629 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 2630 " + ' to learn more about interactive tables.';\n", 2631 " element.innerHTML = '';\n", 2632 " dataTable['output_type'] = 'display_data';\n", 2633 " await google.colab.output.renderOutput(dataTable, element);\n", 2634 " const docLink = document.createElement('div');\n", 2635 " docLink.innerHTML = docLinkHtml;\n", 2636 " element.appendChild(docLink);\n", 2637 " }\n", 2638 " </script>\n", 2639 " </div>\n", 2640 " </div>\n" 2641 ] 2642 }, 2643 "metadata": {}, 2644 "execution_count": 22 2645 } 2646 ] 2647 }, 2648 { 2649 "cell_type": "code", 2650 "source": [ 2651 "df[df.source == \"scidocs\"].reset_index(drop=True)" 2652 ], 2653 "metadata": { 2654 "colab": { 2655 "base_uri": "https://localhost:8080/", 2656 "height": 175 2657 }, 2658 "id": "yt5j8wF1bNka", 2659 "outputId": "5f2d0506-3578-47e4-abde-1cb6736fc1d5" 2660 }, 2661 "execution_count": null, 2662 "outputs": [ 2663 { 2664 "output_type": "execute_result", 2665 "data": { 2666 "text/plain": [ 2667 " source method index memory search ndcg_cut_10 map_cut_10 recall_10 \\\n", 2668 "0 scidocs bm25 17.95 717 1.64 0.15063 0.08756 0.15637 \n", 2669 "1 scidocs es 11.07 632 10.25 0.14924 0.08671 0.15497 \n", 2670 "2 scidocs sqlite 17.85 670 56.64 0.15156 0.08822 0.15717 \n", 2671 "3 scidocs rank 13.11 1056 162.99 0.14932 0.08670 0.15408 \n", 2672 "\n", 2673 " P_10 \n", 2674 "0 0.0772 \n", 2675 "1 0.0765 \n", 2676 "2 0.0776 \n", 2677 "3 0.0761 " 2678 ], 2679 "text/html": [ 2680 "\n", 2681 "\n", 2682 " <div id=\"df-b92c0549-7ae3-45a5-a6ae-5b8c676c8797\">\n", 2683 " <div class=\"colab-df-container\">\n", 2684 " <div>\n", 2685 "<style scoped>\n", 2686 " .dataframe tbody tr th:only-of-type {\n", 2687 " vertical-align: middle;\n", 2688 " }\n", 2689 "\n", 2690 " .dataframe tbody tr th {\n", 2691 " vertical-align: top;\n", 2692 " }\n", 2693 "\n", 2694 " .dataframe thead th {\n", 2695 " text-align: right;\n", 2696 " }\n", 2697 "</style>\n", 2698 "<table border=\"1\" class=\"dataframe\">\n", 2699 " <thead>\n", 2700 " <tr style=\"text-align: right;\">\n", 2701 " <th></th>\n", 2702 " <th>source</th>\n", 2703 " <th>method</th>\n", 2704 " <th>index</th>\n", 2705 " <th>memory</th>\n", 2706 " <th>search</th>\n", 2707 " <th>ndcg_cut_10</th>\n", 2708 " <th>map_cut_10</th>\n", 2709 " <th>recall_10</th>\n", 2710 " <th>P_10</th>\n", 2711 " </tr>\n", 2712 " </thead>\n", 2713 " <tbody>\n", 2714 " <tr>\n", 2715 " <th>0</th>\n", 2716 " <td>scidocs</td>\n", 2717 " <td>bm25</td>\n", 2718 " <td>17.95</td>\n", 2719 " <td>717</td>\n", 2720 " <td>1.64</td>\n", 2721 " <td>0.15063</td>\n", 2722 " <td>0.08756</td>\n", 2723 " <td>0.15637</td>\n", 2724 " <td>0.0772</td>\n", 2725 " </tr>\n", 2726 " <tr>\n", 2727 " <th>1</th>\n", 2728 " <td>scidocs</td>\n", 2729 " <td>es</td>\n", 2730 " <td>11.07</td>\n", 2731 " <td>632</td>\n", 2732 " <td>10.25</td>\n", 2733 " <td>0.14924</td>\n", 2734 " <td>0.08671</td>\n", 2735 " <td>0.15497</td>\n", 2736 " <td>0.0765</td>\n", 2737 " </tr>\n", 2738 " <tr>\n", 2739 " <th>2</th>\n", 2740 " <td>scidocs</td>\n", 2741 " <td>sqlite</td>\n", 2742 " <td>17.85</td>\n", 2743 " <td>670</td>\n", 2744 " <td>56.64</td>\n", 2745 " <td>0.15156</td>\n", 2746 " <td>0.08822</td>\n", 2747 " <td>0.15717</td>\n", 2748 " <td>0.0776</td>\n", 2749 " </tr>\n", 2750 " <tr>\n", 2751 " <th>3</th>\n", 2752 " <td>scidocs</td>\n", 2753 " <td>rank</td>\n", 2754 " <td>13.11</td>\n", 2755 " <td>1056</td>\n", 2756 " <td>162.99</td>\n", 2757 " <td>0.14932</td>\n", 2758 " <td>0.08670</td>\n", 2759 " <td>0.15408</td>\n", 2760 " <td>0.0761</td>\n", 2761 " </tr>\n", 2762 " </tbody>\n", 2763 "</table>\n", 2764 "</div>\n", 2765 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b92c0549-7ae3-45a5-a6ae-5b8c676c8797')\"\n", 2766 " title=\"Convert this dataframe to an interactive table.\"\n", 2767 " style=\"display:none;\">\n", 2768 "\n", 2769 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2770 " width=\"24px\">\n", 2771 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 2772 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 2773 " </svg>\n", 2774 " </button>\n", 2775 "\n", 2776 "\n", 2777 "\n", 2778 " <div id=\"df-f381122e-4fae-4579-8e4c-b055d81263cd\">\n", 2779 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-f381122e-4fae-4579-8e4c-b055d81263cd')\"\n", 2780 " title=\"Suggest charts.\"\n", 2781 " style=\"display:none;\">\n", 2782 "\n", 2783 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 2784 " width=\"24px\">\n", 2785 " <g>\n", 2786 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 2787 " </g>\n", 2788 "</svg>\n", 2789 " </button>\n", 2790 " </div>\n", 2791 "\n", 2792 "<style>\n", 2793 " .colab-df-quickchart {\n", 2794 " background-color: #E8F0FE;\n", 2795 " border: none;\n", 2796 " border-radius: 50%;\n", 2797 " cursor: pointer;\n", 2798 " display: none;\n", 2799 " fill: #1967D2;\n", 2800 " height: 32px;\n", 2801 " padding: 0 0 0 0;\n", 2802 " width: 32px;\n", 2803 " }\n", 2804 "\n", 2805 " .colab-df-quickchart:hover {\n", 2806 " background-color: #E2EBFA;\n", 2807 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2808 " fill: #174EA6;\n", 2809 " }\n", 2810 "\n", 2811 " [theme=dark] .colab-df-quickchart {\n", 2812 " background-color: #3B4455;\n", 2813 " fill: #D2E3FC;\n", 2814 " }\n", 2815 "\n", 2816 " [theme=dark] .colab-df-quickchart:hover {\n", 2817 " background-color: #434B5C;\n", 2818 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2819 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2820 " fill: #FFFFFF;\n", 2821 " }\n", 2822 "</style>\n", 2823 "\n", 2824 " <script>\n", 2825 " async function quickchart(key) {\n", 2826 " const containerElement = document.querySelector('#' + key);\n", 2827 " const charts = await google.colab.kernel.invokeFunction(\n", 2828 " 'suggestCharts', [key], {});\n", 2829 " }\n", 2830 " </script>\n", 2831 "\n", 2832 "\n", 2833 " <script>\n", 2834 "\n", 2835 "function displayQuickchartButton(domScope) {\n", 2836 " let quickchartButtonEl =\n", 2837 " domScope.querySelector('#df-f381122e-4fae-4579-8e4c-b055d81263cd button.colab-df-quickchart');\n", 2838 " quickchartButtonEl.style.display =\n", 2839 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2840 "}\n", 2841 "\n", 2842 " displayQuickchartButton(document);\n", 2843 " </script>\n", 2844 " <style>\n", 2845 " .colab-df-container {\n", 2846 " display:flex;\n", 2847 " flex-wrap:wrap;\n", 2848 " gap: 12px;\n", 2849 " }\n", 2850 "\n", 2851 " .colab-df-convert {\n", 2852 " background-color: #E8F0FE;\n", 2853 " border: none;\n", 2854 " border-radius: 50%;\n", 2855 " cursor: pointer;\n", 2856 " display: none;\n", 2857 " fill: #1967D2;\n", 2858 " height: 32px;\n", 2859 " padding: 0 0 0 0;\n", 2860 " width: 32px;\n", 2861 " }\n", 2862 "\n", 2863 " .colab-df-convert:hover {\n", 2864 " background-color: #E2EBFA;\n", 2865 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 2866 " fill: #174EA6;\n", 2867 " }\n", 2868 "\n", 2869 " [theme=dark] .colab-df-convert {\n", 2870 " background-color: #3B4455;\n", 2871 " fill: #D2E3FC;\n", 2872 " }\n", 2873 "\n", 2874 " [theme=dark] .colab-df-convert:hover {\n", 2875 " background-color: #434B5C;\n", 2876 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 2877 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 2878 " fill: #FFFFFF;\n", 2879 " }\n", 2880 " </style>\n", 2881 "\n", 2882 " <script>\n", 2883 " const buttonEl =\n", 2884 " document.querySelector('#df-b92c0549-7ae3-45a5-a6ae-5b8c676c8797 button.colab-df-convert');\n", 2885 " buttonEl.style.display =\n", 2886 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 2887 "\n", 2888 " async function convertToInteractive(key) {\n", 2889 " const element = document.querySelector('#df-b92c0549-7ae3-45a5-a6ae-5b8c676c8797');\n", 2890 " const dataTable =\n", 2891 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 2892 " [key], {});\n", 2893 " if (!dataTable) return;\n", 2894 "\n", 2895 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 2896 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 2897 " + ' to learn more about interactive tables.';\n", 2898 " element.innerHTML = '';\n", 2899 " dataTable['output_type'] = 'display_data';\n", 2900 " await google.colab.output.renderOutput(dataTable, element);\n", 2901 " const docLink = document.createElement('div');\n", 2902 " docLink.innerHTML = docLinkHtml;\n", 2903 " element.appendChild(docLink);\n", 2904 " }\n", 2905 " </script>\n", 2906 " </div>\n", 2907 " </div>\n" 2908 ] 2909 }, 2910 "metadata": {}, 2911 "execution_count": 23 2912 } 2913 ] 2914 }, 2915 { 2916 "cell_type": "code", 2917 "source": [ 2918 "df[df.source == \"scifact\"].reset_index(drop=True)" 2919 ], 2920 "metadata": { 2921 "colab": { 2922 "base_uri": "https://localhost:8080/", 2923 "height": 175 2924 }, 2925 "id": "7o3RVNt2bQFZ", 2926 "outputId": "6723c801-b485-45b3-e619-ca5ddcf8e7c3" 2927 }, 2928 "execution_count": null, 2929 "outputs": [ 2930 { 2931 "output_type": "execute_result", 2932 "data": { 2933 "text/plain": [ 2934 " source method index memory search ndcg_cut_10 map_cut_10 recall_10 \\\n", 2935 "0 scifact bm25 5.51 653 1.07 0.66324 0.61764 0.78761 \n", 2936 "1 scifact es 2.90 625 9.62 0.66058 0.61518 0.78428 \n", 2937 "2 scifact sqlite 1.85 631 20.28 0.66630 0.61966 0.79494 \n", 2938 "3 scifact rank 1.85 724 42.22 0.65618 0.61204 0.77400 \n", 2939 "\n", 2940 " P_10 \n", 2941 "0 0.08700 \n", 2942 "1 0.08667 \n", 2943 "2 0.08800 \n", 2944 "3 0.08500 " 2945 ], 2946 "text/html": [ 2947 "\n", 2948 "\n", 2949 " <div id=\"df-bdc5481f-7a1e-4570-8c0a-e7693ee76de0\">\n", 2950 " <div class=\"colab-df-container\">\n", 2951 " <div>\n", 2952 "<style scoped>\n", 2953 " .dataframe tbody tr th:only-of-type {\n", 2954 " vertical-align: middle;\n", 2955 " }\n", 2956 "\n", 2957 " .dataframe tbody tr th {\n", 2958 " vertical-align: top;\n", 2959 " }\n", 2960 "\n", 2961 " .dataframe thead th {\n", 2962 " text-align: right;\n", 2963 " }\n", 2964 "</style>\n", 2965 "<table border=\"1\" class=\"dataframe\">\n", 2966 " <thead>\n", 2967 " <tr style=\"text-align: right;\">\n", 2968 " <th></th>\n", 2969 " <th>source</th>\n", 2970 " <th>method</th>\n", 2971 " <th>index</th>\n", 2972 " <th>memory</th>\n", 2973 " <th>search</th>\n", 2974 " <th>ndcg_cut_10</th>\n", 2975 " <th>map_cut_10</th>\n", 2976 " <th>recall_10</th>\n", 2977 " <th>P_10</th>\n", 2978 " </tr>\n", 2979 " </thead>\n", 2980 " <tbody>\n", 2981 " <tr>\n", 2982 " <th>0</th>\n", 2983 " <td>scifact</td>\n", 2984 " <td>bm25</td>\n", 2985 " <td>5.51</td>\n", 2986 " <td>653</td>\n", 2987 " <td>1.07</td>\n", 2988 " <td>0.66324</td>\n", 2989 " <td>0.61764</td>\n", 2990 " <td>0.78761</td>\n", 2991 " <td>0.08700</td>\n", 2992 " </tr>\n", 2993 " <tr>\n", 2994 " <th>1</th>\n", 2995 " <td>scifact</td>\n", 2996 " <td>es</td>\n", 2997 " <td>2.90</td>\n", 2998 " <td>625</td>\n", 2999 " <td>9.62</td>\n", 3000 " <td>0.66058</td>\n", 3001 " <td>0.61518</td>\n", 3002 " <td>0.78428</td>\n", 3003 " <td>0.08667</td>\n", 3004 " </tr>\n", 3005 " <tr>\n", 3006 " <th>2</th>\n", 3007 " <td>scifact</td>\n", 3008 " <td>sqlite</td>\n", 3009 " <td>1.85</td>\n", 3010 " <td>631</td>\n", 3011 " <td>20.28</td>\n", 3012 " <td>0.66630</td>\n", 3013 " <td>0.61966</td>\n", 3014 " <td>0.79494</td>\n", 3015 " <td>0.08800</td>\n", 3016 " </tr>\n", 3017 " <tr>\n", 3018 " <th>3</th>\n", 3019 " <td>scifact</td>\n", 3020 " <td>rank</td>\n", 3021 " <td>1.85</td>\n", 3022 " <td>724</td>\n", 3023 " <td>42.22</td>\n", 3024 " <td>0.65618</td>\n", 3025 " <td>0.61204</td>\n", 3026 " <td>0.77400</td>\n", 3027 " <td>0.08500</td>\n", 3028 " </tr>\n", 3029 " </tbody>\n", 3030 "</table>\n", 3031 "</div>\n", 3032 " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bdc5481f-7a1e-4570-8c0a-e7693ee76de0')\"\n", 3033 " title=\"Convert this dataframe to an interactive table.\"\n", 3034 " style=\"display:none;\">\n", 3035 "\n", 3036 " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 3037 " width=\"24px\">\n", 3038 " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n", 3039 " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n", 3040 " </svg>\n", 3041 " </button>\n", 3042 "\n", 3043 "\n", 3044 "\n", 3045 " <div id=\"df-bc649d91-bc68-40de-88b6-1a1cd41fadc2\">\n", 3046 " <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-bc649d91-bc68-40de-88b6-1a1cd41fadc2')\"\n", 3047 " title=\"Suggest charts.\"\n", 3048 " style=\"display:none;\">\n", 3049 "\n", 3050 "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n", 3051 " width=\"24px\">\n", 3052 " <g>\n", 3053 " <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n", 3054 " </g>\n", 3055 "</svg>\n", 3056 " </button>\n", 3057 " </div>\n", 3058 "\n", 3059 "<style>\n", 3060 " .colab-df-quickchart {\n", 3061 " background-color: #E8F0FE;\n", 3062 " border: none;\n", 3063 " border-radius: 50%;\n", 3064 " cursor: pointer;\n", 3065 " display: none;\n", 3066 " fill: #1967D2;\n", 3067 " height: 32px;\n", 3068 " padding: 0 0 0 0;\n", 3069 " width: 32px;\n", 3070 " }\n", 3071 "\n", 3072 " .colab-df-quickchart:hover {\n", 3073 " background-color: #E2EBFA;\n", 3074 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 3075 " fill: #174EA6;\n", 3076 " }\n", 3077 "\n", 3078 " [theme=dark] .colab-df-quickchart {\n", 3079 " background-color: #3B4455;\n", 3080 " fill: #D2E3FC;\n", 3081 " }\n", 3082 "\n", 3083 " [theme=dark] .colab-df-quickchart:hover {\n", 3084 " background-color: #434B5C;\n", 3085 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 3086 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 3087 " fill: #FFFFFF;\n", 3088 " }\n", 3089 "</style>\n", 3090 "\n", 3091 " <script>\n", 3092 " async function quickchart(key) {\n", 3093 " const containerElement = document.querySelector('#' + key);\n", 3094 " const charts = await google.colab.kernel.invokeFunction(\n", 3095 " 'suggestCharts', [key], {});\n", 3096 " }\n", 3097 " </script>\n", 3098 "\n", 3099 "\n", 3100 " <script>\n", 3101 "\n", 3102 "function displayQuickchartButton(domScope) {\n", 3103 " let quickchartButtonEl =\n", 3104 " domScope.querySelector('#df-bc649d91-bc68-40de-88b6-1a1cd41fadc2 button.colab-df-quickchart');\n", 3105 " quickchartButtonEl.style.display =\n", 3106 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 3107 "}\n", 3108 "\n", 3109 " displayQuickchartButton(document);\n", 3110 " </script>\n", 3111 " <style>\n", 3112 " .colab-df-container {\n", 3113 " display:flex;\n", 3114 " flex-wrap:wrap;\n", 3115 " gap: 12px;\n", 3116 " }\n", 3117 "\n", 3118 " .colab-df-convert {\n", 3119 " background-color: #E8F0FE;\n", 3120 " border: none;\n", 3121 " border-radius: 50%;\n", 3122 " cursor: pointer;\n", 3123 " display: none;\n", 3124 " fill: #1967D2;\n", 3125 " height: 32px;\n", 3126 " padding: 0 0 0 0;\n", 3127 " width: 32px;\n", 3128 " }\n", 3129 "\n", 3130 " .colab-df-convert:hover {\n", 3131 " background-color: #E2EBFA;\n", 3132 " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n", 3133 " fill: #174EA6;\n", 3134 " }\n", 3135 "\n", 3136 " [theme=dark] .colab-df-convert {\n", 3137 " background-color: #3B4455;\n", 3138 " fill: #D2E3FC;\n", 3139 " }\n", 3140 "\n", 3141 " [theme=dark] .colab-df-convert:hover {\n", 3142 " background-color: #434B5C;\n", 3143 " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n", 3144 " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n", 3145 " fill: #FFFFFF;\n", 3146 " }\n", 3147 " </style>\n", 3148 "\n", 3149 " <script>\n", 3150 " const buttonEl =\n", 3151 " document.querySelector('#df-bdc5481f-7a1e-4570-8c0a-e7693ee76de0 button.colab-df-convert');\n", 3152 " buttonEl.style.display =\n", 3153 " google.colab.kernel.accessAllowed ? 'block' : 'none';\n", 3154 "\n", 3155 " async function convertToInteractive(key) {\n", 3156 " const element = document.querySelector('#df-bdc5481f-7a1e-4570-8c0a-e7693ee76de0');\n", 3157 " const dataTable =\n", 3158 " await google.colab.kernel.invokeFunction('convertToInteractive',\n", 3159 " [key], {});\n", 3160 " if (!dataTable) return;\n", 3161 "\n", 3162 " const docLinkHtml = 'Like what you see? Visit the ' +\n", 3163 " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n", 3164 " + ' to learn more about interactive tables.';\n", 3165 " element.innerHTML = '';\n", 3166 " dataTable['output_type'] = 'display_data';\n", 3167 " await google.colab.output.renderOutput(dataTable, element);\n", 3168 " const docLink = document.createElement('div');\n", 3169 " docLink.innerHTML = docLinkHtml;\n", 3170 " element.appendChild(docLink);\n", 3171 " }\n", 3172 " </script>\n", 3173 " </div>\n", 3174 " </div>\n" 3175 ] 3176 }, 3177 "metadata": {}, 3178 "execution_count": 24 3179 } 3180 ] 3181 }, 3182 { 3183 "cell_type": "markdown", 3184 "source": [ 3185 "Once again txtai's implementation compares well with Elasticsearch. The accuracy metrics vary but are all about the same.\n", 3186 "\n", 3187 "It's important to note that in internal testing with solid state storage, Elasticsearch and txtai's speed is about the same. These times for Elasticsearch being a little slower are a product of running in a Google Colab environment." 3188 ], 3189 "metadata": { 3190 "id": "1INPBYQ2lf22" 3191 } 3192 }, 3193 { 3194 "cell_type": "markdown", 3195 "source": [ 3196 "# Wrapping up\n", 3197 "\n", 3198 "This notebook showed how to build an efficient sparse keyword index in Python. The benchmarks show that txtai provides a strong implementation both from an accuracy and speed standpoint, on par with Apache Lucene.\n", 3199 "\n", 3200 "This keyword index can be used as a standalone index in Python or in combination with dense vector indexes to form a `hybrid` index." 3201 ], 3202 "metadata": { 3203 "id": "f41NSYWc0dsy" 3204 } 3205 } 3206 ] 3207 }