/ examples / 02_Build_an_Embeddings_index_with_Hugging_Face_Datasets.ipynb
02_Build_an_Embeddings_index_with_Hugging_Face_Datasets.ipynb
  1  {
  2    "nbformat": 4,
  3    "nbformat_minor": 0,
  4    "metadata": {
  5      "colab": {
  6        "provenance": []
  7      },
  8      "kernelspec": {
  9        "name": "python3",
 10        "display_name": "Python 3"
 11      },
 12      "accelerator": "GPU"
 13    },
 14    "cells": [
 15      {
 16        "cell_type": "markdown",
 17        "metadata": {
 18          "id": "LjmhJ4ad9kBL"
 19        },
 20        "source": [
 21          "# Build an Embeddings index with Hugging Face Datasets\n",
 22          "\n",
 23          "This notebook shows how txtai can index and search with Hugging Face's [Datasets](https://github.com/huggingface/datasets) library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.\n",
 24          "\n",
 25          "In this example, txtai will be used to index and query a dataset.\n",
 26          "\n",
 27          "**Make sure to select a GPU runtime when running this notebook**"
 28        ]
 29      },
 30      {
 31        "cell_type": "markdown",
 32        "metadata": {
 33          "id": "8tLWvo9v-Q0u"
 34        },
 35        "source": [
 36          "# Install dependencies\n",
 37          "\n",
 38          "Install `txtai` and all dependencies. Also install `datasets`."
 39        ]
 40      },
 41      {
 42        "cell_type": "code",
 43        "metadata": {
 44          "id": "Fa5BCjMFqVKE"
 45        },
 46        "source": [
 47          "%%capture\n",
 48          "!pip install git+https://github.com/neuml/txtai\n",
 49          "!pip install datasets"
 50        ],
 51        "execution_count": null,
 52        "outputs": []
 53      },
 54      {
 55        "cell_type": "markdown",
 56        "metadata": {
 57          "id": "hOdEv8MH-e5h"
 58        },
 59        "source": [
 60          "# Load dataset and build a txtai index\n",
 61          "\n",
 62          "In this example, we'll load the `ag_news` dataset, which is a collection of news article headlines. This only takes a single line of code!\n",
 63          "\n",
 64          "Next, txtai will index the first 10,000 rows of the dataset. A sentence similarity model is used to compute sentence embeddings. sentence-transformers has a number of [pre-trained models](https://huggingface.co/models?pipeline_tag=sentence-similarity) that can be swapped in.\n",
 65          "\n",
 66          "In addition to the embeddings index, we'll also create a Similarity instance to re-rank search hits for relevancy. "
 67        ]
 68      },
 69      {
 70        "cell_type": "code",
 71        "metadata": {
 72          "id": "3hYRk9JnsM0J"
 73        },
 74        "source": [
 75          "%%capture\n",
 76          "from datasets import load_dataset\n",
 77          "\n",
 78          "from txtai.embeddings import Embeddings\n",
 79          "from txtai.pipeline import Similarity\n",
 80          "\n",
 81          "def stream(dataset, field, limit):\n",
 82          "  index = 0\n",
 83          "  for row in dataset:\n",
 84          "    yield (index, row[field], None)\n",
 85          "    index += 1\n",
 86          "\n",
 87          "    if index >= limit:\n",
 88          "      break\n",
 89          "\n",
 90          "def search(query):\n",
 91          "  return [(result[\"score\"], result[\"text\"]) for result in embeddings.search(query, limit=50)]\n",
 92          "\n",
 93          "def ranksearch(query):\n",
 94          "  results = [text for _, text in search(query)]\n",
 95          "  return [(score, results[x]) for x, score in similarity(query, results)]\n",
 96          "\n",
 97          "# Load HF dataset\n",
 98          "dataset = load_dataset(\"ag_news\", split=\"train\")\n",
 99          "\n",
100          "# Create embeddings model, backed by sentence-transformers & transformers, enable content storage\n",
101          "embeddings = Embeddings({\"path\": \"sentence-transformers/paraphrase-MiniLM-L3-v2\", \"content\": True})\n",
102          "embeddings.index(stream(dataset, \"text\", 10000))\n",
103          "\n",
104          "# Create similarity instance for re-ranking\n",
105          "similarity = Similarity(\"valhalla/distilbart-mnli-12-3\")"
106        ],
107        "execution_count": null,
108        "outputs": []
109      },
110      {
111        "cell_type": "markdown",
112        "metadata": {
113          "id": "LBhHcX6eFmGI"
114        },
115        "source": [
116          "# Search the dataset\n",
117          "\n",
118          "Now that an index is ready, let's search the data! The following section runs a series of queries and show the results. Like basic search engines, txtai finds token matches. But the real power of txtai is finding semantically similar results.\n",
119          "\n",
120          "sentence-transformers has a great overview on [information retrieval](https://www.sbert.net/examples/applications/information-retrieval/README.html) that is well worth a read. "
121        ]
122      },
123      {
124        "cell_type": "code",
125        "metadata": {
126          "id": "YVmbiY92vxEO",
127          "colab": {
128            "base_uri": "https://localhost:8080/",
129            "height": 1000
130          },
131          "outputId": "85f5e0ad-14ba-4642-aed6-13c14a710d68"
132        },
133        "source": [
134          "from IPython.core.display import display, HTML\n",
135          "\n",
136          "def table(query, rows):\n",
137          "    html = \"\"\"\n",
138          "    <style type='text/css'>\n",
139          "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
140          "    table {\n",
141          "      border-collapse: collapse;\n",
142          "      width: 900px;\n",
143          "    }\n",
144          "    th, td {\n",
145          "        border: 1px solid #9e9e9e;\n",
146          "        padding: 10px;\n",
147          "        font: 15px Oswald;\n",
148          "    }\n",
149          "    </style>\n",
150          "    \"\"\"\n",
151          "\n",
152          "    html += \"<h3>%s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>\" % (query)\n",
153          "    for score, text in rows:\n",
154          "        html += \"<tr><td>%.4f</td><td>%s</td></tr>\" % (score, text)\n",
155          "    html += \"</table>\"\n",
156          "\n",
157          "    display(HTML(html))\n",
158          "\n",
159          "for query in [\"Positive Apple reports\", \"Negative Apple reports\", \"Best planets to explore for life\", \"LA Dodgers good news\", \"LA Dodgers bad news\"]:\n",
160          "  table(query, ranksearch(query)[:2])\n"
161        ],
162        "execution_count": null,
163        "outputs": [
164          {
165            "output_type": "display_data",
166            "data": {
167              "text/html": [
168                "\n",
169                "    <style type='text/css'>\n",
170                "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
171                "    table {\n",
172                "      border-collapse: collapse;\n",
173                "      width: 900px;\n",
174                "    }\n",
175                "    th, td {\n",
176                "        border: 1px solid #9e9e9e;\n",
177                "        padding: 10px;\n",
178                "        font: 15px Oswald;\n",
179                "    }\n",
180                "    </style>\n",
181                "    <h3>Positive Apple reports</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9941</td><td>Apple's iPod a Huge Hit in Japan The iPod is proving a colossal hit on the Japanese electronics and entertainment giant's home ground. The tiny white machine is catching on as a fashion statement and turning into a cultural icon here, much the same way it won a fanatical following in the United States.</td></tr><tr><td>0.9886</td><td>Apple tops US consumer satisfaction Recent data published by the American Customer Satisfaction Index (ACSI) shows Apple leading the consumer computer industry with the the highest customer satisfaction.</td></tr></table>"
182              ],
183              "text/plain": [
184                "<IPython.core.display.HTML object>"
185              ]
186            },
187            "metadata": {}
188          },
189          {
190            "output_type": "display_data",
191            "data": {
192              "text/html": [
193                "\n",
194                "    <style type='text/css'>\n",
195                "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
196                "    table {\n",
197                "      border-collapse: collapse;\n",
198                "      width: 900px;\n",
199                "    }\n",
200                "    th, td {\n",
201                "        border: 1px solid #9e9e9e;\n",
202                "        padding: 10px;\n",
203                "        font: 15px Oswald;\n",
204                "    }\n",
205                "    </style>\n",
206                "    <h3>Negative Apple reports</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9847</td><td>Apple Recalls 28,000 Faulty Batteries Sold with 15-inch PowerBook Apple has had to recall up to 28,000 notebook batteries that were sold for use with their 15-inch PowerBook. Apple reports that faulty batteries sold between January 2004 and August 2004 can overheat and pose a fire hazard.</td></tr><tr><td>0.9795</td><td>Apple Announces Voluntary Recall of Powerbook Batteries Apple, in cooperation with the US Consumer Product Safety Commission (CPSC), announced Thursday a voluntary recall of 15 quot; Aluminum PowerBook batteries. The batteries being recalled could potentially overheat, though no injuries relating ...</td></tr></table>"
207              ],
208              "text/plain": [
209                "<IPython.core.display.HTML object>"
210              ]
211            },
212            "metadata": {}
213          },
214          {
215            "output_type": "display_data",
216            "data": {
217              "text/html": [
218                "\n",
219                "    <style type='text/css'>\n",
220                "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
221                "    table {\n",
222                "      border-collapse: collapse;\n",
223                "      width: 900px;\n",
224                "    }\n",
225                "    th, td {\n",
226                "        border: 1px solid #9e9e9e;\n",
227                "        padding: 10px;\n",
228                "        font: 15px Oswald;\n",
229                "    }\n",
230                "    </style>\n",
231                "    <h3>Best planets to explore for life</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9110</td><td>Tiny 'David' Telescope Finds 'Goliath' Planet A newfound planet detected by a small, 4-inch-diameter telescope demonstrates that we are at the cusp of a new age of planet discovery. Soon, new worlds may be located at an accelerating pace, bringing the detection of the first Earth-sized world one step closer.</td></tr><tr><td>0.8838</td><td>Venus: Inhabited World? by Harry Bortman    In part 1 of this interview with Astrobiology Magazine editor Henry Bortman, planetary scientist David Grinspoon explained how Venus evolved from a wet planet similar to Earth to the scorching hot, dried-out furnace of today. In part 2, Grinspoon discusses the possibility of life on Venus...</td></tr></table>"
232              ],
233              "text/plain": [
234                "<IPython.core.display.HTML object>"
235              ]
236            },
237            "metadata": {}
238          },
239          {
240            "output_type": "display_data",
241            "data": {
242              "text/html": [
243                "\n",
244                "    <style type='text/css'>\n",
245                "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
246                "    table {\n",
247                "      border-collapse: collapse;\n",
248                "      width: 900px;\n",
249                "    }\n",
250                "    th, td {\n",
251                "        border: 1px solid #9e9e9e;\n",
252                "        padding: 10px;\n",
253                "        font: 15px Oswald;\n",
254                "    }\n",
255                "    </style>\n",
256                "    <h3>LA Dodgers good news</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9961</td><td>Dodgers 7, Braves 4 Los Angeles, Ca. -- Shawn Green belted a grand slam and a solo homer as Los Angeles beat Mike Hampton and the Atlanta Braves 7-to-4 Saturday afternoon.</td></tr><tr><td>0.9928</td><td>MLB: Los Angeles 7, Atlanta 4 Shawn Green hit two home runs Saturday, including a grand slam, to lead the Los Angeles Dodgers to a 7-4 victory over the Atlanta Braves.</td></tr></table>"
257              ],
258              "text/plain": [
259                "<IPython.core.display.HTML object>"
260              ]
261            },
262            "metadata": {}
263          },
264          {
265            "output_type": "display_data",
266            "data": {
267              "text/html": [
268                "\n",
269                "    <style type='text/css'>\n",
270                "    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');\n",
271                "    table {\n",
272                "      border-collapse: collapse;\n",
273                "      width: 900px;\n",
274                "    }\n",
275                "    th, td {\n",
276                "        border: 1px solid #9e9e9e;\n",
277                "        padding: 10px;\n",
278                "        font: 15px Oswald;\n",
279                "    }\n",
280                "    </style>\n",
281                "    <h3>LA Dodgers bad news</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead><tr><td>0.9880</td><td>Expos Keep Dodgers at Bay With 8-7 Win (AP) AP - Giovanni Carrara walked Juan Rivera with the bases loaded and two outs in the ninth inning Monday night, spoiling Los Angeles' six-run comeback and handing the Montreal Expos an 8-7 victory over the Dodgers.</td></tr><tr><td>0.9671</td><td>Gagne blows his 2d save Pinch-hitter Lenny Harris delivered a three-run double off Eric Gagne with two outs in the ninth, rallying the Florida Marlins past the Dodgers, 6-4, last night in Los Angeles.</td></tr></table>"
282              ],
283              "text/plain": [
284                "<IPython.core.display.HTML object>"
285              ]
286            },
287            "metadata": {}
288          }
289        ]
290      }
291    ]
292  }