Cradicle Explorer

/ examples / 06_Extractive_QA_with_Elasticsearch.ipynb
06_Extractive_QA_with_Elasticsearch.ipynb
  1  {
  2    "nbformat": 4,
  3    "nbformat_minor": 0,
  4    "metadata": {
  5      "colab": {
  6        "provenance": []
  7      },
  8      "kernelspec": {
  9        "name": "python3",
 10        "display_name": "Python 3"
 11      }
 12    },
 13    "cells": [
 14      {
 15        "cell_type": "markdown",
 16        "metadata": {
 17          "id": "zzZbP0LM6m5z"
 18        },
 19        "source": [
 20          "# Extractive QA with Elasticsearch\n",
 21          "\n",
 22          "txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system."
 23        ]
 24      },
 25      {
 26        "cell_type": "markdown",
 27        "metadata": {
 28          "id": "xk7t5Jcd6reO"
 29        },
 30        "source": [
 31          "# Install dependencies\n",
 32          "\n",
 33          "Install `txtai` and `Elasticsearch`."
 34        ]
 35      },
 36      {
 37        "cell_type": "code",
 38        "metadata": {
 39          "id": "0y1UA4-q-YdA"
 40        },
 41        "source": [
 42          "%%capture\n",
 43          "\n",
 44          "# Install txtai and elasticsearch python client\n",
 45          "!pip install git+https://github.com/neuml/txtai elasticsearch\n",
 46          "\n",
 47          "# Download and extract elasticsearch\n",
 48          "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
 49          "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n",
 50          "!chown -R daemon:daemon elasticsearch-7.10.1"
 51        ],
 52        "execution_count": null,
 53        "outputs": []
 54      },
 55      {
 56        "cell_type": "markdown",
 57        "metadata": {
 58          "id": "nKWz-C5gCJy8"
 59        },
 60        "source": [
 61          "Start an instance of Elasticsearch directly within this notebook. "
 62        ]
 63      },
 64      {
 65        "cell_type": "code",
 66        "metadata": {
 67          "id": "3ZfJeWbM6wmj"
 68        },
 69        "source": [
 70          "import os\n",
 71          "from subprocess import Popen, PIPE, STDOUT\n",
 72          "\n",
 73          "# If issues are encountered with this section, ES can be manually started as follows:\n",
 74          "# ./elasticsearch-7.10.1/bin/elasticsearch\n",
 75          "\n",
 76          "# Start and wait for server\n",
 77          "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n",
 78          "!sleep 30"
 79        ],
 80        "execution_count": null,
 81        "outputs": []
 82      },
 83      {
 84        "cell_type": "markdown",
 85        "metadata": {
 86          "id": "TWEn4w68-D1y"
 87        },
 88        "source": [
 89          "# Download data\n",
 90          "\n",
 91          "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n",
 92          "\n",
 93          "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook."
 94        ]
 95      },
 96      {
 97        "cell_type": "code",
 98        "metadata": {
 99          "id": "8tVrIqSq-KBa"
100        },
101        "source": [
102          "%%capture\n",
103          "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n",
104          "!gunzip tests.gz\n",
105          "!mv tests articles.sqlite"
106        ],
107        "execution_count": null,
108        "outputs": []
109      },
110      {
111        "cell_type": "markdown",
112        "metadata": {
113          "id": "hSWFzkCn61tM"
114        },
115        "source": [
116          "# Load data into Elasticsearch\n",
117          "\n",
118          "The following block copies rows from SQLite to Elasticsearch."
119        ]
120      },
121      {
122        "cell_type": "code",
123        "metadata": {
124          "id": "So-OBvUT61QD",
125          "colab": {
126            "base_uri": "https://localhost:8080/"
127          },
128          "outputId": "9647b8f8-8471-41bf-ccfa-a75306665638"
129        },
130        "source": [
131          "import sqlite3\n",
132          "\n",
133          "import regex as re\n",
134          "\n",
135          "from elasticsearch import Elasticsearch, helpers\n",
136          "\n",
137          "# Connect to ES instance\n",
138          "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n",
139          "\n",
140          "# Connection to database file\n",
141          "db = sqlite3.connect(\"articles.sqlite\")\n",
142          "cur = db.cursor()\n",
143          "\n",
144          "# Elasticsearch bulk buffer\n",
145          "buffer = []\n",
146          "rows = 0\n",
147          "\n",
148          "# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n",
149          "cur.execute(\"SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null\")\n",
150          "for row in cur:\n",
151          "  # Build dict of name-value pairs for fields\n",
152          "  article = dict(zip((\"id\", \"article\", \"title\", \"published\", \"reference\", \"name\", \"text\"), row))\n",
153          "  name = article[\"name\"]\n",
154          "\n",
155          "  # Only process certain document sections\n",
156          "  if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
157          "    # Bulk action fields\n",
158          "    article[\"_id\"] = article[\"id\"]\n",
159          "    article[\"_index\"] = \"articles\"\n",
160          "\n",
161          "    # Buffer article\n",
162          "    buffer.append(article)\n",
163          "\n",
164          "    # Increment number of articles processed\n",
165          "    rows += 1\n",
166          "\n",
167          "    # Bulk load every 1000 records\n",
168          "    if rows % 1000 == 0:\n",
169          "      helpers.bulk(es, buffer)\n",
170          "      buffer = []\n",
171          "\n",
172          "      print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n",
173          "\n",
174          "if buffer:\n",
175          "  helpers.bulk(es, buffer)\n",
176          "\n",
177          "print(\"Total articles inserted: {}\".format(rows))\n"
178        ],
179        "execution_count": null,
180        "outputs": [
181          {
182            "output_type": "stream",
183            "name": "stdout",
184            "text": [
185              "Total articles inserted: 21499\n"
186            ]
187          }
188        ]
189      },
190      {
191        "cell_type": "markdown",
192        "metadata": {
193          "id": "X5RO-VNwzMAo"
194        },
195        "source": [
196          "# Query data\n",
197          "\n",
198          "The following runs a query against Elasticsearch for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match.\n",
199          "\n"
200        ]
201      },
202      {
203        "cell_type": "code",
204        "metadata": {
205          "id": "ucd9mwSfFTMm",
206          "colab": {
207            "base_uri": "https://localhost:8080/",
208            "height": 348
209          },
210          "outputId": "b21d6aff-6abe-48f5-9914-7b7fb8472adb"
211        },
212        "source": [
213          "import pandas as pd\n",
214          "\n",
215          "from IPython.display import display, HTML\n",
216          "\n",
217          "pd.set_option(\"display.max_colwidth\", None)\n",
218          "\n",
219          "query = {\n",
220          "    \"_source\": [\"article\", \"title\", \"published\", \"reference\", \"text\"],\n",
221          "    \"size\": 5,\n",
222          "    \"query\": {\n",
223          "        \"query_string\": {\"query\": \"risk factors\"}\n",
224          "    }\n",
225          "}\n",
226          "\n",
227          "results = []\n",
228          "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n",
229          "  source = result[\"_source\"]\n",
230          "  results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]))\n",
231          "\n",
232          "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n",
233          "\n",
234          "display(HTML(df.to_html(index=False)))"
235        ],
236        "execution_count": null,
237        "outputs": [
238          {
239            "output_type": "display_data",
240            "data": {
241              "text/html": [
242                "<table border=\"1\" class=\"dataframe\">\n",
243                "  <thead>\n",
244                "    <tr style=\"text-align: right;\">\n",
245                "      <th>Title</th>\n",
246                "      <th>Published</th>\n",
247                "      <th>Reference</th>\n",
248                "      <th>Match</th>\n",
249                "    </tr>\n",
250                "  </thead>\n",
251                "  <tbody>\n",
252                "    <tr>\n",
253                "      <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n",
254                "      <td>2020-04-24 00:00:00</td>\n",
255                "      <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n",
256                "      <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n",
257                "    </tr>\n",
258                "    <tr>\n",
259                "      <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
260                "      <td>2020-04-27 00:00:00</td>\n",
261                "      <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
262                "      <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n",
263                "    </tr>\n",
264                "    <tr>\n",
265                "      <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
266                "      <td>2020-07-23 00:00:00</td>\n",
267                "      <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
268                "      <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n",
269                "    </tr>\n",
270                "    <tr>\n",
271                "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
272                "      <td>2020-03-15 00:00:00</td>\n",
273                "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
274                "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
275                "    </tr>\n",
276                "    <tr>\n",
277                "      <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n",
278                "      <td>2020-05-11 00:00:00</td>\n",
279                "      <td>http://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?rss=1</td>\n",
280                "      <td>In addition, many risk factors for covid-19 documented in the literature are highly correlated and it is not clear which may be independently related to risk.</td>\n",
281                "    </tr>\n",
282                "  </tbody>\n",
283                "</table>"
284              ],
285              "text/plain": [
286                "<IPython.core.display.HTML object>"
287              ]
288            },
289            "metadata": {}
290          }
291        ]
292      },
293      {
294        "cell_type": "markdown",
295        "metadata": {
296          "id": "ylxOKji1-9_K"
297        },
298        "source": [
299          "# Derive columns with Extractive QA\n",
300          "\n",
301          "The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article."
302        ]
303      },
304      {
305        "cell_type": "code",
306        "metadata": {
307          "id": "mwBTrCkcOM_H"
308        },
309        "source": [
310          "%%capture\n",
311          "from txtai.embeddings import Embeddings\n",
312          "from txtai.pipeline import Extractor\n",
313          "\n",
314          "# Create embeddings model, backed by sentence-transformers & transformers\n",
315          "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n",
316          "\n",
317          "# Create extractor instance using qa model designed for the CORD-19 dataset\n",
318          "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")"
319        ],
320        "execution_count": null,
321        "outputs": []
322      },
323      {
324        "cell_type": "code",
325        "metadata": {
326          "id": "Yv75Lh-cOpL9",
327          "colab": {
328            "base_uri": "https://localhost:8080/",
329            "height": 400
330          },
331          "outputId": "adee88e1-02bf-4a20-febb-6d2c170a63f9"
332        },
333        "source": [
334          "document = {\n",
335          "    \"_source\": [\"id\", \"name\", \"text\"],\n",
336          "    \"size\": 1000,\n",
337          "    \"query\": {\n",
338          "        \"term\": {\"article\": None}\n",
339          "    },\n",
340          "    \"sort\" : [\"id\"]\n",
341          "}\n",
342          "\n",
343          "def sections(article):\n",
344          "  rows = []\n",
345          "\n",
346          "  search = document.copy()\n",
347          "  search[\"query\"][\"term\"][\"article\"] = article\n",
348          "\n",
349          "  for result in es.search(index=\"articles\", body=search)[\"hits\"][\"hits\"]:\n",
350          "    source = result[\"_source\"]\n",
351          "    name, text = source[\"name\"], source[\"text\"]\n",
352          "\n",
353          "    if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
354          "      rows.append(text)\n",
355          "  \n",
356          "  return rows\n",
357          "\n",
358          "results = []\n",
359          "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n",
360          "  source = result[\"_source\"]\n",
361          "\n",
362          "  # Use QA extractor to derive additional columns\n",
363          "  answers = extractor([(\"Risk factors\", \"risk factor\", \"What are names of risk factors?\", False),\n",
364          "                       (\"Locations\", \"city country state\", \"What are names of locations?\", False)], sections(source[\"article\"]))\n",
365          "\n",
366          "  results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]) + tuple([answer[1] for answer in answers]))\n",
367          "\n",
368          "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n",
369          "\n",
370          "display(HTML(df.to_html(index=False)))"
371        ],
372        "execution_count": null,
373        "outputs": [
374          {
375            "output_type": "display_data",
376            "data": {
377              "text/html": [
378                "<table border=\"1\" class=\"dataframe\">\n",
379                "  <thead>\n",
380                "    <tr style=\"text-align: right;\">\n",
381                "      <th>Title</th>\n",
382                "      <th>Published</th>\n",
383                "      <th>Reference</th>\n",
384                "      <th>Match</th>\n",
385                "      <th>Risk Factors</th>\n",
386                "      <th>Locations</th>\n",
387                "    </tr>\n",
388                "  </thead>\n",
389                "  <tbody>\n",
390                "    <tr>\n",
391                "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
392                "      <td>2020-05-21 00:00:00</td>\n",
393                "      <td>https://doi.org/10.1002/cpt.1910</td>\n",
394                "      <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n",
395                "      <td>sex, obesity, genetic factors and mechanical factors</td>\n",
396                "      <td>None</td>\n",
397                "    </tr>\n",
398                "    <tr>\n",
399                "      <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n",
400                "      <td>2020-04-24 00:00:00</td>\n",
401                "      <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n",
402                "      <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n",
403                "      <td>None</td>\n",
404                "      <td>Abbott, Abbott Park, Illinois</td>\n",
405                "    </tr>\n",
406                "    <tr>\n",
407                "      <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
408                "      <td>2020-04-27 00:00:00</td>\n",
409                "      <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
410                "      <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n",
411                "      <td>None</td>\n",
412                "      <td>None</td>\n",
413                "    </tr>\n",
414                "    <tr>\n",
415                "      <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
416                "      <td>2020-07-23 00:00:00</td>\n",
417                "      <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
418                "      <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n",
419                "      <td>Frailty and multimorbidity</td>\n",
420                "      <td>comorbidity groupings and the corresponding health conditions</td>\n",
421                "    </tr>\n",
422                "    <tr>\n",
423                "      <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n",
424                "      <td>2020-03-15 00:00:00</td>\n",
425                "      <td>https://doi.org/10.7150/ijbs.45134</td>\n",
426                "      <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n",
427                "      <td>Mandatory contact tracing and quarantine</td>\n",
428                "      <td>cities, provinces, and countries</td>\n",
429                "    </tr>\n",
430                "  </tbody>\n",
431                "</table>"
432              ],
433              "text/plain": [
434                "<IPython.core.display.HTML object>"
435              ]
436            },
437            "metadata": {}
438          }
439        ]
440      }
441    ]
442  }