06_Extractive_QA_with_Elasticsearch.ipynb
1 { 2 "nbformat": 4, 3 "nbformat_minor": 0, 4 "metadata": { 5 "colab": { 6 "provenance": [] 7 }, 8 "kernelspec": { 9 "name": "python3", 10 "display_name": "Python 3" 11 } 12 }, 13 "cells": [ 14 { 15 "cell_type": "markdown", 16 "metadata": { 17 "id": "zzZbP0LM6m5z" 18 }, 19 "source": [ 20 "# Extractive QA with Elasticsearch\n", 21 "\n", 22 "txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system." 23 ] 24 }, 25 { 26 "cell_type": "markdown", 27 "metadata": { 28 "id": "xk7t5Jcd6reO" 29 }, 30 "source": [ 31 "# Install dependencies\n", 32 "\n", 33 "Install `txtai` and `Elasticsearch`." 34 ] 35 }, 36 { 37 "cell_type": "code", 38 "metadata": { 39 "id": "0y1UA4-q-YdA" 40 }, 41 "source": [ 42 "%%capture\n", 43 "\n", 44 "# Install txtai and elasticsearch python client\n", 45 "!pip install git+https://github.com/neuml/txtai elasticsearch\n", 46 "\n", 47 "# Download and extract elasticsearch\n", 48 "!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz\n", 49 "!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz\n", 50 "!chown -R daemon:daemon elasticsearch-7.10.1" 51 ], 52 "execution_count": null, 53 "outputs": [] 54 }, 55 { 56 "cell_type": "markdown", 57 "metadata": { 58 "id": "nKWz-C5gCJy8" 59 }, 60 "source": [ 61 "Start an instance of Elasticsearch directly within this notebook. " 62 ] 63 }, 64 { 65 "cell_type": "code", 66 "metadata": { 67 "id": "3ZfJeWbM6wmj" 68 }, 69 "source": [ 70 "import os\n", 71 "from subprocess import Popen, PIPE, STDOUT\n", 72 "\n", 73 "# If issues are encountered with this section, ES can be manually started as follows:\n", 74 "# ./elasticsearch-7.10.1/bin/elasticsearch\n", 75 "\n", 76 "# Start and wait for server\n", 77 "server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))\n", 78 "!sleep 30" 79 ], 80 "execution_count": null, 81 "outputs": [] 82 }, 83 { 84 "cell_type": "markdown", 85 "metadata": { 86 "id": "TWEn4w68-D1y" 87 }, 88 "source": [ 89 "# Download data\n", 90 "\n", 91 "This example is going to work off a subset of the [CORD-19](https://www.semanticscholar.org/cord19) dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.\n", 92 "\n", 93 "The following download is a SQLite database generated from a [Kaggle notebook](https://www.kaggle.com/davidmezzetti/cord-19-slim/output). More information on this data format, can be found in the [CORD-19 Analysis](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) notebook." 94 ] 95 }, 96 { 97 "cell_type": "code", 98 "metadata": { 99 "id": "8tVrIqSq-KBa" 100 }, 101 "source": [ 102 "%%capture\n", 103 "!wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz\n", 104 "!gunzip tests.gz\n", 105 "!mv tests articles.sqlite" 106 ], 107 "execution_count": null, 108 "outputs": [] 109 }, 110 { 111 "cell_type": "markdown", 112 "metadata": { 113 "id": "hSWFzkCn61tM" 114 }, 115 "source": [ 116 "# Load data into Elasticsearch\n", 117 "\n", 118 "The following block copies rows from SQLite to Elasticsearch." 119 ] 120 }, 121 { 122 "cell_type": "code", 123 "metadata": { 124 "id": "So-OBvUT61QD", 125 "colab": { 126 "base_uri": "https://localhost:8080/" 127 }, 128 "outputId": "9647b8f8-8471-41bf-ccfa-a75306665638" 129 }, 130 "source": [ 131 "import sqlite3\n", 132 "\n", 133 "import regex as re\n", 134 "\n", 135 "from elasticsearch import Elasticsearch, helpers\n", 136 "\n", 137 "# Connect to ES instance\n", 138 "es = Elasticsearch(hosts=[\"http://localhost:9200\"], timeout=60, retry_on_timeout=True)\n", 139 "\n", 140 "# Connection to database file\n", 141 "db = sqlite3.connect(\"articles.sqlite\")\n", 142 "cur = db.cursor()\n", 143 "\n", 144 "# Elasticsearch bulk buffer\n", 145 "buffer = []\n", 146 "rows = 0\n", 147 "\n", 148 "# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.\n", 149 "cur.execute(\"SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null\")\n", 150 "for row in cur:\n", 151 " # Build dict of name-value pairs for fields\n", 152 " article = dict(zip((\"id\", \"article\", \"title\", \"published\", \"reference\", \"name\", \"text\"), row))\n", 153 " name = article[\"name\"]\n", 154 "\n", 155 " # Only process certain document sections\n", 156 " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", 157 " # Bulk action fields\n", 158 " article[\"_id\"] = article[\"id\"]\n", 159 " article[\"_index\"] = \"articles\"\n", 160 "\n", 161 " # Buffer article\n", 162 " buffer.append(article)\n", 163 "\n", 164 " # Increment number of articles processed\n", 165 " rows += 1\n", 166 "\n", 167 " # Bulk load every 1000 records\n", 168 " if rows % 1000 == 0:\n", 169 " helpers.bulk(es, buffer)\n", 170 " buffer = []\n", 171 "\n", 172 " print(\"Inserted {} articles\".format(rows), end=\"\\r\")\n", 173 "\n", 174 "if buffer:\n", 175 " helpers.bulk(es, buffer)\n", 176 "\n", 177 "print(\"Total articles inserted: {}\".format(rows))\n" 178 ], 179 "execution_count": null, 180 "outputs": [ 181 { 182 "output_type": "stream", 183 "name": "stdout", 184 "text": [ 185 "Total articles inserted: 21499\n" 186 ] 187 } 188 ] 189 }, 190 { 191 "cell_type": "markdown", 192 "metadata": { 193 "id": "X5RO-VNwzMAo" 194 }, 195 "source": [ 196 "# Query data\n", 197 "\n", 198 "The following runs a query against Elasticsearch for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match.\n", 199 "\n" 200 ] 201 }, 202 { 203 "cell_type": "code", 204 "metadata": { 205 "id": "ucd9mwSfFTMm", 206 "colab": { 207 "base_uri": "https://localhost:8080/", 208 "height": 348 209 }, 210 "outputId": "b21d6aff-6abe-48f5-9914-7b7fb8472adb" 211 }, 212 "source": [ 213 "import pandas as pd\n", 214 "\n", 215 "from IPython.display import display, HTML\n", 216 "\n", 217 "pd.set_option(\"display.max_colwidth\", None)\n", 218 "\n", 219 "query = {\n", 220 " \"_source\": [\"article\", \"title\", \"published\", \"reference\", \"text\"],\n", 221 " \"size\": 5,\n", 222 " \"query\": {\n", 223 " \"query_string\": {\"query\": \"risk factors\"}\n", 224 " }\n", 225 "}\n", 226 "\n", 227 "results = []\n", 228 "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", 229 " source = result[\"_source\"]\n", 230 " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]))\n", 231 "\n", 232 "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\"])\n", 233 "\n", 234 "display(HTML(df.to_html(index=False)))" 235 ], 236 "execution_count": null, 237 "outputs": [ 238 { 239 "output_type": "display_data", 240 "data": { 241 "text/html": [ 242 "<table border=\"1\" class=\"dataframe\">\n", 243 " <thead>\n", 244 " <tr style=\"text-align: right;\">\n", 245 " <th>Title</th>\n", 246 " <th>Published</th>\n", 247 " <th>Reference</th>\n", 248 " <th>Match</th>\n", 249 " </tr>\n", 250 " </thead>\n", 251 " <tbody>\n", 252 " <tr>\n", 253 " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", 254 " <td>2020-04-24 00:00:00</td>\n", 255 " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", 256 " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", 257 " </tr>\n", 258 " <tr>\n", 259 " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", 260 " <td>2020-04-27 00:00:00</td>\n", 261 " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", 262 " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", 263 " </tr>\n", 264 " <tr>\n", 265 " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", 266 " <td>2020-07-23 00:00:00</td>\n", 267 " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", 268 " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", 269 " </tr>\n", 270 " <tr>\n", 271 " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", 272 " <td>2020-03-15 00:00:00</td>\n", 273 " <td>https://doi.org/10.7150/ijbs.45134</td>\n", 274 " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", 275 " </tr>\n", 276 " <tr>\n", 277 " <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n", 278 " <td>2020-05-11 00:00:00</td>\n", 279 " <td>http://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?rss=1</td>\n", 280 " <td>In addition, many risk factors for covid-19 documented in the literature are highly correlated and it is not clear which may be independently related to risk.</td>\n", 281 " </tr>\n", 282 " </tbody>\n", 283 "</table>" 284 ], 285 "text/plain": [ 286 "<IPython.core.display.HTML object>" 287 ] 288 }, 289 "metadata": {} 290 } 291 ] 292 }, 293 { 294 "cell_type": "markdown", 295 "metadata": { 296 "id": "ylxOKji1-9_K" 297 }, 298 "source": [ 299 "# Derive columns with Extractive QA\n", 300 "\n", 301 "The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article." 302 ] 303 }, 304 { 305 "cell_type": "code", 306 "metadata": { 307 "id": "mwBTrCkcOM_H" 308 }, 309 "source": [ 310 "%%capture\n", 311 "from txtai.embeddings import Embeddings\n", 312 "from txtai.pipeline import Extractor\n", 313 "\n", 314 "# Create embeddings model, backed by sentence-transformers & transformers\n", 315 "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\"})\n", 316 "\n", 317 "# Create extractor instance using qa model designed for the CORD-19 dataset\n", 318 "extractor = Extractor(embeddings, \"NeuML/bert-small-cord19qa\")" 319 ], 320 "execution_count": null, 321 "outputs": [] 322 }, 323 { 324 "cell_type": "code", 325 "metadata": { 326 "id": "Yv75Lh-cOpL9", 327 "colab": { 328 "base_uri": "https://localhost:8080/", 329 "height": 400 330 }, 331 "outputId": "adee88e1-02bf-4a20-febb-6d2c170a63f9" 332 }, 333 "source": [ 334 "document = {\n", 335 " \"_source\": [\"id\", \"name\", \"text\"],\n", 336 " \"size\": 1000,\n", 337 " \"query\": {\n", 338 " \"term\": {\"article\": None}\n", 339 " },\n", 340 " \"sort\" : [\"id\"]\n", 341 "}\n", 342 "\n", 343 "def sections(article):\n", 344 " rows = []\n", 345 "\n", 346 " search = document.copy()\n", 347 " search[\"query\"][\"term\"][\"article\"] = article\n", 348 "\n", 349 " for result in es.search(index=\"articles\", body=search)[\"hits\"][\"hits\"]:\n", 350 " source = result[\"_source\"]\n", 351 " name, text = source[\"name\"], source[\"text\"]\n", 352 "\n", 353 " if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n", 354 " rows.append(text)\n", 355 " \n", 356 " return rows\n", 357 "\n", 358 "results = []\n", 359 "for result in es.search(index=\"articles\", body=query)[\"hits\"][\"hits\"]:\n", 360 " source = result[\"_source\"]\n", 361 "\n", 362 " # Use QA extractor to derive additional columns\n", 363 " answers = extractor([(\"Risk factors\", \"risk factor\", \"What are names of risk factors?\", False),\n", 364 " (\"Locations\", \"city country state\", \"What are names of locations?\", False)], sections(source[\"article\"]))\n", 365 "\n", 366 " results.append((source[\"title\"], source[\"published\"], source[\"reference\"], source[\"text\"]) + tuple([answer[1] for answer in answers]))\n", 367 "\n", 368 "df = pd.DataFrame(results, columns=[\"Title\", \"Published\", \"Reference\", \"Match\", \"Risk Factors\", \"Locations\"])\n", 369 "\n", 370 "display(HTML(df.to_html(index=False)))" 371 ], 372 "execution_count": null, 373 "outputs": [ 374 { 375 "output_type": "display_data", 376 "data": { 377 "text/html": [ 378 "<table border=\"1\" class=\"dataframe\">\n", 379 " <thead>\n", 380 " <tr style=\"text-align: right;\">\n", 381 " <th>Title</th>\n", 382 " <th>Published</th>\n", 383 " <th>Reference</th>\n", 384 " <th>Match</th>\n", 385 " <th>Risk Factors</th>\n", 386 " <th>Locations</th>\n", 387 " </tr>\n", 388 " </thead>\n", 389 " <tbody>\n", 390 " <tr>\n", 391 " <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n", 392 " <td>2020-05-21 00:00:00</td>\n", 393 " <td>https://doi.org/10.1002/cpt.1910</td>\n", 394 " <td>Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .</td>\n", 395 " <td>sex, obesity, genetic factors and mechanical factors</td>\n", 396 " <td>None</td>\n", 397 " </tr>\n", 398 " <tr>\n", 399 " <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n", 400 " <td>2020-04-24 00:00:00</td>\n", 401 " <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n", 402 " <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n", 403 " <td>None</td>\n", 404 " <td>Abbott, Abbott Park, Illinois</td>\n", 405 " </tr>\n", 406 " <tr>\n", 407 " <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n", 408 " <td>2020-04-27 00:00:00</td>\n", 409 " <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n", 410 " <td>Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .</td>\n", 411 " <td>None</td>\n", 412 " <td>None</td>\n", 413 " </tr>\n", 414 " <tr>\n", 415 " <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n", 416 " <td>2020-07-23 00:00:00</td>\n", 417 " <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n", 418 " <td>BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.</td>\n", 419 " <td>Frailty and multimorbidity</td>\n", 420 " <td>comorbidity groupings and the corresponding health conditions</td>\n", 421 " </tr>\n", 422 " <tr>\n", 423 " <td>COVID-19: what has been learned and to be learned about the novel coronavirus disease</td>\n", 424 " <td>2020-03-15 00:00:00</td>\n", 425 " <td>https://doi.org/10.7150/ijbs.45134</td>\n", 426 " <td>• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.</td>\n", 427 " <td>Mandatory contact tracing and quarantine</td>\n", 428 " <td>cities, provinces, and countries</td>\n", 429 " </tr>\n", 430 " </tbody>\n", 431 "</table>" 432 ], 433 "text/plain": [ 434 "<IPython.core.display.HTML object>" 435 ] 436 }, 437 "metadata": {} 438 } 439 ] 440 } 441 ] 442 }