/ examples / 15_Distributed_embeddings_cluster.ipynb
15_Distributed_embeddings_cluster.ipynb
  1  {
  2    "nbformat": 4,
  3    "nbformat_minor": 0,
  4    "metadata": {
  5      "colab": {
  6        "provenance": []
  7      },
  8      "kernelspec": {
  9        "name": "python3",
 10        "display_name": "Python 3"
 11      }
 12    },
 13    "cells": [
 14      {
 15        "cell_type": "markdown",
 16        "metadata": {
 17          "id": "4Pjmz-RORV8E"
 18        },
 19        "source": [
 20          "# Distributed embeddings cluster\n",
 21          "\n",
 22          "The txtai API is a web-based service backed by [FastAPI](https://fastapi.tiangolo.com/). All txtai functionality is available via the API. The API can also cluster multiple embeddings indices into a single logical index to horizontally scale over multiple nodes. \n",
 23          "\n",
 24          "This notebook installs the txtai API and shows an example of building an embeddings cluster."
 25        ]
 26      },
 27      {
 28        "cell_type": "markdown",
 29        "metadata": {
 30          "id": "Dk31rbYjSTYm"
 31        },
 32        "source": [
 33          "# Install dependencies\n",
 34          "\n",
 35          "Install `txtai` and all dependencies. Since this notebook uses the API, we need to install the api extras package."
 36        ]
 37      },
 38      {
 39        "cell_type": "code",
 40        "metadata": {
 41          "id": "XMQuuun2R06J"
 42        },
 43        "source": [
 44          "%%capture\n",
 45          "!pip install git+https://github.com/neuml/txtai#egg=txtai[api]"
 46        ],
 47        "execution_count": null,
 48        "outputs": []
 49      },
 50      {
 51        "cell_type": "markdown",
 52        "metadata": {
 53          "id": "PNPJ95cdTKSS"
 54        },
 55        "source": [
 56          "# Start distributed embeddings cluster\n",
 57          "\n",
 58          "First we'll start multiple API instances that will serve as embeddings index shards. Each shard stores a subset of the indexed data and these shards work in tandem to form a single logical index.\n",
 59          "\n",
 60          "Then we'll start the main API instance that clusters the shards together into a logical instance.\n",
 61          "\n",
 62          "The API instances are all started in the background.\n"
 63        ]
 64      },
 65      {
 66        "cell_type": "code",
 67        "metadata": {
 68          "id": "USb4JXZHxqTA"
 69        },
 70        "source": [
 71          "import os\n",
 72          "os.chdir(\"/content\")"
 73        ],
 74        "execution_count": null,
 75        "outputs": []
 76      },
 77      {
 78        "cell_type": "code",
 79        "metadata": {
 80          "id": "nTDwXOUeTH2-",
 81          "colab": {
 82            "base_uri": "https://localhost:8080/"
 83          },
 84          "outputId": "dee26849-39ae-4390-8bba-76bf9025fa61"
 85        },
 86        "source": [
 87          "%%writefile index.yml\n",
 88          "writable: true\n",
 89          "\n",
 90          "# Embeddings settings\n",
 91          "embeddings:\n",
 92          "    path: sentence-transformers/nli-mpnet-base-v2\n",
 93          "    content: true"
 94        ],
 95        "execution_count": null,
 96        "outputs": [
 97          {
 98            "output_type": "stream",
 99            "name": "stdout",
100            "text": [
101              "Writing index.yml\n"
102            ]
103          }
104        ]
105      },
106      {
107        "cell_type": "code",
108        "metadata": {
109          "colab": {
110            "base_uri": "https://localhost:8080/"
111          },
112          "id": "iCdBh-JgfyBl",
113          "outputId": "0066e314-7461-47c7-ca3b-15204911783e"
114        },
115        "source": [
116          "%%writefile cluster.yml\n",
117          "# Embeddings cluster\n",
118          "cluster:\n",
119          "    shards:\n",
120          "        - http://127.0.0.1:8001\n",
121          "        - http://127.0.0.1:8002"
122        ],
123        "execution_count": null,
124        "outputs": [
125          {
126            "output_type": "stream",
127            "name": "stdout",
128            "text": [
129              "Writing cluster.yml\n"
130            ]
131          }
132        ]
133      },
134      {
135        "cell_type": "code",
136        "metadata": {
137          "id": "nGITHxUyRzyp"
138        },
139        "source": [
140          "# Start embeddings shards\n",
141          "!CONFIG=index.yml nohup uvicorn --port 8001 \"txtai.api:app\" &> shard-1.log &\n",
142          "!CONFIG=index.yml nohup uvicorn --port 8002 \"txtai.api:app\" &> shard-2.log &\n",
143          "\n",
144          "# Start main instance\n",
145          "!CONFIG=cluster.yml nohup uvicorn --port 8000 \"txtai.api:app\" &> main.log &\n",
146          "\n",
147          "# Wait for startup\n",
148          "!sleep 90"
149        ],
150        "execution_count": null,
151        "outputs": []
152      },
153      {
154        "cell_type": "markdown",
155        "metadata": {
156          "id": "lxkbVng3giWP"
157        },
158        "source": [
159          "# Python\n",
160          "\n",
161          "Let's first try the cluster out directly in Python. The code below aggregates the two shards into a single cluster and executes actions against the cluster."
162        ]
163      },
164      {
165        "cell_type": "code",
166        "metadata": {
167          "colab": {
168            "base_uri": "https://localhost:8080/"
169          },
170          "id": "36HGAokoglfg",
171          "outputId": "368ae013-2afc-4a1b-d7df-c429183637d7"
172        },
173        "source": [
174          "%%writefile run.py\n",
175          "from txtai.api import Cluster\n",
176          "\n",
177          "cluster = Cluster({\"shards\": [\"http://127.0.0.1:8001\", \"http://127.0.0.1:8002\"]})\n",
178          "\n",
179          "data = [\n",
180          "    \"US tops 5 million confirmed virus cases\",\n",
181          "    \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
182          "    \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
183          "    \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
184          "    \"Maine man wins $1M from $25 lottery ticket\",\n",
185          "    \"Make huge profits without work, earn up to $100,000 a day\",\n",
186          "]\n",
187          "\n",
188          "# Index data\n",
189          "cluster.add([{\"id\": x, \"text\": row} for x, row in enumerate(data)])\n",
190          "cluster.index()\n",
191          "\n",
192          "# Test search\n",
193          "result = cluster.search(\"feel good story\", 1)[0]\n",
194          "print(\"Query: feel good story\\nResult:\", result[\"text\"])"
195        ],
196        "execution_count": null,
197        "outputs": [
198          {
199            "output_type": "stream",
200            "name": "stdout",
201            "text": [
202              "Writing run.py\n"
203            ]
204          }
205        ]
206      },
207      {
208        "cell_type": "code",
209        "metadata": {
210          "colab": {
211            "base_uri": "https://localhost:8080/"
212          },
213          "id": "6dQOzcfEs2Pk",
214          "outputId": "a667594a-b778-4e4e-a75c-72e7982b7fbe"
215        },
216        "source": [
217          "!python run.py"
218        ],
219        "execution_count": null,
220        "outputs": [
221          {
222            "output_type": "stream",
223            "name": "stdout",
224            "text": [
225              "Query: feel good story\n",
226              "Result: Maine man wins $1M from $25 lottery ticket\n"
227            ]
228          }
229        ]
230      },
231      {
232        "cell_type": "markdown",
233        "metadata": {
234          "id": "NHvBFZeSd9AG"
235        },
236        "source": [
237          "# JavaScript\n",
238          "\n",
239          "Next let's try to run the same code above via the API using JavaScript.\n",
240          "\n",
241          "```bash\n",
242          "npm install txtai\n",
243          "```\n",
244          "\n",
245          "For this example, we'll clone the txtai.js project to import the example build configuration."
246        ]
247      },
248      {
249        "cell_type": "code",
250        "metadata": {
251          "id": "b52knObEdcCr"
252        },
253        "source": [
254          "%%capture\n",
255          "!git clone https://github.com/neuml/txtai.js"
256        ],
257        "execution_count": null,
258        "outputs": []
259      },
260      {
261        "cell_type": "markdown",
262        "metadata": {
263          "id": "rUGS0t-JMsS9"
264        },
265        "source": [
266          "## Run cluster.js\n",
267          "\n",
268          "The following script is a JavaScript version of the logic above"
269        ]
270      },
271      {
272        "cell_type": "code",
273        "metadata": {
274          "colab": {
275            "base_uri": "https://localhost:8080/"
276          },
277          "id": "bPQ40_xRyFmA",
278          "outputId": "b86a12c4-f2c7-427b-bd28-edba354c6713"
279        },
280        "source": [
281          "%%writefile txtai.js/examples/node/src/cluster.js\n",
282          "import {Embeddings} from \"txtai\";\n",
283          "import {sprintf} from \"sprintf-js\";\n",
284          "\n",
285          "const run = async () => {\n",
286          "    try {\n",
287          "        let embeddings = new Embeddings(process.argv[2]);\n",
288          "\n",
289          "        let data  = [\"US tops 5 million confirmed virus cases\",\n",
290          "                     \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
291          "                     \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
292          "                     \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
293          "                     \"Maine man wins $1M from $25 lottery ticket\",\n",
294          "                     \"Make huge profits without work, earn up to $100,000 a day\"];\n",
295          "\n",
296          "        console.log();\n",
297          "        console.log(\"Querying an Embeddings cluster\");\n",
298          "        console.log(sprintf(\"%-20s %s\", \"Query\", \"Best Match\"));\n",
299          "        console.log(\"-\".repeat(50));\n",
300          "\n",
301          "        for (let query of [\"feel good story\", \"climate change\", \"public health story\", \"war\", \"wildlife\", \"asia\", \"lucky\", \"dishonest junk\"]) {\n",
302          "            let results = await embeddings.search(query, 1);\n",
303          "            if (results && results.length > 0) {\n",
304          "              let result = results[0].text;\n",
305          "              console.log(sprintf(\"%-20s %s\", query, result));\n",
306          "            }\n",
307          "        }\n",
308          "    }\n",
309          "    catch (e) {\n",
310          "        console.trace(e);\n",
311          "    }\n",
312          "};\n",
313          "\n",
314          "run();"
315        ],
316        "execution_count": null,
317        "outputs": [
318          {
319            "output_type": "stream",
320            "name": "stdout",
321            "text": [
322              "Writing txtai.js/examples/node/src/cluster.js\n"
323            ]
324          }
325        ]
326      },
327      {
328        "cell_type": "markdown",
329        "metadata": {
330          "id": "nTBs11j-GtD-"
331        },
332        "source": [
333          "## Build and run cluster.js\n",
334          "\n",
335          "\n",
336          "\n"
337        ]
338      },
339      {
340        "cell_type": "code",
341        "metadata": {
342          "id": "kC5Oub6wa1nK"
343        },
344        "source": [
345          "%%capture\n",
346          "os.chdir(\"txtai.js/examples/node\")\n",
347          "!npm install\n",
348          "!npm run build"
349        ],
350        "execution_count": null,
351        "outputs": []
352      },
353      {
354        "cell_type": "markdown",
355        "metadata": {
356          "id": "Xr5IlvqH8W77"
357        },
358        "source": [
359          "Next lets run the code against the main cluster URL"
360        ]
361      },
362      {
363        "cell_type": "code",
364        "metadata": {
365          "colab": {
366            "base_uri": "https://localhost:8080/"
367          },
368          "id": "ckOHNqyaeL-B",
369          "outputId": "9c243fac-2316-4b8e-b044-6de529a8f3e8"
370        },
371        "source": [
372          "!node dist/cluster.js http://127.0.0.1:8000"
373        ],
374        "execution_count": null,
375        "outputs": [
376          {
377            "output_type": "stream",
378            "name": "stdout",
379            "text": [
380              "\n",
381              "Querying an Embeddings cluster\n",
382              "Query                Best Match\n",
383              "--------------------------------------------------\n",
384              "feel good story      Maine man wins $1M from $25 lottery ticket\n",
385              "climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
386              "public health story  US tops 5 million confirmed virus cases\n",
387              "war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
388              "wildlife             The National Park Service warns against sacrificing slower friends in a bear attack\n",
389              "asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
390              "lucky                Maine man wins $1M from $25 lottery ticket\n",
391              "dishonest junk       Make huge profits without work, earn up to $100,000 a day\n"
392            ]
393          }
394        ]
395      },
396      {
397        "cell_type": "markdown",
398        "metadata": {
399          "id": "1yukBIMYG5OE"
400        },
401        "source": [
402          "The JavaScript program is showing the same results as the Python code above. This is running a clustered query against both nodes in the cluster and aggregating the results together.\n",
403          "\n",
404          "Queries can be run against each individual shard to see what the queries independently return."
405        ]
406      },
407      {
408        "cell_type": "code",
409        "metadata": {
410          "colab": {
411            "base_uri": "https://localhost:8080/"
412          },
413          "id": "73rZCo4O4IQR",
414          "outputId": "9f2cb119-7a21-41d9-fdbf-4410af246934"
415        },
416        "source": [
417          "!node dist/cluster.js http://127.0.0.1:8001"
418        ],
419        "execution_count": null,
420        "outputs": [
421          {
422            "output_type": "stream",
423            "name": "stdout",
424            "text": [
425              "\n",
426              "Querying an Embeddings cluster\n",
427              "Query                Best Match\n",
428              "--------------------------------------------------\n",
429              "feel good story      Maine man wins $1M from $25 lottery ticket\n",
430              "climate change       Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
431              "public health story  US tops 5 million confirmed virus cases\n",
432              "war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
433              "wildlife             Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
434              "asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate\n",
435              "lucky                Maine man wins $1M from $25 lottery ticket\n"
436            ]
437          }
438        ]
439      },
440      {
441        "cell_type": "code",
442        "metadata": {
443          "colab": {
444            "base_uri": "https://localhost:8080/"
445          },
446          "id": "ZeVBLJyr4Knr",
447          "outputId": "b75691a4-25bf-43dc-8878-f9792a4430b8"
448        },
449        "source": [
450          "!node dist/cluster.js http://127.0.0.1:8002"
451        ],
452        "execution_count": null,
453        "outputs": [
454          {
455            "output_type": "stream",
456            "name": "stdout",
457            "text": [
458              "\n",
459              "Querying an Embeddings cluster\n",
460              "Query                Best Match\n",
461              "--------------------------------------------------\n",
462              "feel good story      Make huge profits without work, earn up to $100,000 a day\n",
463              "climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\n",
464              "public health story  The National Park Service warns against sacrificing slower friends in a bear attack\n",
465              "war                  The National Park Service warns against sacrificing slower friends in a bear attack\n",
466              "wildlife             The National Park Service warns against sacrificing slower friends in a bear attack\n",
467              "asia                 The National Park Service warns against sacrificing slower friends in a bear attack\n",
468              "lucky                The National Park Service warns against sacrificing slower friends in a bear attack\n",
469              "dishonest junk       Make huge profits without work, earn up to $100,000 a day\n"
470            ]
471          }
472        ]
473      },
474      {
475        "cell_type": "markdown",
476        "metadata": {
477          "id": "J2I_4hmZ8uXs"
478        },
479        "source": [
480          "Note the differences. The section below runs a count against the full cluster and each shard to show the count of records in each."
481        ]
482      },
483      {
484        "cell_type": "code",
485        "metadata": {
486          "colab": {
487            "base_uri": "https://localhost:8080/"
488          },
489          "id": "BKm27yna4MWr",
490          "outputId": "bfc60af7-1b2b-451f-b10e-e2f8cf6f14fa"
491        },
492        "source": [
493          "!curl http://127.0.0.1:8000/count\n",
494          "!printf \"\\n\"\n",
495          "!curl http://127.0.0.1:8001/count\n",
496          "!printf \"\\n\"\n",
497          "!curl http://127.0.0.1:8002/count"
498        ],
499        "execution_count": null,
500        "outputs": [
501          {
502            "output_type": "stream",
503            "name": "stdout",
504            "text": [
505              "6\n",
506              "3\n",
507              "3"
508            ]
509          }
510        ]
511      },
512      {
513        "cell_type": "markdown",
514        "metadata": {
515          "id": "6rKj-I0djRQj"
516        },
517        "source": [
518          "This notebook showed how a distributed embeddings cluster can be created with txtai. This example can be further scaled out on Kubernetes with StatefulSets, which will be covered in a future tutorial."
519        ]
520      }
521    ]
522  }