Cradicle Explorer

/ examples / 64_Embeddings_index_format_for_open_data_access.ipynb
64_Embeddings_index_format_for_open_data_access.ipynb
  1  {
  2    "nbformat": 4,
  3    "nbformat_minor": 0,
  4    "metadata": {
  5      "colab": {
  6        "provenance": []
  7      },
  8      "kernelspec": {
  9        "name": "python3",
 10        "display_name": "Python 3"
 11      },
 12      "language_info": {
 13        "name": "python"
 14      },
 15      "gpuClass": "standard",
 16      "accelerator": "GPU"
 17    },
 18    "cells": [
 19      {
 20        "cell_type": "markdown",
 21        "source": [
 22          "# Embeddings index format for open data access\n",
 23          "\n",
 24          "The main programming language with txtai is Python. A key tenet is that the underlying data in an embeddings index is accessible without txtai.\n",
 25          "\n",
 26          "This notebook will demonstrate this through a series of examples.\n"
 27        ],
 28        "metadata": {
 29          "id": "-xU9P9iSR-Cy"
 30        }
 31      },
 32      {
 33        "cell_type": "markdown",
 34        "source": [
 35          "# Install dependencies\n",
 36          "\n",
 37          "Install `txtai` and all dependencies."
 38        ],
 39        "metadata": {
 40          "id": "shlUi2kKS7KT"
 41        }
 42      },
 43      {
 44        "cell_type": "code",
 45        "execution_count": 1,
 46        "metadata": {
 47          "id": "xEvX9vCpn4E0"
 48        },
 49        "outputs": [],
 50        "source": [
 51          "%%capture\n",
 52          "!pip install git+https://github.com/neuml/txtai#egg=txtai[graph] datasets sqlite-vec"
 53        ]
 54      },
 55      {
 56        "cell_type": "markdown",
 57        "source": [
 58          "# Load dataset\n",
 59          "\n",
 60          "This example will use the `chatgpt-prompts` dataset."
 61        ],
 62        "metadata": {
 63          "id": "408IyXzKFSiG"
 64        }
 65      },
 66      {
 67        "cell_type": "code",
 68        "source": [
 69          "from datasets import load_dataset\n",
 70          "\n",
 71          "dataset = load_dataset(\"fka/awesome-chatgpt-prompts\", split=\"train\")"
 72        ],
 73        "metadata": {
 74          "id": "IQ_ns6YvFRm1"
 75        },
 76        "execution_count": null,
 77        "outputs": []
 78      },
 79      {
 80        "cell_type": "markdown",
 81        "source": [
 82          "# Build an Embeddings index\n",
 83          "\n",
 84          "Let's first build an embeddings index using txtai."
 85        ],
 86        "metadata": {
 87          "id": "AtEdP7Utw3mk"
 88        }
 89      },
 90      {
 91        "cell_type": "code",
 92        "source": [
 93          "from txtai import Embeddings\n",
 94          "\n",
 95          "embeddings = Embeddings()\n",
 96          "embeddings.index((x[\"act\"], x[\"prompt\"]) for x in dataset)\n",
 97          "embeddings.save(\"txtai-index\")"
 98        ],
 99        "metadata": {
100          "id": "DPWrubv5oOn7"
101        },
102        "execution_count": null,
103        "outputs": []
104      },
105      {
106        "cell_type": "markdown",
107        "source": [
108          "Let's take a look at the index that was created"
109        ],
110        "metadata": {
111          "id": "0zrjOl6JtPI2"
112        }
113      },
114      {
115        "cell_type": "code",
116        "source": [
117          "!ls -l txtai-index\n",
118          "!echo\n",
119          "!file txtai-index/*"
120        ],
121        "metadata": {
122          "colab": {
123            "base_uri": "https://localhost:8080/"
124          },
125          "id": "i7PN9itKtIyx",
126          "outputId": "ac904afb-83a8-4c5f-cf62-f5340225b8ac"
127        },
128        "execution_count": 4,
129        "outputs": [
130          {
131            "output_type": "stream",
132            "name": "stdout",
133            "text": [
134              "total 268\n",
135              "-rw-r--r-- 1 root root    342 Sep  6 15:21 config.json\n",
136              "-rw-r--r-- 1 root root 262570 Sep  6 15:21 embeddings\n",
137              "-rw-r--r-- 1 root root   2988 Sep  6 15:21 ids\n",
138              "\n",
139              "txtai-index/config.json: JSON data\n",
140              "txtai-index/embeddings:  data\n",
141              "txtai-index/ids:         data\n"
142            ]
143          }
144        ]
145      },
146      {
147        "cell_type": "markdown",
148        "source": [
149          "The txtai embeddings index format is documented [here](https://neuml.github.io/txtai/embeddings/format/). Looking at the files above, we have configuration, embeddings data and ids storage. Ids storage is only used when content is disabled.\n",
150          "\n",
151          "Let's inspect each file."
152        ],
153        "metadata": {
154          "id": "hx8H0dpXtX5b"
155        }
156      },
157      {
158        "cell_type": "code",
159        "source": [
160          "import json\n",
161          "\n",
162          "with open(\"txtai-index/config.json\") as f:\n",
163          "  print(json.dumps(json.load(f), sort_keys=True, indent=2))"
164        ],
165        "metadata": {
166          "colab": {
167            "base_uri": "https://localhost:8080/"
168          },
169          "id": "h0yrQ8rqtnzh",
170          "outputId": "3379ab18-1805-4ef1-b946-1a61d18522db"
171        },
172        "execution_count": 5,
173        "outputs": [
174          {
175            "output_type": "stream",
176            "name": "stdout",
177            "text": [
178              "{\n",
179              "  \"backend\": \"faiss\",\n",
180              "  \"build\": {\n",
181              "    \"create\": \"2024-09-06T15:21:11Z\",\n",
182              "    \"python\": \"3.10.12\",\n",
183              "    \"settings\": {\n",
184              "      \"components\": \"IDMap,Flat\"\n",
185              "    },\n",
186              "    \"system\": \"Linux (x86_64)\",\n",
187              "    \"txtai\": \"7.5.0\"\n",
188              "  },\n",
189              "  \"dimensions\": 384,\n",
190              "  \"offset\": 170,\n",
191              "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
192              "  \"update\": \"2024-09-06T15:21:11Z\"\n",
193              "}\n"
194            ]
195          }
196        ]
197      },
198      {
199        "cell_type": "code",
200        "source": [
201          "import faiss\n",
202          "\n",
203          "index = faiss.read_index(\"txtai-index/embeddings\")\n",
204          "print(f\"Total records {index.ntotal}\")"
205        ],
206        "metadata": {
207          "colab": {
208            "base_uri": "https://localhost:8080/"
209          },
210          "id": "-aqiqfqeuM5p",
211          "outputId": "4b7856f0-ec8d-4960-a419-e88a14d988a8"
212        },
213        "execution_count": 6,
214        "outputs": [
215          {
216            "output_type": "stream",
217            "name": "stdout",
218            "text": [
219              "Total records 170\n"
220            ]
221          }
222        ]
223      },
224      {
225        "cell_type": "code",
226        "source": [
227          "import msgpack\n",
228          "\n",
229          "with open(\"txtai-index/ids\", \"rb\") as f:\n",
230          "  print(msgpack.unpack(f)[5:10])"
231        ],
232        "metadata": {
233          "colab": {
234            "base_uri": "https://localhost:8080/"
235          },
236          "id": "I0ffDyJsunrW",
237          "outputId": "d8c5072d-d181-46e1-fc1a-cc713a78e69f"
238        },
239        "execution_count": 18,
240        "outputs": [
241          {
242            "output_type": "stream",
243            "name": "stdout",
244            "text": [
245              "['JavaScript Console', 'Excel Sheet', 'English Pronunciation Helper', 'Spoken English Teacher and Improver', 'Travel Guide']\n"
246            ]
247          }
248        ]
249      },
250      {
251        "cell_type": "markdown",
252        "source": [
253          "Each file can be read without txtai. [JSON](https://www.json.org/json-en.html), [MessagePack](https://msgpack.org/index.html) and [Faiss](https://github.com/facebookresearch/faiss) all have libraries in multiple programming languages."
254        ],
255        "metadata": {
256          "id": "0e3dpAFuvP-_"
257        }
258      },
259      {
260        "cell_type": "markdown",
261        "source": [
262          "# Embeddings index with SQLite\n",
263          "\n",
264          "In the next example, we'll use SQLite to store content and vectors courtesy of the [sqlite-vec](https://github.com/asg017/sqlite-vec) library."
265        ],
266        "metadata": {
267          "id": "UUwk13mzwUTS"
268        }
269      },
270      {
271        "cell_type": "code",
272        "source": [
273          "from txtai import Embeddings\n",
274          "\n",
275          "embeddings = Embeddings(content=True, backend=\"sqlite\")\n",
276          "embeddings.index((x[\"act\"], x[\"prompt\"]) for x in dataset)\n",
277          "embeddings.save(\"txtai-sqlite\")"
278        ],
279        "metadata": {
280          "id": "Z80FWhuNwj14"
281        },
282        "execution_count": 8,
283        "outputs": []
284      },
285      {
286        "cell_type": "markdown",
287        "source": [
288          "Let's once again explore the generated index files."
289        ],
290        "metadata": {
291          "id": "Kxcm42rixAH0"
292        }
293      },
294      {
295        "cell_type": "code",
296        "source": [
297          "!ls -l txtai-sqlite\n",
298          "!echo\n",
299          "!file txtai-sqlite/*"
300        ],
301        "metadata": {
302          "colab": {
303            "base_uri": "https://localhost:8080/"
304          },
305          "id": "PEHT6LqHw_lw",
306          "outputId": "81d707dd-9760-40cd-e0b2-be43c26ba2d1"
307        },
308        "execution_count": 9,
309        "outputs": [
310          {
311            "output_type": "stream",
312            "name": "stdout",
313            "text": [
314              "total 1696\n",
315              "-rw-r--r-- 1 root root     384 Sep  6 15:21 config.json\n",
316              "-rw-r--r-- 1 root root  126976 Sep  6 15:21 documents\n",
317              "-rw-r--r-- 1 root root 1605632 Sep  6 15:21 embeddings\n",
318              "\n",
319              "txtai-sqlite/config.json: JSON data\n",
320              "txtai-sqlite/documents:   SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1\n",
321              "txtai-sqlite/embeddings:  SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1\n"
322            ]
323          }
324        ]
325      },
326      {
327        "cell_type": "markdown",
328        "source": [
329          "This time note how there is a documents file with content stored in SQLite and a separate SQLite file for embeddings. Let's test it out."
330        ],
331        "metadata": {
332          "id": "UXiKSG0JxLPo"
333        }
334      },
335      {
336        "cell_type": "code",
337        "source": [
338          "embeddings.search(\"teacher\")"
339        ],
340        "metadata": {
341          "colab": {
342            "base_uri": "https://localhost:8080/"
343          },
344          "id": "VYMTXtUpxSLd",
345          "outputId": "6259f3c2-6a76-434d-d40f-f11cb2e63c80"
346        },
347        "execution_count": 10,
348        "outputs": [
349          {
350            "output_type": "execute_result",
351            "data": {
352              "text/plain": [
353                "[{'id': 'Math Teacher',\n",
354                "  'text': 'I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is \"I need help understanding how probability works.\"',\n",
355                "  'score': 0.3421396017074585},\n",
356                " {'id': 'Educational Content Creator',\n",
357                "  'text': 'I want you to act as an educational content creator. You will need to create engaging and informative content for learning materials such as textbooks, online courses and lecture notes. My first suggestion request is \"I need help developing a lesson plan on renewable energy sources for high school students.\"',\n",
358                "  'score': 0.3267676830291748},\n",
359                " {'id': 'Philosophy Teacher',\n",
360                "  'text': 'I want you to act as a philosophy teacher. I will provide some topics related to the study of philosophy, and it will be your job to explain these concepts in an easy-to-understand manner. This could include providing examples, posing questions or breaking down complex ideas into smaller pieces that are easier to comprehend. My first request is \"I need help understanding how different philosophical theories can be applied in everyday life.\"',\n",
361                "  'score': 0.30780404806137085}]"
362              ]
363            },
364            "metadata": {},
365            "execution_count": 10
366          }
367        ]
368      },
369      {
370        "cell_type": "markdown",
371        "source": [
372          "The top N results as expected. Let's again inspect the files."
373        ],
374        "metadata": {
375          "id": "R0zsQx37xjWz"
376        }
377      },
378      {
379        "cell_type": "code",
380        "source": [
381          "import json\n",
382          "\n",
383          "with open(\"txtai-sqlite/config.json\") as f:\n",
384          "  print(json.dumps(json.load(f), sort_keys=True, indent=2))"
385        ],
386        "metadata": {
387          "colab": {
388            "base_uri": "https://localhost:8080/"
389          },
390          "id": "jDvH6dHrxqwf",
391          "outputId": "dcce1647-0488-4f3e-dba0-667a932bad27"
392        },
393        "execution_count": 11,
394        "outputs": [
395          {
396            "output_type": "stream",
397            "name": "stdout",
398            "text": [
399              "{\n",
400              "  \"backend\": \"sqlite\",\n",
401              "  \"build\": {\n",
402              "    \"create\": \"2024-09-06T15:21:13Z\",\n",
403              "    \"python\": \"3.10.12\",\n",
404              "    \"settings\": {\n",
405              "      \"sqlite\": \"3.37.2\",\n",
406              "      \"sqlite-vec\": \"v0.1.1\"\n",
407              "    },\n",
408              "    \"system\": \"Linux (x86_64)\",\n",
409              "    \"txtai\": \"7.5.0\"\n",
410              "  },\n",
411              "  \"content\": true,\n",
412              "  \"dimensions\": 384,\n",
413              "  \"offset\": 170,\n",
414              "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
415              "  \"update\": \"2024-09-06T15:21:13Z\"\n",
416              "}\n"
417            ]
418          }
419        ]
420      },
421      {
422        "cell_type": "code",
423        "source": [
424          "import sqlite3, sqlite_vec\n",
425          "\n",
426          "db = sqlite3.connect(\"txtai-sqlite/documents\")\n",
427          "print(db.execute(\"SELECT COUNT(*) FROM sections\").fetchone()[0])\n",
428          "\n",
429          "db = sqlite3.connect(\"txtai-sqlite/embeddings\")\n",
430          "db.enable_load_extension(True)\n",
431          "sqlite_vec.load(db)\n",
432          "print(db.execute(\"SELECT COUNT(*) FROM vectors\").fetchone()[0])"
433        ],
434        "metadata": {
435          "colab": {
436            "base_uri": "https://localhost:8080/"
437          },
438          "id": "AsalQ-uUxxO0",
439          "outputId": "2a0e303e-e10e-424c-a952-e473d86cd2db"
440        },
441        "execution_count": 12,
442        "outputs": [
443          {
444            "output_type": "stream",
445            "name": "stdout",
446            "text": [
447              "170\n",
448              "170\n"
449            ]
450          }
451        ]
452      },
453      {
454        "cell_type": "markdown",
455        "source": [
456          "As in the previous example, each file can be read without txtai. [JSON](https://www.json.org/json-en.html), [SQLite](https://www.sqlite.org/) and [sqlite-vec](https://github.com/asg017/sqlite-vec) all have libraries in multiple programming languages."
457        ],
458        "metadata": {
459          "id": "Fo7XEUNny6s5"
460        }
461      },
462      {
463        "cell_type": "markdown",
464        "source": [
465          "# Graph storage\n",
466          "\n",
467          "Starting with txtai 7.4, graphs are stored using MessagePack. The indexed file has a list of nodes and edges that can easily be imported."
468        ],
469        "metadata": {
470          "id": "Ipu08WhL0j6p"
471        }
472      },
473      {
474        "cell_type": "code",
475        "source": [
476          "from txtai import Embeddings\n",
477          "\n",
478          "embeddings = Embeddings(content=True, backend=\"sqlite\", graph={\"approximate\": False})\n",
479          "embeddings.index((x[\"act\"], x[\"prompt\"]) for x in dataset)\n",
480          "embeddings.save(\"txtai-graph\")"
481        ],
482        "metadata": {
483          "id": "cmFiae6j0wHz"
484        },
485        "execution_count": 13,
486        "outputs": []
487      },
488      {
489        "cell_type": "code",
490        "source": [
491          "!ls -l txtai-graph\n",
492          "!echo\n",
493          "!file txtai-graph/*"
494        ],
495        "metadata": {
496          "colab": {
497            "base_uri": "https://localhost:8080/"
498          },
499          "id": "hnv82EAB059Q",
500          "outputId": "454b263b-8733-4997-eb6e-7d585cfa32c6"
501        },
502        "execution_count": 14,
503        "outputs": [
504          {
505            "output_type": "stream",
506            "name": "stdout",
507            "text": [
508              "total 1816\n",
509              "-rw-r--r-- 1 root root     454 Sep  6 15:21 config.json\n",
510              "-rw-r--r-- 1 root root  126976 Sep  6 15:21 documents\n",
511              "-rw-r--r-- 1 root root 1605632 Sep  6 15:21 embeddings\n",
512              "-rw-r--r-- 1 root root  119970 Sep  6 15:21 graph\n",
513              "\n",
514              "txtai-graph/config.json: JSON data\n",
515              "txtai-graph/documents:   SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 31, cookie 0x1, schema 4, UTF-8, version-valid-for 1\n",
516              "txtai-graph/embeddings:  SQLite 3.x database, last written using SQLite version 3037002, file counter 1, database pages 392, cookie 0x1, schema 4, UTF-8, version-valid-for 1\n",
517              "txtai-graph/graph:       data\n"
518            ]
519          }
520        ]
521      },
522      {
523        "cell_type": "code",
524        "source": [
525          "import msgpack\n",
526          "\n",
527          "with open(\"txtai-graph/graph\", \"rb\") as f:\n",
528          "  data = msgpack.unpack(f)\n",
529          "  print(data.keys())\n",
530          "\n",
531          "  for key in data:\n",
532          "    if data[key]:\n",
533          "      print(key, data[key][100])"
534        ],
535        "metadata": {
536          "colab": {
537            "base_uri": "https://localhost:8080/"
538          },
539          "id": "Bx3Tzt3l08R3",
540          "outputId": "662d157c-6912-4626-db64-d959c2719331"
541        },
542        "execution_count": 15,
543        "outputs": [
544          {
545            "output_type": "stream",
546            "name": "stdout",
547            "text": [
548              "dict_keys(['nodes', 'edges', 'categories', 'topics'])\n",
549              "nodes [100, {'id': 'Ascii Artist', 'text': 'I want you to act as an ascii artist. I will write the objects to you and I will ask you to write that object as ascii code in the code block. Write only ascii code. Do not explain about the object you wrote. I will say the objects in double quotes. My first object is \"cat\"'}]\n",
550              "edges [5, 100, {'weight': 0.39010339975357056}]\n"
551            ]
552          }
553        ]
554      },
555      {
556        "cell_type": "markdown",
557        "source": [
558          "# Wrapping up\n",
559          "\n",
560          "This notebook gave an overview of the txtai embeddings index file format and how it supports open data access.\n",
561          "\n",
562          "While txtai can be used as an all-in-one embeddings database, it can also be used for only one part of the stack such as data ingestion. For example, it can be used to populate a Postgres or SQLite database for downstream use. The options are there."
563        ],
564        "metadata": {
565          "id": "y7N-YZlR5S-0"
566        }
567      }
568    ]
569  }