/ src / content / posts / japanese-fts-using-sqlite.mdx
japanese-fts-using-sqlite.mdx
  1  ---
  2  title: Japanese Full-Text Search in SQLite
  3  date: 2022-05-19
  4  description: A simple example of how you can make Japanese full-text-search in SQLite.
  5  tags:
  6      - japanese
  7      - sqlite
  8      - csharp
  9  ---
 10  
 11  # Introduction
 12  
 13  Recently I have been working on an android app that uses SQLite. The app has a feature where you can find an entry of a Japanese word with some details attached using a keyword that you typed in the search bar.
 14  
 15  When building this feature, I found almost zero results on how to do it. There are some full-text search tutorials for Latin words, but I couldn't find any for Japanese. It needs a special tokenizer called ICU from [the ICU project](http://site.icu-project.org/). Since it's not enabled by default in SQLite on my Linux machine or the one bundled with `Microsoft.Data.Sqlite`, I had to download the source code and compile it with some build flags to make it work. This is where the tricky part comes in. I spent a lot of time trying to figure out how to do it, but because I don't really familiar with C, I missed things that would be obvious to someone with a fair amount of C experience.
 16  
 17  Hopefully, by the end of this post, you'll know how to do the exact same thing without spending hours of your life trying to figure out how to do it.
 18  
 19  This article assumes that you're somewhat familiar with SQLite and .NET Core so I will not go into much detail about the basics. I will also be using Linux so some parts of the compilation step will be Linux specific.
 20  
 21  # What is Full-Text Search
 22  
 23  Full-text Search is a technique that allows you to search for a text in a database by providing only parts of them. For example, if I have a text that says "Never gonna give you up" and I type "give" in the search bar, I can find a row containing that text despite the fact that the word `give` is only in the middle of the text.
 24  
 25  This is different from an exact search that usually looks like this:
 26  
 27  ```sql
 28  SELECT text FROM rickroll WHERE text = 'give';
 29  ```
 30  
 31  Using that query, you will only find a row that contains `give`, nothing more, nothing less.
 32  
 33  Full-text search is similar to a wildcard search where you can use `%` to search for a text that contains a part of it. For example, if I type `%give%` in the search bar, I can find a row that contains `give` in it.
 34  
 35  ```sql
 36  SELECT text FROM rickroll WHERE text LIKE '%give%';
 37  ```
 38  
 39  If wildcard works then why bother with a full-text search? Well, they have some differences. Not only that wildcard is abysmally slow on a large dataset, they're also less accurate. Wildcard only searches for a pattern while full-text-search will search based on the tokenised text. full-text search is also **very** fast compared to wildcard search because it already has an index of our data.
 40  
 41  With wildcard, for example, if you have a pattern `LIKE '%run%'`, not only that it will return any rows that have the word `run`, but it will also return any rows that have the word `running`, `crunch`, `prune`, etc. This is because wildcard search is based on the pattern, not the tokenised text. With full-text search, it will only return rows that have the word `run` and also `running` depending on the tokenizer that you use.
 42  
 43  # Why SQLite
 44  
 45  > Well, you don't really have that much of an option on android, do you?
 46  
 47  I found this tweet the other day...
 48  
 49  <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Also SQLite3’s Full Text Search combined with triggers is absolutely mind blowing.<br /><br />Indexed 50GB of data in 10 lines of SQL and it queries under 20ms…<br /><br />This technology exists literally running on our toasters and we still throw Postgres and Elasticsearch at trivial problems.</p>&mdash; Sebastian Szturo (@SebastianSzturo) <a href="https://twitter.com/SebastianSzturo/status/1515297367335247877?ref_src=twsrc%5Etfw">April 16, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
 50  
 51  ...which inspired me to use the full-text search feature of SQLite. Initially, I was just thinking of a key-value pair database like a dictionary, but I realized that SQLite is a good fit for this because I can do so much more with it, including full-text search.
 52  
 53  Although, the default SQLite tokenizer, which is "simple" and "porter", doesn't support Japanese. The one that works with CJK characters is the ICU tokenizer from the [ICU Project](https://icu-project.org/) as I mentioned earlier.
 54  
 55  # Simple Example
 56  
 57  ## Setting up the project
 58  
 59  We're going to make a simple console app using .NET Core and SQLite to store the data. The app will have a database that contains a list of Japanese sentences and we can search through them using a prompt.
 60  
 61  First, we need to set up the project. I will use the `dotnet` CLI for the entire process. We can run this simple command to get started.
 62  
 63  ```bash
 64  dotnet new console -o ProjectName
 65  ```
 66  
 67  That should give us a starting point to work with.
 68  
 69  ## Disabling implicit using statements
 70  
 71  Here's a thing that I like to do when I start a new project. Open up the `.csproj` file and remove the `<ImplicitUsings>` line, our `.csproj` file should end up like this:
 72  
 73  ```xml
 74  <Project Sdk="Microsoft.NET.Sdk">
 75  
 76    <PropertyGroup>
 77      <OutputType>Exe</OutputType>
 78      <TargetFramework>net6.0</TargetFramework>
 79      <Nullable>enable</Nullable>
 80    </PropertyGroup>
 81  
 82  </Project>
 83  ```
 84  
 85  We will get a warning from the compiler because our current `Program.cs` file doesn't have any `using` statements. Let's just ignore that for now.
 86  
 87  The reason why I like to do this is that I don't like the implicit `using` that was introduced in .NET 6. I prefer to be explicit with what I use.
 88  
 89  ## Building SQLite with ICU support
 90  
 91  To build SQLite with ICU support, we need to download the source code of SQLite and compile it with certain compilation flags. You can download the source code of SQLite from [their download page](https://www.sqlite.org/download.html). We want to extract it to a folder inside our project so it'll be easier to find. You can name it whatever you want, but I'll put it inside a directory and call it `./External/sqlite/`.
 92  
 93  We can then compile it using the following commands.
 94  
 95  ```bash
 96  # configure the build
 97  CFLAGS="-O3 -DSQLITE_ENABLE_ICU" CPPFLAGS=`icu-config --cppflags` LDFLAGS=`icu-config --ldflags` ./configure
 98  
 99  # build the binary
100  make
101  ```
102  
103  The `-DSQLITE_ENABLE_ICU` is the flag that enables ICU support. You can see the full list of flags from [this page](https://www.sqlite.org/compile.html). We use `icu-config --ldflags` to get the linker flags for ICU.
104  
105  After compiling that, our directory should have the compiled binary called `sqlite3`. That's not what we want for our application, though. We want to use the `sqlite3` as a library, not a standalone executable. This is where it got me the first time, I didn't know there was a directory called `.libs` inside the `sqlite` directory because it was hidden. That's where the libraries are stored. I waste quite some time trying to make the binding in .NET works only to realise that I'm using the wrong binary. There should be an `sqlite3.so` file inside the `.libs` directory. This is what we're going to be using for our application.
106  
107  Since we put the `sqlite` directory inside our project, there will be a small issue. The `sqlite` directory contains a file called `Replace.cs`. This will cause the compiler to freak out because it's trying to analyse a file called `Replace.cs`. We can ignore this file by adding the following block to the end of our `.csproj` file just before the closing tag `</Project>`.
108  
109  ```xml
110  <ItemGroup>
111    <!-- adjust the path to suit your needs -->
112    <Compile Remove="./External/sqlite/Replace.cs" />
113  </ItemGroup>
114  ```
115  
116  Of course, if you don't put the `sqlite` directory inside your project, you don't need to do this.
117  
118  ## Preparing the database
119  
120  We're going to build a simple console app that lets us search for an example sentence containing a word that we typed in. To do that, we will need a table with three columns, `Id`, `EnglishText`, and `JapaneseText`. The `Id` column will be the primary key and will be auto-incremented. The `EnglishText` column will contain the English version of the text and the `JapaneseText` column will be the Japanese version.
121  
122  In our project directory, we will create a `Data` directory where our database will be stored. After doing that, we will run the `sqlite3` CLI to create the database. Make sure you use the `sqlite3` CLI that we just compiled because the one that comes by default might not be compiled with ICU support. We can use our `sqlite3` binary by typing the relative path instead of just `sqlite3`, in my case that would look like this:
123  
124  ```bash
125  ./External/sqlite/sqlite3 ./Data/dictionary.db
126  ```
127  
128  It will put us in a prompt where we can create the table. I write this in multiline form so it would be easier to read, but we want to make it a single line when we're typing it in the prompt to avoid weird things happening.
129  
130  ```sql
131  CREATE TABLE Dictionary (
132    Id INTEGER PRIMARY KEY AUTOINCREMENT,
133    EnglishText TEXT,
134    JapaneseText TEXT
135  );
136  ```
137  
138  In order to use the full-text search feature of SQLite, we're going to need a virtual table. It's a different type of table, but how you create the table is quite similar.
139  
140  ```sql
141  CREATE VIRTUAL TABLE DictionaryFTS
142  USING fts4(
143    JapaneseText,
144    content='Dictionary',
145    tokenize=icu ja_JP
146  );
147  ```
148  
149  The reason why I'm using `FTS4` and not `FTS3` or `FTS5` is because `FTS3` is slower and `FTS5` has several improvements but it's not available on android. For more detail on this, see [the documentation for FTS3/4](https://www.sqlite.org/fts3.html) and [FTS5](https://www.sqlite.org/fts5.html).
150  
151  The `tokenize` parameter is the tokenizer that we're going to use. We specify `icu` tokenizer and we use `ja_JP` as the locale because we're working with the Japanese language.
152  
153  Since we're going to use the data from the regular `Dictionary` table by specifying the `content` column, we need to create a trigger that will update the `DictionaryFTS` table whenever the `Dictionary` table is updated. We can add a trigger using this SQL statement:
154  
155  ```sql
156  CREATE TRIGGER Dictionary_Update_FTS
157  AFTER INSERT ON Dictionary
158  BEGIN
159    INSERT INTO DictionaryFTS(rowid, JapaneseText)
160    VALUES (new.rowid, new.JapaneseText);
161  END;
162  ```
163  
164  This way, whenever we insert a new row in the `Dictionary` table, the `DictionaryFTS` table will be updated as well, which is exactly what we want. If you don't add this trigger you can still query the `DictionaryFTS` table but you won't be able to use the full-text search functionality.
165  
166  Now, let's add a few data to the `Dictionary` table. We can do that by running this statement:
167  
168  ```sql
169  INSERT INTO Dictionary (EnglishText, JapaneseText)
170  VALUES ("I woke up in the morning", "朝に起きた"),
171         ("I was singing that song", "その歌を歌っていた"),
172         ("The food was good", "食べ物はうまかった"),
173         ("What time is it now?", "今何時?"),
174         ("This is Japanese", "これは日本語です"),
175         ("That car went too fast", "あの車が速すぎた"),
176         ("It smells like spring", "春の匂いがする"),
177         ("It's very cold outside", "外には寒いよ"),
178         ("Today is Saturday", "今日は土曜日です"),
179         ("Please tell me", "教えてくれ"),
180         ("Sorry I don't understand Japanese", "日本語分からなくてごめん"),
181         ("This is not an apple", "これはりんごじゃないよ"),
182         ("This book is mine", "この本は私のものです"),
183         ("I usually sleep around this time", "こんな時ごろに寝る"),
184         ("Your Japanese is good", "日本語上手ですね"),
185         ("I like cats", "私は猫が好き"),
186         ("My cat is white", "私の猫が白です");
187  ```
188  
189  <small>pardon my japanese :p</small>
190  
191  You can, of course, add a lot more data, but these should be enough for our purpose.
192  
193  We can test if the full-text search works by typing this statement:
194  
195  ```sql
196  SELECT * FROM DictionaryFTS WHERE JapaneseText MATCH 'です';
197  ```
198  
199  You should see any rows that have the word `です` in it. In this example the result should be these rows:
200  
201  ```
202  これは日本語です
203  今日は土曜日です
204  この本は私のものです
205  日本語上手ですね
206  ```
207  
208  If you're having difficulties with typing Japanese characters in SQLite prompt, you can type all of the commands in a file and then do this
209  
210  ```bash
211  # where `file.sql` is a file that contains SQLite statements
212  ./External/sqlite/sqlite3 ./Data/dictionary.db < file.sql
213  
214  # you can also do it directly by passing a string to the `sqlite3` command
215  ./External/sqlite/sqlite3 ./Data/dictionary.db "SELECT * FROM table;"
216  ```
217  
218  ## Installing SQLite client
219  
220  Since we're going to work on a simple app, we don't really need a full-fledged package like Entity Framework. We're going to use `Microsoft.Data.Sqlite` package which is a lightweight ADO.NET provider for SQLite, but since we want to use our own version of sqlite3 that we've compiled earlier, we want `Microsoft.Data.Sqlite.Core` instead.
221  
222  Make sure that you **DON'T** have `Microsoft.Data.Sqlite` installed in your project, otherwise, this wouldn't work. The `Microsoft.Data.Sqlite` package bundles its own sqlite to provide a wider range of compatibility. I also spent quite some time on this step because I didn't realise that. I missed a part of the documentation where it says that `Microsoft.Data.Sqlite` is a package that bundles sqlite.
223  
224  The package that we're using is built on top of `SQLitePCLRaw` package, which is a thin wrapper around sqlite. According to [the documentation](https://docs.microsoft.com/en-us/dotnet/standard/data/sqlite/custom-versions), we will need other packages to be installed as well if we want to use a custom version of sqlite.
225  
226  ```bash
227  dotnet add package Microsoft.Data.Sqlite.Core
228  dotnet add package SQLitePCLRaw.core
229  dotnet add package SQLitePCLRaw.provider.dynamic_cdecl
230  ```
231  
232  We should be done with the preparation.
233  
234  ## Using a custom SQlite versionReplace.cs
235  
236  Before using any of the `Microsoft.Data.Sqlite` functionality, we need to register our custom sqlite version.
237  First, we need to add a native library adapter class that implements `IGetFunctionPointer` from `SQLitePCL` namespace. This class will be used to get the pointer to the native sqlite library. I will put this inside a file called `NativeLibraryAdapter.cs` at the root of our project.
238  
239  ```csharp
240  using System;
241  using System.Runtime.InteropServices;
242  using SQLitePCL;
243  
244  class NativeLibraryAdapter : IGetFunctionPointer
245  {
246      readonly IntPtr _library;
247  
248      public NativeLibraryAdapter(string name)
249          => _library = NativeLibrary.Load(name);
250  
251      public IntPtr GetFunctionPointer(string name)
252          => NativeLibrary.TryGetExport(_library, name, out var address)
253              ? address
254              : IntPtr.Zero;
255  }
256  ```
257  
258  Next, we want to register our custom sqlite version. We can do that by adding a few lines of code to our `Program.cs` file.
259  
260  ```csharp
261  using System.IO;
262  using SQLitePCL;
263  
264  // adjust this path to where your sqlite3.so is stored
265  string sqliteLibPath = Path.GetFullPath("./External/sqlite/.libs/libsqlite3.so");
266  
267  SQLite3Provider_dynamic_cdecl.Setup("sqlite3", new NativeLibraryAdapter(sqliteLibPath));
268  SQLitePCL.raw.SetProvider(new SQLite3Provider_dynamic_cdecl());
269  ```
270  
271  Make sure that we do this right at the beginning before doing anything with the database connection.
272  
273  ## Connecting to the database
274  
275  We should try to connect to the database. To do that, we can instantiate the `SqliteConnection` class and provide it with the path to the database. We can add this code to our `Program.cs` file. Bare in mind that you should put all `using` statements at the top of the file.
276  
277  ```csharp
278  using Microsoft.Data.Sqlite;
279  
280  string databasePath = Path.GetFullPath("./Data/dictionary.db");
281  using var connection = new SqliteConnection($"Data Source={databasePath}");
282  
283  // open the database connection
284  connection.Open();
285  ```
286  
287  We would want to use the keyword `using` to make sure that the connection is disposed when we're done with it.
288  
289  ## Main logic of the app
290  
291  What I want to build for this simple example is a simple search console app where the user is going to be given a prompt that ask for a word and then the app will search the database for the word and display the results.
292  I won't be going into the details of how to do this because the main focus is the full-text search functionality.
293  
294  ```csharp
295  // close the database connection when we quit the app using Ctrl+C
296  Console.CancelKeyPress += delegate
297  {
298      connection.Close();
299      Environment.Exit(0);
300  };
301  
302  // ask the user for a word to look up until they quit the app
303  while (true)
304  {
305      Console.Write("== Enter a word (ctrl+c to cancel): ");
306      var word = Console.ReadLine();
307  
308      // ask again if the word is empty
309      if (string.IsNullOrEmpty(word))
310      {
311          Console.WriteLine("Word can't be empty!");
312          continue;
313      }
314  
315      // create a command to search the database and bind the user input
316      // to the `@word` parameter
317      var command = connection.CreateCommand();
318      command.CommandText = $@"
319      SELECT
320        Dictionary.EnglishText,
321        Dictionary.JapaneseText
322      FROM Dictionary
323      WHERE Dictionary.JapaneseText IN (
324          SELECT DictionaryFTS.JapaneseText FROM DictionaryFTS
325          WHERE DictionaryFTS MATCH @word
326      );";
327      command.Parameters.Add(new SqliteParameter("@word", word));
328  
329      using var reader = command.ExecuteReader();
330  
331      // prints a message when no result was found
332      if (!reader.HasRows)
333      {
334          Console.WriteLine("No result was found!");
335          continue;
336      }
337  
338      // print the results in a list
339      while (reader.Read())
340      {
341          Console.WriteLine($"- {reader["EnglishText"]} => {reader["JapaneseText"]}");
342      }
343  }
344  ```
345  
346  When you run the app by typing `dotnet run`, you should see something like this:
347  
348  <div style="width:100%;height:0px;position:relative;padding-bottom:46.256%;">
349  	<iframe
350  		title="demo video"
351  		src="https://streamable.com/e/9gyjj3"
352  		frameborder="0"
353  		width="100%"
354  		height="100%"
355  		allowfullscreen
356  		style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;"
357  	></iframe>
358  </div>
359  
360  As you can see, when I search for `です`, I get every rows that has `です` in the `JapaneseText` field.
361  
362  ## Highlighting the matched word
363  
364  We can highlight the matched word in the result by using the `snippet` function. To do that, we need to change our query a bit.
365  
366  ```sql
367  SELECT
368      Dictionary.EnglishText,
369      JapaneseTextSnippet
370  FROM Dictionary
371  JOIN (
372      SELECT
373          snippet(DictionaryFTS, '[', ']') AS JapaneseTextSnippet,
374          rowid
375      FROM DictionaryFTS
376      WHERE DictionaryFTS MATCH @word
377  ) AS ResultFTS
378      ON Dictionary.Id = ResultFTS.rowid;
379  ```
380  
381  I'm not sure about the performance of this query compared to using `WHERE IN`. If you have any suggestions, please let me know :)
382  
383  This query will now select the `JapaneseTextSnippet` column that comes from the `snippet` function. The matched word should be surrounded by `[` and `]`. If you're using the result for the web, you can wrap it with an html tag instead of square brackets, but since we're making a console app, we don't have the luxury of that and have to rely on ANSI escape sequence. You can see [this gist](https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797) for the ANSI escape sequence reference.
384  
385  The reason why I don't put the escape sequence inside the `snippet` function is because I had weird things happening when I tried to do that. So, I decided to wrap it in a bracket and then replace the bracket with the escape sequence using C# like this.
386  
387  ```csharp
388  // print the results in a list
389  while (reader.Read())
390  {
391    var englishText = reader.GetString(0);
392    var japaneseTextSnippet = reader.GetString(1).Replace("[", "").Replace("]", "");
393    Console.WriteLine($"- {englishText} => {japaneseTextSnippet}");
394  }
395  ```
396  
397  The end result should now look like this:
398  
399  ![result with highlight](https://i.ibb.co/ym9QcW3/image-2022-05-20-16-11-29.png)
400  
401  # References
402  
403  Here are some useful resources that helped me that might also be useful to you:
404  
405  -   [SQLite Virtual Table](https://www.sqlite.org/vtab.html)
406  -   [SQLite FTS3 Module](https://www.sqlite.org/fts3.html)
407  -   [kemsereylam.com: FTS with SQLite](https://kimsereylam.com/sqlite/2020/03/06/full-text-search-with-sqlite.html)
408  -   [abdus.dev: Quick FTS using SQLite](https://abdus.dev/posts/quick-full-text-search-using-sqlite/)
409  
410  If you want the full code, you can find the repository at [elianiva/foo-dictionary](https://github.com/elianiva/foo-dictionary)
411  
412  # Conclusion
413  
414  Doing a full-text search is actually quite easy. Initially, I thought full-text search requires a fancy technology like Elasticsearch, Algolia, or something like that, but I found that I can do it with SQLite.
415  Most of the pain I had to go through came from trying to make the SQLite binding work.
416  
417  I hope this article helped you to implement full-text search for the Japanese language in your SQLite database. If you have any questions or suggestions, feel free to leave a comment :)
418  
419  Have a nice day!