japanese-fts-using-sqlite.mdx
1 --- 2 title: Japanese Full-Text Search in SQLite 3 date: 2022-05-19 4 description: A simple example of how you can make Japanese full-text-search in SQLite. 5 tags: 6 - japanese 7 - sqlite 8 - csharp 9 --- 10 11 # Introduction 12 13 Recently I have been working on an android app that uses SQLite. The app has a feature where you can find an entry of a Japanese word with some details attached using a keyword that you typed in the search bar. 14 15 When building this feature, I found almost zero results on how to do it. There are some full-text search tutorials for Latin words, but I couldn't find any for Japanese. It needs a special tokenizer called ICU from [the ICU project](http://site.icu-project.org/). Since it's not enabled by default in SQLite on my Linux machine or the one bundled with `Microsoft.Data.Sqlite`, I had to download the source code and compile it with some build flags to make it work. This is where the tricky part comes in. I spent a lot of time trying to figure out how to do it, but because I don't really familiar with C, I missed things that would be obvious to someone with a fair amount of C experience. 16 17 Hopefully, by the end of this post, you'll know how to do the exact same thing without spending hours of your life trying to figure out how to do it. 18 19 This article assumes that you're somewhat familiar with SQLite and .NET Core so I will not go into much detail about the basics. I will also be using Linux so some parts of the compilation step will be Linux specific. 20 21 # What is Full-Text Search 22 23 Full-text Search is a technique that allows you to search for a text in a database by providing only parts of them. For example, if I have a text that says "Never gonna give you up" and I type "give" in the search bar, I can find a row containing that text despite the fact that the word `give` is only in the middle of the text. 24 25 This is different from an exact search that usually looks like this: 26 27 ```sql 28 SELECT text FROM rickroll WHERE text = 'give'; 29 ``` 30 31 Using that query, you will only find a row that contains `give`, nothing more, nothing less. 32 33 Full-text search is similar to a wildcard search where you can use `%` to search for a text that contains a part of it. For example, if I type `%give%` in the search bar, I can find a row that contains `give` in it. 34 35 ```sql 36 SELECT text FROM rickroll WHERE text LIKE '%give%'; 37 ``` 38 39 If wildcard works then why bother with a full-text search? Well, they have some differences. Not only that wildcard is abysmally slow on a large dataset, they're also less accurate. Wildcard only searches for a pattern while full-text-search will search based on the tokenised text. full-text search is also **very** fast compared to wildcard search because it already has an index of our data. 40 41 With wildcard, for example, if you have a pattern `LIKE '%run%'`, not only that it will return any rows that have the word `run`, but it will also return any rows that have the word `running`, `crunch`, `prune`, etc. This is because wildcard search is based on the pattern, not the tokenised text. With full-text search, it will only return rows that have the word `run` and also `running` depending on the tokenizer that you use. 42 43 # Why SQLite 44 45 > Well, you don't really have that much of an option on android, do you? 46 47 I found this tweet the other day... 48 49 <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Also SQLite3’s Full Text Search combined with triggers is absolutely mind blowing.<br /><br />Indexed 50GB of data in 10 lines of SQL and it queries under 20ms…<br /><br />This technology exists literally running on our toasters and we still throw Postgres and Elasticsearch at trivial problems.</p>— Sebastian Szturo (@SebastianSzturo) <a href="https://twitter.com/SebastianSzturo/status/1515297367335247877?ref_src=twsrc%5Etfw">April 16, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 50 51 ...which inspired me to use the full-text search feature of SQLite. Initially, I was just thinking of a key-value pair database like a dictionary, but I realized that SQLite is a good fit for this because I can do so much more with it, including full-text search. 52 53 Although, the default SQLite tokenizer, which is "simple" and "porter", doesn't support Japanese. The one that works with CJK characters is the ICU tokenizer from the [ICU Project](https://icu-project.org/) as I mentioned earlier. 54 55 # Simple Example 56 57 ## Setting up the project 58 59 We're going to make a simple console app using .NET Core and SQLite to store the data. The app will have a database that contains a list of Japanese sentences and we can search through them using a prompt. 60 61 First, we need to set up the project. I will use the `dotnet` CLI for the entire process. We can run this simple command to get started. 62 63 ```bash 64 dotnet new console -o ProjectName 65 ``` 66 67 That should give us a starting point to work with. 68 69 ## Disabling implicit using statements 70 71 Here's a thing that I like to do when I start a new project. Open up the `.csproj` file and remove the `<ImplicitUsings>` line, our `.csproj` file should end up like this: 72 73 ```xml 74 <Project Sdk="Microsoft.NET.Sdk"> 75 76 <PropertyGroup> 77 <OutputType>Exe</OutputType> 78 <TargetFramework>net6.0</TargetFramework> 79 <Nullable>enable</Nullable> 80 </PropertyGroup> 81 82 </Project> 83 ``` 84 85 We will get a warning from the compiler because our current `Program.cs` file doesn't have any `using` statements. Let's just ignore that for now. 86 87 The reason why I like to do this is that I don't like the implicit `using` that was introduced in .NET 6. I prefer to be explicit with what I use. 88 89 ## Building SQLite with ICU support 90 91 To build SQLite with ICU support, we need to download the source code of SQLite and compile it with certain compilation flags. You can download the source code of SQLite from [their download page](https://www.sqlite.org/download.html). We want to extract it to a folder inside our project so it'll be easier to find. You can name it whatever you want, but I'll put it inside a directory and call it `./External/sqlite/`. 92 93 We can then compile it using the following commands. 94 95 ```bash 96 # configure the build 97 CFLAGS="-O3 -DSQLITE_ENABLE_ICU" CPPFLAGS=`icu-config --cppflags` LDFLAGS=`icu-config --ldflags` ./configure 98 99 # build the binary 100 make 101 ``` 102 103 The `-DSQLITE_ENABLE_ICU` is the flag that enables ICU support. You can see the full list of flags from [this page](https://www.sqlite.org/compile.html). We use `icu-config --ldflags` to get the linker flags for ICU. 104 105 After compiling that, our directory should have the compiled binary called `sqlite3`. That's not what we want for our application, though. We want to use the `sqlite3` as a library, not a standalone executable. This is where it got me the first time, I didn't know there was a directory called `.libs` inside the `sqlite` directory because it was hidden. That's where the libraries are stored. I waste quite some time trying to make the binding in .NET works only to realise that I'm using the wrong binary. There should be an `sqlite3.so` file inside the `.libs` directory. This is what we're going to be using for our application. 106 107 Since we put the `sqlite` directory inside our project, there will be a small issue. The `sqlite` directory contains a file called `Replace.cs`. This will cause the compiler to freak out because it's trying to analyse a file called `Replace.cs`. We can ignore this file by adding the following block to the end of our `.csproj` file just before the closing tag `</Project>`. 108 109 ```xml 110 <ItemGroup> 111 <!-- adjust the path to suit your needs --> 112 <Compile Remove="./External/sqlite/Replace.cs" /> 113 </ItemGroup> 114 ``` 115 116 Of course, if you don't put the `sqlite` directory inside your project, you don't need to do this. 117 118 ## Preparing the database 119 120 We're going to build a simple console app that lets us search for an example sentence containing a word that we typed in. To do that, we will need a table with three columns, `Id`, `EnglishText`, and `JapaneseText`. The `Id` column will be the primary key and will be auto-incremented. The `EnglishText` column will contain the English version of the text and the `JapaneseText` column will be the Japanese version. 121 122 In our project directory, we will create a `Data` directory where our database will be stored. After doing that, we will run the `sqlite3` CLI to create the database. Make sure you use the `sqlite3` CLI that we just compiled because the one that comes by default might not be compiled with ICU support. We can use our `sqlite3` binary by typing the relative path instead of just `sqlite3`, in my case that would look like this: 123 124 ```bash 125 ./External/sqlite/sqlite3 ./Data/dictionary.db 126 ``` 127 128 It will put us in a prompt where we can create the table. I write this in multiline form so it would be easier to read, but we want to make it a single line when we're typing it in the prompt to avoid weird things happening. 129 130 ```sql 131 CREATE TABLE Dictionary ( 132 Id INTEGER PRIMARY KEY AUTOINCREMENT, 133 EnglishText TEXT, 134 JapaneseText TEXT 135 ); 136 ``` 137 138 In order to use the full-text search feature of SQLite, we're going to need a virtual table. It's a different type of table, but how you create the table is quite similar. 139 140 ```sql 141 CREATE VIRTUAL TABLE DictionaryFTS 142 USING fts4( 143 JapaneseText, 144 content='Dictionary', 145 tokenize=icu ja_JP 146 ); 147 ``` 148 149 The reason why I'm using `FTS4` and not `FTS3` or `FTS5` is because `FTS3` is slower and `FTS5` has several improvements but it's not available on android. For more detail on this, see [the documentation for FTS3/4](https://www.sqlite.org/fts3.html) and [FTS5](https://www.sqlite.org/fts5.html). 150 151 The `tokenize` parameter is the tokenizer that we're going to use. We specify `icu` tokenizer and we use `ja_JP` as the locale because we're working with the Japanese language. 152 153 Since we're going to use the data from the regular `Dictionary` table by specifying the `content` column, we need to create a trigger that will update the `DictionaryFTS` table whenever the `Dictionary` table is updated. We can add a trigger using this SQL statement: 154 155 ```sql 156 CREATE TRIGGER Dictionary_Update_FTS 157 AFTER INSERT ON Dictionary 158 BEGIN 159 INSERT INTO DictionaryFTS(rowid, JapaneseText) 160 VALUES (new.rowid, new.JapaneseText); 161 END; 162 ``` 163 164 This way, whenever we insert a new row in the `Dictionary` table, the `DictionaryFTS` table will be updated as well, which is exactly what we want. If you don't add this trigger you can still query the `DictionaryFTS` table but you won't be able to use the full-text search functionality. 165 166 Now, let's add a few data to the `Dictionary` table. We can do that by running this statement: 167 168 ```sql 169 INSERT INTO Dictionary (EnglishText, JapaneseText) 170 VALUES ("I woke up in the morning", "朝に起きた"), 171 ("I was singing that song", "その歌を歌っていた"), 172 ("The food was good", "食べ物はうまかった"), 173 ("What time is it now?", "今何時?"), 174 ("This is Japanese", "これは日本語です"), 175 ("That car went too fast", "あの車が速すぎた"), 176 ("It smells like spring", "春の匂いがする"), 177 ("It's very cold outside", "外には寒いよ"), 178 ("Today is Saturday", "今日は土曜日です"), 179 ("Please tell me", "教えてくれ"), 180 ("Sorry I don't understand Japanese", "日本語分からなくてごめん"), 181 ("This is not an apple", "これはりんごじゃないよ"), 182 ("This book is mine", "この本は私のものです"), 183 ("I usually sleep around this time", "こんな時ごろに寝る"), 184 ("Your Japanese is good", "日本語上手ですね"), 185 ("I like cats", "私は猫が好き"), 186 ("My cat is white", "私の猫が白です"); 187 ``` 188 189 <small>pardon my japanese :p</small> 190 191 You can, of course, add a lot more data, but these should be enough for our purpose. 192 193 We can test if the full-text search works by typing this statement: 194 195 ```sql 196 SELECT * FROM DictionaryFTS WHERE JapaneseText MATCH 'です'; 197 ``` 198 199 You should see any rows that have the word `です` in it. In this example the result should be these rows: 200 201 ``` 202 これは日本語です 203 今日は土曜日です 204 この本は私のものです 205 日本語上手ですね 206 ``` 207 208 If you're having difficulties with typing Japanese characters in SQLite prompt, you can type all of the commands in a file and then do this 209 210 ```bash 211 # where `file.sql` is a file that contains SQLite statements 212 ./External/sqlite/sqlite3 ./Data/dictionary.db < file.sql 213 214 # you can also do it directly by passing a string to the `sqlite3` command 215 ./External/sqlite/sqlite3 ./Data/dictionary.db "SELECT * FROM table;" 216 ``` 217 218 ## Installing SQLite client 219 220 Since we're going to work on a simple app, we don't really need a full-fledged package like Entity Framework. We're going to use `Microsoft.Data.Sqlite` package which is a lightweight ADO.NET provider for SQLite, but since we want to use our own version of sqlite3 that we've compiled earlier, we want `Microsoft.Data.Sqlite.Core` instead. 221 222 Make sure that you **DON'T** have `Microsoft.Data.Sqlite` installed in your project, otherwise, this wouldn't work. The `Microsoft.Data.Sqlite` package bundles its own sqlite to provide a wider range of compatibility. I also spent quite some time on this step because I didn't realise that. I missed a part of the documentation where it says that `Microsoft.Data.Sqlite` is a package that bundles sqlite. 223 224 The package that we're using is built on top of `SQLitePCLRaw` package, which is a thin wrapper around sqlite. According to [the documentation](https://docs.microsoft.com/en-us/dotnet/standard/data/sqlite/custom-versions), we will need other packages to be installed as well if we want to use a custom version of sqlite. 225 226 ```bash 227 dotnet add package Microsoft.Data.Sqlite.Core 228 dotnet add package SQLitePCLRaw.core 229 dotnet add package SQLitePCLRaw.provider.dynamic_cdecl 230 ``` 231 232 We should be done with the preparation. 233 234 ## Using a custom SQlite versionReplace.cs 235 236 Before using any of the `Microsoft.Data.Sqlite` functionality, we need to register our custom sqlite version. 237 First, we need to add a native library adapter class that implements `IGetFunctionPointer` from `SQLitePCL` namespace. This class will be used to get the pointer to the native sqlite library. I will put this inside a file called `NativeLibraryAdapter.cs` at the root of our project. 238 239 ```csharp 240 using System; 241 using System.Runtime.InteropServices; 242 using SQLitePCL; 243 244 class NativeLibraryAdapter : IGetFunctionPointer 245 { 246 readonly IntPtr _library; 247 248 public NativeLibraryAdapter(string name) 249 => _library = NativeLibrary.Load(name); 250 251 public IntPtr GetFunctionPointer(string name) 252 => NativeLibrary.TryGetExport(_library, name, out var address) 253 ? address 254 : IntPtr.Zero; 255 } 256 ``` 257 258 Next, we want to register our custom sqlite version. We can do that by adding a few lines of code to our `Program.cs` file. 259 260 ```csharp 261 using System.IO; 262 using SQLitePCL; 263 264 // adjust this path to where your sqlite3.so is stored 265 string sqliteLibPath = Path.GetFullPath("./External/sqlite/.libs/libsqlite3.so"); 266 267 SQLite3Provider_dynamic_cdecl.Setup("sqlite3", new NativeLibraryAdapter(sqliteLibPath)); 268 SQLitePCL.raw.SetProvider(new SQLite3Provider_dynamic_cdecl()); 269 ``` 270 271 Make sure that we do this right at the beginning before doing anything with the database connection. 272 273 ## Connecting to the database 274 275 We should try to connect to the database. To do that, we can instantiate the `SqliteConnection` class and provide it with the path to the database. We can add this code to our `Program.cs` file. Bare in mind that you should put all `using` statements at the top of the file. 276 277 ```csharp 278 using Microsoft.Data.Sqlite; 279 280 string databasePath = Path.GetFullPath("./Data/dictionary.db"); 281 using var connection = new SqliteConnection($"Data Source={databasePath}"); 282 283 // open the database connection 284 connection.Open(); 285 ``` 286 287 We would want to use the keyword `using` to make sure that the connection is disposed when we're done with it. 288 289 ## Main logic of the app 290 291 What I want to build for this simple example is a simple search console app where the user is going to be given a prompt that ask for a word and then the app will search the database for the word and display the results. 292 I won't be going into the details of how to do this because the main focus is the full-text search functionality. 293 294 ```csharp 295 // close the database connection when we quit the app using Ctrl+C 296 Console.CancelKeyPress += delegate 297 { 298 connection.Close(); 299 Environment.Exit(0); 300 }; 301 302 // ask the user for a word to look up until they quit the app 303 while (true) 304 { 305 Console.Write("== Enter a word (ctrl+c to cancel): "); 306 var word = Console.ReadLine(); 307 308 // ask again if the word is empty 309 if (string.IsNullOrEmpty(word)) 310 { 311 Console.WriteLine("Word can't be empty!"); 312 continue; 313 } 314 315 // create a command to search the database and bind the user input 316 // to the `@word` parameter 317 var command = connection.CreateCommand(); 318 command.CommandText = $@" 319 SELECT 320 Dictionary.EnglishText, 321 Dictionary.JapaneseText 322 FROM Dictionary 323 WHERE Dictionary.JapaneseText IN ( 324 SELECT DictionaryFTS.JapaneseText FROM DictionaryFTS 325 WHERE DictionaryFTS MATCH @word 326 );"; 327 command.Parameters.Add(new SqliteParameter("@word", word)); 328 329 using var reader = command.ExecuteReader(); 330 331 // prints a message when no result was found 332 if (!reader.HasRows) 333 { 334 Console.WriteLine("No result was found!"); 335 continue; 336 } 337 338 // print the results in a list 339 while (reader.Read()) 340 { 341 Console.WriteLine($"- {reader["EnglishText"]} => {reader["JapaneseText"]}"); 342 } 343 } 344 ``` 345 346 When you run the app by typing `dotnet run`, you should see something like this: 347 348 <div style="width:100%;height:0px;position:relative;padding-bottom:46.256%;"> 349 <iframe 350 title="demo video" 351 src="https://streamable.com/e/9gyjj3" 352 frameborder="0" 353 width="100%" 354 height="100%" 355 allowfullscreen 356 style="width:100%;height:100%;position:absolute;left:0px;top:0px;overflow:hidden;" 357 ></iframe> 358 </div> 359 360 As you can see, when I search for `です`, I get every rows that has `です` in the `JapaneseText` field. 361 362 ## Highlighting the matched word 363 364 We can highlight the matched word in the result by using the `snippet` function. To do that, we need to change our query a bit. 365 366 ```sql 367 SELECT 368 Dictionary.EnglishText, 369 JapaneseTextSnippet 370 FROM Dictionary 371 JOIN ( 372 SELECT 373 snippet(DictionaryFTS, '[', ']') AS JapaneseTextSnippet, 374 rowid 375 FROM DictionaryFTS 376 WHERE DictionaryFTS MATCH @word 377 ) AS ResultFTS 378 ON Dictionary.Id = ResultFTS.rowid; 379 ``` 380 381 I'm not sure about the performance of this query compared to using `WHERE IN`. If you have any suggestions, please let me know :) 382 383 This query will now select the `JapaneseTextSnippet` column that comes from the `snippet` function. The matched word should be surrounded by `[` and `]`. If you're using the result for the web, you can wrap it with an html tag instead of square brackets, but since we're making a console app, we don't have the luxury of that and have to rely on ANSI escape sequence. You can see [this gist](https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797) for the ANSI escape sequence reference. 384 385 The reason why I don't put the escape sequence inside the `snippet` function is because I had weird things happening when I tried to do that. So, I decided to wrap it in a bracket and then replace the bracket with the escape sequence using C# like this. 386 387 ```csharp 388 // print the results in a list 389 while (reader.Read()) 390 { 391 var englishText = reader.GetString(0); 392 var japaneseTextSnippet = reader.GetString(1).Replace("[", "[1;31m").Replace("]", "[0m"); 393 Console.WriteLine($"- {englishText} => {japaneseTextSnippet}"); 394 } 395 ``` 396 397 The end result should now look like this: 398 399  400 401 # References 402 403 Here are some useful resources that helped me that might also be useful to you: 404 405 - [SQLite Virtual Table](https://www.sqlite.org/vtab.html) 406 - [SQLite FTS3 Module](https://www.sqlite.org/fts3.html) 407 - [kemsereylam.com: FTS with SQLite](https://kimsereylam.com/sqlite/2020/03/06/full-text-search-with-sqlite.html) 408 - [abdus.dev: Quick FTS using SQLite](https://abdus.dev/posts/quick-full-text-search-using-sqlite/) 409 410 If you want the full code, you can find the repository at [elianiva/foo-dictionary](https://github.com/elianiva/foo-dictionary) 411 412 # Conclusion 413 414 Doing a full-text search is actually quite easy. Initially, I thought full-text search requires a fancy technology like Elasticsearch, Algolia, or something like that, but I found that I can do it with SQLite. 415 Most of the pain I had to go through came from trying to make the SQLite binding work. 416 417 I hope this article helped you to implement full-text search for the Japanese language in your SQLite database. If you have any questions or suggestions, feel free to leave a comment :) 418 419 Have a nice day!