article.svx
1 --- 2 title: Models are not data structures 3 subtitle: You knew this already 4 date: 2023-01-16 5 tags: 6 - architecture 7 - database 8 author: Cam 9 --- 10 11 When building an app that manipulates data (e.g. a web server) we typically 12 interact with that data in three ways: 13 * As raw data in memory or database 14 * Manipulating via specific behaviours through code 15 * Sending it to other applications via external APIs 16 17 If your instinct tells you that "this is the job for a model", then your instinct 18 is wrong. But I know you know better than that. The three seaprate tasks should be 19 seen as a hint that three separate pieces are required: 20 * The **structure** is the base data representation 21 * The **model** is for managing behaviours in code 22 * The **serialization** is how we send it between applications 23 24 If you've worked on any decently sized project, you have almost certainly 25 been given an ORM, and your framework lets you define (what they call) "models" 26 which provide access to the ORMs methods, upon which you build your app. 27 Mistake. 28 29 Asked what ORM stands for, is your answer "Object Relational Mapping"? Notably, 30 the M does not stand for model, because an ORM does not (should not?) have a concept 31 of "model". It's simply a __mapping__ of __objects__ to a __relational__ database. 32 Providing convenient methods for sending data to the database, *an ORM is 33 a serializer*. Meanwhile, the objects you've created are an implementation of a 34 data structure. 35 36 This is all fine as it is until your framework convinces that you this data structure 37 is called a "model" and you should implement behaviours on it. Now we've merged 38 all three layers into a single messy lump, and your app spirals out of control 39 after that. This is the part we must avoid. 40 41 ### The Business Object 42 43 Before getting too deep into the code side, it's worth identifying what a 44 "business object" might refer to: this is the thing that your non-developer 45 boss knows about. 46 47 If you're making a Facebook-like social network, a post is a likely business 48 object. To the boss, posts have text content and images, users react and comment 49 on posts, and those comments in turn have text and images, and further comments 50 and reactions. 51 52 If you're making an ecommerce shop, you have products. Products have names 53 and descriptions, each product may have a few variants, each variant may 54 have a different price in different countries, and there are reviews on each 55 variant as well. Each variant also has inventory which may exist in multiple 56 locations, and needs to increase and decrease as people make and cancel orders. 57 58 The business object is very high-level and, as you might have noticed already, 59 *has little bearing on the structure of the code*. Code will need post, reaction, 60 like, and share to each be distinct entities, as will product, variant, price, 61 review, inventory levels, and orders. 62 63 We aim to keep this bearing of business object on code structure as small as 64 possible. 65 66 ### The Structure 67 68 Probably the most core piece of this whole puzzle is the structure. A good 69 data structure makes everything easy; easy to implement behaviours on 70 the model and easy to serialize the data (often it can be done purely 71 structurally). 72 73 This is also the most often overlooked layer because it's boring and 74 feels "low level", a scary place for those of us who are used to just 75 building web apps. Rather than designing features from the data-level 76 up, we go from model-level down. 77 78 In some cases, the model first approach doesn't go too wrong. Setting up 79 users and login for example is pretty straight forward: 80 81 ```python 82 class User(Structure): 83 id: int 84 username: str 85 email: str 86 password_hash: str 87 ``` 88 89 Meanwhile, if we wanted to represent posts and comments, we might consider 90 the needs of our model and design this structure: 91 92 ```python 93 class Post(Structure): 94 author: User 95 content: str 96 image_url: str | None 97 reactions: list[Reaction] 98 comments: list[Comment] 99 100 101 class Comment(Structure): 102 author: User 103 content: str 104 image_url: str | None 105 reactions: list[Reaction] 106 comments: list[Comment] 107 108 109 class Reaction(Structure): 110 user: User 111 react_emoji: str 112 ``` 113 114 Sadly, this structure is lacking in a few ways. Firstly, there is 115 no way to consider a post without also considering its comments, author, and 116 reactions as well. 117 118 Secondly, and more noticeably, the nested shape does not map well 119 to a (relational) database. That should be the hint. In 120 this case, we can attempt to do the usual obvious transformation, 121 but quickly notice that things are not as they seem: comments can 122 be on posts or other comments, so we have two potential foreign 123 keys. Alternatively, we could support comments only to a specific 124 depth, but of course that's not acceptable, so we go with something 125 like this: 126 127 ```python 128 class Post(Structure): 129 id: int 130 author_id: int 131 content: str 132 image_url: str | None 133 134 135 class Comment(Structure): 136 post_id: int | None 137 comment_id: int | None 138 author_id: int 139 content: str 140 image_url: str | None 141 142 143 class Reaction(Structure): 144 post_id: int | None 145 comment_id: int | None 146 user_id: int 147 react_emoji: str 148 ``` 149 150 Just one look at that should be giving you the heebie-jeebies: there's 151 so much room for invalid data! What if I set both `post_id` and `comment_id`? 152 Or neither? Sure, the app *right now* doesn't do that, but there's nothing 153 stopping an uninformed admin from manually inserting such invalid data, or 154 a little bug coming up that introduces hundreds of unattached comments. 155 Sure we could put some constraints in at the database level to make this work, 156 but those are hard to notice as a developer without access to the database, and 157 are prone to being forgotten when, say, we start allowing reactions to have 158 comments as well and a third foreign key appears. 159 160 Instead, if we had come at this problem from the data-side first (think, 161 database side first), we might have arrived at something like this: 162 163 ```python 164 class Item(Structure): 165 id: int 166 167 168 class Reply(Structure): 169 item_id: int 170 reply_item_id: int 171 172 173 class Post(Structure): 174 item_id: int 175 author_id: int 176 content: str 177 image_url: str | None 178 179 180 class Comment(Structure): 181 item_id: int 182 author_id: int 183 content: str 184 image_url: str | None 185 186 187 class Reaction(Structure): 188 item_id: int 189 user_id: int 190 react_emoji: str 191 ``` 192 193 This is starting to look less and less like the natural shape that we might 194 have come up with for this business object, but is actually a better structure 195 for it. It solves all of the problems we might possibly have wanted to solve: 196 * We can look up posts, comments and even reactions independently from other parts 197 * We can comment on and react to anything 198 * It fits nicely into a relational database 199 * We can build that natural shape on top of this structure as a model 200 201 It actually also allows a few things we maybe didn't intend, such as a post as 202 a reply to another post. Maybe this doesn't make much sense in the Facebook style 203 social network we were envisioning, but this is a feature that a more Twitter-like 204 platform allows, so it doesn't hurt to be ready for the day that the boss decides 205 that's a good idea. 206 207 Databases are surprisingly powerful at what they do. Very rarely, if ever, have 208 I seen the bottleneck in my application be "too much data in the database". They 209 are very capable of storing that data efficiently, and also at searching through 210 it quickly. Don't optimize your database for "easiest queries to think of" or 211 "storing less data". It will always pay off to pick "simplest, most basic, 212 normalized, reliable structure" and to write a good model around that data to 213 simulate the easier queries at code level. 214 215 ### The Model 216 217 With a solid structure, it's often not the most convenient thing to have to work 218 with that structure directly all over the code. It is best to consider the structure 219 an "implementation detail" of the model, and let the model be a friendly API to 220 this data. 221 222 Continuing with the social network example, a decent interface for our model is 223 likely more similar to the original structure we had come up with: 224 225 ```python 226 class PostModel(Model): 227 def instance(post: Post): pass 228 def author(self) -> User: pass 229 def content(self) -> str: pass 230 def image_url(self) -> str | None: pass 231 def reactions(self) -> list[ReactionModel]: pass 232 def comments(self) -> list[CommentModel]: pass 233 234 235 class CommentModel(Model): 236 def instance(comment: Comment): pass 237 def author(self) -> UserModel: pass 238 def content(self) -> str: pass 239 def image_url(self) -> str | None: pass 240 def reactions(self) -> list[ReactionModel]: pass 241 def comments(self) -> list[CommentModel]: pass 242 243 244 class ReactionModel(Model): 245 def instance(reaction: Reaction): pass 246 def user(self) -> User: pass 247 def react_emoji(self) -> str: pass 248 ``` 249 250 A few things to notice here: 251 * No model is defined for `Item` or `Reply`, those aren't actually all that important to interact with explicitly. 252 * No fields are defined, we only need methods in our API. Fields are implementation detail. 253 * The constructor takes the structure, implying that these models wrap structures in some way. 254 255 Then we go on to add a few more methods to these models to implement all of the 256 basic actions we might want to take on those models. Without a method for adding 257 a post as a reply to another post, we are safe from accidentally filling that 258 data into our database, despite the data model technically supporting it. 259 260 ```python 261 class PostModel(Model): 262 # ... 263 def ref(id: int) -> PostModel: pass 264 def add_comment(self, comment: Comment): pass 265 def add_reaction(self, reaction: Reaction): pass 266 267 268 class CommentModel(Model): 269 # ... 270 def ref(id: int) -> CommentModel: pass 271 def add_comment(self, comment: Comment): pass 272 def add_reaction(self, reaction: Reaction): pass 273 274 275 class ReactionModel(Model): 276 # ... 277 def ref(id: int) -> ReactionModel: pass 278 ``` 279 280 While `add_comment` and `add_reaction` are likely obvious, 281 of particular interest might be the `ref` method. Though it 282 takes just an ID, `ref` does not actually need to load any 283 data from the database; only return a model that represents 284 a "reference" to the post, comment, or reaction in question. 285 Since all the interactions are implemented as methods, they 286 can quietly load data only when required. 287 288 ```python 289 class CommentModel(Model): 290 def __init__(self, *, id=None, comment=None): 291 self._id = id 292 self._comment = comment 293 294 def _prepare(self): 295 if self._comment is None and self._id is not None: 296 self._comment = Comment.load_from_db(self._id) 297 298 def instance(comment: Comment) -> CommentModel: 299 return CommentModel(comment=comment) 300 301 def ref(id: int) -> CommentModel: 302 return CommentModel(id=id) 303 304 def author(self) -> UserModel: 305 self._prepare() 306 return UserModel.ref(self._comment.author_id) 307 ``` 308 309 In fact, even when created using the constructor, passed a 310 whole entry, the methods are quietly loading data and touching 311 the item and reply tables in the background: 312 * `post.comments()` will execute some query that looks up related items through the reply table, and then comments from their Item ID. 313 * `post.add_comment()` needs to insert the a record into the item table first, then the comment itself, and finally a reply as well. 314 315 While these things may have been performed in a single simple query 316 with a more direct model-to-structure mapping, three queries 317 is really not all that bad. Remember that databases are *designed* 318 for this stuff. At the end of the day, 3 inserts is still a 319 constant number (O(1)) of inserts for this task, so realistically 320 it makes very little noticeable difference. 321 322 Plus, if you really think about it, these operations *can* still be a 323 single query each: 324 325 ```sql 326 -- Find all comments on post: 327 SELECT comments.* 328 FROM replies 329 LEFT OUTER JOIN comments ON comments.item_id = replies.reply_item_id 330 WHERE replies.item_id = ? 331 WHERE comments.id IS NOT NULL 332 333 -- Add a comment (on post or comment) 334 WITH 335 i AS (INSERT INTO items (id) VALUES (DEFAULT) RETURNING *), 336 c AS ( 337 INSERT INTO comments (id, item_id, author_id, content, image_url) 338 SELECT i.id, ?, ?, ? 339 FROM i 340 ) 341 INSERT INTO replies (item_id, reply_item_id) 342 SELECT 1, i.id 343 FROM i 344 ``` 345 346 ### The Serialization 347 348 The serialization is an alternative representation for a piece of data 349 which is designed to be able to exist on its own, outside of the context 350 of a running program. 351 352 Depending on how this data is being used, the design of the serialization 353 may be different. For example, if the data is being sent across the network 354 to another client application, we may want to optimize the data to reduce its 355 size in bytes, to reduce bandwidth usage. Alternatively, if the data is being 356 stored into version control, such as Git, we may want to write it in a text 357 format that lends well to computing line-based diffs. 358 359 The most common serialization is to simply convert to JSON structurally, and 360 in all honesty, this actually makes the most practical sense in most cases. 361 However, note that since the application code typically interacts with a *model*, 362 it isn't working with an actual structural data, only an interface to some unknown 363 underlying structure. In the example above, the model might actually just be an 364 ID and not a whole object at any given time. 365 366 In general, we don't want to couple our serialization format with the shape 367 of the application model, nor with the underlying structure of our data. Both 368 of those are implementation details of the server. Instead, taking the time 369 to explicitly define a serialization format and writing out functions to convert 370 the data into those formats typically leads to a more clear and stable API. 371 372 Some people call this a Data Transfer Object (DTO). Not a beautiful name by any 373 means, but it does get the job done. 374 375 ```python 376 class PostDto(Serialization): 377 author: UserDto 378 content: str 379 image_url: str | None 380 reactions: list[ReactionDto] 381 comments: list[CommentDto] 382 383 384 class CommentDto(Serialization): 385 author: UserDto 386 content: str 387 image_url: str | None 388 reactions: list[ReactionDto] 389 comments: list[CommentDto] 390 391 392 class ReactionDto(Serialization): 393 user: UserDto 394 react_emoji: str 395 ``` 396 397 Worth noticing is that a serialization is used on both sides of any application, 398 both for sending to, say, a client over HTTP, and also for sending to the database. 399 The database's serialization just often happens to be exactly the same as the 400 structure, as a typical web server application owns the schema of the database so 401 can use its structure safely. 402 403 Meanwhile, to a client side application such as a website, the serialization is used when 404 receiving from or sending to the server. In that case, since the client application is not 405 in control of the data format, a separate structure is likely recommended (even if that 406 structure is basically the same as the serialization) so as not to couple the application 407 with the API it is using. If that API ever changes, being able to change the serialization 408 and then only adjusting the mapping layer of serialization to structure will be a lot easier 409 than having to update the entire website at every that changed data was used. 410 411 There are some less conventional uses of databases in which even the structure and the 412 database tables may not correspond, such as when aggregation is performed at the database 413 level in order to construct the structure. One example of this might be when using a database 414 table as a ledger, and the entries are added together to produce the available inventory 415 at a particular point in time. In those cases, it can be useful to define a serialization 416 of a single ledger entry, a structure that represents the aggregated quantity at a particular 417 time, and a model that allows updating that structure while writing entries to the database 418 via the single entry serialization. In any case, that is not a pattern that lends itself 419 well to the one-model-to-rule-them-all approach that your off the shelf framework is likely 420 to lead to towards by default.