article.svx
  1  ---
  2  title: Models are not data structures
  3  subtitle: You knew this already
  4  date: 2023-01-16
  5  tags:
  6  - architecture
  7  - database
  8  author: Cam
  9  ---
 10  
 11  When building an app that manipulates data (e.g. a web server) we typically
 12  interact with that data in three ways:
 13  *   As raw data in memory or database
 14  *   Manipulating via specific behaviours through code
 15  *   Sending it to other applications via external APIs
 16  
 17  If your instinct tells you that "this is the job for a model", then your instinct
 18  is wrong. But I know you know better than that. The three seaprate tasks should be
 19  seen as a hint that three separate pieces are required:
 20  *   The **structure** is the base data representation
 21  *   The **model** is for managing behaviours in code
 22  *   The **serialization** is how we send it between applications
 23  
 24  If you've worked on any decently sized project, you have almost certainly
 25  been given an ORM, and your framework lets you define (what they call) "models"
 26  which provide access to the ORMs methods, upon which you build your app.
 27  Mistake.
 28  
 29  Asked what ORM stands for, is your answer "Object Relational Mapping"? Notably,
 30  the M does not stand for model, because an ORM does not (should not?) have a concept
 31  of "model". It's simply a __mapping__ of __objects__ to a __relational__ database.
 32  Providing convenient methods for sending data to the database, *an ORM is
 33  a serializer*. Meanwhile, the objects you've created are an implementation of a
 34  data structure.
 35  
 36  This is all fine as it is until your framework convinces that you this data structure
 37  is called a "model" and you should implement behaviours on it. Now we've merged
 38  all three layers into a single messy lump, and your app spirals out of control
 39  after that. This is the part we must avoid.
 40  
 41  ### The Business Object
 42  
 43  Before getting too deep into the code side, it's worth identifying what a
 44  "business object" might refer to: this is the thing that your non-developer
 45  boss knows about.
 46  
 47  If you're making a Facebook-like social network, a post is a likely business
 48  object. To the boss, posts have text content and images, users react and comment
 49  on posts, and those comments in turn have text and images, and further comments
 50  and reactions.
 51  
 52  If you're making an ecommerce shop, you have products. Products have names
 53  and descriptions, each product may have a few variants, each variant may
 54  have a different price in different countries, and there are reviews on each
 55  variant as well. Each variant also has inventory which may exist in multiple
 56  locations, and needs to increase and decrease as people make and cancel orders.
 57  
 58  The business object is very high-level and, as you might have noticed already,
 59  *has little bearing on the structure of the code*. Code will need post, reaction,
 60  like, and share to each be distinct entities, as will product, variant, price,
 61  review, inventory levels, and orders.
 62  
 63  We aim to keep this bearing of business object on code structure as small as
 64  possible.
 65  
 66  ### The Structure
 67  
 68  Probably the most core piece of this whole puzzle is the structure. A good
 69  data structure makes everything easy; easy to implement behaviours on
 70  the model and easy to serialize the data (often it can be done purely
 71  structurally).
 72  
 73  This is also the most often overlooked layer because it's boring and
 74  feels "low level", a scary place for those of us who are used to just
 75  building web apps. Rather than designing features from the data-level
 76  up, we go from model-level down.
 77  
 78  In some cases, the model first approach doesn't go too wrong. Setting up
 79  users and login for example is pretty straight forward:
 80  
 81  ```python
 82  class User(Structure):
 83      id: int
 84      username: str
 85      email: str
 86      password_hash: str
 87  ```
 88  
 89  Meanwhile, if we wanted to represent posts and comments, we might consider
 90  the needs of our model and design this structure:
 91  
 92  ```python
 93  class Post(Structure):
 94      author: User
 95      content: str
 96      image_url: str | None
 97      reactions: list[Reaction]
 98      comments: list[Comment]
 99  
100  
101  class Comment(Structure):
102      author: User
103      content: str
104      image_url: str | None
105      reactions: list[Reaction]
106      comments: list[Comment]
107  
108  
109  class Reaction(Structure):
110      user: User
111      react_emoji: str
112  ```
113  
114  Sadly, this structure is lacking in a few ways. Firstly, there is
115  no way to consider a post without also considering its comments, author, and
116  reactions as well.
117  
118  Secondly, and more noticeably, the nested shape does not map well
119  to a (relational) database. That should be the hint. In
120  this case, we can attempt to do the usual obvious transformation,
121  but quickly notice that things are not as they seem: comments can
122  be on posts or other comments, so we have two potential foreign
123  keys. Alternatively, we could support comments only to a specific
124  depth, but of course that's not acceptable, so we go with something
125  like this:
126  
127  ```python
128  class Post(Structure):
129      id: int
130      author_id: int
131      content: str
132      image_url: str | None
133  
134  
135  class Comment(Structure):
136      post_id: int | None
137      comment_id: int | None
138      author_id: int
139      content: str
140      image_url: str | None
141  
142  
143  class Reaction(Structure):
144      post_id: int | None
145      comment_id: int | None
146      user_id: int
147      react_emoji: str
148  ```
149  
150  Just one look at that should be giving you the heebie-jeebies: there's
151  so much room for invalid data! What if I set both `post_id` and `comment_id`?
152  Or neither? Sure, the app *right now* doesn't do that, but there's nothing
153  stopping an uninformed admin from manually inserting such invalid data, or
154  a little bug coming up that introduces hundreds of unattached comments.
155  Sure we could put some constraints in at the database level to make this work,
156  but those are hard to notice as a developer without access to the database, and
157  are prone to being forgotten when, say, we start allowing reactions to have
158  comments as well and a third foreign key appears.
159  
160  Instead, if we had come at this problem from the data-side first (think,
161  database side first), we might have arrived at something like this:
162  
163  ```python
164  class Item(Structure):
165      id: int
166  
167  
168  class Reply(Structure):
169      item_id: int
170      reply_item_id: int
171  
172  
173  class Post(Structure):
174      item_id: int
175      author_id: int
176      content: str
177      image_url: str | None
178  
179  
180  class Comment(Structure):
181      item_id: int
182      author_id: int
183      content: str
184      image_url: str | None
185  
186  
187  class Reaction(Structure):
188      item_id: int
189      user_id: int
190      react_emoji: str
191  ```
192  
193  This is starting to look less and less like the natural shape that we might
194  have come up with for this business object, but is actually a better structure
195  for it. It solves all of the problems we might possibly have wanted to solve:
196  *   We can look up posts, comments and even reactions independently from other parts
197  *   We can comment on and react to anything
198  *   It fits nicely into a relational database
199  *   We can build that natural shape on top of this structure as a model
200  
201  It actually also allows a few things we maybe didn't intend, such as a post as
202  a reply to another post. Maybe this doesn't make much sense in the Facebook style
203  social network we were envisioning, but this is a feature that a more Twitter-like
204  platform allows, so it doesn't hurt to be ready for the day that the boss decides
205  that's a good idea.
206  
207  Databases are surprisingly powerful at what they do. Very rarely, if ever, have
208  I seen the bottleneck in my application be "too much data in the database". They
209  are very capable of storing that data efficiently, and also at searching through
210  it quickly. Don't optimize your database for "easiest queries to think of" or
211  "storing less data". It will always pay off to pick "simplest, most basic,
212  normalized, reliable structure" and to write a good model around that data to
213  simulate the easier queries at code level.
214  
215  ### The Model
216  
217  With a solid structure, it's often not the most convenient thing to have to work
218  with that structure directly all over the code. It is best to consider the structure
219  an "implementation detail" of the model, and let the model be a friendly API to
220  this data.
221  
222  Continuing with the social network example, a decent interface for our model is
223  likely more similar to the original structure we had come up with:
224  
225  ```python
226  class PostModel(Model):
227      def instance(post: Post): pass
228      def author(self) -> User: pass
229      def content(self) -> str: pass
230      def image_url(self) -> str | None: pass
231      def reactions(self) -> list[ReactionModel]: pass
232      def comments(self) -> list[CommentModel]: pass
233  
234  
235  class CommentModel(Model):
236      def instance(comment: Comment): pass
237      def author(self) -> UserModel: pass
238      def content(self) -> str: pass
239      def image_url(self) -> str | None: pass
240      def reactions(self) -> list[ReactionModel]: pass
241      def comments(self) -> list[CommentModel]: pass
242  
243  
244  class ReactionModel(Model):
245      def instance(reaction: Reaction): pass
246      def user(self) -> User: pass
247      def react_emoji(self) -> str: pass
248  ```
249  
250  A few things to notice here:
251  *   No model is defined for `Item` or `Reply`, those aren't actually all that important to interact with explicitly.
252  *   No fields are defined, we only need methods in our API. Fields are implementation detail.
253  *   The constructor takes the structure, implying that these models wrap structures in some way.
254  
255  Then we go on to add a few more methods to these models to implement all of the
256  basic actions we might want to take on those models. Without a method for adding
257  a post as a reply to another post, we are safe from accidentally filling that
258  data into our database, despite the data model technically supporting it.
259  
260  ```python
261  class PostModel(Model):
262      # ...
263      def ref(id: int) -> PostModel: pass
264      def add_comment(self, comment: Comment): pass
265      def add_reaction(self, reaction: Reaction): pass
266  
267  
268  class CommentModel(Model):
269      # ...
270      def ref(id: int) -> CommentModel: pass
271      def add_comment(self, comment: Comment): pass
272      def add_reaction(self, reaction: Reaction): pass
273  
274  
275  class ReactionModel(Model):
276      # ...
277      def ref(id: int) -> ReactionModel: pass
278  ```
279  
280  While `add_comment` and `add_reaction` are likely obvious,
281  of particular interest might be the `ref` method. Though it
282  takes just an ID, `ref` does not actually need to load any
283  data from the database; only return a model that represents
284  a "reference" to the post, comment, or reaction in question.
285  Since all the interactions are implemented as methods, they
286  can quietly load data only when required.
287  
288  ```python
289  class CommentModel(Model):
290      def __init__(self, *, id=None, comment=None):
291          self._id = id
292          self._comment = comment
293  
294      def _prepare(self):
295          if self._comment is None and self._id is not None:
296              self._comment = Comment.load_from_db(self._id)
297  
298      def instance(comment: Comment) -> CommentModel:
299          return CommentModel(comment=comment)
300  
301      def ref(id: int) -> CommentModel:
302          return CommentModel(id=id)
303  
304      def author(self) -> UserModel:
305          self._prepare()
306          return UserModel.ref(self._comment.author_id)
307  ```
308  
309  In fact, even when created using the constructor, passed a
310  whole entry, the methods are quietly loading data and touching
311  the item and reply tables in the background:
312  *   `post.comments()` will execute some query that looks up related items through the reply table, and then comments from their Item ID.
313  *   `post.add_comment()` needs to insert the a record into the item table first, then the comment itself, and finally a reply as well.
314  
315  While these things may have been performed in a single simple query
316  with a more direct model-to-structure mapping, three queries
317  is really not all that bad. Remember that databases are *designed*
318  for this stuff. At the end of the day, 3 inserts is still a
319  constant number (O(1)) of inserts for this task, so realistically
320  it makes very little noticeable difference.
321  
322  Plus, if you really think about it, these operations *can* still be a
323  single query each:
324  
325  ```sql
326  -- Find all comments on post:
327  SELECT comments.*
328    FROM replies
329    LEFT OUTER JOIN comments ON comments.item_id = replies.reply_item_id
330    WHERE replies.item_id = ?
331    WHERE comments.id IS NOT NULL
332  
333  -- Add a comment (on post or comment)
334  WITH
335    i AS (INSERT INTO items (id) VALUES (DEFAULT) RETURNING *),
336    c AS (
337      INSERT INTO comments (id, item_id, author_id, content, image_url)
338        SELECT i.id, ?, ?, ?
339        FROM i
340    )
341  INSERT INTO replies (item_id, reply_item_id)
342    SELECT 1, i.id
343    FROM i
344  ```
345  
346  ### The Serialization
347  
348  The serialization is an alternative representation for a piece of data
349  which is designed to be able to exist on its own, outside of the context
350  of a running program.
351  
352  Depending on how this data is being used, the design of the serialization
353  may be different. For example, if the data is being sent across the network
354  to another client application, we may want to optimize the data to reduce its
355  size in bytes, to reduce bandwidth usage. Alternatively, if the data is being
356  stored into version control, such as Git, we may want to write it in a text
357  format that lends well to computing line-based diffs.
358  
359  The most common serialization is to simply convert to JSON structurally, and
360  in all honesty, this actually makes the most practical sense in most cases.
361  However, note that since the application code typically interacts with a *model*,
362  it isn't working with an actual structural data, only an interface to some unknown
363  underlying structure. In the example above, the model might actually just be an
364  ID and not a whole object at any given time.
365  
366  In general, we don't want to couple our serialization format with the shape
367  of the application model, nor with the underlying structure of our data. Both
368  of those are implementation details of the server. Instead, taking the time
369  to explicitly define a serialization format and writing out functions to convert
370  the data into those formats typically leads to a more clear and stable API.
371  
372  Some people call this a Data Transfer Object (DTO). Not a beautiful name by any
373  means, but it does get the job done.
374  
375  ```python
376  class PostDto(Serialization):
377      author: UserDto
378      content: str
379      image_url: str | None
380      reactions: list[ReactionDto]
381      comments: list[CommentDto]
382  
383  
384  class CommentDto(Serialization):
385      author: UserDto
386      content: str
387      image_url: str | None
388      reactions: list[ReactionDto]
389      comments: list[CommentDto]
390  
391  
392  class ReactionDto(Serialization):
393      user: UserDto
394      react_emoji: str
395  ```
396  
397  Worth noticing is that a serialization is used on both sides of any application,
398  both for sending to, say, a client over HTTP, and also for sending to the database.
399  The database's serialization just often happens to be exactly the same as the
400  structure, as a typical web server application owns the schema of the database so
401  can use its structure safely.
402  
403  Meanwhile, to a client side application such as a website, the serialization is used when
404  receiving from or sending to the server. In that case, since the client application is not
405  in control of the data format, a separate structure is likely recommended (even if that
406  structure is basically the same as the serialization) so as not to couple the application
407  with the API it is using. If that API ever changes, being able to change the serialization
408  and then only adjusting the mapping layer of serialization to structure will be a lot easier
409  than having to update the entire website at every that changed data was used.
410  
411  There are some less conventional uses of databases in which even the structure and the
412  database tables may not correspond, such as when aggregation is performed at the database
413  level in order to construct the structure. One example of this might be when using a database
414  table as a ledger, and the entries are added together to produce the available inventory
415  at a particular point in time. In those cases, it can be useful to define a serialization
416  of a single ledger entry, a structure that represents the aggregated quantity at a particular
417  time, and a model that allows updating that structure while writing entries to the database
418  via the single entry serialization. In any case, that is not a pattern that lends itself
419  well to the one-model-to-rule-them-all approach that your off the shelf framework is likely
420  to lead to towards by default.