Cradicle Explorer

/ ADR / Store_and_Array_Design.md
Store_and_Array_Design.md
  1  # `Sirocco.core.graph_items.Store` Design
  2  
  3  # Initial version [2024-11-11] 
  4  
  5  ## Understanding the intended usage
  6  
  7  In the current yaml format we specify all data nodes in the same way, whether they are
  8  
  9  - determined before we start (avalilable)
 10  - generated once (generated)
 11  - generated periodically in a cycle (generated)
 12  
 13  Although available and generated data nodes are in separate sections
 14  
 15  
 16  ```yaml
 17  data:
 18    available:
 19      - grid_file:
 20          type: file
 21          src: $PWD/examples/files/data/grid
 22      - obs_data:
 23          type: file
 24          src: $PWD/examples/files/data/obs_data
 25      ...
 26    generated:
 27      - extpar_file:
 28          type: file
 29          src: output
 30      - icon_input:
 31          type: file
 32          src: output
 33      - icon_restart:
 34          type: file
 35          format: ncdf
 36          src: restart
 37      ...
 38  ```
 39  
 40  
 41  And they are eventually turned into the same data structure (`core.Data`) with an optional `.date` attribute, which is none except for the cyclically generated case.
 42  
 43  When building the "unrolled" dependency graph, we also don't know whether we are building a recurring or "one-off" task node, except from their context:
 44  
 45  ```yaml
 46  cycles:
 47    - bimonthly_tasks:
 48        start_date: *root_start_date
 49        end_date: *root_end_date
 50        period: P2M
 51        tasks:
 52          - icon:  # recurring task is in a cycle with start & end dates
 53              inputs:
 54                - icon_restart:
 55                    lag: -P2M
 56              outputs: [icon_output, icon_restart]
 57    - lastly:
 58        tasks:
 59          - cleanup:  # one-off task looks the same but is in a cycle without start & end dates
 60              depends:
 61                - icon:
 62                    date: 2026-05-01T00:00
 63  
 64  ```
 65  
 66  There is again only one data structure for tasks, whether they are recurring or not, and data and tasks are stored side-by side, as nodes in the same graph.
 67  
 68  A further constraint for the design at the moment is that we want to distinguish three different cases when accessing a data point / task:
 69  
 70  - one-off (access with `None` as date): return if stored, KeyError if not
 71  - recurring (access with a valid date):
 72      - return node if stored and there is a node for the date
 73      - return `None` if the date is too earlier / later than the earliest / latest stored node
 74      - ValueError if the date is in the right range but there is no node stored for it
 75  
 76  **[SEE UPDATE]** To this end the `TimeSeries` data structure was introduced, which takes care of storing all the data points by date for recurring nodes.
 77  
 78  ```python
 79  icon_output = TimeSeries()
 80  icon_output[datetime.fromisoformat("2024-01-01")] = Data.from_config(...)
 81  icon_output[datetime.fromisoformat("2025-01-01")] = Data.from_config(...)
 82  
 83  icon_output.start_date  # is now 2024-01-01
 84  icon_output.end_date  # is now 2025-01-01
 85  icon_output[datetime.fromisoformat("2024-01-01")]  # will return the first entry
 86  icon_output[datetime.fromisoformat("2026-01-01")]  # will return None and log a warning
 87  icon_output[datetime.fromisoformat("2024-06-01")]  # will raise an Error
 88  ```
 89  
 90  This means the checking logic to decide whether we are storing a one-off data point / task or a recurring one (in this case we initialize a `TimeSeries` for it) has to go somewhere. The choices are:
 91  
 92  - At creation of the "unrolled" nodes (this is currently done in nested for loops and branches would increase the complexity of that code even more)
 93      - pro: no custom container needed
 94      - con: either very complex or requires twice as many containers to keep recurring and one-offs apart
 95  
 96  ```python
 97  data: dict[str, node | TimeSeries]
 98  for ...:
 99      for ...:
100          for ...:
101              ...
102              if date_or_none_from_context:
103                  data[name][date_from_config_or_none] = Data.from_config(...)
104              else:
105                  data[name] = Data.from_config(...) # this might be a different container to simplify access logic
106  ...
107  # repeat the same thing later for tasks
108  # and on access later
109  for name, item in data.items()
110      if isinstance(item, TimeSeries):
111          if not access_date:
112              raise ... # we must access with a date
113          else:
114              data_point = item[access_date]
115      else:
116          data_point = item
117      ...
118  ...
119          
120  # under the assumption that we are looping over unrolled nodes and do again not know whether they are recurring. If they are stored separately, this would be simpler but twice as many loops.
121  
122  for name, data_point in one_off_data.items():
123      ...
124      
125  for name, data_series in recurring_data.items():
126      ...
127  ```
128  
129  - At insertion into the container we use (the current choice with `Store`)
130      - pro:
131          - reduces the amount of containers
132          - reduces the complexity of code interfacing with `core.WorkFlow`
133      - con: additional container class to maintain (however, it does *not* need to conform to standard container interfaces)
134  
135  ```python
136  
137  data = Store()
138  for ...:
139      for ...:
140          for ...:
141              data[name, date_or_none_from_context] = Data.from_config(...)
142              
143  ...
144  # on access later
145  for name in data:
146      data_point = data[name, access_date_or_none]
147      ...
148      
149  # or simply
150  for data_point in data.values():
151      name = data.name
152      date_or_none = data.date
153      ...
154  ```
155  
156  If we were not using `TimeSeries`, this would open up the following additional option:
157  
158  - Store in a flat mapping with (name, date) as the key instead of T
159      - pro: can use a dict
160      - con:
161          - (unless above constraint is dropped): the logic in `TimeSeries` would have to be implemented external to the mapping and would be more complex.
162              - either the mapping would be custom and do the same job as `TimeSeries`, except for multiple recurring data nodes
163              - or the functionality would have to be implemented external to a standard mapping and would have to do even more checking
164          - If not hosted in the `Workflow` class directly, a cumbersome logic will have to be reproduced each time we need to access the nodes, like generating the `WorkGraph` or the visualization graph. If hosted in `WorkFlow`, this is not less maintenance than the `Store` and `TimeSeries` classes but less clean.
165  
166  ## Temporary Conclusion
167  
168  All-in-all we (Rico, Matthieu) think `Store` is a good enough design for now, as the maintenance burden is low, given that `Sirocco` is more of an app and less of a library. Therefore `Store` should not be confronted with expectations to support any `Mapping` functionality beyond what we need inside `Sirocco` itself.
169  
170  ## Further developments potentially affecting this design
171  
172  We will at some point introduce parameterized tasks (and thus data as well). This will add other dimensions to the `Store`. 
173  
174  ## Reasons to change
175  
176  - If we find ourselves implementing more and more standard container functionality on `Store` it is time to reconsider.
177  - If the code for turning the IR into a WorkGraph suffers from additional complexity due to the design of `Store`, then it needs to be changed
178  - Potentially, changes to the config schema might necessitate or unlock a different design here
179  
180  
181  # UPDATE [2025-01-29]
182  
183  ## The `Array` class
184  
185  Since the introduction of parameterized tasks, the `Store` and `Timeseries` design evolved.
186  
187  The `Timeseries` class became the more generic `Array` class. This makes the date part of the `Array` dimensions, along with potential parameters. Objects stored in an `Array` are accessed with their `coordinates` which is a `dict` mapping the dimension name to its value. `Store` is now a container for `Array` objects.
188  
189  The following 2 changes also simplified the code and made it cleaner:
190  - Accessing nodes without any date or parameter is simplified as `Array` allows for empty coordinates (`{}`) so that these cases are not captured and treated in a special way in `Store` anymore.
191  - The need for special handling of nodes with dates out of range (just ignoring them, which is rather dirty) also disappeared with the introduction of the `when` keyword in the config format.
192  
193