Store_and_Array_Design.md
1 # `Sirocco.core.graph_items.Store` Design 2 3 # Initial version [2024-11-11] 4 5 ## Understanding the intended usage 6 7 In the current yaml format we specify all data nodes in the same way, whether they are 8 9 - determined before we start (avalilable) 10 - generated once (generated) 11 - generated periodically in a cycle (generated) 12 13 Although available and generated data nodes are in separate sections 14 15 16 ```yaml 17 data: 18 available: 19 - grid_file: 20 type: file 21 src: $PWD/examples/files/data/grid 22 - obs_data: 23 type: file 24 src: $PWD/examples/files/data/obs_data 25 ... 26 generated: 27 - extpar_file: 28 type: file 29 src: output 30 - icon_input: 31 type: file 32 src: output 33 - icon_restart: 34 type: file 35 format: ncdf 36 src: restart 37 ... 38 ``` 39 40 41 And they are eventually turned into the same data structure (`core.Data`) with an optional `.date` attribute, which is none except for the cyclically generated case. 42 43 When building the "unrolled" dependency graph, we also don't know whether we are building a recurring or "one-off" task node, except from their context: 44 45 ```yaml 46 cycles: 47 - bimonthly_tasks: 48 start_date: *root_start_date 49 end_date: *root_end_date 50 period: P2M 51 tasks: 52 - icon: # recurring task is in a cycle with start & end dates 53 inputs: 54 - icon_restart: 55 lag: -P2M 56 outputs: [icon_output, icon_restart] 57 - lastly: 58 tasks: 59 - cleanup: # one-off task looks the same but is in a cycle without start & end dates 60 depends: 61 - icon: 62 date: 2026-05-01T00:00 63 64 ``` 65 66 There is again only one data structure for tasks, whether they are recurring or not, and data and tasks are stored side-by side, as nodes in the same graph. 67 68 A further constraint for the design at the moment is that we want to distinguish three different cases when accessing a data point / task: 69 70 - one-off (access with `None` as date): return if stored, KeyError if not 71 - recurring (access with a valid date): 72 - return node if stored and there is a node for the date 73 - return `None` if the date is too earlier / later than the earliest / latest stored node 74 - ValueError if the date is in the right range but there is no node stored for it 75 76 **[SEE UPDATE]** To this end the `TimeSeries` data structure was introduced, which takes care of storing all the data points by date for recurring nodes. 77 78 ```python 79 icon_output = TimeSeries() 80 icon_output[datetime.fromisoformat("2024-01-01")] = Data.from_config(...) 81 icon_output[datetime.fromisoformat("2025-01-01")] = Data.from_config(...) 82 83 icon_output.start_date # is now 2024-01-01 84 icon_output.end_date # is now 2025-01-01 85 icon_output[datetime.fromisoformat("2024-01-01")] # will return the first entry 86 icon_output[datetime.fromisoformat("2026-01-01")] # will return None and log a warning 87 icon_output[datetime.fromisoformat("2024-06-01")] # will raise an Error 88 ``` 89 90 This means the checking logic to decide whether we are storing a one-off data point / task or a recurring one (in this case we initialize a `TimeSeries` for it) has to go somewhere. The choices are: 91 92 - At creation of the "unrolled" nodes (this is currently done in nested for loops and branches would increase the complexity of that code even more) 93 - pro: no custom container needed 94 - con: either very complex or requires twice as many containers to keep recurring and one-offs apart 95 96 ```python 97 data: dict[str, node | TimeSeries] 98 for ...: 99 for ...: 100 for ...: 101 ... 102 if date_or_none_from_context: 103 data[name][date_from_config_or_none] = Data.from_config(...) 104 else: 105 data[name] = Data.from_config(...) # this might be a different container to simplify access logic 106 ... 107 # repeat the same thing later for tasks 108 # and on access later 109 for name, item in data.items() 110 if isinstance(item, TimeSeries): 111 if not access_date: 112 raise ... # we must access with a date 113 else: 114 data_point = item[access_date] 115 else: 116 data_point = item 117 ... 118 ... 119 120 # under the assumption that we are looping over unrolled nodes and do again not know whether they are recurring. If they are stored separately, this would be simpler but twice as many loops. 121 122 for name, data_point in one_off_data.items(): 123 ... 124 125 for name, data_series in recurring_data.items(): 126 ... 127 ``` 128 129 - At insertion into the container we use (the current choice with `Store`) 130 - pro: 131 - reduces the amount of containers 132 - reduces the complexity of code interfacing with `core.WorkFlow` 133 - con: additional container class to maintain (however, it does *not* need to conform to standard container interfaces) 134 135 ```python 136 137 data = Store() 138 for ...: 139 for ...: 140 for ...: 141 data[name, date_or_none_from_context] = Data.from_config(...) 142 143 ... 144 # on access later 145 for name in data: 146 data_point = data[name, access_date_or_none] 147 ... 148 149 # or simply 150 for data_point in data.values(): 151 name = data.name 152 date_or_none = data.date 153 ... 154 ``` 155 156 If we were not using `TimeSeries`, this would open up the following additional option: 157 158 - Store in a flat mapping with (name, date) as the key instead of T 159 - pro: can use a dict 160 - con: 161 - (unless above constraint is dropped): the logic in `TimeSeries` would have to be implemented external to the mapping and would be more complex. 162 - either the mapping would be custom and do the same job as `TimeSeries`, except for multiple recurring data nodes 163 - or the functionality would have to be implemented external to a standard mapping and would have to do even more checking 164 - If not hosted in the `Workflow` class directly, a cumbersome logic will have to be reproduced each time we need to access the nodes, like generating the `WorkGraph` or the visualization graph. If hosted in `WorkFlow`, this is not less maintenance than the `Store` and `TimeSeries` classes but less clean. 165 166 ## Temporary Conclusion 167 168 All-in-all we (Rico, Matthieu) think `Store` is a good enough design for now, as the maintenance burden is low, given that `Sirocco` is more of an app and less of a library. Therefore `Store` should not be confronted with expectations to support any `Mapping` functionality beyond what we need inside `Sirocco` itself. 169 170 ## Further developments potentially affecting this design 171 172 We will at some point introduce parameterized tasks (and thus data as well). This will add other dimensions to the `Store`. 173 174 ## Reasons to change 175 176 - If we find ourselves implementing more and more standard container functionality on `Store` it is time to reconsider. 177 - If the code for turning the IR into a WorkGraph suffers from additional complexity due to the design of `Store`, then it needs to be changed 178 - Potentially, changes to the config schema might necessitate or unlock a different design here 179 180 181 # UPDATE [2025-01-29] 182 183 ## The `Array` class 184 185 Since the introduction of parameterized tasks, the `Store` and `Timeseries` design evolved. 186 187 The `Timeseries` class became the more generic `Array` class. This makes the date part of the `Array` dimensions, along with potential parameters. Objects stored in an `Array` are accessed with their `coordinates` which is a `dict` mapping the dimension name to its value. `Store` is now a container for `Array` objects. 188 189 The following 2 changes also simplified the code and made it cleaner: 190 - Accessing nodes without any date or parameter is simplified as `Array` allows for empty coordinates (`{}`) so that these cases are not captured and treated in a special way in `Store` anymore. 191 - The need for special handling of nodes with dates out of range (just ignoring them, which is rather dirty) also disappeared with the introduction of the `when` keyword in the config format. 192 193