serialization.mdx
1 --- 2 title: "Serializing Pipelines" 3 id: serialization 4 slug: "/serialization" 5 description: "Save your pipelines into a custom format and explore the serialization options." 6 --- 7 8 # Serializing Pipelines 9 10 Save your pipelines into a custom format and explore the serialization options. 11 12 Serialization means converting a pipeline to a format that you can save on your disk and load later. 13 14 Haystack supports YAML format for pipeline serialization. 15 16 ## Converting a Pipeline to YAML 17 18 Use the `dumps()` method to convert a Pipeline object to YAML: 19 20 ```python 21 from haystack import Pipeline 22 23 pipe = Pipeline() 24 print(pipe.dumps()) 25 26 ## Prints: 27 ## 28 ## components: {} 29 ## connections: [] 30 ## max_runs_per_component: 100 31 ## metadata: {} 32 ``` 33 34 You can also use `dump()` method to save the YAML representation of a pipeline in a file: 35 36 ```python 37 with open("/content/test.yml", "w") as file: 38 pipe.dump(file) 39 ``` 40 41 ## Converting a Pipeline Back to Python 42 43 You can convert a YAML pipeline back into Python. Use the `loads()` method to convert a string representation of a pipeline (`str`, `bytes` or `bytearray`) or the `load()` method to convert a pipeline represented in a file-like object into a corresponding Python object. 44 45 Both loading methods support callbacks that let you modify components during the deserialization process. 46 47 Here is an example script: 48 49 ```python 50 from haystack import Pipeline 51 from haystack.core.serialization import DeserializationCallbacks 52 from typing import Type, Dict, Any 53 54 ## This is the YAML you want to convert to Python: 55 pipeline_yaml = """ 56 components: 57 cleaner: 58 init_parameters: 59 remove_empty_lines: true 60 remove_extra_whitespaces: true 61 remove_regex: null 62 remove_repeated_substrings: false 63 remove_substrings: null 64 type: haystack.components.preprocessors.document_cleaner.DocumentCleaner 65 converter: 66 init_parameters: 67 encoding: utf-8 68 type: haystack.components.converters.txt.TextFileToDocument 69 connections: 70 - receiver: cleaner.documents 71 sender: converter.documents 72 max_runs_per_component: 100 73 metadata: {} 74 """ 75 76 77 def component_pre_init_callback( 78 component_name: str, 79 component_cls: Type, 80 init_params: Dict[str, Any], 81 ): 82 # This function gets called every time a component is deserialized. 83 if component_name == "cleaner": 84 assert "DocumentCleaner" in component_cls.__name__ 85 # Modify the init parameters. The modified parameters are passed to 86 # the init method of the component during deserialization. 87 init_params["remove_empty_lines"] = False 88 print("Modified 'remove_empty_lines' to False in 'cleaner' component") 89 else: 90 print(f"Not modifying component {component_name} of class {component_cls}") 91 92 93 pipe = Pipeline.loads( 94 pipeline_yaml, 95 callbacks=DeserializationCallbacks(component_pre_init_callback), 96 ) 97 ``` 98 99 ## Default Serialization Behavior 100 101 The serialization system uses `default_to_dict` and `default_from_dict` to handle many object types automatically. You typically do **not** need to implement custom `to_dict`/`from_dict` for: 102 103 - **Secrets**: serialized and deserialized automatically so that sensitive values aren't stored in plain text. 104 - **ComponentDevice**: device configuration is detected and restored automatically. 105 - **Objects with their own `to_dict`/`from_dict`**: any init parameter whose type defines `to_dict()` is serialized by calling it; any dict in `init_parameters` with a `type` key pointing to a class with `from_dict()` is deserialized automatically. 106 107 To serialize or deserialize a single component, you can use `component_to_dict` and `component_from_dict` from `haystack.core.serialization`. They use the default behavior above as a fallback when the component doesn't define custom `to_dict`/`from_dict`: 108 109 ```python 110 from haystack import component 111 from haystack.core.serialization import component_from_dict, component_to_dict 112 113 114 @component 115 class Greeter: 116 def __init__(self, message: str = "Hello"): 117 self.message = message 118 119 @component.output_types(greeting=str) 120 def run(self, name: str): 121 return {"greeting": f"{self.message}, {name}!"} 122 123 124 # Serialize a component instance to a dictionary 125 greeter = Greeter(message="Hi") 126 data = component_to_dict(greeter, "my_greeter") 127 128 # Deserialize back to a component instance 129 restored = component_from_dict(Greeter, data, "my_greeter") 130 assert restored.message == greeter.message 131 ``` 132 133 :::caution[Init parameters must be stored as instance attributes] 134 135 Default serialization only works when there is a **1:1 mapping** between init parameter names and instance attributes. For every argument in `__init__`, the component must assign it to an attribute with the same name. For example, if you have `def __init__(self, prompt: str)`, you must have `self.prompt = prompt` in the class. Otherwise the serialization logic can't find the value to serialize and raises an error or uses the default value if the parameter has one. 136 ::: 137 138 ## Performing Custom Serialization 139 140 Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works: 141 142 ```python 143 from haystack import component 144 145 146 @component 147 class RepeatWordComponent: 148 def __init__(self, times: int): 149 self.times = times 150 151 @component.output_types(result=str) 152 def run(self, word: str): 153 return word * self.times 154 ``` 155 156 On the other hand, this code doesn't work if the final format is JSON, as the `set` type is not JSON-serializable: 157 158 ```python 159 from haystack import component 160 161 162 @component 163 class SetIntersector: 164 def __init__(self, intersect_with: set): 165 self.intersect_with = intersect_with 166 167 @component.output_types(result=set) 168 def run(self, data: set): 169 return data.intersection(self.intersect_with) 170 ``` 171 172 In such cases, you can provide your own implementation `from_dict` and `to_dict` to components: 173 174 ```python 175 from haystack import component, default_from_dict, default_to_dict 176 177 178 class SetIntersector: 179 def __init__(self, intersect_with: set): 180 self.intersect_with = intersect_with 181 182 @component.output_types(result=set) 183 def run(self, data: set): 184 return data.intersect(self.intersect_with) 185 186 def to_dict(self): 187 return default_to_dict(self, intersect_with=list(self.intersect_with)) 188 189 @classmethod 190 def from_dict(cls, data): 191 # convert the set into a list for the dict representation, 192 # so it can be converted to JSON 193 data["intersect_with"] = set(data["intersect_with"]) 194 return default_from_dict(cls, data) 195 ``` 196 197 ## Saving a Pipeline to a Custom Format 198 199 Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller. 200 201 A `Marshaller` is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the `Marshaller` [protocol](https://github.com/deepset-ai/haystack/blob/main/haystack/marshal/protocol.py), providing the methods `marshal` and `unmarshal`. 202 203 This is the code for a custom TOML marshaller that relies on the `rtoml` library: 204 205 ```python 206 ## This code requires a `pip install rtoml` 207 from typing import Dict, Any, Union 208 import rtoml 209 210 211 class TomlMarshaller: 212 def marshal(self, dict_: Dict[str, Any]) -> str: 213 return rtoml.dumps(dict_) 214 215 def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]: 216 return dict(rtoml.loads(data_)) 217 ``` 218 219 You can then pass a Marshaller instance to the methods `dump`, `dumps`, `load`, and `loads`: 220 221 ```python 222 from haystack import Pipeline 223 from my_custom_marshallers import TomlMarshaller 224 225 pipe = Pipeline() 226 pipe.dumps(TomlMarshaller()) 227 ## prints: 228 ## 'max_runs_per_component = 100\nconnections = []\n\n[metadata]\n\n[components]\n' 229 ``` 230 231 ## Additional References 232 233 :notebook: Tutorial: [Serializing LLM Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)