serialization.mdx
1 --- 2 title: "Serializing Pipelines" 3 id: serialization 4 slug: "/serialization" 5 description: "Save your pipelines into a custom format and explore the serialization options." 6 --- 7 8 # Serializing Pipelines 9 10 Save your pipelines into a custom format and explore the serialization options. 11 12 Serialization means converting a pipeline to a format that you can save on your disk and load later. 13 14 :::info[Serialization formats] 15 16 Haystack 2.0 only supports YAML format at this time. We will be rolling out more formats gradually. 17 ::: 18 19 ## Converting a Pipeline to YAML 20 21 Use the `dumps()` method to convert a Pipeline object to YAML: 22 23 ```python 24 from haystack import Pipeline 25 26 pipe = Pipeline() 27 print(pipe.dumps()) 28 29 ## Prints: 30 ## 31 ## components: {} 32 ## connections: [] 33 ## max_runs_per_component: 100 34 ## metadata: {} 35 ``` 36 37 You can also use `dump()` method to save the YAML representation of a pipeline in a file: 38 39 ```python 40 with open("/content/test.yml", "w") as file: 41 pipe.dump(file) 42 ``` 43 44 ## Converting a Pipeline Back to Python 45 46 You can convert a YAML pipeline back into Python. Use the `loads()` method to convert a string representation of a pipeline (`str`, `bytes` or `bytearray`) or the `load()` method to convert a pipeline represented in a file-like object into a corresponding Python object. 47 48 Both loading methods support callbacks that let you modify components during the deserialization process. 49 50 Here is an example script: 51 52 ```python 53 from haystack import Pipeline 54 from haystack.core.serialization import DeserializationCallbacks 55 from typing import Type, Dict, Any 56 57 ## This is the YAML you want to convert to Python: 58 pipeline_yaml = """ 59 components: 60 cleaner: 61 init_parameters: 62 remove_empty_lines: true 63 remove_extra_whitespaces: true 64 remove_regex: null 65 remove_repeated_substrings: false 66 remove_substrings: null 67 type: haystack.components.preprocessors.document_cleaner.DocumentCleaner 68 converter: 69 init_parameters: 70 encoding: utf-8 71 type: haystack.components.converters.txt.TextFileToDocument 72 connections: 73 - receiver: cleaner.documents 74 sender: converter.documents 75 max_runs_per_component: 100 76 metadata: {} 77 """ 78 79 80 def component_pre_init_callback( 81 component_name: str, 82 component_cls: Type, 83 init_params: Dict[str, Any], 84 ): 85 # This function gets called every time a component is deserialized. 86 if component_name == "cleaner": 87 assert "DocumentCleaner" in component_cls.__name__ 88 # Modify the init parameters. The modified parameters are passed to 89 # the init method of the component during deserialization. 90 init_params["remove_empty_lines"] = False 91 print("Modified 'remove_empty_lines' to False in 'cleaner' component") 92 else: 93 print(f"Not modifying component {component_name} of class {component_cls}") 94 95 96 pipe = Pipeline.loads( 97 pipeline_yaml, 98 callbacks=DeserializationCallbacks(component_pre_init_callback), 99 ) 100 ``` 101 102 ## Performing Custom Serialization 103 104 Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works: 105 106 ```python 107 from haystack import component 108 109 110 @component 111 class RepeatWordComponent: 112 def __init__(self, times: int): 113 self.times = times 114 115 @component.output_types(result=str) 116 def run(self, word: str): 117 return word * self.times 118 ``` 119 120 On the other hand, this code doesn't work if the final format is JSON, as the `set` type is not JSON-serializable: 121 122 ```python 123 from haystack import component 124 125 126 @component 127 class SetIntersector: 128 def __init__(self, intersect_with: set): 129 self.intersect_with = intersect_with 130 131 @component.output_types(result=set) 132 def run(self, data: set): 133 return data.intersection(self.intersect_with) 134 ``` 135 136 In such cases, you can provide your own implementation `from_dict` and `to_dict` to components: 137 138 ```python 139 from haystack import component, default_from_dict, default_to_dict 140 141 class SetIntersector: 142 def __init__(self, intersect_with: set): 143 self.intersect_with = intersect_with 144 145 @component.output_types(result=set) 146 def run(self, data: set): 147 return data.intersect(self.intersect_with) 148 149 def to_dict(self): 150 return default_to_dict(self, intersect_with=list(self.intersect_with)) 151 152 @classmethod 153 def from_dict(cls, data): 154 # convert the set into a list for the dict representation, 155 # so it can be converted to JSON 156 data["intersect_with"] = set(data["intersect_with"]) 157 return default_from_dict(cls, data) 158 ``` 159 160 ## Saving a Pipeline to a Custom Format 161 162 Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller. 163 164 A `Marshaller` is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the `Marshaller` [protocol](https://github.com/deepset-ai/haystack/blob/main/haystack/marshal/protocol.py), providing the methods `marshal` and `unmarshal`. 165 166 This is the code for a custom TOML marshaller that relies on the `rtoml` library: 167 168 ```python 169 ## This code requires a `pip install rtoml` 170 from typing import Dict, Any, Union 171 import rtoml 172 173 174 class TomlMarshaller: 175 def marshal(self, dict_: Dict[str, Any]) -> str: 176 return rtoml.dumps(dict_) 177 178 def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]: 179 return dict(rtoml.loads(data_)) 180 ``` 181 182 You can then pass a Marshaller instance to the methods `dump`, `dumps`, `load`, and `loads`: 183 184 ```python 185 from haystack import Pipeline 186 from my_custom_marshallers import TomlMarshaller 187 188 pipe = Pipeline() 189 pipe.dumps(TomlMarshaller()) 190 ## prints: 191 ## 'max_runs_per_component = 100\nconnections = []\n\n[metadata]\n\n[components]\n' 192 ``` 193 194 ## Additional References 195 196 :notebook: Tutorial: [Serializing LLM Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)