Cradicle Explorer

/ docs-website / versioned_docs / version-2.25 / concepts / pipelines / serialization.mdx
serialization.mdx
  1  ---
  2  title: "Serializing Pipelines"
  3  id: serialization
  4  slug: "/serialization"
  5  description: "Save your pipelines into a custom format and explore the serialization options."
  6  ---
  7  
  8  # Serializing Pipelines
  9  
 10  Save your pipelines into a custom format and explore the serialization options.
 11  
 12  Serialization means converting a pipeline to a format that you can save on your disk and load later.
 13  
 14  Haystack supports YAML format for pipeline serialization.
 15  
 16  ## Converting a Pipeline to YAML
 17  
 18  Use the `dumps()` method to convert a Pipeline object to YAML:
 19  
 20  ```python
 21  from haystack import Pipeline
 22  
 23  pipe = Pipeline()
 24  print(pipe.dumps())
 25  
 26  ## Prints:
 27  ##
 28  ## components: {}
 29  ## connections: []
 30  ## max_runs_per_component: 100
 31  ## metadata: {}
 32  ```
 33  
 34  You can also use `dump()` method to save the YAML representation of a pipeline in a file:
 35  
 36  ```python
 37  with open("/content/test.yml", "w") as file:
 38      pipe.dump(file)
 39  ```
 40  
 41  ## Converting a Pipeline Back to Python
 42  
 43  You can convert a YAML pipeline back into Python. Use the `loads()` method to convert a string representation of a pipeline (`str`, `bytes` or `bytearray`)  or the `load()` method to convert a pipeline represented in a file-like object into a corresponding Python object.
 44  
 45  Both loading methods support callbacks that let you modify components during the deserialization process.
 46  
 47  Here is an example script:
 48  
 49  ```python
 50  from haystack import Pipeline
 51  from haystack.core.serialization import DeserializationCallbacks
 52  from typing import Type, Dict, Any
 53  
 54  ## This is the YAML you want to convert to Python:
 55  pipeline_yaml = """
 56  components:
 57    cleaner:
 58      init_parameters:
 59        remove_empty_lines: true
 60        remove_extra_whitespaces: true
 61        remove_regex: null
 62        remove_repeated_substrings: false
 63        remove_substrings: null
 64      type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
 65    converter:
 66      init_parameters:
 67        encoding: utf-8
 68      type: haystack.components.converters.txt.TextFileToDocument
 69  connections:
 70  - receiver: cleaner.documents
 71    sender: converter.documents
 72  max_runs_per_component: 100
 73  metadata: {}
 74  """
 75  
 76  
 77  def component_pre_init_callback(
 78      component_name: str,
 79      component_cls: Type,
 80      init_params: Dict[str, Any],
 81  ):
 82      # This function gets called every time a component is deserialized.
 83      if component_name == "cleaner":
 84          assert "DocumentCleaner" in component_cls.__name__
 85          # Modify the init parameters. The modified parameters are passed to
 86          # the init method of the component during deserialization.
 87          init_params["remove_empty_lines"] = False
 88          print("Modified 'remove_empty_lines' to False in 'cleaner' component")
 89      else:
 90          print(f"Not modifying component {component_name} of class {component_cls}")
 91  
 92  
 93  pipe = Pipeline.loads(
 94      pipeline_yaml,
 95      callbacks=DeserializationCallbacks(component_pre_init_callback),
 96  )
 97  ```
 98  
 99  ## Default Serialization Behavior
100  
101  The serialization system uses `default_to_dict` and `default_from_dict` to handle many object types automatically. You typically do **not** need to implement custom `to_dict`/`from_dict` for:
102  
103  - **Secrets**: serialized and deserialized automatically so that sensitive values aren't stored in plain text.
104  - **ComponentDevice**: device configuration is detected and restored automatically.
105  - **Objects with their own `to_dict`/`from_dict`**: any init parameter whose type defines `to_dict()` is serialized by calling it; any dict in `init_parameters` with a `type` key pointing to a class with `from_dict()` is deserialized automatically.
106  
107  To serialize or deserialize a single component, you can use `component_to_dict` and `component_from_dict` from `haystack.core.serialization`. They use the default behavior above as a fallback when the component doesn't define custom `to_dict`/`from_dict`:
108  
109  ```python
110  from haystack import component
111  from haystack.core.serialization import component_from_dict, component_to_dict
112  
113  
114  @component
115  class Greeter:
116      def __init__(self, message: str = "Hello"):
117          self.message = message
118  
119      @component.output_types(greeting=str)
120      def run(self, name: str):
121          return {"greeting": f"{self.message}, {name}!"}
122  
123  
124  # Serialize a component instance to a dictionary
125  greeter = Greeter(message="Hi")
126  data = component_to_dict(greeter, "my_greeter")
127  
128  # Deserialize back to a component instance
129  restored = component_from_dict(Greeter, data, "my_greeter")
130  assert restored.message == greeter.message
131  ```
132  
133  :::caution[Init parameters must be stored as instance attributes]
134  
135  Default serialization only works when there is a **1:1 mapping** between init parameter names and instance attributes. For every argument in `__init__`, the component must assign it to an attribute with the same name. For example, if you have `def __init__(self, prompt: str)`, you must have `self.prompt = prompt` in the class. Otherwise the serialization logic can't find the value to serialize and raises an error or uses the default value if the parameter has one.
136  :::
137  
138  ## Performing Custom Serialization
139  
140  Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works:
141  
142  ```python
143  from haystack import component
144  
145  
146  @component
147  class RepeatWordComponent:
148      def __init__(self, times: int):
149          self.times = times
150  
151      @component.output_types(result=str)
152      def run(self, word: str):
153          return word * self.times
154  ```
155  
156  On the other hand, this code doesn't work if the final format is JSON, as the `set` type is not JSON-serializable:
157  
158  ```python
159  from haystack import component
160  
161  
162  @component
163  class SetIntersector:
164      def __init__(self, intersect_with: set):
165          self.intersect_with = intersect_with
166  
167      @component.output_types(result=set)
168      def run(self, data: set):
169          return data.intersection(self.intersect_with)
170  ```
171  
172  In such cases, you can provide your own implementation  `from_dict` and `to_dict` to components:
173  
174  ```python
175  from haystack import component, default_from_dict, default_to_dict
176  
177  
178  class SetIntersector:
179      def __init__(self, intersect_with: set):
180          self.intersect_with = intersect_with
181  
182      @component.output_types(result=set)
183      def run(self, data: set):
184          return data.intersect(self.intersect_with)
185  
186      def to_dict(self):
187          return default_to_dict(self, intersect_with=list(self.intersect_with))
188  
189      @classmethod
190      def from_dict(cls, data):
191          # convert the set into a list for the dict representation,
192          # so it can be converted to JSON
193          data["intersect_with"] = set(data["intersect_with"])
194          return default_from_dict(cls, data)
195  ```
196  
197  ## Saving a Pipeline to a Custom Format
198  
199  Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller.
200  
201  A `Marshaller` is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the `Marshaller` [protocol](https://github.com/deepset-ai/haystack/blob/main/haystack/marshal/protocol.py), providing the methods `marshal` and `unmarshal`.
202  
203  This is the code for a custom TOML marshaller that relies on the `rtoml` library:
204  
205  ```python
206  ## This code requires a `pip install rtoml`
207  from typing import Dict, Any, Union
208  import rtoml
209  
210  
211  class TomlMarshaller:
212      def marshal(self, dict_: Dict[str, Any]) -> str:
213          return rtoml.dumps(dict_)
214  
215      def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]:
216          return dict(rtoml.loads(data_))
217  ```
218  
219  You can then pass a Marshaller instance to the methods `dump`, `dumps`, `load`, and `loads`:
220  
221  ```python
222  from haystack import Pipeline
223  from my_custom_marshallers import TomlMarshaller
224  
225  pipe = Pipeline()
226  pipe.dumps(TomlMarshaller())
227  ## prints:
228  ## 'max_runs_per_component = 100\nconnections = []\n\n[metadata]\n\n[components]\n'
229  ```
230  
231  ## Additional References
232  
233  :notebook: Tutorial: [Serializing LLM Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)