Cradicle Explorer

/ docs-website / versioned_docs / version-2.20 / concepts / pipelines / serialization.mdx
serialization.mdx
  1  ---
  2  title: "Serializing Pipelines"
  3  id: serialization
  4  slug: "/serialization"
  5  description: "Save your pipelines into a custom format and explore the serialization options."
  6  ---
  7  
  8  # Serializing Pipelines
  9  
 10  Save your pipelines into a custom format and explore the serialization options.
 11  
 12  Serialization means converting a pipeline to a format that you can save on your disk and load later.
 13  
 14  :::info[Serialization formats]
 15  
 16  Haystack 2.0 only supports YAML format at this time. We will be rolling out more formats gradually.
 17  :::
 18  
 19  ## Converting a Pipeline to YAML
 20  
 21  Use the `dumps()` method to convert a Pipeline object to YAML:
 22  
 23  ```python
 24  from haystack import Pipeline
 25  
 26  pipe = Pipeline()
 27  print(pipe.dumps())
 28  
 29  ## Prints:
 30  ##
 31  ## components: {}
 32  ## connections: []
 33  ## max_runs_per_component: 100
 34  ## metadata: {}
 35  ```
 36  
 37  You can also use `dump()` method to save the YAML representation of a pipeline in a file:
 38  
 39  ```python
 40  with open("/content/test.yml", "w") as file:
 41      pipe.dump(file)
 42  ```
 43  
 44  ## Converting a Pipeline Back to Python
 45  
 46  You can convert a YAML pipeline back into Python. Use the `loads()` method to convert a string representation of a pipeline (`str`, `bytes` or `bytearray`)  or the `load()` method to convert a pipeline represented in a file-like object into a corresponding Python object.
 47  
 48  Both loading methods support callbacks that let you modify components during the deserialization process.
 49  
 50  Here is an example script:
 51  
 52  ```python
 53  from haystack import Pipeline
 54  from haystack.core.serialization import DeserializationCallbacks
 55  from typing import Type, Dict, Any
 56  
 57  ## This is the YAML you want to convert to Python:
 58  pipeline_yaml = """
 59  components:
 60    cleaner:
 61      init_parameters:
 62        remove_empty_lines: true
 63        remove_extra_whitespaces: true
 64        remove_regex: null
 65        remove_repeated_substrings: false
 66        remove_substrings: null
 67      type: haystack.components.preprocessors.document_cleaner.DocumentCleaner
 68    converter:
 69      init_parameters:
 70        encoding: utf-8
 71      type: haystack.components.converters.txt.TextFileToDocument
 72  connections:
 73  - receiver: cleaner.documents
 74    sender: converter.documents
 75   max_runs_per_component: 100
 76  metadata: {}
 77  """
 78  
 79  
 80  def component_pre_init_callback(
 81      component_name: str,
 82      component_cls: Type,
 83      init_params: Dict[str, Any],
 84  ):
 85      # This function gets called every time a component is deserialized.
 86      if component_name == "cleaner":
 87          assert "DocumentCleaner" in component_cls.__name__
 88          # Modify the init parameters. The modified parameters are passed to
 89          # the init method of the component during deserialization.
 90          init_params["remove_empty_lines"] = False
 91          print("Modified 'remove_empty_lines' to False in 'cleaner' component")
 92      else:
 93          print(f"Not modifying component {component_name} of class {component_cls}")
 94  
 95  
 96  pipe = Pipeline.loads(
 97      pipeline_yaml,
 98      callbacks=DeserializationCallbacks(component_pre_init_callback),
 99  )
100  ```
101  
102  ## Performing Custom Serialization
103  
104  Pipelines and components in Haystack can serialize simple components, including custom ones, out of the box. Code like this just works:
105  
106  ```python
107  from haystack import component
108  
109  
110  @component
111  class RepeatWordComponent:
112      def __init__(self, times: int):
113          self.times = times
114  
115      @component.output_types(result=str)
116      def run(self, word: str):
117          return word * self.times
118  ```
119  
120  On the other hand, this code doesn't work if the final format is JSON, as the `set` type is not JSON-serializable:
121  
122  ```python
123  from haystack import component
124  
125  
126  @component
127  class SetIntersector:
128      def __init__(self, intersect_with: set):
129          self.intersect_with = intersect_with
130  
131      @component.output_types(result=set)
132      def run(self, data: set):
133          return data.intersection(self.intersect_with)
134  ```
135  
136  In such cases, you can provide your own implementation  `from_dict` and `to_dict` to components:
137  
138  ```python
139  from haystack import component, default_from_dict, default_to_dict
140  
141  class SetIntersector:
142  		def __init__(self, intersect_with: set):
143  	      self.intersect_with = intersect_with
144  
145      @component.output_types(result=set)
146  	  def run(self, data: set):
147          return data.intersect(self.intersect_with)
148  
149      def to_dict(self):
150          return default_to_dict(self, intersect_with=list(self.intersect_with))
151  
152      @classmethod
153      def from_dict(cls, data):
154          # convert the set into a list for the dict representation,
155          # so it can be converted to JSON
156          data["intersect_with"] = set(data["intersect_with"])
157          return default_from_dict(cls, data)
158  ```
159  
160  ## Saving a Pipeline to a Custom Format
161  
162  Once a pipeline is available in its dictionary format, the last step of serialization is to convert that dictionary into a format you can store or send over the wire. Haystack supports YAML out of the box, but if you need a different format, you can write a custom Marshaller.
163  
164  A `Marshaller` is a Python class responsible for converting text to a dictionary and a dictionary to text according to a certain format. Marshallers must respect the `Marshaller` [protocol](https://github.com/deepset-ai/haystack/blob/main/haystack/marshal/protocol.py), providing the methods `marshal` and `unmarshal`.
165  
166  This is the code for a custom TOML marshaller that relies on the `rtoml` library:
167  
168  ```python
169  ## This code requires a `pip install rtoml`
170  from typing import Dict, Any, Union
171  import rtoml
172  
173  
174  class TomlMarshaller:
175      def marshal(self, dict_: Dict[str, Any]) -> str:
176          return rtoml.dumps(dict_)
177  
178      def unmarshal(self, data_: Union[str, bytes]) -> Dict[str, Any]:
179          return dict(rtoml.loads(data_))
180  ```
181  
182  You can then pass a Marshaller instance to the methods `dump`, `dumps`, `load`, and `loads`:
183  
184  ```python
185  from haystack import Pipeline
186  from my_custom_marshallers import TomlMarshaller
187  
188  pipe = Pipeline()
189  pipe.dumps(TomlMarshaller())
190  ## prints:
191  ## 'max_runs_per_component = 100\nconnections = []\n\n[metadata]\n\n[components]\n'
192  ```
193  
194  ## Additional References
195  
196  :notebook: Tutorial: [Serializing LLM Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)