audio_api.md
1 --- 2 title: "Audio" 3 id: audio-api 4 description: "Transcribes audio files." 5 slug: "/audio-api" 6 --- 7 8 9 ## whisper_local 10 11 ### LocalWhisperTranscriber 12 13 Transcribes audio files using OpenAI's Whisper model on your local machine. 14 15 For the supported audio formats, languages, and other parameters, see the 16 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 17 [GitHub repository](https://github.com/openai/whisper). 18 19 ### Usage example 20 21 ```python 22 from haystack.components.audio import LocalWhisperTranscriber 23 24 whisper = LocalWhisperTranscriber(model="small") 25 transcription = whisper.run(sources=["test/test_files/audio/answer.wav"]) 26 ``` 27 28 #### __init__ 29 30 ```python 31 __init__( 32 model: WhisperLocalModel = "large", 33 device: ComponentDevice | None = None, 34 whisper_params: dict[str, Any] | None = None, 35 ) -> None 36 ``` 37 38 Creates an instance of the LocalWhisperTranscriber component. 39 40 **Parameters:** 41 42 - **model** (<code>WhisperLocalModel</code>) – The name of the model to use. Set to one of the following models: 43 "tiny", "base", "small", "medium", "large" (default). 44 For details on the models and their modifications, see the 45 [Whisper documentation](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). 46 - **device** (<code>ComponentDevice | None</code>) – The device for loading the model. If `None`, automatically selects the default device. 47 48 #### warm_up 49 50 ```python 51 warm_up() -> None 52 ``` 53 54 Loads the model in memory. 55 56 #### to_dict 57 58 ```python 59 to_dict() -> dict[str, Any] 60 ``` 61 62 Serializes the component to a dictionary. 63 64 **Returns:** 65 66 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 67 68 #### from_dict 69 70 ```python 71 from_dict(data: dict[str, Any]) -> LocalWhisperTranscriber 72 ``` 73 74 Deserializes the component from a dictionary. 75 76 **Parameters:** 77 78 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 79 80 **Returns:** 81 82 - <code>LocalWhisperTranscriber</code> – The deserialized component. 83 84 #### run 85 86 ```python 87 run( 88 sources: list[str | Path | ByteStream], 89 whisper_params: dict[str, Any] | None = None, 90 ) -> dict[str, Any] 91 ``` 92 93 Transcribes a list of audio files into a list of documents. 94 95 **Parameters:** 96 97 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of paths or binary streams to transcribe. 98 - **whisper_params** (<code>dict\[str, Any\] | None</code>) – For the supported audio formats, languages, and other parameters, see the 99 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 100 [GitHup repo](https://github.com/openai/whisper). 101 102 **Returns:** 103 104 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 105 - `documents`: A list of documents where each document is a transcribed audio file. The content of 106 the document is the transcription text, and the document's metadata contains the values returned by 107 the Whisper model, such as the alignment data and the path to the audio file used 108 for the transcription. 109 110 #### transcribe 111 112 ```python 113 transcribe( 114 sources: list[str | Path | ByteStream], **kwargs: Any 115 ) -> list[Document] 116 ``` 117 118 Transcribes the audio files into a list of Documents, one for each input file. 119 120 For the supported audio formats, languages, and other parameters, see the 121 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 122 [github repo](https://github.com/openai/whisper). 123 124 **Parameters:** 125 126 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of paths or binary streams to transcribe. 127 128 **Returns:** 129 130 - <code>list\[Document\]</code> – A list of Documents, one for each file. 131 132 ## whisper_remote 133 134 ### RemoteWhisperTranscriber 135 136 Transcribes audio files using the OpenAI's Whisper API. 137 138 The component requires an OpenAI API key, see the 139 [OpenAI documentation](https://platform.openai.com/docs/api-reference/authentication) for more details. 140 For the supported audio formats, languages, and other parameters, see the 141 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text). 142 143 ### Usage example 144 145 ```python 146 from haystack.components.audio import RemoteWhisperTranscriber 147 148 whisper = RemoteWhisperTranscriber(model="whisper-1") 149 transcription = whisper.run(sources=["test/test_files/audio/answer.wav"]) 150 ``` 151 152 #### __init__ 153 154 ```python 155 __init__( 156 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 157 model: str = "whisper-1", 158 api_base_url: str | None = None, 159 organization: str | None = None, 160 http_client_kwargs: dict[str, Any] | None = None, 161 **kwargs: Any 162 ) -> None 163 ``` 164 165 Creates an instance of the RemoteWhisperTranscriber component. 166 167 **Parameters:** 168 169 - **api_key** (<code>Secret</code>) – OpenAI API key. 170 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 171 during initialization. 172 - **model** (<code>str</code>) – Name of the model to use. Currently accepts only `whisper-1`. 173 - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's documentation on 174 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization). 175 - **api_base_url** (<code>str | None</code>) – An optional URL to use as the API base. For details, see the 176 OpenAI [documentation](https://platform.openai.com/docs/api-reference/audio). 177 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 178 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 179 - **kwargs** (<code>Any</code>) – Other optional parameters for the model. These are sent directly to the OpenAI 180 endpoint. See OpenAI [documentation](https://platform.openai.com/docs/api-reference/audio) for more details. 181 Some of the supported parameters are: 182 - `language`: The language of the input audio. 183 Provide the input language in ISO-639-1 format 184 to improve transcription accuracy and latency. 185 - `prompt`: An optional text to guide the model's 186 style or continue a previous audio segment. 187 The prompt should match the audio language. 188 - `response_format`: The format of the transcript 189 output. This component only supports `json`. 190 - `temperature`: The sampling temperature, between 0 191 and 1. Higher values like 0.8 make the output more 192 random, while lower values like 0.2 make it more 193 focused and deterministic. If set to 0, the model 194 uses log probability to automatically increase the 195 temperature until certain thresholds are hit. 196 197 #### to_dict 198 199 ```python 200 to_dict() -> dict[str, Any] 201 ``` 202 203 Serializes the component to a dictionary. 204 205 **Returns:** 206 207 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 208 209 #### from_dict 210 211 ```python 212 from_dict(data: dict[str, Any]) -> RemoteWhisperTranscriber 213 ``` 214 215 Deserializes the component from a dictionary. 216 217 **Parameters:** 218 219 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 220 221 **Returns:** 222 223 - <code>RemoteWhisperTranscriber</code> – The deserialized component. 224 225 #### run 226 227 ```python 228 run(sources: list[str | Path | ByteStream]) -> dict[str, Any] 229 ``` 230 231 Transcribes the list of audio files into a list of documents. 232 233 **Parameters:** 234 235 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or `ByteStream` objects containing the audio files to transcribe. 236 237 **Returns:** 238 239 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 240 - `documents`: A list of documents, one document for each file. 241 The content of each document is the transcribed text.