audio_api.md
1 --- 2 title: "Audio" 3 id: audio-api 4 description: "Transcribes audio files." 5 slug: "/audio-api" 6 --- 7 8 9 ## whisper_local 10 11 ### LocalWhisperTranscriber 12 13 Transcribes audio files using OpenAI's Whisper model on your local machine. 14 15 For the supported audio formats, languages, and other parameters, see the 16 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 17 [GitHub repository](https://github.com/openai/whisper). 18 19 ### Usage example 20 21 ```python 22 from haystack.components.audio import LocalWhisperTranscriber 23 24 whisper = LocalWhisperTranscriber(model="small") 25 transcription = whisper.run(sources=["test/test_files/audio/answer.wav"]) 26 ``` 27 28 #### __init__ 29 30 ```python 31 __init__( 32 model: WhisperLocalModel = "large", 33 device: ComponentDevice | None = None, 34 whisper_params: dict[str, Any] | None = None, 35 ) 36 ``` 37 38 Creates an instance of the LocalWhisperTranscriber component. 39 40 **Parameters:** 41 42 - **model** (<code>WhisperLocalModel</code>) – The name of the model to use. Set to one of the following models: 43 "tiny", "base", "small", "medium", "large" (default). 44 For details on the models and their modifications, see the 45 [Whisper documentation](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). 46 - **device** (<code>ComponentDevice | None</code>) – The device for loading the model. If `None`, automatically selects the default device. 47 48 #### warm_up 49 50 ```python 51 warm_up() -> None 52 ``` 53 54 Loads the model in memory. 55 56 #### to_dict 57 58 ```python 59 to_dict() -> dict[str, Any] 60 ``` 61 62 Serializes the component to a dictionary. 63 64 **Returns:** 65 66 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 67 68 #### from_dict 69 70 ```python 71 from_dict(data: dict[str, Any]) -> LocalWhisperTranscriber 72 ``` 73 74 Deserializes the component from a dictionary. 75 76 **Parameters:** 77 78 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 79 80 **Returns:** 81 82 - <code>LocalWhisperTranscriber</code> – The deserialized component. 83 84 #### run 85 86 ```python 87 run( 88 sources: list[str | Path | ByteStream], 89 whisper_params: dict[str, Any] | None = None, 90 ) 91 ``` 92 93 Transcribes a list of audio files into a list of documents. 94 95 **Parameters:** 96 97 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of paths or binary streams to transcribe. 98 - **whisper_params** (<code>dict\[str, Any\] | None</code>) – For the supported audio formats, languages, and other parameters, see the 99 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 100 [GitHup repo](https://github.com/openai/whisper). 101 102 **Returns:** 103 104 - – A dictionary with the following keys: 105 - `documents`: A list of documents where each document is a transcribed audio file. The content of 106 the document is the transcription text, and the document's metadata contains the values returned by 107 the Whisper model, such as the alignment data and the path to the audio file used 108 for the transcription. 109 110 #### transcribe 111 112 ```python 113 transcribe( 114 sources: list[str | Path | ByteStream], 115 **kwargs: list[str | Path | ByteStream] 116 ) -> list[Document] 117 ``` 118 119 Transcribes the audio files into a list of Documents, one for each input file. 120 121 For the supported audio formats, languages, and other parameters, see the 122 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text) and the official Whisper 123 [github repo](https://github.com/openai/whisper). 124 125 **Parameters:** 126 127 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of paths or binary streams to transcribe. 128 129 **Returns:** 130 131 - <code>list\[Document\]</code> – A list of Documents, one for each file. 132 133 ## whisper_remote 134 135 ### RemoteWhisperTranscriber 136 137 Transcribes audio files using the OpenAI's Whisper API. 138 139 The component requires an OpenAI API key, see the 140 [OpenAI documentation](https://platform.openai.com/docs/api-reference/authentication) for more details. 141 For the supported audio formats, languages, and other parameters, see the 142 [Whisper API documentation](https://platform.openai.com/docs/guides/speech-to-text). 143 144 ### Usage example 145 146 ```python 147 from haystack.components.audio import RemoteWhisperTranscriber 148 149 whisper = RemoteWhisperTranscriber(model="whisper-1") 150 transcription = whisper.run(sources=["test/test_files/audio/answer.wav"]) 151 ``` 152 153 #### __init__ 154 155 ```python 156 __init__( 157 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 158 model: str = "whisper-1", 159 api_base_url: str | None = None, 160 organization: str | None = None, 161 http_client_kwargs: dict[str, Any] | None = None, 162 **kwargs: dict[str, Any] | None 163 ) 164 ``` 165 166 Creates an instance of the RemoteWhisperTranscriber component. 167 168 **Parameters:** 169 170 - **api_key** (<code>Secret</code>) – OpenAI API key. 171 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 172 during initialization. 173 - **model** (<code>str</code>) – Name of the model to use. Currently accepts only `whisper-1`. 174 - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's documentation on 175 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization). 176 - **api_base_url** (<code>str | None</code>) – An optional URL to use as the API base. For details, see the 177 OpenAI [documentation](https://platform.openai.com/docs/api-reference/audio). 178 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 179 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 180 - **kwargs** – Other optional parameters for the model. These are sent directly to the OpenAI 181 endpoint. See OpenAI [documentation](https://platform.openai.com/docs/api-reference/audio) for more details. 182 Some of the supported parameters are: 183 - `language`: The language of the input audio. 184 Provide the input language in ISO-639-1 format 185 to improve transcription accuracy and latency. 186 - `prompt`: An optional text to guide the model's 187 style or continue a previous audio segment. 188 The prompt should match the audio language. 189 - `response_format`: The format of the transcript 190 output. This component only supports `json`. 191 - `temperature`: The sampling temperature, between 0 192 and 1. Higher values like 0.8 make the output more 193 random, while lower values like 0.2 make it more 194 focused and deterministic. If set to 0, the model 195 uses log probability to automatically increase the 196 temperature until certain thresholds are hit. 197 198 #### to_dict 199 200 ```python 201 to_dict() -> dict[str, Any] 202 ``` 203 204 Serializes the component to a dictionary. 205 206 **Returns:** 207 208 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 209 210 #### from_dict 211 212 ```python 213 from_dict(data: dict[str, Any]) -> RemoteWhisperTranscriber 214 ``` 215 216 Deserializes the component from a dictionary. 217 218 **Parameters:** 219 220 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 221 222 **Returns:** 223 224 - <code>RemoteWhisperTranscriber</code> – The deserialized component. 225 226 #### run 227 228 ```python 229 run(sources: list[str | Path | ByteStream]) 230 ``` 231 232 Transcribes the list of audio files into a list of documents. 233 234 **Parameters:** 235 236 - **sources** (<code>list\[str | Path | ByteStream\]</code>) – A list of file paths or `ByteStream` objects containing the audio files to transcribe. 237 238 **Returns:** 239 240 - – A dictionary with the following keys: 241 - `documents`: A list of documents, one document for each file. 242 The content of each document is the transcribed text.