当前位置:首页电脑音频厂商动态其他动态音频正文

Azure微软语音转文本 (STT) 和文本转语音 (TTS)

发布时间：09-04 编辑：微软

通过微软官方合作伙伴获取服务，企业用户可以合规、稳定地使用微软语音TTS、ChatGPT等服务，满足国内发票需求，同时也能解决连接不稳定/响应速度慢/并发配额低等问题。

Azure微软语音转文本 (STT) 和文本转语音 (TTS) 延迟问题如何解决？

语音识别和合成中的延迟可能是创建无缝高效应用程序的重大障碍。

降低延迟不仅可以改善用户体验，还可以提高实时应用程序的整体性能。这篇文章将探讨减少一般转录、实时转录、文件转录和语音合成中延迟的策略。

_url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0903%2F7eecc9ecj00sj7v2r000td000k0008im.jpg

1. 网络延迟：将语音资源移近应用程序导致

语音识别延迟的主要因素之一是网络延迟。为了缓解这种情况，必须尽量缩短应用程序和语音识别资源之间的距离。以下是一些提示：

语音容器：它提供了在本地或边缘运行模型的灵活性，从而无需通过云发送音频数据，从而减少了网络延迟。

利用云提供商：选择数据中心位于距离用户较近地区的云服务提供商。这可以显著降低网络延迟。

使用嵌入式语音：这是一种紧凑模型，专为互联网连接受限或不可用的设备场景而设计，从而显著减少网络延迟。但是，这可能会导致准确度略有下降。因此，为了获得最佳准确度，请考虑采用混合方法：在有网络连接时通过云使用 Azure AI Speech，在没有网络时切换到嵌入式语音。这提供了高质量和准确的语音处理以及可靠的备份选项。

2.实时转录：

实时转录需要立即处理音频输入以提供即时反馈。以下是实现实时转录低延迟的一些建议：

2.1 使用实时流式传输

无需录制整个音频然后进行处理，而是使用实时流式传输将音频数据以小块的形式发送到语音识别服务。这样可以立即进行处理并减少总体延迟。

def speech_recognize_continuous_async_from_microphone(): """performs continuous speech recognition asynchronously with input from microphone""" speech_config = speechsdk.SpeechConfig(subscription=os.getenv("SUBSCRIPTION_KEY"), region="centralIndia") speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config) done = False def recognized_cb(evt: speechsdk.SpeechRecognitionEventArgs): print('RECOGNIZED: {}'.format(evt.result.text)) def stop_cb(evt: speechsdk.SessionEventArgs): """callback that signals to stop continuous recognition""" print('CLOSING on {}'.format(evt)) nonlocal done done = True # Connect callbacks to the events fired by the speech recognizer speech_recognizer.recognized.connect(recognized_cb) speech_recognizer.session_stopped.connect(stop_cb) # Other tasks can be performed on this thread while recognition starts... result_future = speech_recognizer.start_continuous_recognition_async() result_future.get() # wait for voidfuture, so we know engine initialization is done. print('Continuous Recognition is now running, say something.') while not done: print('type "stop" then enter when done') stop = input() if (stop.lower() == "stop"): print('Stopping async recognition.') speech_recognizer.stop_continuous_recognition_async() break print("recognition stopped, main thread can exit now.")speech_recognize_continuous_async_from_microphone()

Azure Speech SDK 还提供了一种将音频流式传输到识别器的方法，作为麦克风或文件输入的替代方案。您可以根据需要在 PushAudioInputStream 和 PullAudioInputStream 之间进行选择。

2.2 定义默认语言

如果默认语言已知，请在转录过程开始时定义它。这样可以省去检测输入语言所需的额外处理时间。

如果默认语言未知，请使用“SpeechServiceConnection_LanguageIdMode”在转录开始时检测语言并指定预期语言列表以减少处理时间

speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion") speech_config.speech_recognition_language = "en-US" # Set default language## OR speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode, value = "AtStart")auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=["en-US", "gu-In", "bn-IN", "mr-IN"])speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config,auto_detect_source_language_config=auto_detect_source_language_config)

2.3 使用异步方法

使用异步方法，例如使用 start_continuous_recognition_async 代替 start_continuous_recognition，使用 stop_continuous_recognition_async 代替 stop_continuous_recognition。这些方法允许非阻塞操作并减少延迟。

speech_recognizer.start_continuous_recognition_async(); # Perform other tasksspeech_recognizer.stop_continuous_recognition_async()

2.4 使用快速转录

快速转录转录音频的速度比实时流式转录快得多，适用于需要即时转录的场景，如呼叫中心分析、会议摘要、配音等。它可以在不到一分钟的时间内转录 30 分钟的音频。尽管这是公开预览版，并且仅支持少数语言环境。

3. 文件转录

对于文件转录，处理大型音频文件可能会带来显著的延迟。以下是一些减少延迟的策略：

3.1 将音频分割成小块

将音频文件分割成小块，并并行运行每个块的转录。这样可以加快处理速度并减少总体转录时间。音频分块的一个警告是，根据分块策略，它可能会导致转录质量略有下降，但如果转录层之后是 LLM 智能层，用于分析洞察、后期处理等，质量下降应该会被卓越的 LLM 智能所抵消。

from pydub import AudioSegment import concurrent.futures def transcribe_chunk(chunk): # Transcription logic for each chunk pass audio = AudioSegment.from_file("large_audio_file.wav") chunk_length_ms = 10000 # 10 seconds chunks = [audio[i:i + chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)] with concurrent.futures.ThreadPoolExecutor() as executor: futures = [executor.submit(transcribe_chunk, chunk) for chunk in chunks] results = [f.result() for f in concurrent.futures.as_completed(futures)]

3.2 提高音频速度

在发送音频文件进行转录之前，请提高其播放速度。这可以减少处理整个文件所需的时间，而对转录的准确性几乎没有影响。

def increase_audio_speed(filename, output_filename = "modified_audio_file.wav", speed_change_factor = 1.7): # Load your audio file audio = AudioSegment.from_file(filename) # Change to your file format # Change speed: Speed up (e.g., 1.5 times) speed_change_factor = speed_change_factor # Increase this to make it faster, decrease to slow down new_audio = audio._spawn(audio.raw_data, overrides={'frame_rate': int(audio.frame_rate * speed_change_factor)}) # Set the frame rate to the new audio new_audio = new_audio.set_frame_rate(audio.frame_rate) # Export the modified audio new_audio.export(output_filename, format="wav") # Change to your desired format

3.3 压缩输入音频

在发送输入音频进行转录之前对其进行压缩。这可以减小文件大小，从而加快传输速度，优化转录中的带宽使用率和存储效率。

from pydub import AudioSegmentinput_audio = 'gujrati_tts.wav'output_audio = 'compressed_audio.mp3'try: # Load the audio file audio = AudioSegment.from_file(input_audio) # Export the audio file with a lower bitrate to compress it audio.export(output_audio, format="mp3", bitrate="64k") print(f"Compressed audio saved as {output_audio}")except Exception as e: print(f"An error occurred: {e}")

4. 语音合成

语音合成中的延迟可能是一个瓶颈，尤其是在实时应用中。以下是一些减少延迟的建议：

4.1 使用异步方法

不要使用speak_text_async进行语音合成，因为这种方法会阻塞音频流，直到处理完整个音频，而是改用start_speaking_text_async方法。此方法在收到第一个音频块后立即开始音频输出流，从而显著减少延迟。

4.2 文本流：文本流允许 TTS 系统在收到文本的初始部分后立即开始处理和生成语音，而不必等待整个文本可用。这减少了语音输出开始前的初始延迟，使其成为交互式应用程序、现场活动和响应式 AI 驱动对话的理想选择

# tts sentence end marktts_sentence_end = [ ".", "!", "?", ";", "。", "！", "？", "；", "n" ]completion = gpt_client.chat.completions.create( model="gpt-4o", messages=[, {"role": "user", "content":

} ], stream=True)collected_messages = []last_tts_request = Nonefor chunk in completion: if len(chunk.choices) > 0: chunk_text = chunk.choices[0].delta.content if chunk_text: collected_messages.append(chunk_text) if chunk_text in tts_sentence_end: text = "".join(collected_messages).strip() # join the received message together to build a sentence last_tts_request = speech_synthesizer.start_speaking_text_async(text).get() collected_messages.clear()

4.3 优化音频输出格式有效

负载大小会影响延迟。使用压缩音频格式可以节省网络带宽，这在网络不稳定或带宽有限时至关重要。切换到比特率为 384 kbps 的 Riff48Khz16BitMonoPcm 格式会自动使用压缩输出格式进行转录，从而减少延迟。

通过遵循这些策略，您可以显著减少 STT 和 TTS 应用程序中的延迟，从而提供更流畅、更高效的用户体验。实施这些技术将确保您的应用程序即使在实时场景中也能响应迅速且性能卓越。

标签：