What is the ASR configuration and what configuration should I use?

What Is ASR (Automatic Speech Recognition)?

Automatic Speech Recognition (ASR) is the process of converting spoken words into text. In Deep Live Hub, the ASR configuration refers to the processing time allocated to the AI model that performs speech-to-text transcription.

ASR Processing Time Explained:

Just like humans, the AI needs time to process speech before converting it to text. The longer the processing time, the more accurate the transcription will be. For example:

Punctuation and capitalization choices can depend on the context. The AI may decide to capitalize a word if it recognizes the start of a sentence, or to add punctuation marks based on sentence structure.
A longer processing time allows the AI to gather more context, leading to fewer errors in transcription.

However, longer processing times also create a latency between the input (speech) and output (transcription). For real-time applications, this delay can disrupt the audience’s ability to follow along with the subtitles, so you need to balance accuracy with speed depending on your use case.

Choosing the Right ASR Processing Time:

In Deep Live Hub, you can set the ASR processing time between 3 seconds and 20 seconds. Here are some general recommendations:

3-5 seconds: Suitable for most languages and scenarios where real-time interaction is critical. Provides a good balance between speed and accuracy.
10-15 seconds: Recommended for languages like German that tend to have longer sentences, or in contexts like scientific discussions, where sentence structure is more complex. This setting provides higher accuracy but adds more delay.
20 seconds: If real-time speed is not an issue and you want the highest transcription quality, choose the longest processing time for the best results.

Partials Mode:

For HLS or HLS-realtime text streams, you can use partials mode for a better balance between speed and accuracy. Here’s how it works:

In partials mode, the ASR sends initial results as soon as they are available, often with sub-second delay.
As the sentence progresses and more context is gathered, the ASR may send corrections within the set ASR processing time.
This means you can get fast results while still allowing the system to make corrections as the speech continues.

For example, if you're using the Aiconix Live Viewer, partials mode can provide near-instant transcript results, which are corrected and improved in real-time, ensuring fast updates while maintaining quality.

What Is Partials Mode?

Partials Mode sends real-time transcript results as soon as the AI detects speech. Corrections are sent later within the set processing time, ensuring faster updates with improved accuracy.

ASR Configuration and Translations:

In Deep Live Hub, transcription and translation are two separate processes. The translation model works after the transcription is complete:

First, the AI creates a transcript in the original language.
Once a text chunk is fully transcribed, it is sent to the translation model.

This means that better transcription quality leads to better translations. The length and quality of each text chunk directly affect the accuracy of the translation.

Longer ASR processing times will result in more accurate translations, as the AI has more context to work with. However, note that partials mode only affects transcripts and not translated subtitles, since translations are based on completed text chunks, not partial results.