What is the Speech Recognition Module?

Module Description

The Speech Recognition module transcribes spoken language into text (speech-to-text), it can detect the spoken language automatically, detect Named Entities and custom entities from a Dictionary and offers the translation of the transcript.

Customized Speech Recognition
To use Speech Recognition with own words and entities, you need to upload a dictionary in the Dictionaries section.

How does it work?

Select the Media File: Choose the media file you want to analyze.
Activate the Speech Recognition Module: In the left column, select the "Speech Recognition" module.
Define the Model & Parameters: Choose the model for analysis from the available options, set the parameters, and click the yellow "Add Module" button.
Start the Analysis: You can either add more modules or begin the analysis immediately by clicking "Start Analysis"

What Parameters are available?

The following parameters can be configured for the Speech Recognition Module:

Language (Dropdown):
The expected language can be set here, the system can also recognize a language with the “Auto” configuration, in case the Input language is unknown.

Note:
For best transcription results, you should pre-select the language; the "Auto" setting may result in a higher transcription error rate when used on media assets with multiple languages or many foreign words.

Formatting Paragraph (Checkbox):
This allows the transcript to be formatted in paragraphs instead of individual sentences, making it easier to read for further processing. For captioning purposes, use single sentences without paragraph formatting.
Translation Language (Dropdown):
Translation of the transcript into a specific language.

Available Languages for Translation:

Arabic
Azerbaijani
Catalan
Chinese
Czech
Danish
Dutch
English
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Indonesian
Italian
Japanese
Korean
Persian
Polish
Portuguese
Russian
Slovak
Spanish
Swedish
Turkish
Ukrainian

Enable Voice Activation Detection (Checkbox):
Enable Voice Activation Detection (VAD) to automatically fade out background noise up to a certain level, such as background music or sound effects. This improves transcription quality for difficult media content and unclear speech.
Threshold for the activation of voice (0,01 – 0,9):
Sensitivity level up to which the VAD is active and filters out noises, the louder the noise, the higher the level Name Entity Recognition Activate to visually highlight entities (names, places, professions, etc.) in the transcript.
Dictionairy (Dropdown)
A Dictionary represents a text-based dictionary of words and phrases. This offers the possibility of training the transcription with specific entities (names, technical terms, places, etc.). Using Dictionaries increases the quality of the transcription, especially for specialist texts. You are able to use our preset dictionaries or add your own dictionaries in the Dictionaries section.

We have several default dictionaries built into the Speech Recognition, but you can also add your own. You can use more than one dictionary in an analysis job, just click "Add another dictionary" and all words of the applied dictionaries will be included in the transcription analysis.

These default dictionaries are currently available in addition to custom dictionaries:

- Animal Names
  Broad list of animals, for accurate transcription of the scientific animal names.
- European Football Clubs
  Accurately transcribes the names of 550+ European football teams.
- IAB Content Taxonomy 3.0 (Not applicable for Speech Recognition)
  This dictionary is for object and scene recognition only.
- GARM Brand Safety (Not applicable for Speech Recognition)
  This dictionary is for object and scene recognition only.

Own Dictionaries:
Read more about creating custom dictionaries here.
Different custom dictionary types provide different behavior when used in the Speech Recognition. Note that the type is defined by the format of the uploaded file.

Simple dictionary
Simple dictionaries are used to provide word information for correct spelling. They are defined by a UTF-8 encoded text file (.txt), where each line represents a single word in the dictionary. Empty lines in the file are ignored.
Mapping Dictionary
The purpose of map dictionaries is to provide word substitution behavior during inference. Each map dictionary is defined by a UTF-8 encoded CSV file (.csv) that should contain at least the header and two columns without missing values: source and target. Each time a word from the source column is predicted by the Speech Recognition, it is replaced by the word provided in the target column.
Example of Use: Explicit words can be censored in the subtitles with *******.

Read more about creating custom dictionaries here.

Displaying the Results:

Under “Speech Recognition” you will find the finished transcript on the right side as it was output by the AI without corrections.
If you notice errors in the transcript or need to make corrections, you can open the transcript in the Transcript Editor.

Search Field

Located in the top bar, the search field includes filter settings for refining your results.

Text field: Enter a word to find it in the transcript or translation below.
Sorting: Results can be sorted Chronologically (Standard), or by Name Enty Recognition. You can toggle between ascending and descending order.

After adjusting filters, click "Apply" to apply them. Active filters appear in a black box beneath the search field and can be cleared by clicking the X symbol.

Module Section

On the right side of the player, you’ll see a section with detailed results for each module used in the analysis. Clicking on the module name opens a dropdown with specific parameters, useful for troubleshooting or viewing metadata.

If you notice errors in the transcript or need to make corrections, you can open the transcript in the Editor. To do this, you will find two icons below the Module section:

Transcript Editor (Text Icon)
Creates a new Transcript out of the Speech Recognition result and opens the Transcript Editor.
Translation Editor (World Icon)
Creates a new Translation out of the Speech Recognition result and opens the Transcript Editor.

Transcript Editor:
The Transcript Editor provides easy-to-use tools and an easy-to-understand interface for editing, reviewing, and finalizing a transcript. Learn more here.

Result Cards

Results are displayed as paragraphs in chronological order. Each card provides key information, such as:

- Timecode of the result: The time at which this section of the transcription is in the file.
- Language: Either the automatically identified or manually selected language.
- Transcription: The sentences the Speech Recognition transcribed.

Artifacts

At the bottom of the transcript you will also find the artifacts for the analysis job, e.g. text files (.docx) or subtitle files (.srt), for direct download. These files are without any manual correction. For best results, use the Transcript Editor and finalize the transcript with various export options.

Filetypes:

.srt file (SubRip Subtitle File):
This format is commonly used for subtitles in video files. It contains the transcription text along with timestamps for when each line should appear and disappear on screen.
Usage example: Adding subtitles to a YouTube video or a movie file to improve accessibility.
.docx file (Microsoft Word Document):
A .docx file is a standard format for text documents, often used for creating written transcripts that are easy to read and edit. Unlike .srt it doesn't include timing information but focuses purely on the transcription text.
Usage example: Sharing a clean, formatted transcription for a interview or meeting minutes.