What is the Advanced Speech Recognition Module?

New Function:
This function was just released and is an further development of our Speech Recognition.
It is a Composite AI function linking the transcription and translation with the Speaker Identification module of the Deep Media Analyzer.

Note:
As this is a completely new development, we are spending a lot of time testing and debugging. However, there may still be some issues that we have not found during our internal and external testing. We encourage you to report any bugs or errors using our support form. Thank you very much!

Module Description

The Advanced Speech Recognition module transcribes spoken language into text (speech-to-text), it can detect the spoken language and speakers automatically, detect name entities and custom entities from a dictionary and offers the translation of the transcript.

Customization of the Advanced Speech Recognition:
To use the Advanced Speech Recognition with own words and entities, you need to upload a dictionary in the Dictionaries section. This can be a mapping dictionary to replace words or a simple dictionary to correct spellings and names.

How does it work?

Select the Media File: Choose the media file you want to analyze.
Activate the Advanced Speech Recognition Module: In the left column, select the "Advanced Speech Recognition" module.
Define the Model & Parameters: Choose the model for analysis from the available options, set the parameters, and click the yellow "Add Module" button.
Start the Analysis: You can either add more modules or begin the analysis immediately by clicking "Start Analysis"

What Parameters are available?

The following parameters can be configured for the Advanved Speech Recognition Module, these are a combination of the already implemented Speaker Identification and Speech Recognition and work in the same way.

Model (Dropdown)
Select from pre-trained models or your own custom-trained speaker identification models.
Currently we feature only one pre-trained model:
- Celebrities
  Several personalities, including the world's most famous people and a vast majority of German politicians and athletes

Customized Speaker Identification
To create a custom speaker identification model, you need to access the training function within the Deep Model Customizer.

Min. Similarity (Slider):
Adjust the minimum similarity score for identifying speakers. A lower value returns more results, while a higher value improves accuracy.
Cluster Unknown Identities (Checkbox)
Group unrecognized speakers together as "unknown" without assigning individual IDs.
Numbering of Labels for Unknown Identities (Checkbox):
Automatically number and label all unknown speakers for easier reference.

Speaker Index:
The Speaker Index offers the easiest way to manage unknown voices. Each speaker is automatically assigned a unique ID, allowing you to rename it instantly. Normally, in the Deep Model Customizer, you would need to upload training material for each person. However, with the Speaker Index, every voice becomes recognizable right away without the need for extra training data.

Language (Dropdown):
The expected language can be set here, the system can also recognize a language with the “Auto” configuration, in case the Input language is unknown.

Note:
For best transcription results, you should pre-select the language; the "Auto" setting may result in a higher transcription error rate when used on media assets with multiple languages or many foreign words.

Formatting Paragraph (Checkbox):
This allows the transcript to be formatted in paragraphs instead of individual sentences, making it easier to read for further processing. For captioning purposes, use single sentences without paragraph formatting.
Translation Language (Dropdown):
Translation of the transcript into a specific language.

Available Languages for Translation:

Arabic
Azerbaijani
Catalan
Chinese
Czech
Danish
Dutch
English
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Indonesian
Italian
Japanese
Korean
Persian
Polish
Portuguese
Russian
Slovak
Spanish
Swedish
Turkish
Ukrainian

Dictionairy (Dropdown)
A Dictionary represents a text-based dictionary of words and phrases. This offers the possibility of training the transcription with specific entities (names, technical terms, places, etc.). Using Dictionaries increases the quality of the transcription, especially for specialist texts. You are able to use our preset dictionaries or add your own dictionaries in the Dictionaries section.

We have several default dictionaries built into the Speech Recognition, but you can also add your own. You can use more than one dictionary in an analysis job, just click "Add another dictionary" and all words of the applied dictionaries will be included in the transcription analysis.

These default dictionaries are currently available in addition to custom dictionaries:

- Animal Names
  Broad list of animals, for accurate transcription of the scientific animal names.
- European Football Clubs
  Accurately transcribes the names of 550+ European football teams.
- IAB Content Taxonomy 3.0 (Not applicable for Speech Recognition)
  This dictionary is for object and scene recognition only.
- GARM Brand Safety (Not applicable for Speech Recognition)
  This dictionary is for object and scene recognition only.

Own Dictionaries:
Read more about creating custom dictionaries here.
Different custom dictionary types provide different behavior when used in the Speech Recognition. Note that the type is defined by the format of the uploaded file.

Simple dictionary
Simple dictionaries are used to provide word information for correct spelling. They are defined by a UTF-8 encoded text file (.txt), where each line represents a single word in the dictionary. Empty lines in the file are ignored.
Mapping Dictionary
The purpose of map dictionaries is to provide word substitution behavior during inference. Each map dictionary is defined by a UTF-8 encoded CSV file (.csv) that should contain at least the header and two columns without missing values: source and target. Each time a word from the source column is predicted by the Speech Recognition, it is replaced by the word provided in the target column.
Example of Use: Explicit words can be censored in the subtitles with *******.

Read more about creating custom dictionaries here.

Displaying the Results:

Under “Advanved Speech Recognition” you will find the finished transcript on the right side as it was output by the AI without corrections.
If you notice errors in the transcript or need to make corrections, you can open the transcript in the Transcript Editor.

Search Field

Located in the top bar, the search field includes filter settings for refining your results.

Text field: Enter a word to find it in the transcript or translation below.
Sorting: Results can be sorted Chronologically (Standard), or by Name Enty Recognition. You can toggle between ascending and descending order.

After adjusting filters, click "Apply" to apply them. Active filters appear in a black box beneath the search field and can be cleared by clicking the X symbol.

Module Section

On the right side of the player, you’ll see a section with detailed results for each module used in the analysis. Clicking on the module name opens a dropdown with specific parameters, useful for troubleshooting or viewing metadata.

If you notice errors in the transcript or need to make corrections, you can open the transcript in the Editor. To do this, you will find two icons below the Module section:

Transcript Editor (Text Icon)
Creates a new Transcript out of the Speech Recognition result and opens the Transcript Editor.
Translation Editor (World Icon)
Creates a new Translation out of the Speech Recognition result and opens the Transcript Editor.

Transcript Editor:
The Transcript Editor provides easy-to-use tools and an easy-to-understand interface for editing, reviewing, and finalizing a transcript. Learn more here.

Result Cards

Results are displayed as paragraphs in chronological order. Each card provides key information, such as:

- Timecode of the result: The time at which this section of the transcription is in the file.
- Speaker: The name or number of the speaker with the possibility of renaming and opening the speaker in the speaker index overview.
- Language: Either the automatically identified or manually selected language.
- Transcription: The sentences the Advanced Speech Recognition transcribed.

Artifacts

At the bottom of the transcript you will also find the artifacts for the analysis job, e.g. text files (.docx) or subtitle files (.srt), for direct download. These files are without any manual correction. For best results, use the Transcript Editor and finalize the transcript with various export options.

Filetypes:

.srt file (SubRip Subtitle File):
This format is commonly used for subtitles in video files. It contains the transcription text along with timestamps for when each line should appear and disappear on screen.
Usage example: Adding subtitles to a YouTube video or a movie file to improve accessibility.
WebVTT (Web Video Text Tracks)
Is a World Wide Web Consortium (W3C) standard for displaying timed text in connection with the HTML5 <track> element.
.docx file (Microsoft Word Document):
A .docx file is a standard format for text documents, often used for creating written transcripts that are easy to read and edit. Unlike .srt it doesn't include timing information but focuses purely on the transcription text.
Usage example: Sharing a clean, formatted transcription for a interview or meeting minutes.