What is the Visual Understanding Module?

Module Description

Performing visual language comprehension tasks, such as answering visual questions, understanding scenes, and making advanced deductions.

Visual Language Comprehension
Is the ability to interpret and understand information conveyed through visual elements like symbols, images, colors, and layouts. It involves recognizing patterns, decoding cultural or contextual meanings, and connecting visuals with emotions or concepts. Any Audial information is not taken into account.

How does it work?

Select the Media File: Choose the media file you want to analyze.
Activate the Visual Understanding Module: In the left column, select the "Visual Understanding" module, enter a prompt and click the yellow "Add Module" button.
Start the Analysis: You can either add more modules or begin the analysis immediately by clicking "Start Analysis"

What Parameters are available?

Prompt (Free Text)
This algorithm needs an additional prompt in order to perform the analysis.
You can enter any prompt, depending on the length of the result, the analysis will take more time.

EXAMPLES:
Scene Description:
"Describe the actions happening in this video scene."
"What objects and people are present in this clip?"
Content Summarization:
"Summarize the key events in this 30-second video."
"Provide a high-level overview of this sports match."
Emotion and Tone Analysis:
"What is the emotional tone of this scene?"
"Are the characters in the video happy, sad, or angry?"
Highlights Extraction:
"Identify the most exciting moments in this soccer match."
"Find key scenes with dialogue in this video."
Visual Elements Detection:
"Identify all appearances of company logos and what companies?"
"Which nametags are visible in this video"
Audience Engagement Insights:
"What visual elements are most frequently focused on?"
"Analyze the facial expressions of viewers in this focus group video."

Prompt Library & Backlog:
In this initial release, a prompt library and the option to save used prompts are not yet available. These features will be introduced in a future update.

Temperature (0.0–1.0):
The temperature controls how creative or deterministic the model’s responses are.
- A low temperature (e.g., 0.1–0.3) makes the model more focused and predictable. It will stick closely to the most likely answers — ideal for factual, structured outputs.

- A high temperature (e.g., 0.7–1.0) makes the model more creative and diverse in its responses. This allows for more varied or exploratory descriptions but may also include less relevant details.

EXAMPLE:

Low temperature: Technical image tagging, accessibility ALT text, compliance use cases
High temperature: Creative descriptions, storyboarding, conceptual exploration

Enable Shot Detection (checkbox).

With Shot Detection enabled, our shot boundary detection module will attempt to separate each shot in the video into a segment and apply the prompt to each individual segment. This will give you shot descriptions with timecodes, for example. If disabled, the prompt will be applied to the entire video.
Shot Detection Threshold (0–100):

The threshold defines how different each pair of adjacent frames needs to be to trigger a scene break when the difference between them exceeds the threshold value. A higher number means a stricter threshold, resulting in fewer shots, while a lower number means a more lenient threshold, resulting in more shots.

Displaying the Results:

Module Section:

On the right side of the player, you’ll see a section with detailed results for each module used in the analysis. Clicking on the module name opens a dropdown with specific parameters, useful for troubleshooting or viewing metadata.

Results:

The results are shown in the sidebar, along with the original prompt displayed as a text field.

New Function
This is a new addition to our modules, if you find improvements or bugs, please feel free to contact us via support form.