Skip to content
English
  • There are no suggestions because the search field is empty.

What is the Visual Understanding Module?

Prompt-based deeper analysis of images and video.

Module Description

Performing visual language comprehension tasks, such as answering visual questions, understanding scenes, and making advanced deductions.

Visual Language Comprehension
Is the ability to interpret and understand information conveyed through visual elements like symbols, images, colors, and layouts. It involves recognizing patterns, decoding cultural or contextual meanings, and connecting visuals with emotions or concepts. Any Audial information is not taken into account.


How does it work?

  1. Select the Media File: Choose the media file you want to analyze.
  2. Activate the Visual Understanding Module: In the left column, select the "Visual Understanding" module, enter a prompt and click the yellow "Add Module" button.
  3. Start the Analysis: You can either add more modules or begin the analysis immediately by clicking "Start Analysis"

What Parameters are available?

  • Prompt (Free Text)
    This algorithm needs an additional prompt in order to perform the analysis.
    You can enter any prompt, depending on the length of the result, the analysis will take more time.

EXAMPLES:
Scene Description:

"Describe the actions happening in this video scene."
"What objects and people are present in this clip?"
Content Summarization:
"Summarize the key events in this 30-second video."
"Provide a high-level overview of this sports match."
Emotion and Tone Analysis:
"What is the emotional tone of this scene?"
"Are the characters in the video happy, sad, or angry?"
Highlights Extraction:
"Identify the most exciting moments in this soccer match."
"Find key scenes with dialogue in this video."
Visual Elements Detection:
"Identify all appearances of company logos and what companies?"
"Which nametags are visible in this video"
Audience Engagement Insights:
"What visual elements are most frequently focused on?"
"Analyze the facial expressions of viewers in this focus group video."

Prompt Library & Backlog:
In this initial release, a prompt library and the option to save used prompts are not yet available. These features will be introduced in a future update.

  • Model (Dropdown)
    You can choose from a variety of available models to select the best one for your specific task. The list of models is updated regularly depending on customer needs. If you require a specific model, please feel free to contact our support team.
    • Qwen 2.5 VL 7B instruct:
      Best choice for complex tasks requiring higher accuracy and reasoning. Strong at detailed visual understanding and handling longer or more sophisticated queries. Recommended if you need high-quality results over efficiency and speed.
      Especially recommended when using structured Output.
    • Qwen 2.5 VL 3B instruct:
      A balanced option between performance and speed. Delivers reliable results for most visual understanding tasks while being lighter and faster than the 7B model. Suitable for general-purpose use cases where efficiency and cost matter, but you still want solid accuracy for simple tasks.
    • SmolVLM:
      Optimized for lightweight, real-time or resource-constrained scenarios. Delivers quick responses with lower compute requirements, making it ideal for near-realtime needs or simple visual tasks.
    • More to follow (e.g. Teuken 7B and more)
  • Temperature (0.0–1.0):
    The temperature controls how creative or deterministic the model’s responses are.
    • A low temperature (e.g., 0.1–0.3) makes the model more focused and predictable. It will stick closely to the most likely answers — ideal for factual, structured outputs.
    • A high temperature (e.g., 0.7–1.0) makes the model more creative and diverse in its responses. This allows for more varied or exploratory descriptions but may also include less relevant details. 

EXAMPLE:

  • Low temperature: Technical image tagging, accessibility ALT text, compliance use cases

  • High temperature: Creative descriptions, storyboarding, conceptual exploration

  • Enable Shot Detection (checkbox).

    With Shot Detection enabled, our shot boundary detection module will attempt to separate each shot in the video into a segment and apply the prompt to each individual segment. This will give you shot descriptions with timecodes, for example. If disabled, the prompt will be applied to the entire video.
  • Shot Detection Threshold (0–100):

    The threshold defines how different each pair of adjacent frames needs to be to trigger a scene break when the difference between them exceeds the threshold value. A higher number means a stricter threshold, resulting in fewer shots, while a lower number means a more lenient threshold, resulting in more shots.

  • Structured Output (Code Window)
    Structured output means that we provide a prompt for the VLM to solve and in addition a JSON code in the form of a text structure to use to structure the answer.
    To provide a working JSON structure without coding knowledge, use an LLM to generate ideas and the finished JSON code. Include the model you want to use (e.g. Qwen 2.5 7B) in the LLM prompt, and it will reply with a custom structure ideal for your use case.

    To make this task easier, here two examples:

Structure Output Example 1: Predefined Tags
To give the VLM strict boundaries to choose from, you can define a category, e.g. 'Season:', and then provide a prompt within the JSON structure for this specific task, e.g. 'What season is shown? Choose one:', followed by the tags you want the VLM to draw from, e.g. ["Summer", "Winter", "Fall", "Spring", "Neutral"].

The full example would be:

{
"season": "What season is shown? Choose one: ['Summer', 'Winter', 'Fall', 'Spring', 'Neutral']"
}


Another Example for predefined Tags - in this case shot specific camera metadata:

{
  "camera_type": "Describe the camera type or viewpoint. Choose one from: ['Static', 'Handheld', 'POV', 'Drone', 'CCTV', 'Dashcam', 'Bodycam', 'Crane', 'Tracking Shot', 'Studio', 'Other']. If the camera angle is identifiable, also include one from: ['Low angle', 'High angle', 'Eye level', 'Over-the-shoulder', 'Wide shot', 'Close-up', 'Medium shot']. Return both in an unordered list if applicable."
}

Structure Output Example 2: Free Text/Tags
To make the model provide answers with a less rigid structure, define the boundaries within which it should respond and allow it to space out its answers. This is recommended for text descriptions or for defining matching content tags. In most of these cases, the temperature configuration will also significantly impact the quality of the answers. As above, define a category (e.g. 'scene description'), add a prompt for this category, and then leave the answer field empty (e.g. ','). This will prompt the model to fill the empty field with a free-form answer.

The full example  for scene description would be:

{
"scene_description": ""scene_description": "Describe the scene in one complete sentence suitable for accessibility purposes. Include key actions, visible people, setting, objects, atmosphere, and inferred context. Avoid mentioning the viewer or the act of filming.","
}


Here is another example of using the VLM to assign content tags:

{
 "Generate 10 concise, social-media-friendly tags that capture the scene’s essence. Focus on people, actions, objects, setting, mood, and cultural or topical context. Prefer hashtags or phrases commonly used online that boost discoverability. Include readable text if visible and meaningful. Avoid redundancy and overly generic terms; highlight specific themes, emotions, or trends behind the scene."
}

Important:
If you want to use the structured output, make sure you adapt your prompt to include the task that the VLM should answer using the given structure.

Example Prompt:
"You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format."

The JSON structure must also be provided.

Of course, all types of structured output can be combined and adapted as required.
To provide more guidance, we will be adding a prompt library, as well as creating a dedicated category in our Discord community.


Displaying the Results:

Module Section:

On the right side of the player, you’ll see a section with detailed results for each module used in the analysis. Clicking on the module name opens a dropdown with specific parameters, useful for troubleshooting or viewing metadata.


Results:

The results are shown in the sidebar, along with the original prompt displayed as a text field.


New Function
This is a new addition to our modules, if you find improvements or bugs, please feel free to contact us via support form.