Skip to content
English
  • There are no suggestions because the search field is empty.

What is the Visual Understanding Module?

Prompt-based deeper analysis of images and video.

Module Description

The Visual Understanding Module performs visual-language comprehension on images and videos. It can answer visual questions, describe scenes, extract structured data, and optionally segment videos into shots or fixed time segments and analyze each segment independently.

Typical use cases include scene understanding, content moderation, accessibility descriptions, sports and event analysis, brand detection, and structured metadata extraction.

Visual Language Comprehension
Is the ability to interpret and understand information conveyed through visual elements like symbols, images, colors, and layouts. It involves recognizing patterns, decoding cultural or contextual meanings, and connecting visuals with emotions or concepts. Any Audial information is not taken into account.


How does it work?

  1. Select the Media File: Choose the media file you want to analyze.
  2. Activate the Visual Understanding Module: In the left column, select the "Visual Understanding" module, enter a prompt and click the yellow "Add Module" button.
  3. Start the Analysis: You can either add more modules or begin the analysis immediately by clicking "Start Analysis"

What Parameters are available?

  • Prompt (Free Text)
    This algorithm needs an additional prompt in order to perform the analysis.
    You can enter any prompt, depending on the length of the result, the analysis will take more time.

EXAMPLES:
Scene Description:

"Describe the actions happening in this video scene."
"What objects and people are present in this clip?"
Content Summarization:
"Summarize the key events in this 30-second video."
"Provide a high-level overview of this sports match."
Emotion and Tone Analysis:
"What is the emotional tone of this scene?"
"Are the characters in the video happy, sad, or angry?"
Highlights Extraction:
"Identify the most exciting moments in this soccer match."
"Find key scenes with dialogue in this video."
Visual Elements Detection:
"Identify all appearances of company logos and what companies?"
"Which nametags are visible in this video"
Audience Engagement Insights:
"What visual elements are most frequently focused on?"
"Analyze the facial expressions of viewers in this focus group video."

Prompt Library & Backlog:
In this initial release, a prompt library and the option to save used prompts are not yet available. These features will be introduced in a future update.

  • Model (Dropdown)
    You can choose from a variety of available models to select the best one for your specific task. The list of models is updated regularly depending on customer needs. If you require a specific model, please feel free to contact our support team.
    • Qwen 2.5 VL 7B instruct:
      Best choice for complex tasks requiring higher accuracy and reasoning. Strong at detailed visual understanding and handling longer or more sophisticated queries. Recommended if you need high-quality results over efficiency and speed.
      Especially recommended when using structured Output.
    • Qwen 2.5 VL 3B instruct:
      A balanced option between performance and speed. Delivers reliable results for most visual understanding tasks while being lighter and faster than the 7B model. Suitable for general-purpose use cases where efficiency and cost matter, but you still want solid accuracy for simple tasks.
    • SmolVLM:
      Optimized for lightweight, real-time or resource-constrained scenarios. Delivers quick responses with lower compute requirements, making it ideal for near-realtime needs or simple visual tasks.
    • Qwen 3 VL 2B instruct NEW
      Lightweight and fast. Good for simple tagging, quick descriptions, or high-throughput scenarios.

    • Qwen 3 VL 2B thinking NEW
      Optimized for deeper reasoning despite its smaller size. Useful when logical deductions matter more than raw detail.

    • Qwen 3 VL 4B instruct NEW
      A strong mid-size model with improved understanding over 2B variants.

    • Qwen 3 VL 4B thinking NEW
      Enhanced reasoning and consistency for analytical tasks.

    • Qwen 3 VL 8B instruct NEW
      High-quality visual understanding for demanding use cases.

    • Qwen 3 VL 8B thinking NEW
      Best choice for complex reasoning, structured extraction, and multi-step visual analysis.

    • More to follow (e.g. Teuken 7B and more)
  • Temperature (0.0–1.0):
    The temperature controls how creative or deterministic the model’s responses are.
    • A low temperature (e.g., 0.1–0.3) makes the model more focused and predictable. It will stick closely to the most likely answers — ideal for factual, structured outputs.
    • A high temperature (e.g., 0.7–1.0) makes the model more creative and diverse in its responses. This allows for more varied or exploratory descriptions but may also include less relevant details. 

EXAMPLE:

  • Low temperature: Technical image tagging, accessibility ALT text, compliance use cases

  • High temperature: Creative descriptions, storyboarding, conceptual exploration

  • Shot Segmentation
    • Enable Shot Detection (checkbox)

      When enabled, the module automatically splits the video into individual shots and applies your prompt to each shot separately.

      If disabled, the prompt is applied to the entire video as one segment.

    • Shot Detection Method (Drop-down)

      You can choose how shots are detected:

      • Content (default)Detects cuts using pixel changes in HSV color space. A solid general-purpose method.
      • Adaptive
        Uses a rolling average of frame differences. Often better for fast motion or action-heavy videos.

      • Threshold
        Detects brightness changes. Useful for fade-ins and fade-outs.

      • Histogram
        Compares brightness histograms between frames. More robust to flashes, lighting changes, and noise.

      • Hash
        Uses perceptual image hashing. Very robust against compression artifacts, watermarks, logos, and encoding differences.

    • Fixed-Length Segmentation (Alternative)

      Instead of detecting shots, you can also split a video into fixed time segments
      (e.g. every 5 or 10 seconds).

      • Each segment is analyzed independently

      • Cannot be combined with shot detection

      • Useful for:

        • Long videos

        • Monitoring or surveillance

        • Uniform sampling use cases

    • Structured Output (Code Window)
      Structured output means that we provide a prompt for the VLM to solve and in addition a JSON code in the form of a text structure to use to structure the answer.
      To provide a working JSON structure without coding knowledge, use an LLM to generate ideas and the finished JSON code. Include the model you want to use (e.g. Qwen 2.5 7B) in the LLM prompt, and it will reply with a custom structure ideal for your use case.

      To make this task easier, here two examples:

Structure Output Example 1: Predefined Tags
To give the VLM strict boundaries to choose from, you can define a category, e.g. 'Season:', and then provide a prompt within the JSON structure for this specific task, e.g. 'What season is shown? Choose one:', followed by the tags you want the VLM to draw from, e.g. ["Summer", "Winter", "Fall", "Spring", "Neutral"].

The full example would be:

{
"season": "What season is shown? Choose one: ['Summer', 'Winter', 'Fall', 'Spring', 'Neutral']"
}


Another Example for predefined Tags - in this case shot specific camera metadata:

{
  "camera_type": "Describe the camera type or viewpoint. Choose one from: ['Static', 'Handheld', 'POV', 'Drone', 'CCTV', 'Dashcam', 'Bodycam', 'Crane', 'Tracking Shot', 'Studio', 'Other']. If the camera angle is identifiable, also include one from: ['Low angle', 'High angle', 'Eye level', 'Over-the-shoulder', 'Wide shot', 'Close-up', 'Medium shot']. Return both in an unordered list if applicable."
}

Structure Output Example 2: Free Text/Tags
To make the model provide answers with a less rigid structure, define the boundaries within which it should respond and allow it to space out its answers. This is recommended for text descriptions or for defining matching content tags. In most of these cases, the temperature configuration will also significantly impact the quality of the answers. As above, define a category (e.g. 'scene description'), add a prompt for this category, and then leave the answer field empty (e.g. ','). This will prompt the model to fill the empty field with a free-form answer.

The full example  for scene description would be:

{
"scene_description": ""scene_description": "Describe the scene in one complete sentence suitable for accessibility purposes. Include key actions, visible people, setting, objects, atmosphere, and inferred context. Avoid mentioning the viewer or the act of filming.","
}


Here is another example of using the VLM to assign content tags:

{
 "Generate 10 concise, social-media-friendly tags that capture the scene’s essence. Focus on people, actions, objects, setting, mood, and cultural or topical context. Prefer hashtags or phrases commonly used online that boost discoverability. Include readable text if visible and meaningful. Avoid redundancy and overly generic terms; highlight specific themes, emotions, or trends behind the scene."
}

Important:
If you want to use the structured output, make sure you adapt your prompt to include the task that the VLM should answer using the given structure.

Example Prompt:
"You are a scene analysis assistant. Analyze the given image and return a JSON object describing the scene. Populate each field precisely according to the given instructions in the JSON schema below. Use exact categories, avoid vague language, and ensure each entry follows the required format."

The JSON structure must also be provided.

Of course, all types of structured output can be combined and adapted as required.
To provide more guidance, we will be adding a prompt library, as well as creating a dedicated category in our Discord community.


Displaying the Results:

Module Section:

On the right side of the player, you’ll see a section with detailed results for each module used in the analysis. Clicking on the module name opens a dropdown with specific parameters, useful for troubleshooting or viewing metadata.


Results:

The results are shown in the sidebar, along with the original prompt displayed as a text field.


New Function
This is a new addition to our modules, if you find improvements or bugs, please feel free to contact us via support form.