ORena SAVE FOCUS Challenge — SEGMENT Track

Foreign Object Contextual Understanding in Surgery

Highlighted announcements

The second data batch (LapChole-FOCUS VQA) has now been released. It contains 170 laparoscopic cholecystectomy videos and is available on Hugging Face! 👉 Hugging Face

Short-video surgical VQA for foreign object understanding

This is the SEGMENT Track of the ORena SAVE FOCUS Challenge. The track evaluates whether vision-language models can answer clinically relevant questions from short laparoscopic video segments of up to 5 minutes, focusing on local temporal context, foreign object interactions, short-term tracking, and action understanding.

The broader ORena SAVE FOCUS Challenge benchmarks vision-language models on clinically grounded visual question answering for foreign object understanding in minimally invasive surgery. The goal is to advance AI methods that can support intraoperative quality assurance and patient safety.

The SEGMENT Track bridges single-image surgical scene understanding and long-context procedural reasoning. It tests whether models can integrate events across time within a bounded video context before participants move to the more demanding PROCEDURE Track.

Start here

Register for the challenge

Join the central forum

Download the first data batch

Download the second data batch

Use the orena-focus package

Why this challenge matters

Clinical relevance

In minimally invasive surgery, foreign objects such as sponges, needles, clips, drains, specimen bags, and similar objects may be introduced into the abdominal cavity during a procedure. Retained foreign objects after major operations are rare but clinically relevant adverse events associated with patient harm [Badiee et al., 2025].

Technical challenge

Foreign object understanding in short videos requires models to combine visual recognition with local temporal reasoning. Long-video benchmarks have shown the importance of evaluating models beyond isolated images by requiring reasoning over extended visual context [Wu et al., 2024].

Benchmark at a glance

Task type Surgical video question answering	Input Video segment up to 5 min, meta data (type of procedure, timestamps) + question	Output Short text answer	Focus Short-term foreign object understanding
SEGMENT time budget 15 seconds per question	SEGMENT hardware 80GB VRAM GPU	Prize pool $60k+ across tracks	Submission Docker container

The three ORena SAVE FOCUS tracks

FRAME
Track

Single-image understanding

Answer clinically relevant questions from one laparoscopic image. This track evaluates visual perception, foreign object identification, counting, attributes, and spatial localization.

SEGMENT
Track

Short-video understanding

Answer questions from short video segments of up to 5 minutes. This track evaluates local temporal reasoning, short-term tracking, and action understanding.

You are here.

PROCEDURE
Track

Long-context understanding

Answer questions over long video contexts up to full procedures. This track evaluates long-horizon memory, persistent object tracking, aggregation over time, and retrieval-status reasoning.

SEGMENT Track

The SEGMENT Track evaluates a model’s ability to answer clinically relevant questions from a short laparoscopic video segment of up to 5 minutes. The task targets surgical video understanding skills such as:

foreign object identification and identity matching over short video segments
local temporal reasoning within a bounded video context
foreign object counting
recognition of insertion, manipulation, or removal events
complex reasoning in a short video segment context

The input consists of a video segment and a question. The submitted algorithm must return a text answer. All methods must be fully automated.

Algorithm input

Video segment up to 5 min, meta data (type of procedure, timestamp) + question

Exact input format will follow the official submission template repository.

Algorithm output

Short text answer

Exact answer formatting and validation details will follow the official submission template repository.

Data and scientific background

The first released data batch, HeiCo-FOCUS, is based on Heidelberg colorectal surgery videos and provides clinically grounded VQA pairs for foreign object understanding. The dataset covers five capability categories: object recognition and identity matching, temporal grounding, aggregation, event and procedural understanding, and complex reasoning.

The SEGMENT Track builds on prior work in surgical visual question answering, where models answer clinically relevant questions from surgical scenes [Seenivasan et al., 2022].

The SEGMENT Track also connects to the broader development of long-video understanding benchmarks, which evaluate whether multimodal models can reason beyond isolated frames and short static contexts [Fu et al., 2025].

For the SEGMENT Track, the focus is on the short-video part of this benchmark. This provides a controlled setting for evaluating whether models can use local temporal context to reason about foreign objects, their interactions, and brief action sequences before moving to long-context reasoning in the PROCEDURE Track.

First data batch
HeiCo-FOCUS VQA

Number of videos
30

Expert involvement
Clinical and technical experts

Motivation
Foreign object safety and short term context understanding

Figure 1: Overview of the HeiCo-FOCUS benchmark, showing a) the clinical motivation and b) providing an overview of the first batch dataset.

Second data batch
LapChole-FOCUS VQA

Number of videos
170

Expert involvement
Clinical and technical experts

Motivation
Foreign object safety

Submission and evaluation

Submissions must be made through the challenge website.
Algorithms are submitted as Docker containers.
Containers must run without internet access.
Inference is limited to a single GPU.
The SEGMENT Track time budget is 15 seconds per question on an 80GB VRAM GPU.
During pre-evaluation, each team may submit up to 10 times, subject to possible adjustment depending on compute constraints.
Only teams that beat the baselines on the leaderboard proceed to the final test stage.
Teams must submit a method description with sufficient technical detail for interpretation of the results.

Prizes and recognition

$60k+ prize pool

A prize pool of at least $60k has been secured across the ORena SAVE FOCUS Challenge tracks. The SEGMENT Track is planned to receive approximately 40% of the total prize money.

Publication opportunity

Teams that beat the baselines may be invited as co-authors on the planned challenge publication, subject to the official rules and submission requirements.

Resources

Registration	Register for the ORena SAVE FOCUS Challenge
Central forum	ORena SAVE FOCUS Forum
First data batch	HeiCo-FOCUS VQA on Hugging Face
Second data batch	LapChole-FOCUS VQA on Hugging Face
Python package	orena-focus GitHub repository
Submission template	Will be released soon.

Webinar recording

The ORena SAVE FOCUS webinar recording is available here after May 28th: