Evaluation

This page describes how submissions to the SEGMENT Track are evaluated and ranked. The evaluation philosophy follows the Metrics Reloaded recommendations [Maier-Hein et al., 2024], balancing interpretability of the ranking against methodological precision of the detailed performance analyses. The same evaluation protocol applies to the FRAME, SEGMENT, and PROCEDURE Tracks, with track-specific differences noted where relevant.

At a glance. Each test case is a Visual Question Answering (VQA) instance. Submissions are scored with Accuracy, aggregated into capability- and robustness-stratified buckets, and combined into a final ranking using the Copeland method. Ties within the top three places are resolved by bootstrap-based head-to-head win rates.

Why this evaluation design

Interpretability

A single, transparent primary metric, Accuracy, makes leaderboard positions easy to read and verify, while a richer stratified analysis is reported alongside for nuance.

Fairness across categories

Equal-weight aggregation across capabilities and robustness buckets prevents large question categories from dominating the ranking.

Robustness first

Out-of-distribution (OOD) questions, including unseen procedure types and unseen question formulations, are weighted equally to in-distribution (ID) questions to reward generalization.

Theoretical guarantees

Because no aggregation scheme satisfies every desirable property simultaneously (Arrow's impossibility theorem), we adopt the Copeland method, which has favorable properties relative to alternatives such as Borda counts [Rofin et al., 2023].


Primary metric: Accuracy

The primary metric used for ranking is Accuracy, defined as the proportion of correctly answered VQA cases. We chose Accuracy because it directly operationalizes the assessment goal of correctness, is the de facto standard for VQA benchmarking, and yields results that are comparable across categorical and open-ended question types.

The well-known disadvantages of Accuracy, including sensitivity to class prevalence and threshold choice, are mitigated by the stratified data splits and hierarchical aggregation described below, rather than by replacing the metric.

Closed-ended questions

Closed-ended questions are scored by exact match against the reference answer after format-specific verification and parsing. Each question declares its answer format during dataset generation. Submissions whose response does not pass the format's verification step are counted as incorrect.

The full set of formats is implemented in the focus.data.formats module of the orena-focus Python package and includes:

Format Accepted response Comparison rule
binary yes / no case-insensitive Exact match on parsed boolean
number Non-negative integer Exact match on parsed integer
percentage Non-negative number, optional % suffix Tolerance-aware match on parsed float
fo_class A registered foreign-object class name or none Case-insensitive match on canonical class name
time hh:mm:ss timestamp Tolerance-aware match
multiple_choice / open_ended / matching One of a predefined option set, or free-form text up to 300 characters LLM-as-a-judge

Open-ended questions: LLM-as-a-judge

For questions whose responses cannot be checked by exact match, namely formats multiple_choice, open_ended, and matching, semantic correctness is assessed using an LLM-as-a-judge protocol. A sample judge implementation is provided in focus.evaluation.judges.

Up to three independent judge LLMs evaluate every open-ended response; the final verdict is determined by majority vote. The voting routine short-circuits as soon as an absolute majority is reached, which keeps inference cost low without changing the outcome. This paradigm has become common practice in modern VLM benchmarks and tolerates clinically meaningful linguistic variability; internal pilot experiments showed only minor disagreement across state-of-the-art judge LLMs.

To prevent tuning towards specific judges, the exact set of judge models used for the official evaluation will not be disclosed until after the challenge concludes. The evaluation code is released openly, but the judge identity will remain redacted.

Anti-gaming policy. Any attempt to manipulate the LLM-as-a-judge through adversarial prompting, jailbreaking, or other techniques aimed at unfairly influencing evaluation will result in immediate disqualification of the team.

Tolerance-aware accuracy

For questions where the reference annotation has inherent uncertainty, most notably temporal references in the time format and numeric estimates in the percentage format, predictions are counted as a true positive if they fall within a predefined tolerance window configured per question via the format's threshold_seconds or threshold_pp parameter. Tolerance thresholds were derived from inter-rater variability and clinical input, similar to the application-dependent accuracy formulation of [Dergachyova et al., 2016].

Missing submissions

Any VQA case for which a submission does not produce a response, including timeouts exceeding the per-question time budget, is treated as incorrect.


Aggregation: stratified buckets

Each test case carries two pieces of meta-information used for aggregation:

1. Robustness level

In-distribution (ID) or out-of-distribution (OOD) with respect to procedure type and question formulation. OOD cases come from procedure types not represented in the training set and from question phrasings not seen during training.

2. Primary capability

Each case is mapped to exactly one primary capability based on its question intent, using the FOCUS taxonomy.

Capability mapping

Capability Question intent Example
Object recognition and instance matching Which object? In which state? Where? Which type of foreign object is visible in the lower right of the frame?
Temporal grounding When? How long? When in the segment is the sponge first introduced?
Aggregation How many? How many distinct needles appear in this segment?
Event and procedural understanding Which action? Is the clip being applied or removed in this segment?
Complex reasoning Why? What happens if? Why is the surgeon repositioning the specimen bag?

For every model, mean Accuracy is computed within each capability × robustness bucket. This yields up to 2 × 5 = 10 bucket scores per model for the SEGMENT and PROCEDURE Tracks. The FRAME Track uses a reduced taxonomy, focused on object recognition and aggregation, and therefore has fewer buckets.


Ranking procedure

The ranking is computed in three steps.

Step 1 — Per-bucket ranking with significance adjustment

Within each bucket, models are ranked by mean Accuracy. Pairwise significance tests using cluster-aware bootstrapping are then used to collapse ranks when performance differences are not significant. Two models with statistically indistinguishable bucket-level Accuracy receive the same rank, so that irrelevant differences within the noise floor of the test set do not propagate into the final ranking.

Step 2 — Aggregating buckets via the Copeland method

The per-bucket rankings are combined into a single overall ranking using the Copeland method [Rofin et al., 2023]:

  • For every ordered pair of models A and B, count the number of buckets in which A is ranked strictly higher than B, and vice versa.
  • Model A dominates model B if A is ranked higher more often than B is.
  • Each model's Copeland score is the number of models it dominates minus the number of models that dominate it.
  • Higher Copeland scores are better; models are ordered by descending Copeland score.

This approach was chosen over linear schemes, such as the Borda rule, equivalent to averaging ranks, because it is more robust to irrelevant alternatives. In other words, adding or removing a weak submission does not arbitrarily perturb the ordering at the top.

Step 3 — Tie-breaking via bootstrap win rate

Ties within the top three positions of the Copeland ranking are resolved by directly comparing the tied models:

  • Within every bucket, draw K bootstrap samples with replacement of the case-level scores, respecting the clustering of cases within source videos.
  • For each bootstrap sample, identify the tied model with the highest Accuracy in that bucket. This is one win.
  • The win rate of a model in a bucket is the fraction of bootstrap samples in which it wins. The model's overall win rate is the mean win rate across all buckets.
  • Tied models are ordered by descending overall win rate.

Outperforming the baselines

Two baseline submissions are provided:

1. Frontier closed-source VLM

A state-of-the-art frontier model, for example GPT-class or Gemini-class, applied zero-shot and selected by best validation performance.

2. Fine-tuned open-source VLM

A strong open-source VLM fine-tuned on the challenge training data by the organizers.

During the pre-evaluation phase, a team is considered to outperform a baseline if its mean Accuracy across all buckets is higher than the baseline's. Only teams that beat both baselines during pre-evaluation are admitted to the final test phase. Prizes during the final phase require beating both baselines on the final ranking.

The full Copeland-based ranking, not just mean Accuracy, is applied in the final test phase.


Open-source evaluation code

The evaluation code for the pre-evaluation phase is released publicly as part of the orena-focus Python package, so that participants can reproduce per-case scoring locally on the released training data. The aggregating and ranking code for the final phase will be released soon. The specific judge LLMs used for open-ended scoring remain undisclosed during the challenge to prevent over-fitting to a particular judge.