Inference Results

100

Task Domains

Models

Samples/Task

Pigment Color Mixing

Samples:

1/5

PROMPT

Loading prompt...

GROUND TRUTH

First

Final

Ground Truth

MODEL OUTPUTS 1/9

VBVR-Wan2.2

CogVideoX 1.5

Kling 2.6

LTX-2

Runway Gen-4

Sora 2

Veo 3

Wan 2.2 I2V

Hunyuan I2V

VBVR-EvalKit

The official evaluation toolkit for Very Big Video Reasoning (VBVR). Unified inference and evaluation across 37 video generation models, including commercial APIs (Luma, Veo, Kling, Sora, Runway) and open-source models (LTX-Video, LTX-2, HunyuanVideo, SVD, WAN, CogVideoX, and more).

View on GitHub

Quick Start

You can evaluate video models using VBVR-EvalKit:

# Install
git clone https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit
python -m venv venv && source venv/bin/activate
pip install -e .

# Setup a model
bash setup/install_model.sh --model svd --validate

# Inference
python examples/generate_videos.py --questions-dir setup/test_assets/ --output-dir ./outputs --model svd

# Evaluation (VBVR-Bench)
python examples/score_videos.py --inference-dir ./outputs

VBVR-Bench: 100+ rule-based evaluators with deterministic 0–1 scores and no API calls.

VBVR-Bench

The official test dataset for standardized and rigorous evaluation of video-based visual reasoning models, designed for use with the VBVR-EvalKit evaluation framework.

View Dataset View Rule-Based Verifiers

Current Options

You can load and use VBVR-Bench-Data with HuggingFace datasets:

# Install datasets library
pip install datasets

# Load VBVR-Bench-Data
from datasets import load_dataset
bench_data = load_dataset("Video-Reason/VBVR-Bench-Data")

# Access test split
test_data = bench_data["test"]
print(f"Test samples: {len(test_data)}")

500 test samples organized by domain (In-Domain_50, Out-of-Domain_50) and task generator, enabling reproducible evaluation of video reasoning capabilities.

Leaderboard

Model performance rankings on the VBVR benchmarks.

View on HuggingFace

Reference

Strong Baseline

Proprietary

Open-source

Human

97.4%

VBVR-Wan2.2

68.5%

Sora 2

54.6%

Veo 3.1

48.0%

Runway Gen-4 Turbo

40.3%

Wan2.2-I2V-A14B

37.1%

Kling 2.6

36.9%

LTX-2

31.3%

CogVideoX1.5-5B-I2V

27.3%

HunyuanVideo-I2V

27.3%

VBVR-Bench Evaluation Framework

VBVR-Bench provides a systematic, reproducible, and explainable evaluation framework for video reasoning models. Our evaluation employs a fully rule-based scoring system that ensures deterministic and interpretable assessment. The benchmark consists of 100 diverse tasks organized into two splits: 50 in-domain tasks (testing generalization within seen task categories) and 50 out-of-domain tasks (testing transfer to novel task structures). Each task includes 5 test samples, totaling 7,500 test cases across the full benchmark.

Five Cognitive Capabilities

Model performance is evaluated across five foundational cognitive faculties: Perception (extraction of structured representations from sensory input), Transformation (manipulation and synthesis of mental representations), Spatiality (representation of places and geometric relationships), Abstraction (distillation of generalizable knowledge from experiences), and Knowledge (propositional truth statements, both learned and intrinsic). Each task is categorized under one of these capabilities, enabling granular analysis of model strengths and weaknesses.

Evaluation Dimensions

Each task is scored across multiple weighted dimensions including spatial accuracy, trajectory correctness, temporal consistency, and logical validity. The rule-based framework provides granular verifiability, allowing precise measurement even at the pixel or object-property level. Our human preference alignment experiments demonstrate strong agreement between automated scores and human judgments, with a Spearman's correlation coefficient of ρ > 0.9.

Current Benchmark Results

VBVR-Bench has evaluated 8 state-of-the-art models, including 4 open-source models (CogVideoX1.5, HunyuanVideo, Wan2.2, LTX-2) and 4 proprietary models (Veo 3.1, Sora 2, Kling 2.6, Runway Gen-4). The current best-performing model, VBVR-Wan2.2, achieves an overall score of 0.685, representing an 84.6% relative improvement over its base model. However, a considerable gap to human performance (0.974) remains, highlighting persistent challenges in long-horizon temporal reasoning and robust symbolic manipulation.

Submit Your Models

We will allow researchers and developers to submit their own models for evaluation on our leaderboard. Stay tuned for submission guidelines and API access.

Hidden Test Sets

To ensure fair and robust evaluation, 50 tasks are reserved as a hidden set for future leaderboard evaluation. These tasks are not publicly available, preventing overfitting and ensuring that rankings reflect genuine generalization capabilities rather than task-specific memorization.