Inference Results
VBVR-EvalKit
The official evaluation toolkit for Very Big Video Reasoning (VBVR). Unified inference and evaluation across 37 video generation models, including commercial APIs (Luma, Veo, Kling, Sora, Runway) and open-source models (LTX-Video, LTX-2, HunyuanVideo, SVD, WAN, CogVideoX, and more).
Quick Start
You can evaluate video models using VBVR-EvalKit:
# Install git clone https://github.com/Video-Reason/VBVR-EvalKit.git && cd VBVR-EvalKit python -m venv venv && source venv/bin/activate pip install -e . # Setup a model bash setup/install_model.sh --model svd --validate # Inference python examples/generate_videos.py --questions-dir setup/test_assets/ --output-dir ./outputs --model svd # Evaluation (VBVR-Bench) python examples/score_videos.py --inference-dir ./outputs
VBVR-Bench: 100+ rule-based evaluators with deterministic 0–1 scores and no API calls.
VBVR-Bench
The official test dataset for standardized and rigorous evaluation of video-based visual reasoning models, designed for use with the VBVR-EvalKit evaluation framework.
Current Options
You can load and use VBVR-Bench-Data with HuggingFace datasets:
# Install datasets library
pip install datasets
# Load VBVR-Bench-Data
from datasets import load_dataset
bench_data = load_dataset("Video-Reason/VBVR-Bench-Data")
# Access test split
test_data = bench_data["test"]
print(f"Test samples: {len(test_data)}")
500 test samples organized by domain (In-Domain_50, Out-of-Domain_50) and task generator, enabling reproducible evaluation of video reasoning capabilities.
Leaderboard
Model performance rankings on the VBVR benchmarks.
VBVR-Bench Evaluation Framework
VBVR-Bench provides a systematic, reproducible, and explainable evaluation framework for video reasoning models. Our evaluation employs a fully rule-based scoring system that ensures deterministic and interpretable assessment. The benchmark consists of 100 diverse tasks organized into two splits: 50 in-domain tasks (testing generalization within seen task categories) and 50 out-of-domain tasks (testing transfer to novel task structures). Each task includes 5 test samples, totaling 7,500 test cases across the full benchmark.
Five Cognitive Capabilities
Model performance is evaluated across five foundational cognitive faculties: Perception (extraction of structured representations from sensory input), Transformation (manipulation and synthesis of mental representations), Spatiality (representation of places and geometric relationships), Abstraction (distillation of generalizable knowledge from experiences), and Knowledge (propositional truth statements, both learned and intrinsic). Each task is categorized under one of these capabilities, enabling granular analysis of model strengths and weaknesses.
Evaluation Dimensions
Each task is scored across multiple weighted dimensions including spatial accuracy, trajectory correctness, temporal consistency, and logical validity. The rule-based framework provides granular verifiability, allowing precise measurement even at the pixel or object-property level. Our human preference alignment experiments demonstrate strong agreement between automated scores and human judgments, with a Spearman's correlation coefficient of ρ > 0.9.
Current Benchmark Results
VBVR-Bench has evaluated 8 state-of-the-art models, including 4 open-source models (CogVideoX1.5, HunyuanVideo, Wan2.2, LTX-2) and 4 proprietary models (Veo 3.1, Sora 2, Kling 2.6, Runway Gen-4). The current best-performing model, VBVR-Wan2.2, achieves an overall score of 0.685, representing an 84.6% relative improvement over its base model. However, a considerable gap to human performance (0.974) remains, highlighting persistent challenges in long-horizon temporal reasoning and robust symbolic manipulation.
Submit Your Models
We will allow researchers and developers to submit their own models for evaluation on our leaderboard. Stay tuned for submission guidelines and API access.
Hidden Test Sets
To ensure fair and robust evaluation, 50 tasks are reserved as a hidden set for future leaderboard evaluation. These tasks are not publicly available, preventing overfitting and ensuring that rankings reflect genuine generalization capabilities rather than task-specific memorization.