HumanVBench is an innovative benchmark designed to evaluate the human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs). Traditional benchmarks focus on object and action recognition, often neglecting the nuances of human emotions, behaviors, and speech-visual alignment. HumanVBench addresses these gaps by comprising 16 tasks that explore inner emotions and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects. It utilizes advanced automated pipelines for video annotation and QA generation, minimizing human annotation dependency. HumanVBench evaluates 22 state-of-the-art video MLLMs, revealing limitations in cross-modal and emotion perception, highlighting the need for further refinement. It is open-sourced to facilitate advancements in video MLLMs.
State-of-the-art video MLLMs
Multimodal Large Language Models
HumanVBench dataset
Emotion perception, cross-modal understanding
Research environments
No
Yes
Human-centric evaluation, cross-modal tasks
No
Standard computing resources
Linux, Windows, macOS
Compatible with video MLLMs
N/A
N/A
N/A
Yes
Research community
N/A
Large-scale benchmark
Depends on model size
Standard for MLLMs
N/A
N/A
Focus on specific human-centric tasks
AI research, video analysis
Evaluating video MLLMs
Researchers
Integrates with MLLMs
Scalable with model size
Community support
N/A
Command-line
No
N/A
Open-source
Yes
Research institutions
N/A
N/A
N/A
Research tool
No
N/A
Open-source
0.00
N/A
Open-source
01/01/1970
01/01/1970
N/A
N/A
Yes