We introduce AoT-PsyPhyBENCH, a psychophysically grounded benchmark that tests whether vision–language models (VLMs) can judge the arrow of time in natural videos (forward vs. backward) using the same stimuli and human baselines. Across open-weight and proprietary VLMs, most models perform near chance and fall well short of human accuracy, especially on physically irreversible processes and causal manual actions. These results indicate a gap in temporal and causal inductive biases despite strong visual–semantic capabilities.
| Category | Description + Example | Reversal easy for humans? | Human F1 (Fwd/Bwd) | # samples | Included in AoT-PsyPhyBENCH? |
|---|---|---|---|---|---|
| (1) Proceed |
forward locomotion of people, animals, or vehicles |
✅ | 86.5 / 82.5 | 82 | Yes |
| (2) Fall |
free-fall / ballistic motion under gravity |
✅ | 86.9 / 82.8 | 84 | Yes |
| (3) Diffusion |
centrifugal diffusion or small-particle explosions |
✅ | 84.6 / 78.7 | 56 | Yes |
| (4) Division |
division of material by hand or tool |
✅ | 86.0 / 80.6 | 37 | Yes |
| (5) Put |
addition / construction of material by hand |
✅ | 84.1 / 77.4 | 67 | Yes |
| (6) Reciprocal |
reciprocating (cyclic) motion |
❌ | 71.6 / 38.5 | 148 | No |
| Rank | Family | Model | Reasoning / Setting | F1 Forward |
F1 Backward |
Acc. |
|---|
@misc{matta2025waydoestimeflow,
title={Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models},
author={Shiho Matta and Lis Kanashiro Pereira and Peitao Han and Fei Cheng and Shigeru Kitazawa},
year={2025},
eprint={2510.26241},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.26241},
}