=Resources

=Gallery

=Overview

ViLP is the dataset we used to probe the visual language priors of VLMs by constructing Question-Image-Answer (QIA) triplets that deliberately deviate from the training data distribution. It contains 300 carefully designed questions, each paired with three distinct answers: a Prior Answer and two Test Answers, resulting in a total of 900 QIA triplets. Our question context directly leads to the Prior Answer. In contrast, the two Test Answers are crafted to challenge the priors by requiring both textual and visual cues for accurate reasoning.

To alleviate the replying of visual language priors obtained in the training process, we propose a new pipeline & objective called ImageDPO, which is a self-improving approach to enhance VLM visual reasoning performance by increasing reliance on visual inputs.