=Resources
- Our ViLP dataset is hosted at [Huggingface].
- Our ViLP evaluation code are released in [Github].
- Our ImageDPO finetuned models are released in [LLaVA-v1.5-13b-ImageDPO] and [LLaVA-v1.5-7b-ImageDPO].
- Our ImageDPO finetune pipeline code (data synthetic & finetuning) are released in [Github].
- Our ImageDPO training data are released in [Huggingface].
=Gallery
=Overview
ViLP is the dataset we used to probe the visual language priors of VLMs by constructing Question-Image-Answer (QIA) triplets that deliberately deviate from the training data distribution. It contains 300 carefully designed questions, each paired with three distinct answers: a Prior Answer and two Test Answers, resulting in a total of 900 QIA triplets. Our question context directly leads to the Prior Answer. In contrast, the two Test Answers are crafted to challenge the priors by requiring both textual and visual cues for accurate reasoning.

