ViLP

Probing Visual Language Priors in VLMs

Tiange Luo*, Ang Cao*, Gunhee Lee, Justin Johnson^†, Honglak Lee^† (*: equal contribution, †: equal advising)

Our ViLP dataset is hosted at [Huggingface].
Our ViLP evaluation code are released in [Github].
Our ImageDPO finetuned models are released in [LLaVA-v1.5-13b-ImageDPO] and [LLaVA-v1.5-7b-ImageDPO].
Our ImageDPO finetune pipeline code (data synthetic & finetuning) are released in [Github].
Our ImageDPO training data are released in [Huggingface].

ViLP is the dataset we used to probe the visual language priors of VLMs by constructing Question-Image-Answer (QIA) triplets that deliberately deviate from the training data distribution. It contains 300 carefully designed questions, each paired with three distinct answers: a Prior Answer and two Test Answers, resulting in a total of 900 QIA triplets. Our question context directly leads to the Prior Answer. In contrast, the two Test Answers are crafted to challenge the priors by requiring both textual and visual cues for accurate reasoning.

To alleviate the replying of visual language priors obtained in the training process, we propose a new pipeline & objective called ImageDPO, which is a self-improving approach to enhance VLM visual reasoning performance by increasing reliance on visual inputs.

ViLP

Probing Visual Language Priors in VLMs

=Resources

=Gallery

=Overview