Kostyumov, V. V.
Nutfullin, B. M.
Pilipenko, O. G.
Article History
Received: 6 February 2025
Revised: 7 March 2025
Accepted: 7 April 2025
First Online: 17 October 2025
CONFLICT OF INTEREST
: The authors of this work declare that they have no conflicts of interest.
: In this study, we only consider multiple choice visual-verbal question answering (VQA) tasks, which covers only a subset of the benchmarks used to evaluate multimodal models. Furthermore, we only explore the standard query generation strategy for VQA from LLaVA. Since most of the VLMs considered do not support multiimage input, we do not experiment with using demos in queries. However, it could help to reveal new patterns in the VLM uncertainty.An important direction for future research on multimodal models could be to study their uncertainty in other vision-language tasks: such as open-ended VQA questions, image captioning, and visual grounding.