Crossmark

Evaluation of Multimodal Image and Text Processing Models from an Uncertainty Perspective

Published Online: 2025-10-17

Published Print: 2025-09

Authors

Kostyumov, V. V.

Nutfullin, B. M.

Pilipenko, O. G.
License Information

Text and Data Mining valid from 2025-09-01

Version of Record valid from 2025-09-01
More Information

Article History

Received: 6 February 2025

Revised: 7 March 2025

Accepted: 7 April 2025

First Online: 17 October 2025

CONFLICT OF INTEREST

: The authors of this work declare that they have no conflicts of interest.

: In this study, we only consider multiple choice visual-verbal question answering (VQA) tasks, which covers only a subset of the benchmarks used to evaluate multimodal models. Furthermore, we only explore the standard query generation strategy for VQA from LLaVA. Since most of the VLMs considered do not support multiimage input, we do not experiment with using demos in queries. However, it could help to reveal new patterns in the VLM uncertainty.An important direction for future research on multimodal models could be to study their uncertainty in other vision-language tasks: such as open-ended VQA questions, image captioning, and visual grounding.

Document is current