BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai^*1, Zirui Song^*2,3,

Dayan Guan¹, Zhenhao Chen⁴, Yaohang Li^2,3, Xing Luo⁵, Chenyu Yi¹,

Alex Kot¹,

Nanyang Technological University¹

University of Technology Sydney²

Northeastern University³

Mohamed bin Zayed University of Artificial Intelligence⁴

Zhejiang University⁵

^*Equally contributing first authors

arXiv Code Demo Dataset

Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.

Motivation

We propose BenchLMM to investigate the cross-style capability of Large Multimodal Models (LMMs). This includes their proficiency in tasks like analyzing images with diverse styles, processing images acquired from non-RGB cameras, and interpreting images sourced from specific application knowledge. In this section, we will elaborate on how we build the benchmark that encompasses these diverse styles.

This picture shows benchmark examples. For each specific field, one example image and its corresponding Q&A have been shown. Note that, for a simple presentation, the questions in Domestic Robot and Open Game have been simplified from multiple-choice format. Please see the Appendix for more examples and detailed questions.

Evaluate Public LMMs

In most existing works, LMMs are predominantly evaluated using images in the 'Photo' style, leading to a gap in understanding their performance across diverse artistic styles. We extend the evaluation scope by examining LMMs' performance with various artistic styles beyond the common 'Photo' style. Results, as detailed in Table, reveal a notable decline in LMMs' effectiveness when processing these artistic styles. This trend suggests a potential overfitting of LMMs to the 'Photo' style, highlighting their limited adaptability to varied artistic styles, a capability that humans typically possess. Interestingly, GPT-4V, despite being a robust commercial model, exhibits similar limitations in handling diverse styles.

The figure shows the evaluations of public LMMs on cross-style BenchLMM. Note that Average^* represents the % out-of-distribution (OOD) average accuracy computed over five artistic-style benchmarks. The best and the second best results are highlighted in and respectively. All the numbers are presented in % and the full score is 100%.

Evaluations of public LMMs on cross-sensor BenchLMM.

Evaluations of public LMMs on cross-task BenchLMM.

More examples of our benchmark

BibTeX


  @misc{cai2023benchlmm,
    title={BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models}, 
    author={Rizhao Cai and Zirui Song and Dayan Guan and Zhenhao Chen and Xing Luo and Chenyu Yi and Alex Kot},
    year={2023},
    eprint={2312.02896},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Acknowledgement

This research is supported in part by the Rapid-Rich Object Search (ROSE) Lab of Nanyang Technological University and the NTU-PKU Joint Research Institute (a collaboration between NTU and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation). We are deeply grateful to Yaohang Li from the University of Technology Sydney for his invaluable assistance in conducting the experiments, and to Jingpu Yang, Helin Wang, Zihui Cui, Yushan Jiang, Fengxian Ji, and Yuxiao Hang from NLULab@NEUQ (Northeastern University at Qinhuangdao, China) for their meticulous efforts in annotating the dataset. We also would like to thank Prof. Miao Fang (PI of NLULab@NEUQ) for his supervision and insightful suggestion during discussion on this project. This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.