VisualWebBench

How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?


Carnegie Mellon University    The Chinese University of Hong Kong    §Peking University
MBZUAI    Allen Institute for AI

*Equal Contribution
†Corresponding to: jpliu@link.cuhk.edu.hk, xyue2@andrew.cmu.edu
geometric reasoning

We introduce VisualWebBench, a multimodal benchmark designed to assess the understanding and grounding capabilities of MLLMs in web scenarios. VisualWebBench consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14 open-source MLLMs, Gemini Pro, Claude 3, and GPT-4V(ision) on VisualWebBench, revealing significant challenges and performance gaps. Further analysis highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs. We believe VisualWebBench will serve as a valuable resource for the research community and contribute to the creation of more powerful and versatile MLLMs for web-related applications.

Update

2024/10/18: We introduce 🤗 MultiUI, 7.3M general multimodal instructions synthesized from webUIs using text-based LLMs, enhancing both UI-related and Doc/OCR/chart understanding tasks.

Overview

We introduce VisualWebBench, a comprehensive multimodal benchmark designed to assess the capabilities of MLLMs in the web domain. Inspired by the human interaction process with web browsers, VisualWebBench consists of seven tasks that map to core abilities required for web tasks: captioning, webpage QA, heading OCR, element OCR, element grounding, action prediction, and action grounding, as detailed in the figure. The benchmark comprises 1.5K instances, all uniformly formulated in the QA style, making it easy to evaluate and compare the performance of different MLLMs.

The proposed VisualWebBench possesses the following features:
  • Comprehensiveness: VisualWebBench spans 139 websites with 1.5K samples, encompassing 12 different domains (e.g., travel, sports, hobby, lifestyle, animals, science, etc.) and 87 sub-domains.
  • Multi-granularity: VisualWebBench assesses MLLMs at three levels: website-level, element-level, and action-level.
  • Multi-tasks: VisualWebBench encompasses seven tasks designed to evaluate the understanding, OCR, grounding, and reasoning capabilities of MLLMs.
  • High quality: Quality is ensured through careful human verification and curation efforts.

Experimental Results

We evaluate 14 open-source general MLLMs on VisualWebBench. By default, for each model family, we use the largest available checkpoint. We consider three scales of LLaVA, 7B, 13B, and 34B, for model scaling analysis. Several strong close-source MLLMs, Gemini Pro, Claude series, and GPT-4V(ision), are also included for evaluation. In addition, we evaluate 2 GUI agent MLLMs, i.e., CogAgent and SeeClick, on VisualWebBench.

We highlight the following findings:
  • Challenging Nature of Web Tasks: Even the most powerful MLLMs, GPT-4V and Claude Sonnet achieve average scores of 64.6 and 65.8, respectively, leaving ample room for improvement.
  • Disparity between Open-source and Proprietary MLLMs: GPT-4V and Claude outperform open-source MLLMs including GUI agent MLLMs by a large margin, highlighting a discernible gap in the capabilities of current open-source MLLMs compared to proprietary ones.
  • Relatively strong correlation with general understanding benchmarks like MMMU but weak correlation with web agent benchmark like Mind2Web: MLLMs' abilities in web agent tasks, such as Mind2Web, do not correlate much with their performance on VisualWebBench, highlighting the importance of web understanding benchmarks like VisualWebBench.
  • Importance of Image Resolution: The limited image resolution handling capabilities of most open-source MLLMs restrict their utility in web scenarios, where rich text and elements are prevalent.
  • Weak Grounding Ability: Grounding ability, a crucial skill for developing MLLM-based web applications like autonomous web agents, is a weakness for most MLLMs.

Case Study

BibTeX

@misc{liu2024visualwebbench,
      title={VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?}, 
      author={Junpeng Liu and Yifan Song and Bill Yuchen Lin and Wai Lam and Graham Neubig and Yuanzhi Li and Xiang Yue},
      year={2024},
      eprint={2404.05955},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}