FaceXBench

Motivation

Figure 1. Left: Failure cases of leading MLLMs, such as LLaVA-OV, Qwen2-VL, GPT-4o and GeminiPro1.5, on basic questions related to face understanding. Right: Performance comparison of top models across the 14 tasks included in the benchmark.

MLLMs are increasingly deployed as central processors in various advanced applications, including virtual-reality headsets, embodied AI, driving safety, authentication, human-computer interaction, and sports analysis. In these applications, accurate face understanding is crucial, as face images appear frequently and require accurate face understanding for appropriate responses. However, the face understanding capabilities of existing MLLMs are limited; they often fail to answer basic questions such as "What is the expression of the person in this image?" or "Which of the following regions is not present in the face image?" (See Figure 1. Left). These shortcomings indicate a significant scope for improvement.

Contributions

The key contributions of our work are as follows:

Introducing FaceXBench: A comprehensive benchmark for evaluating MLLMs face understanding across 14 tasks in 6 key categories. It includes 5, 000 VQA questions derived from 25 public datasets and a newly developed dataset, FaceXAPI.
Extensive Evaluation: We evaluate 26 open-source MLLMs and two proprietary models, GPT-4o and GeminiPro1.5, which achieve accuracies of 50.24% and 54.40%, respectively, highlighting the significant challenge posed by FaceXBench and the substantial room for improvement.
Analysis and Discussion: We provide a detailed analysis of MLLMs performance across various aspects of face understanding, identifying areas where current MLLMs fall short. Additionally, we suggest potential research directions that could enhance MLLMs face understanding.

FaceXBench

Figure 2. FaceXBench examples cover a total of 14 tasks, addressing various aspects of face understanding. Each question may consist of single or multiple images. Every question includes four options, with only one correct answer. The options are strategically designed to prompt the model to analyze carefully before selecting an option.

Distribution of questions in FaceXBench across different categories and sub-categories.

FaceXBench key statistics.

FaceXBench source dataset distribution.

Benchmark Samples

Results

Table 1. Results of different models on the FaceXBench. We categorize the open-source models in three categories based on parameter size: (a) Open source MLLMs (<4B parameters), (b) Open source MLLMs (4B-13B parameters), (c) Open source MLLMs (>13B parameters). Additionally, we evaluate (d) proprietary models. The best model in each category is highlighted in bold.

FaceXBench: Evaluating Multimodal

LLMs on Face Understanding

Motivation

Contributions

FaceXBench

Benchmark Samples

Results

BibTeX

Acknowledgement: The website template is taken from Nerfies