Figure 1. Left: Failure cases of leading MLLMs, such as LLaVA-OV, Qwen2-VL, GPT-4o and GeminiPro1.5, on basic questions related to face understanding. Right: Performance comparison of top models across the 14 tasks included in the benchmark.
MLLMs are increasingly deployed as central processors in various advanced applications, including virtual-reality headsets, embodied AI, driving safety, authentication, human-computer interaction, and sports analysis. In these applications, accurate face understanding is crucial, as face images appear frequently and require accurate face understanding for appropriate responses. However, the face understanding capabilities of existing MLLMs are limited; they often fail to answer basic questions such as "What is the expression of the person in this image?" or "Which of the following regions is not present in the face image?" (See Figure 1. Left). These shortcomings indicate a significant scope for improvement.
The key contributions of our work are as follows:
Figure 2. FaceXBench examples cover a total of 14 tasks, addressing various aspects of face understanding. Each question may consist of single or multiple images. Every question includes four options, with only one correct answer. The options are strategically designed to prompt the model to analyze carefully before selecting an option.
Distribution of questions in FaceXBench across different categories and sub-categories.
FaceXBench key statistics.
FaceXBench source dataset distribution.
Table 1. Results of different models on the FaceXBench. We categorize the open-source models in three categories based on parameter size: (a) Open source MLLMs (<4B parameters), (b) Open source MLLMs (4B-13B parameters), (c) Open source MLLMs (>13B parameters). Additionally, we evaluate (d) proprietary models. The best model in each category is highlighted in bold.
Coming Soon ... !!!