FaceMoE: Mixture of Experts for
Low-Resolution Face Recognition

Johns Hopkins University

Abstract

Low-resolution face recognition remains challenging due to severe degradations in probe images, domain differences between high-resolution gallery and low-resolution probe data, and catastrophic forgetting during low-resolution adaptation. FaceMoE introduces a transformer with Mixture-of-Experts feed-forward blocks and a top-k router that dynamically activates specialized experts for different semantic facial regions. This resolution-aware sparse routing improves feature extraction under degradation while preserving pretrained knowledge and scaling capacity efficiently.

Across eleven datasets (high-quality, mixed-quality, and low-resolution benchmarks), FaceMoE outperforms prior state-of-the-art methods, including strong gains on BRIAR Protocol 3.1, IJB-S, and TinyFace.

Motivation and Contributions

FaceMoE motivation overview

The figure highlights the central low-resolution face recognition challenges that motivate FaceMoE: (1) severe degradation in probe frames causes weak and unstable identity cues, making reliable feature aggregation difficult; (2) a strong domain gap exists between high-resolution gallery images and low-resolution probe images, where models rely on different facial cues across resolutions; and (3) naive fine-tuning on low-resolution data can trigger catastrophic forgetting and reduce pretrained performance on high-quality data.

  • FaceMoE introduces sparse FFN experts in transformer blocks to improve feature extraction for degraded low-resolution probes.
  • A top-k router assigns tokens to specialized experts, enabling resolution-aware routing across facial regions.
  • Modular sparse activation supports adaptation to low-resolution datasets while reducing catastrophic forgetting.
  • These design choices lead to strong low-resolution recognition gains with minimal drop on high-quality and mixed-quality benchmarks.

FaceMoE Architecture

FaceMoE architecture

FaceMoE replaces the standard transformer FFN with multiple expert MLPs and a learnable top-k router. Each token is routed to a sparse subset of experts, producing adaptive feature transformations that capture different facial regions and frequency characteristics. In practice, the model achieves an effective tradeoff at N = 3 experts and k = 2 active experts per token.

Training uses a composite objective: face recognition loss (CosFace) plus router z-loss and load-balancing loss to stabilize routing and avoid expert collapse.

Quantitative Results

BRIAR and IJB-S comparisons

Comparison against prior methods on BRIAR Protocol 3.1 and IJB-S demonstrates strong low-resolution recognition performance and robust surveillance-domain generalization.

Qualitative Analysis

FaceMoE expert activation maps

Expert activation maps show semantically meaningful specialization and improved routing behavior after low-resolution fine-tuning, supporting robust identity extraction from degraded faces.

BibTeX

Coming soon ...
Acknowledgement: Website template adapted from Nerfies