[CLS] is Not Enough:
Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

1Yunnan Normal University, Kunming, China
2Kunming University of Science and Technology, Kunming, China
ICML 2026
PIAA Framework Overview

Overview of PIAA: Instead of relying on a single global [CLS] token, our framework formulates prediction as Patch-level Inference followed by Adaptive Aggregation. It narrows the vision-language modality gap without training and adaptively fuses patch-level scores for final multi-label output.

Abstract

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns.

To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision–language modality gap.

We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines.

Methodology

1. Patch-based Visual Classifier Learning (PVCL)

To bridge the vision-language modality gap without backpropagation, PVCL learns a visual classifier directly from the patch embeddings. By using entropy-driven patch selection, it isolates discriminative representations and models them through Gaussian Discriminant Analysis (GDA) to provide robust, visually-aligned decision boundaries.

2. Prediction Adaptive Aggregation (PAA)

PAA max-pools the highest visual responses to capture small foreground objects without losing spatial evidence, successfully merging robust localized patches with the overall global [CLS] anchoring.

Experimental Results

Citation

@misc{wang2026clsenoughmultilabelrecognition,
      title={[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation}, 
      author={Akang Wang and Xili Deng and Zhanxuan Hu and Yi Zhao and Yonghang Tai and Huafeng Li},
      year={2026},
      eprint={2605.25821},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.25821}, 
}