Overview of PIAA: Instead of relying on a single global [CLS] token, our framework formulates prediction as Patch-level Inference followed by Adaptive Aggregation. It narrows the vision-language modality gap without training and adaptively fuses patch-level scores for final multi-label output.
Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by
aligning images with textual concepts, yet they often underperform on multi-label
recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual
representation, is insufficient to faithfully encode diverse targets with varying
scales, contexts, and co-occurrence patterns.
To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision–language modality gap.
We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines.
To bridge the vision-language modality gap without backpropagation, PVCL learns a visual classifier directly from the patch embeddings. By using entropy-driven patch selection, it isolates discriminative representations and models them through Gaussian Discriminant Analysis (GDA) to provide robust, visually-aligned decision boundaries.
PAA max-pools the highest visual responses to capture small foreground objects without losing
spatial evidence, successfully merging robust localized patches with the overall global [CLS] anchoring.
@misc{wang2026clsenoughmultilabelrecognition,
title={[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation},
author={Akang Wang and Xili Deng and Zhanxuan Hu and Yi Zhao and Yonghang Tai and Huafeng Li},
year={2026},
eprint={2605.25821},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.25821},
}