[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation

Overview of PIAA: Instead of relying on a single global [CLS] token, our framework formulates prediction as Patch-level Inference followed by Adaptive Aggregation. It narrows the vision-language modality gap without training and adaptively fuses patch-level scores for final multi-label output.

Abstract

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns.

To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision–language modality gap.

We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines.

Methodology

1. Patch-based Visual Classifier Learning (PVCL)

To bridge the vision-language modality gap without backpropagation, PVCL learns a visual classifier directly from the patch embeddings. By using entropy-driven patch selection, it isolates discriminative representations and models them through Gaussian Discriminant Analysis (GDA) to provide robust, visually-aligned decision boundaries.

2. Prediction Adaptive Aggregation (PAA)

PAA max-pools the highest visual responses to capture small foreground objects without losing spatial evidence, successfully merging robust localized patches with the overall global [CLS] anchoring.

Experimental Results

Quantitative Performance

Method	Frozen	VOC12	VOC07	COCO	NUS
Unsupervised
NaiveAN	✗	85.5	86.5	65.1	40.8
ROLE	✗	82.6	84.6	67.1	43.2
CDUL	✗	88.6	89.0	69.2	44.0
CCD	✗	90.1	91.0	70.3	44.5
Training free
CLIP	✓	84.9	85.4	61.7	44.4
TagCLIP	✓	90.8	91.2	70.0	38.7
SPARC	✓	-	88.7	68.0	47.5
PIAA (Ours)	✓	92.2	92.5	73.2	50.6

PIAA consistently establishes a new state-of-the-art among training-free and unsupervised methods across four challenging multi-label benchmarks.

Orthogonal Improvements

Method	VOC12	VOC07	COCO	NUS	AVG
CLIP	78.3	78.6	49.2	34.8	60.2
+ PIAA	89.2+10.9	89.7+11.1	68.8+19.6	47.9+13.1	73.9+13.7
SCLIP	84.7	85.7	63.1	37.7	67.8
+ PIAA	91.4+6.7	91.7+6.0	73.0+9.9	49.2+11.5	76.3+8.5
ITACLIP	86.2	86.5	67.7	36.2	69.2
+ PIAA	92.2+6.0	92.3+5.8	74.6+6.9	49.1+12.9	77.1+7.9
SC-CLIP	88.8	89.1	68.8	43.3	72.5
+ PIAA	92.2+3.4	92.5+3.4	73.2+4.4	50.6+7.3	77.1+4.6

PIAA provides significant and orthogonal improvements to all evaluated front-ends, achieving an average mAP surge of +13.7% even on the standard CLIP baseline.

Ablation Study of PIAA Components

PVCL	PAA	VOC12	VOC07	COCO	NUS
		88.8	89.1	68.8	43.3
	✓	89.6	90.4	70.3	45.3
✓		91.3	91.7	69.9	45.7
✓	✓	92.2	92.5	73.2	50.6

Performance evaluated on SC-CLIP front-end. The synergy of PVCL and PAA yields the best multi-label predictions.

Efficiency Analysis (Time Cost)

Dataset	Learning Time (min)			Inference Time (min)
Dataset	CCD	Ours	Speedup	TagCLIP	Ours	Speedup
VOC12	43.2	0.2	216.0×	5.8	0.2	29.0×
VOC07	45.6	0.2	228.0×	5.4	0.2	27.0×
COCO	512.4	1.6	320.3×	66.1	1.1	60.1×
NUS	991.8	2.5	396.7×	83.8	1.6	52.4×
Average	398.3	1.1	362.1×	40.3	0.8	50.4×

Comparison of Classifier Acquisition and Inference Time. PIAA achieves a staggering 362× learning speedup.

Citation

@inproceedings{wang2026,
  title={[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation},
  author={Wang, Akang and Deng, Xili and Hu, Zhanxuan and Zhao, Yi and Tai, Yonghang and Li, Huafeng},
  booktitle={Proceedings of the International Conference on Machine Learning},
  year={2026}
}