Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting (ICCV 2025)

Abstract

CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, methods show a significant drop in performance (-55%) on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method.

Overview of the problem

Problem Definition

Existing few-shot learning methods achieve strong performance when evaluating on classes that are identical or highly similar to those seen during CLIP's pretraining. However, this represents an idealized evaluation scenario (partially transductive setting) that does not reflect real-world deployment conditions where we are interested to know the performance on truly novel classes (inductive setting).
Since CLIP training data is not open-sourced, we cannot definitively determine which classes the model has never encountered. This creates a critical need for evaluation methodologies that can reliably benchmark few-shot performance on genuinely novel categories. Our work addresses this gap by developing a pipeline to emulate true unseen-class scenarios.
We introduce a novel pipeline to obtain inductive benchmarks using unlearning . This way we can evaluate CLIP-based few-shot learning methods inductively on any dataset.

Comparisons between the previous evaluation pipeline (Top: partially transductive) and the proposed pipeline (Bottom: inductive). Top: The tested methods (M_1-3 ) apply the pretrained CLIP model which the target classes are likely to be included during training. Bottom: The proposed inductive pipeline aims to provide the test methods with an updated CLIP model with target class information unlearned. The target dataset, unlearning module and few-shot methods are all replaceable.

Inductive vs. Transductive Performance

Performance in the inductive setting drops across all few-shot methods examined.
Performance range on X-axis (inductive) is 3%-23% while on the Y-axis (partially transductive) is 50% -77%.

Can Unlearning be a Proxy for the Inductive Setting?

We validate whether we are truly testing CLIP model in an inductive manner by comparing CLIP trained from scratch on ImageNet excluding and unlearning certain subset of classes. We compare the two settings based on (1) Few-shot learning methods performance and (2) Uniform Manifold Approximation and Projection visualization of the visual features. The results for the unlearned and held-out models are very similar in these two settings.

Few-shot Learning Methods Performance

Aggregated results across different few-shot learning methods are similar in both settings

Uniform Manifold Approximation and Projection visualization

(1) Highlighted classes from the excluded subset in "No subset"/"Unlearned" are more sparse and overlapping compared to “full” setting. (2) Unlearned subset in both “No subset” and “Unlearned” overlap more with other classes compared to the “full” setting.

Proposed Baseline

SEPRES

Our proposed method Self-Enhanced Prompt Tuning with Residual Textual Features (SEPRES) outperforms by a large margin other few-shot methods in the inductive setting while being strong in the partially transductive one. See the paper for more details.

Ablations

All the components of our pipeline are analyzed. We experiment with: (a) 4 different unlearning settings - in this setting we lose general knowledge of the CLIP model by less than 3%, 25%, 50% and 90% while unlearning a particular selected dataset; (b) 14 different few-shot classification methods, including the proposed SEPRES method; (c) 7 different forget datasets by using the corresponding set of retain datasets and validating on the 5 validation datasets; (d) 5 different few-shot settings - (1,2,4,8 and 16 shots) with 3 seeds for each setting. As more knowledge is lost, accuracy across all methods goes down, but the proposed SEPRES method tends to be more robust compared to other methods.

BibTeX

@InProceedings{kravets2025rethinkingfsl,
author    = {Kravets, Alexey and Chen, Da and P. Namboodiri, Vinay},
title     = {{Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting}},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year      = {2025}
}