Summarize this content to 100 words:
[Submitted on 8 Oct 2025 (v1), last revised 22 Feb 2026 (this version, v2)]
View a PDF of the paper titled Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking, by Mitchell Keren Taraday and 2 other authors
View PDF
HTML (experimental)
Abstract:Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image–text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
Submission history From: Chaim Baskin [view email] [v1]
Wed, 8 Oct 2025 09:46:09 UTC (2,661 KB)
[v2]
Sun, 22 Feb 2026 12:55:23 UTC (5,301 KB)