Supervised Contrastive Learning for Low-Resource Language Identification

[Submitted on 18 Jun 2025 (v1), last revised 9 Mar 2026 (this version, v2)]

View a PDF of the paper titled ConLID: Supervised Contrastive Learning for Low-Resource Language Identification, by Negar Foroutan and 3 other authors

View PDF
HTML (experimental)

Abstract:Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages — often limited to single-domain data, such as the Bible — continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Submission history

From: Negar Foroutan [view email]
[v1]
Wed, 18 Jun 2025 09:35:33 UTC (9,317 KB)
[v2]
Mon, 9 Mar 2026 20:16:21 UTC (9,311 KB)

Source link

Supervised Contrastive Learning for Low-Resource Language Identification

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Compact, Multilingual, and Built for the Edge

10 Best YouTube Channels to Learn Generative AI

Exploring Urban Street Networks Through Graph Metrics

Agentic RAG vs Classic RAG: From a Pipeline to a Control Loop