Supervised Contrastive Learning for Low-Resource Language Identification

Dataemia
1 Min Read


View a PDF of the paper titled ConLID: Supervised Contrastive Learning for Low-Resource Language Identification, by Negar Foroutan and 3 other authors

View PDF
HTML (experimental)

Abstract:Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages — often limited to single-domain data, such as the Bible — continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Submission history

From: Negar Foroutan [view email]
[v1]
Wed, 18 Jun 2025 09:35:33 UTC (9,317 KB)
[v2]
Mon, 9 Mar 2026 20:16:21 UTC (9,311 KB)



Source link

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!