LexC-Gen: Generating Data for
Extremely Low-Resource Languages
with Large Language Models
and Bilingual Lexicons

1Department of Computer Science, Brown University
2Data Science Institute, Brown University
LexC-Gen method animated.

Abstract

Data scarcity in low-resource languages can be addressed with word-for-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization.

We propose lexicon-conditioned data generation (LexC-Gen), a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, generated data is as competitive as with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. We show that conditioning on bilingual lexicons is the key component of LexC-Gen.

LexC-Gen is also practical—it only needs a single GPU to generate data at scale. It works well with open-access LLMs, and its cost is one-fifth of the cost of GPT4-based multilingual data generation.


LexC-Gen

Motivation: Data-Lexicon Mismatch

For low-resource languages that lack labeled data, we can use bilingual lexicons to translate existing labeled task data in high-resource languages to low-resource languages with word-to-word translation.

We observe that often the words in existing task data have low lexical overlap with the words in the task-agnostic bilingual lexicons.

Therefore, we propose generating lexicon-compatible task data (Figure (a)) for translating into low-resource languages. This improves number of words translated (Figure (b) left) and maximizes the utilization of semantic information in bilingual lexicons (Figure (b) right).

Topic classification samples.

Lexicon-Conditioned Generation (LexC-Gen)

LexC-Gen method figure.

TL;DR: We prompt LLMs to generate task data using English words from bilingual lexicons (i.e., lexicon-conditioned generation). Then we filter out poor-quality samples before translating with bilingual lexicons.

Given a bilingual lexicon and the set of classes for a classification task,
(1) we randomly sample the class label and a set of words from bilingual lexicon, for as many instances we desire to generate.
(2) We use these pairs to build the prompts to query CTG-trained (controlled-text-generation-trained) LLM and generate the task data in English.
(3) Then, we train a task classifier on existing task data to filter generated data and ensure input-label consistency.
(4) After filtering, we apply word-to-word translation with the bilingual lexicon following prior work.
Now we get the synthetic task data for the target low-resource language, which is used to finetune task classifier.


Examples of Generated Data

Sentiment Analysis

Generated text samples for sentiment analysis task in Acehnese language by LexC-Gen. The English words that remain untranslated are underlined.

Sentiment analysis samples.

Topic Classification

Generated text samples for topic classification task in Twi language by LexC-Gen. The English words that remain untranslated are underlined.

Topic classification samples.

Results

Finetuning classifiers on LexC-Gen generated data outperforms all baselines and is competitive with finetuning on gold translations curated by native speakers or professional translators.

Sentiment Analysis

We evaluated LexC-Gen on NusaX dataset for 7 low-resource local Indonesian languages.

Sentiment analysis result.

Topic Classification

We evaluated LexC-Gen on SIB-200 dataset for 10 worst-performing extremely low-resource languages.

Topic classification result.

Scaling up synthetic task data is not enough. You need the data generation to use high-resource-language (English) words from bilingual lexicons.

Ablation of Lexicon-Conditioning

We found that the ablation of lexicon-conditioning from LexC-Gen (i.e., Gen and Gen w/o filter) results in a significant drop in sentiment analysis accuracy.

Ablation of lexicon-conditioning result.
Quality control reduces the size of task training data and boosts performance at the same time.

Ablation of Quality Control

LexC-Gen ensures high-quality synthetic data by filtering out generated samples with mismatched input text and class labels.
This quality control filter not only reduces the size of the training data to one-third its original size, and boosts the performance of the task classifier to be competitive with gold translations.

Ablation of lexicon-conditioning result.

Practicality of LexC-Gen

LexC-Gen costs less than $100 to generate data with open-access 7B LLMs.

LexC-Gen only needs a single V100 GPU to generate data at scale in less than 36 hours. This is 20% of the cost of generating data with GPT-4 (Whitehouse et al., 2023) while having the same task performance.

LexC-Gen also works with open-access LLMs such as BLOOMZ-7B1. This allows permissive use of generated data for research in low-resource languages.

Topic classification samples.

Takeaways

1. We can use available linguistic resources such as bilingual lexicons to generate task data for low-resource languages.

2. Synthetic data can be competitive with expert-translated gold training data.

3. We don't need closed, frontier LLMs such as GPT-4 to generate data. Open-access 7B LLMs such as BLOOMZ are sufficient.

BibTeX

@misc{yong2024lexcgen,
      title={LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons}, 
      author={Zheng-Xin Yong and Cristina Menghini and Stephen H. Bach},
      year={2024},
      eprint={2402.14086},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}