LexC-Gen

Motivation: Data-Lexicon Mismatch

For low-resource languages that lack labeled data, we can use bilingual lexicons to translate existing labeled task data in high-resource languages to low-resource languages with word-to-word translation.

We observe that often the words in existing task data have low lexical overlap with the words in the task-agnostic bilingual lexicons.

Therefore, we propose generating lexicon-compatible task data (Figure (a)) for translating into low-resource languages. This improves number of words translated (Figure (b) left) and maximizes the utilization of semantic information in bilingual lexicons (Figure (b) right).

Lexicon-Conditioned Generation (LexC-Gen)

TL;DR: We prompt LLMs to generate task data using English words from bilingual lexicons (i.e., lexicon-conditioned generation). Then we filter out poor-quality samples before translating with bilingual lexicons.

Given a bilingual lexicon and the set of classes for a classification task,
(1) we randomly sample the class label and a set of words from bilingual lexicon, for as many instances we desire to generate.
(2) We use these pairs to build the prompts to query CTG-trained (controlled-text-generation-trained) LLM and generate the task data in English.
(3) Then, we train a task classifier on existing task data to filter generated data and ensure input-label consistency.
(4) After filtering, we apply word-to-word translation with the bilingual lexicon following prior work.
Now we get the synthetic task data for the target low-resource language, which is used to finetune task classifier.

Examples of Generated Data

Sentiment Analysis

Generated text samples for sentiment analysis task in Acehnese language by LexC-Gen. The English words that remain untranslated are underlined.

Topic Classification

Generated text samples for topic classification task in Twi language by LexC-Gen. The English words that remain untranslated are underlined.

Results

Finetuning classifiers on LexC-Gen generated data outperforms all baselines and is competitive with finetuning on gold translations curated by native speakers or professional translators.

Sentiment Analysis

We evaluated LexC-Gen on NusaX dataset for 7 low-resource local Indonesian languages.

Topic Classification

We evaluated LexC-Gen on SIB-200 dataset for 10 worst-performing extremely low-resource languages.

Scaling up synthetic task data is not enough. You need the data generation to use high-resource-language (English) words from bilingual lexicons.

Ablation of Lexicon-Conditioning

We found that the ablation of lexicon-conditioning from LexC-Gen (i.e., Gen and Gen w/o filter) results in a significant drop in sentiment analysis accuracy.

Quality control reduces the size of task training data and boosts performance at the same time.

Ablation of Quality Control

LexC-Gen ensures high-quality synthetic data by filtering out generated samples with mismatched input text and class labels.
This quality control filter not only reduces the size of the training data to one-third its original size, and boosts the performance of the task classifier to be competitive with gold translations.

Practicality of LexC-Gen

LexC-Gen costs less than $100 to generate data with open-access 7B LLMs.

LexC-Gen only needs a single V100 GPU to generate data at scale in less than 36 hours. This is 20% of the cost of generating data with GPT-4 (Whitehouse et al., 2023) while having the same task performance.

LexC-Gen also works with open-access LLMs such as BLOOMZ-7B1. This allows permissive use of generated data for research in low-resource languages.

Takeaways

1. We can use available linguistic resources such as bilingual lexicons to generate task data for low-resource languages.

2. Synthetic data can be competitive with expert-translated gold training data.

3. We don't need closed, frontier LLMs such as GPT-4 to generate data. Open-access 7B LLMs such as BLOOMZ are sufficient.

LexC-Gen: Generating Data for
Extremely Low-Resource Languages
with Large Language Models
and Bilingual Lexicons

Abstract

LexC-Gen

Motivation: Data-Lexicon Mismatch

Lexicon-Conditioned Generation (LexC-Gen)

Examples of Generated Data

Sentiment Analysis

Topic Classification

Results

Sentiment Analysis

Topic Classification

Ablation of Lexicon-Conditioning

Ablation of Quality Control

Practicality of LexC-Gen

Takeaways

BibTeX

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Abstract

LexC-Gen

Motivation: Data-Lexicon Mismatch

Lexicon-Conditioned Generation (LexC-Gen)

Examples of Generated Data

Sentiment Analysis

Topic Classification

Results

Sentiment Analysis

Topic Classification

Ablation of Lexicon-Conditioning

Ablation of Quality Control

Practicality of LexC-Gen

Takeaways

BibTeX

LexC-Gen: Generating Data for
Extremely Low-Resource Languages
with Large Language Models
and Bilingual Lexicons