MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

MulDA

This repository contains the source code and data used in our paper “MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER” accepted by ACL-IJCNLP-2021.

Data

The data generated using our labeled sequence translation method can be found in the “data” directory.

Labled Sequence Translation

cd code/translate; python translate.py

lstm-lm: multiilngual LSTM language model

  • train lstm-lm on linearized sequences

    cd code/lstm-lm;
    
    python train.py \
    --train_file PATH/TO/train.linearized.txt \
    --valid_file PATH/TO/dev.linearized.txt \
    --model_file PATH/TO/model.pt \
    --emb_dim 300 \
    --rnn_size 512 \
    --gpuid 0 
    
  • generate linearized sequences

    cd code/lstm-lm;
    
    python generate.py \
    --model_file PATH/TO/model.pt \
    --out_file PATH/TO/out.txt \
    --num_sentences 10000 \
    --temperature 1.0 \
    --seed 3435 \
    --max_sent_length 32 \
    --gpuid 0
    

tools: tools for data processing

  • tools/preprocess.py: sequence linearization
  • tools/line2cols.py: convert linearized sequence back to two-column format

Requirements

  • code/lstm-lm/requirements.txt

Citation

Please cite our paper if you found the resources in this repository useful.

@inproceedings{liu-etal-2021-mulda,
    title = "MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER",
    author = "Liu Linlin  and
      Ding, Bosheng  and
      Bing, Lidong  and
      Joty, Shafiq  and
      Si, Luo  and
      Miao, Chunyan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL'21)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
}