A Unified Linear-Time Framework for Sentence-Level Discourse Parsing

A Unified Linear-Time Framework for Sentence-Level Discourse Parsing

This repository contains the source code of our paper “A Unified Linear-Time Framework for Sentence-Level Discourse Parsing” in ACL 2019.

Getting Started

These instructions will help you to run our unified discourse parser based on RST dataset.

Prerequisites

* PyTorch 0.4 or higher
* Python 3
* AllenNLP

Dataset

We train and evaluate the model with the standard RST Discourse Treebank (RST-DT) corpus. * Segmenter: we utilize all 7673 sentences for training and 991 sentences for testing. * Parser: we extract sentence-level DTs from a document-level DT by finding the subtrees that span over the respective sentences. This gives 7321 sentence-level DTs for training, 951 for testing, and 1114 for getting hu- man agreements.

Data format

Example

  • Sentence: (Input sentences should be tokenizaed first. ‘[]’ denotes the EDU boundary tokens.)

    • Although the [report,] which has [released] before the stock market [opened,] didn’t trigger the 190.58 point drop in the Dow Jones Industrial [Average,] analysts [said] it did play a role in the market’s [decline.]
  • EDU_Breaks: (The indexes of the EDU boundary words, including the last word of the sentence.)

    • [2, 5, 10, 22, 24, 33]
  • Gold Discourse Tree structure: (The output of the parser also holds for the format.)

    • (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
  • Parsing_Label (This should accord with Top-Down Depth-First manner. e.g., There are 6 EDUs in this case. At the first decoding step, the parser will predict 4th (index 3) EDU as the break position such that two new splits (EDU1-EDU4 and EDU5-EDU6) are generated.

    • [3, 2, 0, 1, 4]
  • Relation_Label (In all we have 39 relations in our model. Each time two newly splits are created, the classifier would predict the corresponding relation label between them.)

    • [3, 17, 5, 30, 21]

  • For training, you will need to prepare decoder_input_index as the decoder input and corresponding parent_index, sibling_index as the partial tree information.

    • decoder_input_index: [ 0 , 0 , 0 , 1 , 4 ] - to take the first EDU as the representation of text span to be parsed. or [ 5 , 3 , 2 , 2 , 5 ] - to take the last EDU as the representation of text span to be parsed.
    • parent_index: [ 0 , 5 , 4 , 3 , 6 ]
    • sibling_index: [ 99 , 99 , 99 , 0 , 4 ] - ‘99’ denotes empty siblings.

How To Run

  • Parser:

    cd Parser/
    python train.py
    

    You can also control any other arguments. Please refer to main.py. By default, the parser will use the same parameters as described in the paper.

  • Segmenter:
    To be released..

Citation

Please cite our paper if you found the resources in this repository useful.

@inproceedings{Xiang19,
	Address     = {Florence, Italy},
	Author      = {Xiang Lin and Shafiq Joty and Prathyusha Jwalapuram and M Saiful Bari},
	Booktitle   = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
	Numpages    = {9},
	Publisher   = {ACL},
	Series      = {ACL '19},
        pages       = {xx--xx},
	Title       = {{A Unified Linear-Time Framework for Sentence-Level Discourse Parsing}},
	Year        = {2019},
	url         = {https://arxiv.org/abs/1905.05682}
}