# A Unified Linear-Time Framework for Sentence-Level Discourse Parsing

This repository contains the source code of our paper “A Unified Linear-Time Framework for Sentence-Level Discourse Parsing” in ACL 2019.

## Getting Started

These instructions will help you to run our unified discourse parser based on RST dataset.

### Prerequisites

* PyTorch 0.4 or higher
* Python 3
* AllenNLP


### Dataset

We train and evaluate the model with the standard RST Discourse Treebank (RST-DT) corpus. * Segmenter: we utilize all 7673 sentences for training and 991 sentences for testing. * Parser: we extract sentence-level DTs from a document-level DT by finding the subtrees that span over the respective sentences. This gives 7321 sentence-level DTs for training, 951 for testing, and 1114 for getting hu- man agreements.

### Data format

#### Example

• Sentence: (Input sentences should be tokenizaed first. ‘[]’ denotes the EDU boundary tokens.)

• Although the [report,] which has [released] before the stock market [opened,] didn’t trigger the 190.58 point drop in the Dow Jones Industrial [Average,] analysts [said] it did play a role in the market’s [decline.]
• EDU_Breaks: (The indexes of the EDU boundary words, including the last word of the sentence.)

• [2, 5, 10, 22, 24, 33]
• Gold Discourse Tree structure: (The output of the parser also holds for the format.)

• (1:Satellite=Contrast:4,5:Nucleus=span:6) (1:Nucleus=Same-Unit:3,4:Nucleus=Same-Unite:4) (5:Satellite=Attribution:5,6:Nucleus=span:6) (1:Satellite=span:1,2:Nucleus=Elaboration:3) (2:Nucleus=span:2,3:Satellite=Temporal:3)
• Parsing_Label (This should accord with Top-Down Depth-First manner. e.g., There are 6 EDUs in this case. At the first decoding step, the parser will predict 4th (index 3) EDU as the break position such that two new splits (EDU1-EDU4 and EDU5-EDU6) are generated.

• [3, 2, 0, 1, 4]
• Relation_Label (In all we have 39 relations in our model. Each time two newly splits are created, the classifier would predict the corresponding relation label between them.)

• [3, 17, 5, 30, 21]

• For training, you will need to prepare decoder_input_index as the decoder input and corresponding parent_index, sibling_index as the partial tree information.

• decoder_input_index: [ 0 , 0 , 0 , 1 , 4 ] - to take the first EDU as the representation of text span to be parsed. or [ 5 , 3 , 2 , 2 , 5 ] - to take the last EDU as the representation of text span to be parsed.
• parent_index: [ 0 , 5 , 4 , 3 , 6 ]
• sibling_index: [ 99 , 99 , 99 , 0 , 4 ] - ‘99’ denotes empty siblings.

## How To Run

• Parser:

cd Parser/
python train.py


You can also control any other arguments. Please refer to main.py. By default, the parser will use the same parameters as described in the paper.

• Segmenter:
To be released..

## Citation

Please cite our paper if you found the resources in this repository useful.

@inproceedings{Xiang19,
Address     = {Florence, Italy},
Author      = {Xiang Lin and Shafiq Joty and Prathyusha Jwalapuram and M Saiful Bari},
Booktitle   = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
Numpages    = {9},
Publisher   = {ACL},
Series      = {ACL '19},
pages       = {xx--xx},
Title       = {{A Unified Linear-Time Framework for Sentence-Level Discourse Parsing}},
Year        = {2019},
url         = {https://arxiv.org/abs/1905.05682}
}