Mentions légales du service

Skip to content
Snippets Groups Projects
Commit 1a0b8d53 authored by chloe's avatar chloe
Browse files

up slides

parent a0502404
No related branches found
No related tags found
No related merge requests found
No preview for this file type
......@@ -436,6 +436,165 @@ comment-topic, rhetorical-question
]
---
.left-column[
## Discourse parsing
### - Corpora
##### RST DT
###### Segmentation
]
.right-column[
### Discourse parsing: first step, segmenting
* Segment a document into EDUs
* mostly clauses and sentences, but a bit more fine-grained in the RST DT
* see the large set of rules in the [tagging manual](https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf), eg:
.small[
* Includes both speech acts and other cognitive acts:
|The company says| |it will shut down its plant.|
* But if the complement is a to-infinitival, do not segment:
|The company wants to shut down its plant.|
* But segment infinitive clause marking a purpose relation (but not all of them,
would be too easy...):
|A grand jury has been investigating
whether officials (...) conspired **to** cover up their accounting| |**to**
evade federal income taxes.|
]
]
---
.left-column[
## Discourse parsing
### - Corpora
##### RST DT
###### Segmentation
]
.right-column[
### Discourse segmentation
Most existing systems use lexical, POS tags and syntactic information + gold
sentence segmentation (not a so easy task!)
* RST DT: [Xuan Bach et al.]: F1 91.0% (automatic parse) / 93.7% (gold parse)
* English instructional corpus: [Joty et al, 2015] F1 80.9%
### ToNy, winner of the last shared task :)
See the results [here](https://sites.google.com/view/disrpt2019/shared-task?authuser=0),
and the paper [here](https://hal.archives-ouvertes.fr/hal-02374091/file/21_Paper.pdf)
[Muller et al, 2019]
* Using contextual embeddings alone allows close to state-of-the-art results
* ELMo better than BERT on English, but not multilingual
* Results with BERT multilingual, average over the languages:
* F1 90.11% if sentence boundaries given
* F1 86.38% else
* Problem with cross-domain learning:
* Training on GUM and testing on RST-DT: drop from 96% to 66%
* Training on RST-DT to test on GUM from 93% to 73%
* Note: [GUM corpus](http://corpling.uis.georgetown.edu/gum/) is composed of
documents from several domains.
]
---
.left-column[
## Discourse parsing
### - Corpora
##### RST DT
###### Segmentation
###### Parsing
]
.right-column[
### Discourse parsing: second step, building the tree
* Attachment: which EDUs are linked together
* Labeling: with which relation / sense
* Recursive process: the pair of discourse units has to be linked to another
segment, and so on, until full coverage
* + RST bonus: label each segment as nucleus or satellite
Parsers are inspired from syntactic parsing:
* Transition based, shift-reduce, CKY parsing (constituency or dependency)
* Main problems:
* Efficiency: trees are often far deeper than in syntax
* Representation: we need to encode spans of text instead of just words
* Relations are semantic, harder to identify than syntactic ones
* Lack of data: corpora are small, 385 documents in the RST DT, meaning
385 trees / instances for ou system
]
---
.left-column[
## Discourse parsing
### - Corpora
##### RST DT
###### Segmentation
###### Parsing
]
.right-column[
### Representing discourse units and their combination [Ji and Eisenstein, 2014](https://www.aclweb.org/anthology/P14-1002.pdf)
* Idea: jointly learn the task and the word representation (as low dimensional
vector)
* Test 3 options for transforming the original features, taking into account
relationships between adjacent EDUs
### Overcoming the lack of data by splitting the task [Wang et al, 2017](https://www.aclweb.org/anthology/P17-2029.pdf)
* Idea: not enough data for structure + nuclearity + relation.
* First: build a parser that identifies the naked structure + nuclearity
* Then: relations, 3 classifiers (within/across sentence, across paragraphs)
[Morey et al, 2017](https://hal.archives-ouvertes.fr/hal-01650251/document): evaluation problem, scores in [Ji and Eisenstein, 2014] are
not computed using the right evaluation metrics, F1=57.8% (and not 61.6%)
<img src="images/rst-parsing.png" width="40%"/>
]
---
.left-column[
## Discourse parsing
### - Corpora
##### RST DT
###### Segmentation
###### Parsing
]
.right-column[
### What about other languages? [Braud et al., 2017](https://arxiv.org/pdf/1701.02946.pdf)
* Cross-lingual experiments:
* Train only on data for other languages
* Train on data for other languages but optimize the hyper-parameters on data
for the target language
* Transfer is very hard!
* Monolingual experiments: large drop of performance for language other than
english, ie. smaller corpora
<img src="images/rst-cross.png" width="100%"/>
]
---
.left-column[
......
images/cross-seg.png

81.2 KiB

images/rst-cross.png

67.5 KiB

images/rst-parsing.png

53.6 KiB

images/rst_tree.png

48.5 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment