parsing.md 3.92 KB
Newer Older
Bruno Guillaume's avatar
Bruno Guillaume committed
1 2 3 4 5 6 7 8 9 10 11
+++
date = "2017-03-15T22:22:20+01:00"
title = "parsing"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
Description = ""
+++

# Dependency parsing

Bruno Guillaume's avatar
Bruno Guillaume committed
12 13
**Grew-parse-FR** is natural language parser for French.
It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data.
Bruno Guillaume's avatar
Bruno Guillaume committed
14
With a POS-tagger (**Talismane** is recommended), it provides a full parser with sentences as input and dependency structures as output.
Bruno Guillaume's avatar
Bruno Guillaume committed
15
The parsing GRS is described in an [IWPT 2015 publication](https://hal.inria.fr/hal-01188694).
Bruno Guillaume's avatar
Bruno Guillaume committed
16

Bruno Guillaume's avatar
Bruno Guillaume committed
17
# How to parse a sentence?
Bruno Guillaume's avatar
Bruno Guillaume committed
18 19 20

We consider the sentence:

21
- *"La souris a été mangée par le chat."* [*"The mouse was eaten by the cat."*].
Bruno Guillaume's avatar
Bruno Guillaume committed
22

Bruno Guillaume's avatar
Bruno Guillaume committed
23
The parsing is done in three steps:
Bruno Guillaume's avatar
Bruno Guillaume committed
24

Bruno Guillaume's avatar
Bruno Guillaume committed
25 26 27
1. POS-tagging with **Talismane**
2. Convert **Talismane** into a format usable by **Grew** (a **sed** script)
3. Building the dependency syntax structure by applying Graph Rewriting System
Bruno Guillaume's avatar
Bruno Guillaume committed
28 29 30

# Prerequisite

Bruno Guillaume's avatar
Bruno Guillaume committed
31 32 33
 * **Talismane**:
  * Download from [Talismane github page](https://github.com/joliciel-informatique/talismane/releases), the 3 files: `talismane-distribution-5.2.0-bin.zip`, `frenchLanguagePack-5.2.0.zip` and `talismane-fr-5.2.0.conf`.
  * Unzip `talismane-distribution-5.2.0-bin.zip` (and not the other zip file).
Bruno Guillaume's avatar
Bruno Guillaume committed
34 35
 * **Grew**: see [Installation page](../installation)
 * **POStoSSQ**: get it with the command: `git clone https://gitlab.inria.fr/grew/POStoSSQ.git`
Bruno Guillaume's avatar
Bruno Guillaume committed
36 37
 * Download sed script [`tal2grew.sed`](/parsing/tal2grew.sed)

Bruno Guillaume's avatar
Bruno Guillaume committed
38 39 40

# More info on the parsing process

Bruno Guillaume's avatar
Bruno Guillaume committed
41 42 43 44 45 46 47 48
## Step 0: Get the text to parse

Put the input text in the file `data.txt`

`echo "La souris a été mangée par le chat." > data.txt`


## Step 1: POS-tagging
Bruno Guillaume's avatar
Bruno Guillaume committed
49
The parsing system **POStoSSQ** is waiting for a pos-tagged input.
Bruno Guillaume's avatar
Bruno Guillaume committed
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
One easy way to produce such a pos-tagged French sentence is to use [Talismane](http://redac.univ-tlse2.fr/applications/talismane.html).

Call **Talismane** for tokenisation and POS-tagging with the command:

```
java -Xmx1G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-5.2.0.jar \
  --analyse \
  --endModule=posTagger \
  --sessionId=fr \
  --encoding=UTF8 \
  --inFile=data.txt \
  --outFile=data.tal
```

This should produce the file [`data.tal`](/parsing/data.tal):

{{< input file="static/parsing/data.tal" >}}
Bruno Guillaume's avatar
Bruno Guillaume committed
67

Bruno Guillaume's avatar
Bruno Guillaume committed
68 69
## Step 2: Convert output
Apply the sed script:
Bruno Guillaume's avatar
Bruno Guillaume committed
70

Bruno Guillaume's avatar
Bruno Guillaume committed
71
`sed -f tal2grew.sed data.tal > data.pos.conll`
Bruno Guillaume's avatar
Bruno Guillaume committed
72

Bruno Guillaume's avatar
Bruno Guillaume committed
73
This produces the file [`data.pos.conll`](/parsing/data.pos.conll):
Bruno Guillaume's avatar
Bruno Guillaume committed
74

Bruno Guillaume's avatar
Bruno Guillaume committed
75
{{< input file="static/parsing/data.pos.conll" >}}
Bruno Guillaume's avatar
Bruno Guillaume committed
76

Bruno Guillaume's avatar
Bruno Guillaume committed
77
## Step 3: Parsing with the GRS
Bruno Guillaume's avatar
Bruno Guillaume committed
78

Bruno Guillaume's avatar
Bruno Guillaume committed
79
With the file [`data.pos.conll`](/parsing/data.pos.conll) described above, the following command produces the CoNLL code of the parsed sentence:
80

Bruno Guillaume's avatar
Bruno Guillaume committed
81
`grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll -o data.surf.conll`
Bruno Guillaume's avatar
Bruno Guillaume committed
82

Bruno Guillaume's avatar
Bruno Guillaume committed
83
The output file is [`data.surf.conll`](/parsing/data.surf.conll):
Bruno Guillaume's avatar
Bruno Guillaume committed
84

Bruno Guillaume's avatar
Bruno Guillaume committed
85
{{< input file="static/parsing/data.surf.conll" >}}
Bruno Guillaume's avatar
Bruno Guillaume committed
86 87 88

which encodes the syntactic structure:

Bruno Guillaume's avatar
Bruno Guillaume committed
89
![Dependency structure](/parsing/data.surf.svg)
Bruno Guillaume's avatar
Bruno Guillaume committed
90

Bruno Guillaume's avatar
Bruno Guillaume committed
91
It is also possible to run a GTK interface in which you can explore step by step rewriting of the input sentence:
Bruno Guillaume's avatar
Bruno Guillaume committed
92

Bruno Guillaume's avatar
Bruno Guillaume committed
93
`grew gui -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll`
Bruno Guillaume's avatar
Bruno Guillaume committed
94

Bruno Guillaume's avatar
Bruno Guillaume committed
95
# In case of trouble
96

Bruno Guillaume's avatar
Bruno Guillaume committed
97
## Conversion of Talismane output
Bruno Guillaume's avatar
Bruno Guillaume committed
98 99 100
**Talismane** outputs features with disjunction of values in case of ambiguities.
These disjunction can not be handle with the current parsing system.
The sed script [`tal2grew.sed`](/parsing/tal2grew.sed) rewrites or removes the disjunction we have discovered so far but this may not be exhaustive.
101

Bruno Guillaume's avatar
Bruno Guillaume committed
102
If there is an error in the **Grew** output, you may have to adapt the Step 3.1 in the sed file (please inform [us](mailto:Bruno.Guillaume@inria.fr) if this is the case, we will update the sed file for other users!).
103

Bruno Guillaume's avatar
Bruno Guillaume committed
104 105 106 107
## Use MElt instead of Talismane

If you didn't manage to use **Talismane**, MElt is an alternative.
See [Dependency parsing with MElt](../parsing_melt) if you want to use [MElt](https://gforge.inria.fr/frs/?group_id=481)).
108 109