Commit 5e5cce7f authored by Bruno Guillaume's avatar Bruno Guillaume

talismane tagger

parent f61f7551
......@@ -5,79 +5,80 @@ title = "installation"
# Grew installation
The installation proceeds in two steps: first, installation of the native library and second, installation of the Python library
**Grew** is implemented with the [Ocaml](http://ocaml.org) language.
**Grew** is implemented with the **[Ocaml](http://ocaml.org)** language.
It is easy to install on Linux or Mac OS X (installation on Windows should be possible, but this is untested).
A Python binding is also available.
:warning: If you run into trouble using the instructions of this page, feel free to [open an issue on GitLab](https://gitlab.inria.fr/grew/grew_doc/issues) or to [contact the developer](mailto:Bruno.Guillaume@inria.fr?subject=Install%20of%20Grew).
## Linux
### Step 1: native library
* `apt-get install opam m4 aspcud` # Prerequisite
* `opam init -a -y --comp 4.06.0` # Download and install Ocaml (4.06.0)
* ```eval `opam config env` ``` # Make Ocaml ready to be used now
* `opam remote add grew "http://opam.grew.fr"` # Add the grew OPAM repository
* `opam install grew grewpy` # Install Grew
## Step 1: Install prerequisite
To verify your installation
### Linux
```bash
apt install wget m4 unzip librsvg2-bin curl bubblewrap
```
* Try the command `grew version`
* In case of trouble, make sure that your PATH contains `~/.opam/4.06.0/bin` and try again
* If trouble persists, please [fill an issue](https://gitlab.inria.fr/grew/grew_doc/issues)
### Mac OS X
* Install **[XCode](https://developer.apple.com/xcode/)**
* Install the package manager **[MacPorts](http://www.macports.org/)**
### Step 2: The Python library
:warning: **[Brew](https://brew.sh/)** is an alternative only if you do not plan to use the GUI (the package `webkit-gtk` required by the GUI is not available through **Brew**).
With Python 3, use the following command:
`pip install grew`
* `sudo port install aspcud`
Note: depending on your local installation, you may have to use `pip3` or `pip3.5`.
## Step 2: Install opam
**opam** is a package manager for **Ocaml**.
**Grew** requires **opam** version **2.0.0** or higher.
### Linux
The `apt` package manager does not currently (February 2019) provide `opam` version 2.
You should be able to install version **2.0.3** with the following commands:
## Mac OS X
* `wget -q https://github.com/ocaml/opam/releases/download/2.0.3/opam-2.0.3-x86_64-linux`
* `sudo mv opam-2.0.3-x86_64-linux /usr/local/bin/opam`
* `sudo chmod a+x /usr/local/bin/opam`
### Step 1: native library
* Install [XCode](https://developer.apple.com/xcode/)
* Install the package manager [MacPorts](http://www.macports.org/) (:warning: [Brew](https://brew.sh/) is an alternative only if you do not plan to use the GUI. The package `webkit-gtk` required by the GUI is not available through Brew).
* `sudo port install opam aspcud` # Install opam and aspcud
* `opam init -a -y --comp 4.06.0` # Download and install Ocaml (4.06.0)
* ```eval `opam config env` ``` # Make Ocaml ready to be used now
* `opam remote add grew "http://opam.grew.fr"` # Add the grew OPAM repository
* `opam install grew grewpy` # Install Grew
For more information, please consult [**opam** installation page](https://opam.ocaml.org/doc/Install.html).
To verify your installation
### Mac OS X
* Try the command `grew version`
* In case of trouble, make sure that your PATH contains `~/.opam/4.06.0/bin` and try again
* In trouble persists, please [fill an issue](https://gitlab.inria.fr/grew/grew_doc/issues)
**MacPorts** proposes **opam** version 2 by default.
### Step 2: The Python library
* `sudo port install opam`
With Python 3, use the following command:
`pip install grew`
## Step 3: Setup opam
Note: depending on your local installation, you may have to use something like `pip3` or `pip3.5`.
Run `opam init` and follow instructions.
Note that it takes some times to download and build the `ocaml` compiler.
NB: some user have reported that the command `opam init --disable-sandboxing` may avoid errors given by `opam init`.
# Upgrade
Check that `ocaml` is installed with `ocamlc -v`.
### On Linux
When grew is already installed, you can upgrade to the latest version with:
## Step 4: Install the Grew software
* `apt-get update && apt-get upgrade` # upgrade prerequisites
* `opam update && opam upgrade` # upgrade OCaml part
* `pip install grew --upgrade` # upgrade Python part
```bash
opam remote add grew "http://opam.grew.fr"
opam install grew grewpy
```
### On Mac OSX
When grew is already installed, you can upgrade to the latest version with:
To verify your installation:
* Try the command `grew version`
* In case of trouble, make sure that your PATH contains `~/.opam/default/bin` and try again
* If trouble persists, please [fill an issue](https://gitlab.inria.fr/grew/grew_doc/issues)
## Step 5: The Python library
With Python 3, use the following command:
`pip install grew`
Note: depending on your local installation, you may have to use `pip3` or `pip3.5`.
* `sudo port sync && sudo port upgrade` # MacPorts upgrade
* `opam update && opam upgrade` # upgrade OCaml part
* `pip install grew --upgrade` # upgrade Python part
# Other available installations
* A Gtk user interface is available, see [here](../gtk).
* A Gtk user interface is available, see [here](../install_gtk).
* A docker file with the Python library ready to be used is available [here](../docker).
+++
date = "2018-04-25"
title = "Gtk installation"
+++
date = "2019-02-19T18:02:29+01:00"
title = "install_gtk"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
Description = ""
A GTK interface is available (on Linux and Mac OS X, untested on Windows) separately.
+++
# Installation of the GTK interface
We suppose that the basic version ([see install page](../install)) is already installed.
We suppose that the basic version ([see installation page](../install)) is already installed.
## Linux
* Install GUI interface
......
......@@ -9,9 +9,11 @@ Description = ""
# Dependency parsing
NB: the previous version of this page (with **MElt** tagger) is available [here](../parsing_melt)
**Grew-parse-FR** is natural language parser for French.
It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data.
With a POS-tagger (**MElt** is recommended), it provides a full parser with sentences as input and dependency structures as output.
With a POS-tagger (**Talismane** is recommended), it provides a full parser with sentences as input and dependency structures as output.
The parsing GRS is described in an [IWPT 2015 publication](https://hal.inria.fr/hal-01188694).
# How to parse a sentence?
......@@ -20,61 +22,87 @@ We consider the sentence:
- *"La souris a été mangée par le chat."* [*"The mouse was eaten by the cat."*].
The parsing is done in two steps:
The parsing is done in three steps:
1. POS-tagging with **MElt**
2. Building the dependency syntax structure by applying Graph Rewriting System
1. POS-tagging with **Talismane**
2. Convert **Talismane** into a format usable by **Grew** (a **sed** script)
3. Building the dependency syntax structure by applying Graph Rewriting System
# Prerequisite
* **MElt**: see [MElt download page](https://gforge.inria.fr/frs/?group_id=481)
* **Talismane**:
* Download from [Talismane github page](https://github.com/joliciel-informatique/talismane/releases), the 3 files: `talismane-distribution-5.2.0-bin.zip`, `frenchLanguagePack-5.2.0.zip` and `talismane-fr-5.2.0.conf`.
* Unzip `talismane-distribution-5.2.0-bin.zip` (and not the other zip file).
* **Grew**: see [Installation page](../installation)
* **POStoSSQ**: get it with the command: `git clone https://gitlab.inria.fr/grew/POStoSSQ.git`
* Download sed script [`tal2grew.sed`](/parsing/tal2grew.sed)
# More info on the parsing process
## POS-tagging
## Step 0: Get the text to parse
Put the input text in the file `data.txt`
`echo "La souris a été mangée par le chat." > data.txt`
## Step 1: POS-tagging
The parsing system **POStoSSQ** is waiting for a pos-tagged input.
One easy way to produce such a pos-tagged French sentence is to use [MElt](https://gforge.inria.fr/frs/?group_id=481).
It should be possible to use another tagger but this may require a few categories matching to adapt the output of the tagger.
One easy way to produce such a pos-tagged French sentence is to use [Talismane](http://redac.univ-tlse2.fr/applications/talismane.html).
It is possible to use another tagger (see [Dependency parsing with MElt](../parsing_melt) if you want to use [MElt](https://gforge.inria.fr/frs/?group_id=481)).
Call **Talismane** for tokenisation and POS-tagging with the command:
```
java -Xmx1G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-5.2.0.jar \
--analyse \
--endModule=posTagger \
--sessionId=fr \
--encoding=UTF8 \
--inFile=data.txt \
--outFile=data.tal
```
This should produce the file [`data.tal`](/parsing/data.tal):
{{< input file="static/parsing/data.tal" >}}
With **MElt**, the options used `-L` and `-T` are used to tokenize the input sentence and to lemmatize the output.
For instance, the following command:
## Step 2: Convert output
Apply the sed script:
`echo "La souris a été mangée par le chat." | MElt -L -T > test.melt`
`sed -f tal2grew.sed data.tal > data.pos.conll`
produces the file [`test.melt`](/parsing/test.melt):
This produces the file [`data.pos.conll`](/parsing/data.pos.conll):
{{< input file="static/parsing/test.melt" >}}
{{< input file="static/parsing/data.pos.conll" >}}
## Parsing with the GRS
## Step 3: Parsing with the GRS
With the file [`test.melt`](/parsing/test.melt) described above, the following command produces the CoNLL code of the parsed sentence:
With the file [`data.pos.conll`](/parsing/data.pos.conll) described above, the following command produces the CoNLL code of the parsed sentence:
`grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i test.melt -o test.surf.conll`
`grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll -o data.surf.conll`
The output file is [`test.surf.conll`](/parsing/test.surf.conll):
The output file is [`data.surf.conll`](/parsing/data.surf.conll):
{{< input file="static/parsing/test.surf.conll" >}}
{{< input file="static/parsing/data.surf.conll" >}}
which encodes the syntactic structure:
![Dependency structure](/parsing/test.surf.svg)
![Dependency structure](/parsing/data.surf.svg)
It is also possible to run a GTK interface in which you can explore step by step rewriting of the input sentence:
`grew gui -grs POStoSSQ/grs/surf_synt_main.grs -i test.melt`
`grew gui -grs POStoSSQ/grs/surf_synt_main.grs -i data.pos.conll`
## Parsing a set of sentence
No explicit linking with a sentence tokenizer is provided.
We will suppose here that the input file is already split in sentences (one by line).
# In case of trouble
Suppose that the file [`tdm80_ch01.txt`](/parsing/tdm80_ch01.txt) contains the following data:
**Talismane** outputs features with disjunction of values in case of ambiguities.
These disjunction can not be handle with the current parsing system.
The sed script [`tal2grew.sed`](/parsing/tal2grew.sed) rewrites or removes the disjunction we have discovered so far but this may not be exhaustive.
{{< input file="static/parsing/tdm80_ch01.txt" >}}
If there is an error in the **Grew** output, you may have to adapt the Step 3.1 in the sed file (please inform [us](mailto:Bruno.Guillaume@inria.fr) if this is the case, we will update the sed file for other users!).
The parsing can be done with the same two steps process:
1. POS-tagging with melt: `cat tdm80_ch01.txt | MElt -L -T > tdm80_ch01.melt`
2. Building the dependency syntax structure: `grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i tdm80_ch01.melt -o tdm80_ch01.conll`
+++
date = "2017-03-15T22:22:20+01:00"
title = "parsing"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
Description = ""
+++
# Dependency parsing with MElt
**Grew-parse-FR** is natural language parser for French.
It is composed of a GRS (Graph Rewriting System) which can be used with the Grew software to produce dependency syntax structures from POS-tagged data.
With a POS-tagger (**MElt** is recommended), it provides a full parser with sentences as input and dependency structures as output.
The parsing GRS is described in an [IWPT 2015 publication](https://hal.inria.fr/hal-01188694).
# How to parse a sentence?
We consider the sentence:
- *"La souris a été mangée par le chat."* [*"The mouse was eaten by the cat."*].
The parsing is done in two steps:
1. POS-tagging with **MElt**
2. Building the dependency syntax structure by applying Graph Rewriting System
# Prerequisite
* **MElt**: see [MElt download page](https://gforge.inria.fr/frs/?group_id=481)
* **Grew**: see [Installation page](../installation)
* **POStoSSQ**: get it with the command: `git clone https://gitlab.inria.fr/grew/POStoSSQ.git`
# More info on the parsing process
## POS-tagging
The parsing system **POStoSSQ** is waiting for a pos-tagged input.
One easy way to produce such a pos-tagged French sentence is to use [MElt](https://gforge.inria.fr/frs/?group_id=481).
It should be possible to use another tagger but this may require a few categories matching to adapt the output of the tagger.
With **MElt**, the options used `-L` and `-T` are used to tokenize the input sentence and to lemmatize the output.
For instance, the following command:
`echo "La souris a été mangée par le chat." | MElt -L -T > test.melt`
produces the file [`test.melt`](/parsing/test.melt):
{{< input file="static/parsing/test.melt" >}}
## Parsing with the GRS
With the file [`test.melt`](/parsing/test.melt) described above, the following command produces the CoNLL code of the parsed sentence:
`grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i test.melt -o test.surf.conll`
The output file is [`test.surf.conll`](/parsing/test.surf.conll):
{{< input file="static/parsing/test.surf.conll" >}}
which encodes the syntactic structure:
![Dependency structure](/parsing/test.surf.svg)
It is also possible to run a GTK interface in which you can explore step by step rewriting of the input sentence:
`grew gui -grs POStoSSQ/grs/surf_synt_main.grs -i test.melt`
## Parsing a set of sentence
No explicit linking with a sentence tokenizer is provided.
We will suppose here that the input file is already split in sentences (one by line).
Suppose that the file [`tdm80_ch01.txt`](/parsing/tdm80_ch01.txt) contains the following data:
{{< input file="static/parsing/tdm80_ch01.txt" >}}
The parsing can be done with the same two steps process:
1. POS-tagging with melt: `cat tdm80_ch01.txt | MElt -L -T > tdm80_ch01.melt`
2. Building the dependency syntax structure: `grew transform -grs POStoSSQ/grs/surf_synt_main.grs -i tdm80_ch01.melt -o tdm80_ch01.conll`
+++
date = "2019-02-19T17:56:47+01:00"
title = "upgrade"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
Description = ""
+++
# Upgrading to a new version
## Make sure that your opam is in version 2
The last version of **grew_doc** requires that the **opam** tool is in version **2.0.0** or higher.
You can check your versions with the command `opam --version`.
It it's not version 2, re-install **opam** in version 2 with instructions steps 2 and 3 on the [Installation page](../installation).
## Update prerequisite
### Linux
```bash
apt-get update && apt-get upgrade
```
### On Mac OSX
```bash
sudo port sync && sudo port upgrade
```
## Update the Grew software
```bash
opam update
opam upgrade
```
## Update the Python binding
```bash
pip install grew --upgrade
```
......@@ -13,9 +13,11 @@
<li><a href="http://parse.grew.fr">Grew-parse (Online parsing)</a></li>
<li class="section">Use Grew</li>
<li><a href="/install/">Install</a></li>
<li><a href="/install/">Install Grew CLI & Python</a></li>
<li><a href="/tuto/">Run Python library</a></li>
<li><a href="/run/">Run command line program</a></li>
<li><a href="/run/">Run command line interface</a></li>
<li><a href="/install_gtk/">Install Gtk-based GUI</a></li>
<li><a href="/upgrade/">Upgrade</a></li>
<li class="section">Available GRS</li>
<li><a href="/parsing/">Dependency parsing</a></li>
......
......@@ -10,6 +10,8 @@ li {
.section {
border-top: 2px solid #B5CFDA;
margin-top: 10pt;
padding-top: 10pt;
margin-left: 0pt;
font-size: 20pt;
}
......
# This file contains sed commands to transform talismane output to a CoNLL format suitable for parsing with Grew
# See http://grew.fr/parsing/ for more information
# --------------------------------------
# Step 1: add columns to have 10 columns
s/ $/ _ _ _ _/
# --------------------------------------
# Step 2: replace column 4 by "_"
s/ADJ ADJ/_ ADJ/
s/ADJWH ADJWH/_ ADJWH/
s/ADV ADV/_ ADV/
s/ADVWH ADVWH/_ ADVWH/
s/CC CC/_ CC/
s/CLO CLO/_ CLO/
s/CLR CLR/_ CLR/
s/CLS CLS/_ CLS/
s/CS CS/_ CS/
s/DET DET/_ DET/
s/DETWH DETWH/_ DETWH/
s/ET ET/_ ET/
s/I I/_ I/
s/NC NC/_ NC/
s/NPP NPP/_ NPP/
s/P P/_ P/
s/P+D P+D/_ P+D/
s/P+PRO P+PRO/_ P+PRO/
s/PONCT PONCT/_ PONCT/
s/PREF PREF/_ PREF/
s/PRO PRO/_ PRO/
s/PROREL PROREL/_ PROREL/
s/PROWH PROWH/_ PROWH/
s/V V/_ V/
s/VIMP VIMP/_ VIMP/
s/VINF VINF/_ VINF/
s/VPP VPP/_ VPP/
s/VPR VPR/_ VPR/
s/VS VS/_ VS/
# --------------------------------------
# Step 3: Change features
# Step 3.1: Talismane use the comma to deal with ambiguities. This is rewritten or removed
# NOTE: this list may not be exhaustive and should be extended if needed
s/t=P,S/t=pst/
s/t=J,P/m=ind/
s/g=f,m//
s/n=p,s//
s/p=1,3//
s/p=1,2//
# Step 3.2: Feature removing may cause wrong features syntax, following lines fix this.
s/ |/ /
s/| / /
s/|||/|/
s/||/|/
# Step 3.3: possessives
s/poss=s/s=poss/
s/poss=p/s=poss/
# Step 3.4: possessives
# verbal features
s/t=C/m=ind|t=cond/
s/t=F/m=ind|t=fut/
s/t=G/m=part|t=pst/
s/t=I/m=ind|t=impft/
s/t=J/m=ind|t=past/
s/t=K/m=part|t=past/
s/t=P/m=ind|t=pst/
s/t=T/m=subj|t=past/
s/t=S/m=subj|t=pst/
s/t=W/m=inf/
s/t=Y/m=imp|t=pst/
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment