Commit 0c799ae0 authored by Bruno Guillaume's avatar Bruno Guillaume

add page for the “Corpus stat” tool

parent a103e7b7
+++
Description = ""
date = "2019-10-23T16:34:39+02:00"
title = "corpus_stat"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
+++
The tool `grew_daemon` was initially built to be used as the daemon to answer requests in **Grew-match**.
But it can also be used as a command line tool to compute statistics on sets of corpora.
# Install the `grew_daemon` tool
Follow general instruction for [Grew installation](../install) and then install the `grew_daemon` tool with:
`opam install grew_daemon`
# Describe the set of corpora on which you want to compute statistics
A JSON file is used to describes the set.
Each corpus is described by a identifier `id` and a `directory` where the files of the corpus are stored.
For instance, the following file `en_fr_zh.json` describes 3 corpora from UD 2.4 (of course, directories should be modified to match your local installation).
```
{ "corpora": [
{ "id": "UD_English-EWT@2.4",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.4/UD_English-EWT/"
},
{ "id": "UD_French-Sequoia@2.4",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.4/UD_French-Sequoia/"
},
{ "id": "UD_Chinese-GSD@2.4",
"directory": "/Users/guillaum/resources/ud-treebanks-v2.4/UD_Chinese-GSD/"
} ]
}
```
# Compile the corpora in order to speed up the next step
```
grew_daemon marshal en_fr_zh.json
```
Note that this will produce a new file `id.marshal`, stored in the corpus directory, for each corpus in `en_fr_zh.json`
# Compute statistics
It is possible to compute the number of occurrences of several patterns at the same time.
With the two files:
* `ADJ_NOUN.pat` containing: `pattern { A[upos=ADJ]; N[upos=NOUN]; N -[amod]-> A; A << N }`
* `NOUN_ADJ.pat` containing: `pattern { A[upos=ADJ]; N[upos=NOUN]; N -[amod]-> A; N << A }`
The commands below computes the corresponding stats:
```
grew_daemon grep --patterns "ADJ_NOUN.pat NOUN_ADJ.pat" en_fr_zh.json
```
The output is given as TSV data:
```
Corpus # sentences ADJ_NOUN NOUN_ADJ
UD_English-EWT 16622 9838 162
UD_French-Sequoia 3099 891 2777
UD_Chinese-GSD 4997 1481 0
```
which corresponds to the table:
| Corpus | # sentences | ADJ_NOUN | NOUN_ADJ |
|------------|-------------|----------|----|
| UD_English-EWT | 16622 | 9838 | 162 |
| UD_French-Sequoia | 3099 | 891 | 2777 |
| UD_Chinese-GSD | 4997 | 1481 | 0 |
We can then observe that in the 3 corpora in use:
* in English, there is a strong preference for prepositional adjectves
* in French, there is a weak preference for postpositional adjectves
* in Chinese, there is a **very** strong preference for prepositional adjectves
----
## Remarks
* Pattern syntax can be learned [here](/pattern/) or with the online [**Grew-match**](http://match.grew.fr) tool, first with the [tutorial](http://match.grew.fr?tutorial=yes) and then with snippets given on the right of the text area.
* If some data are changed in the corpora, it is necessary to run again the compilation step.
* The command `grew_daemon clean en_fr_zh.json` can be used to remove marshal files (results of compilation).
* Some patterns may take some times to be searched in corpora.
\ No newline at end of file
+++
Tags = ["Development","golang"]
Description = ""
menu = "main"
Categories = ["Development","GoLang"]
date = "2019-06-01T20:54:20+02:00"
title = "flat"
+++
# Transformation of single-headed structure into a chained structure
:information_source: You can download files used in this page:
......@@ -7,7 +16,7 @@
There are two basic ways to represent *flat* structures:
1. a single-headed structure: for instance the graph `SH6` below on the left for a 6 words flat structure
1. a chained stucture: for instance the graph `C6` below on the right for the same 6 words flat structure
1. a chained structure: for instance the graph `C6` below on the right for the same 6 words flat structure
| SH6 | C6 |
|:---:|:---:|
......@@ -42,7 +51,7 @@ Output 120 normal forms! For instance:
![C6_120_example](/examples/flat/img/C6_120_example.svg)
Our rule is not strict enough. We have to put more restriction in the pattern part.
If we require that `N1`and `N2`are two consecutive words, the rule is:
If we require that `N1` and `N2` are two consecutive words, the rule is:
```grew
rule sh2c_2 {
......@@ -170,6 +179,3 @@ rule c2sh_strict {
At each step, we ensure that the node `H` of the pattern is matched to the word `head` of the graph.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment