Commit d91fbdce authored by Bruno Guillaume's avatar Bruno Guillaume

update doc for v 1.2

parent b88110e6
+++
date = "2018-06-05T11:16:30+02:00"
title = "features"
title = "graphs"
menu = "main"
Categories = ["Development","GoLang"]
Tags = ["Development","golang"]
......@@ -8,24 +8,37 @@ Description = ""
+++
# CoNLL files
# Graphs definition
The graphs we consider in Grew are defined as usually in mathematics by two sets:
* A set **N** of nodes
* A set **E** of edges
A node is described by a an identifier (needed to refer to nodes on edges definitions) and a feature structure (basically a finite list of pairs (*feature_name*, *feature_value*)).
An edge is described by two nodes (called the *source* and the *target* of the edge) and by an edge label.
Until version 1.1, edge label where atomic strings.
Since version 1.2 labels are encoded as feature structures (mainly to ease the writing of rules with complex label like `nsubj:pass` where we would like to be able to table about subparts `nsubj` and `pass` independently).
However, backward compatibility is ensured and the user do not need to manipulate subparts of labels, he/she can consider label are atomic (including labels like `nsubj:pass`).
See [below](#complex-edge-labels) for more detail on complex edge labels.
# Graph input formats
To describe a graph in practice, *Grew* offers several input formats: a native `gr` format, the `conll` format (and a few derived formats), the `amr` format.
## CoNLL format
The most common way to store dependency structures is the CoNLL format.
Several extension were proposed and we describe here the one which is used by **Grew**, kwown as [CoNLL-U](http://universaldependencies.org/format.html) format defined in the Unverisal Dependency project.
Several extensions were proposed and we describe here the one which is used by **Grew**, known as [CoNLL-U](http://universaldependencies.org/format.html) format defined in the Universal Dependency project.
For each sentence, some metadata are given in lines beginning by `#` followed by one line per lexical unit.
These lines contain 10 fields, separated by tabulations.
For a sentence, some metadata are given in lines beginning by `#`.
The rest of the lines described the tokens of the structure.
Tokens lines contain 10 fields, separated by tabulations.
Here is an example of CoNLL-U data taken form the corpus `UD_English-PUD` (version 2.1).
The file [`n01118003.conllu`](/graph/n01118003.conllu) is an example of CoNLL-U data taken form the corpus `UD_English-PUD` (version 2.3).
{{< input file="static/graph/n01118003.conllu" >}}
```
# sent_id = n01118003
# text = Drop the mic.
1 Drop drop VERB VB VerbForm=Inf 0 root _ _
2 the the DET DT Definite=Def|PronType=Art 3 det _ _
3 mic mic NOUN NN Number=Sing 1 obj _ SpaceAfter=No
4 . . PUNCT . _ 1 punct _ _
```
We explain here how **Grew** deals with the 10 fields if CoNLL files:
......@@ -34,7 +47,7 @@ In Grew, it is available as the feature `position` (most of the times it not use
2. **FORM**. The phonological form of the LU.
In Grew, the value of this field is available through a feature named `form`
(for backward compatibility, the keyword `phon` can also be used instead of `form`).
3. **LEMMA**. The lemma of the LU. In Grew, this correponds to the feature `lemma`.
3. **LEMMA**. The lemma of the LU. In Grew, this corresponds to the feature `lemma`.
4. **UPOS**. The field `upos` (for backward compatibility, `cat` can also be used to refer to this field).
5. **XPOS**. The field `xpos` (for backward compatibility, `pos` can also be used to refer to this field).
6. **FEATS**. List of morphological features.
......@@ -43,9 +56,30 @@ In Grew, the value of this field is available through a feature named `form`
9. **DEPS**. Enhanced dependency graph in the form of a list of head-deprel pairs. In Grew, the relation are available with the prefix `E:`
10. **MISC**. Any other annotation. In Grew, annotation of the field are accessible with the prefix `_MISC_`.
## Note about backward compatibility
Note that the same format is very often use to describes dependency syntax corpora.
In these cases, a set of sentences is described in the same file using the same convention as above and a blank line as separator between sentences.
It is also requires that the `sent_id` metadata is unique for each sentence in the file.
In practice, it may be useful to deal explicitly with the `root` relation (for instance, if some rewriting rule is designed to change the root of the structure).
To allow this, when reading CoNLL-U format **Grew** also creates a node at position `0` and link it with the `root` relation to the linguistic `root` node of the sentence.
The example above then produce the 5 nodes graphs below:
![Dependency structure](/graph/n01118003.svg)
### Note about backward compatibility
In older versions of Grew (before the definition of the CoNLL-U format), the fields 2, 4 and 5 where accessible with the names `phon`, `cat` and `pos` respectively.
To have a backward compatibility and uniform handling of these fields, the 3 names `phon`, `cat` and `pos` are replaced at parsing time by `form`, `upos` and `xpos`.
As a consequence, it is impossible to use both `phon` and `form` in the same system.
We highly recommend to use only the `form` feature in this setting.
Of course, the same observation applies to `cat` and `upos` (`upos` should be used) and to `pos` and `xpos` (`xpos` should be chosen).
\ No newline at end of file
We highly recommend to use only the `form` feature in this setting. Of course, the same observation applies to `cat` and `upos` (`upos` should be prefered) and to `pos` and `xpos` (`xpos` should be chosen).
## The native format
:TODO:
## The AMR format
:TODO:
# Complex edge labels
:TODO:
......@@ -13,7 +13,7 @@ Categories = ["Development","GoLang"]
**Grew** is a Graph Rewriting tool dedicated to applications in Natural Language Processing (NLP). It can manipulate many kinds of linguistic representation. It has been used on POS-tagged sequence, surface dependency syntax, deep dependency syntax, semantic representation (AMR, DMRS) but it can be used to represent any graph-based structure.
## News
**2018/09/10:** New release of version **1.0**. See [What's new](/whats/) for changes
**2019/03/26:** New release of version **1.2**. See [What's new](/whats/) for changes
**April 2018:** Publication of the book [*Application of Graph Rewriting to Natural Language Processing*](https://www.wiley.com/en-fr/Application+of+Graph+Rewriting+to+Natural+Language+Processing-p-9781119522348).
The chapter 1 is [available from the editor website](https://media.wiley.com/product_data/excerpt/66/17863009/1786300966-587.pdf).
......
......@@ -9,4 +9,48 @@ title = "pattern"
# Pattern syntax
One way to learn the syntax of patterns in grew is to follow the tutorial part of the [Grew-match](http://match.grew.fr) tool.
\ No newline at end of file
A Pattern is defined through 3 different parts that are all optional.
* at most one positive clause introduced by keyword `pattern` which describes a positive pattern that must be find a the graph.
* any number of nogative clauses introduced by the keyword `without`; each clause filters out a subpart of the matchings previously selected
* :warning: New from version 1.2: at most one global clause introduced by the keyword `global` which filters out a subpart of graphs.
The global matching process is as follow:
* It takes a graph and a pattern as input.
* It outputs a set of matchings; a matching being a function from nodes and edges defined in the positive clause to nodes and edges of the host graph.
* If the graph does not satisfied one of the global constrains, the output is empty.
* Else the set M is initialized as the set of matchings which satisfies the positive pattern.
* For each negative clause, matchings which satisfies the negative pattern are removed from M.
* Output M
Note that if there is more than one negative matching, there are all interpreted independently.
One way to learn the syntax of patterns in grew is to follow the tutorial part of the [Grew-match](http://match.grew.fr) tool.
## Positive pattern
## Negative pattern
## Global pattern
Global patterns were introduced in version 1.2 to let the user express constrain about the whole graph.
Currently, constraints may be expressed with a fixed list of keywords.
We plan to add more constraints in the near future. Please drop us a [feature request](https://gitlab.inria.fr/grew/grew/issues) if you like to suggest one.
We describe below 4 of the constraints available in version 1.2.
For each one, its negation is available by changing the `is_` prefix by the `is_not_` prefix.
* `is_cyclic`: the graph satisfied this constrain if and only if it contains a cycle.
A cycle is a list of nodes `N1`, `N2``N(k-1)`, `Nk` such that there are edges `N1 -> N2`, `N2 -> N3`, `N(k-1) -> Nk`, `Nk -> N1`.
In graph theory, a non cyclic graph is also called a Directed Acyclic Graph (DAG).
* `is_forest`: the graph satisfied this constrain if and only it is acyclic and if there are no couples of edges with the same target.
In other words, a graph is a forest if and only if it is acyclic and each node has at most one incoming edge.
* `is_tree`: a graph is a tree if it is a forest and if it have exactly one root.
* `is_projective`: the usual notion of projectivity defined on tree is generalized by saying the a structure is projective if there are no 4-tuples (`A`, `B`, `C`, `D`) of ordered nodes (i.e. `A << B`, `B << C` and `C << D`) such that `A` and `C` are linked and `B` and `D` are linked (two nodes are linked when there is at least one edge between the two, whatever is the orientation).
......@@ -18,7 +18,17 @@ More detailled informations in files `CHANGES.md` for each sub-project:
---
# [**last release**] Version 1.1 on November 23, 2018
# [**last release**] Version 1.2 on March 26, 2019
* Edge label can be viewed as feature structure "x:y" <=> "1=x, 2=y"
* Add `global` section in pattern (is_projective, is_cyclic, is_tree, is_forest)
* Add `?get_url` parameter to `Graph.to_dot` (AMR handling in Grew-match)
* Add a notion of pivot node in pattern for Grew-match export
* Add `Libgrew.set_track_rules` function
---
# Version 1.1 on November 23, 2018
* More general definition of pattern edges (String are available everywhere)
* Update to new MWE types (with projection information)
......@@ -33,7 +43,7 @@ More detailled informations in files `CHANGES.md` for each sub-project:
---
# Version 0.48 on June 19, 2018
* remove `conll_fields` mechanism (names of conll fields 2, 4 and 5 are `form`, `upos`, `xpos`). See [here](../features#note-about-backward-compatibility) for more information.
* remove `conll_fields` mechanism (names of conll fields 2, 4 and 5 are `form`, `upos`, `xpos`). See [here](../graph#note-about-backward-compatibility) for more information.
---
......
......@@ -32,4 +32,7 @@
<!-- Icons -->
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="/apple-touch-icon-144-precomposed.png">
<link rel="shortcut icon" href="/favicon.png">
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
</head>
......@@ -26,7 +26,7 @@
<!-- <li><a href="/todo">Other GRS</a></li> -->
<li class="section">Documentation</li>
<li><a href="/features/">CoNLL files</a></li>
<li><a href="/graph/">Graphs</a></li>
<li><a href="/pattern/">Pattern syntax</a></li>
<li><a href="/commands/">Command syntax</a></li>
<li><a href="/rule/">Rule syntax</a></li>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment