Commit 4892dcd2 authored by mcandito's avatar mcandito

proposition de README-distrib a inclure ds la release, et ajout des formats en constituants

git-svn-id: svn+ssh://scm.gforge.inria.fr/svnroot/deep-sequoia@60 4f834e7d-5b19-456f-8924-a42755b34a2b
parent 77ad8293
-------------------------------------------------------
Deep Sequoia corpus v7.0
-------------------------------------------------------
november 2015
The corpus contains French sentences, from Europarl, Est Republicain newspaper,
French Wikipedia and European Medicine Agency, with the following manual annotations :
- parts-of-speech and morphological features
- grammatical compound words (merged as one token)
- surface syntax (dependencies and constituents)
- deep syntax (dependencies only)
1. Licence
2. History of the corpus
3. References
4. Content of the release, and formats
5. Appendix
** contact : sequoia@inria.fr
** web site : deep-sequoia.inria.fr
** annotation guide : http://passage.inria.fr/deepwiki/node/19 (in French)
------------------------------------------------------
1. Licence
------------------------------------------------------
The corpus is freely available under the free licence LGPL-LR
(Lesser General Public License For Linguistic Resources)
cf. http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html
------------------------------------------------------
2. History of the corpus
------------------------------------------------------
The Sequoia corpus was first manually annotated for part-of-speech and phrase-structure, and automatically converted to surface syntactic dependency trees.
(Candito and Seddah, 2012a).
The phrase-structure annotation follows mainly the French Treebank guidelines
( http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php ),
modified in the context of conversion to dependencies:
- prepositions that dominate a infinitival VP do project a PP
- any sentence introduced by a complementizer (CS tag) is grouped into a Sint constituent
A further step of manual annotation was carried out, aiming at correcting the governor of extracted elements:
in case of long-distance dependency, the automatic conversion from constituents to dependencies picks out
a wrong governor for the extracted element.
These were manually corrected, leading to a few non-projective links.
(Candito and Seddah, 2012b).
Then a collaboration started in 2013 between the Alpage and Sémagramme teams,
to obtain DEEP SYNTACTIC DEPENDENCIES on top of surface dependencies.
The main characteristics of the deep syntactic annotation scheme are:
(i) explicitation of subjects of non finite verbs and of adjectives
(ii) neutralization of diathesis alternations
(iii) distribution of dependents over coordinated governors
This led to a first release of the Deep Sequoia corpus (v 1.0)
(Candito et al., 2014; Perrier et al., 2014).
Annotating the corpus for deep syntax has sometimes led to correct some surface dependencies.
A further step of systematic search for inconsistencies was carried out,
using the Grew system (http://wikilligramme.loria.fr/doku.php/grew:grew).
This led to the current release (7.0).
(Note: the current release number (7.0) was chosen to get same version numbers for the surface and the deep syntactic annotations of the corpus)
The deep sequoia corpus and the surface sequoia corpus contain the same 3099 sentences,
but note that the original surface corpus ( versions prior to 6.0) contained 101 more sentences, that turned out to be duplicates and were thus
subsequently removed (from the EMEA-test part of the corpus).
See the appendix for the ids of the removed sentences.
------------------------------------------------------
3. References
------------------------------------------------------
** Deep syntactic annotation :
- Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Éric de la Clergerie. (2014) Deep Syntax Annotation of the Sequoia French Treebank. Proc. of LREC 2014, Reykjavic, Iceland.
- Guy Perrier, Marie Candito, Bruno Guillaume, Corentin Ribeyre, Karën Fort and Djamé Seddah. (2014) Un schéma d’annotation en dépendances syntaxiques profondes pour le français. Proc. of TALN 2014, Marseille, France.
** Original paper (surface syntactic annotation) :
Candito M. and Seddah D., 2012a : "Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical", Actes de TALN'2012, Grenoble, France
Candito M. and Seddah D., 2012b : "Effectively long-distance dependencies in French : annotation and parsing evaluation", Proceedings of TLT'11, 2012, Lisbon, Portugal)
------------------------------------------------------
4. Content and formats
------------------------------------------------------
The corpus contains 3099 sentences.
Number of sentences for each sub-domain :
----------------------------------------
561 sentences Europarl file= Europar.550+fct.mrg
529 sentences EstRepublicain file= annodis.er+fct.mrg
996 sentences French Wikipedia file= frwiki_50.1000+fct.mrg
574 sentences EMEA (dev) file= emea-fr-dev+fct.mrg
544 sentences EMEA (test) file= emea-fr-test+fct.mrg, among which 101 were removed (because duplicates) in surface version 6.0 and 1.0 deep version.
Tokenization :
--------------
Current tokenization contains some multi-word expressions.
More precisely, only grammatical (aka functional) multi-word were recognized as such, and treated as one token (components separated with an underscore, as in "parce_que").
Other multi-word expressions (e.g. nominal) are not represented (i.e. "pomme de terre" appears as three tokens).
The release contains :
******************
sequoia.surf.conll
******************
CoNLL format for surface syntax, augmented with "comments" lines (starting with a #)
A few arcs are non projective.
This arise in some cases of extraction (that can be traces by searching the "fctpath" attribute).
The first token of each sentence contains a sentid attribute for the sentence id.
******************
sequoia.deep.conll
******************
Extended CoNLL format for deep syntax:
The CoNLL format still contains one token per line, but a given token may have several governors:
this is represented by column 7 containing several labels, and column 8 containing several governor ids.
For instance :
TODO HERE
********************
sequoia.surf.const
********************
Constituency trees, in bracketed format.
NB: before deep syntactic annotation of the sentences,
the dependency trees were obtained by automatic conversion from the constituency trees.
But when the sentences were annotated with deep dependency syntax,
some "surface" annotation was corrected.
HENCE: the isomorphism between the surface dependency trees and the constituency trees no longer holds.
(Constituency trees are isomorphic to surface dependencies up to version 5.2)
Grammatical functions for dependents of verbs are appended to the non terminal symbol, after a "-"
(example NP-SUJ stands for a subject NP).
********************
sequoia.predeep.const
********************
A few manual deep syntactic annotations were added to the bracketed format.
NB: this is only a very small part of the phenomena handled in the deep dependency format.
With respect to FTB bracketed format, some grammatical functions have been specialized
P_OBJ has been split to P_OBJ.O and P_OBJ.AGT
MOD has been split to MOD, MOD.APP, MOD.INC, MOD.CLEFT
- non referential "il" :
(CLS-SUJ##_@@void=y il) is used for a non referential "il", entering an impersonal alternation
(meaning the verb alternates with a diathesis in which the subject is referential : Il arrive trois personnes <=> Trois personnes arrivent)
(CLS-SUJ##_@@void=y,intrinsimp=y Il) is used for a non referential "il", without syntactic alternation
(e.g. "il faut 3 personnes" does not alternate with a construction of the same verb with referential subject ("*3 personnes faut")
- causative constructions
The canonical grammatical functions in case of causative verb complex have been marked
E.g :
( (SENT (NP-SUJ##ARGC (DET Le) (NC conteur)) (VN (V a) (VPP fait) (VINF@@diat=causi jouer)) (NP-OBJ##SUJ (DET les) (NC enfants)) (PONCT .)))
The final subject is canonical "argc" (causer argument), and the final object (les enfants) is the canonical subject of the "jouer" verb.
----------------------------------
5. Appendix
----------------------------------
** Data split (TALN 2012 experiments)
The "neutral" domain is made of EstRepublicain + Europarl + FrWiki,
and the split into dev and test sets is the following :
head -265 annodis.er+fct.mrg >> sequoia-neutre-dev+fct.mrg
head -280 Europar.550+fct.mrg >> sequoia-neutre-dev+fct.mrg
head -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-dev+fct.mrg
tail -264 annodis.er+fct.mrg >> sequoia-neutre-test+fct.mrg
tail -281 Europar.550+fct.mrg >> sequoia-neutre-test+fct.mrg
tail -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-test+fct.mrg
** Appendix : duplicate sentences removed in version 6.0
< emea-fr-test_00301
< emea-fr-test_00302
< emea-fr-test_00303
< emea-fr-test_00304
< emea-fr-test_00305
< emea-fr-test_00306
< emea-fr-test_00307
< emea-fr-test_00308
< emea-fr-test_00309
< emea-fr-test_00310
< emea-fr-test_00311
< emea-fr-test_00312
< emea-fr-test_00313
< emea-fr-test_00314
< emea-fr-test_00315
< emea-fr-test_00316
< emea-fr-test_00317
< emea-fr-test_00318
< emea-fr-test_00319
< emea-fr-test_00320
< emea-fr-test_00321
< emea-fr-test_00322
< emea-fr-test_00323
< emea-fr-test_00324
< emea-fr-test_00325
< emea-fr-test_00326
< emea-fr-test_00327
< emea-fr-test_00328
< emea-fr-test_00329
< emea-fr-test_00330
< emea-fr-test_00331
< emea-fr-test_00332
< emea-fr-test_00333
< emea-fr-test_00334
< emea-fr-test_00335
< emea-fr-test_00336
< emea-fr-test_00337
< emea-fr-test_00338
< emea-fr-test_00339
< emea-fr-test_00340
< emea-fr-test_00341
< emea-fr-test_00342
< emea-fr-test_00343
< emea-fr-test_00344
< emea-fr-test_00345
< emea-fr-test_00346
< emea-fr-test_00347
< emea-fr-test_00348
< emea-fr-test_00349
< emea-fr-test_00350
< emea-fr-test_00351
< emea-fr-test_00352
< emea-fr-test_00353
< emea-fr-test_00354
< emea-fr-test_00355
< emea-fr-test_00356
< emea-fr-test_00357
< emea-fr-test_00358
< emea-fr-test_00359
< emea-fr-test_00360
< emea-fr-test_00361
< emea-fr-test_00362
< emea-fr-test_00363
< emea-fr-test_00364
< emea-fr-test_00365
< emea-fr-test_00366
< emea-fr-test_00367
< emea-fr-test_00368
< emea-fr-test_00369
< emea-fr-test_00370
< emea-fr-test_00371
< emea-fr-test_00372
< emea-fr-test_00373
< emea-fr-test_00374
< emea-fr-test_00375
< emea-fr-test_00376
< emea-fr-test_00377
< emea-fr-test_00378
< emea-fr-test_00379
< emea-fr-test_00380
< emea-fr-test_00381
< emea-fr-test_00382
< emea-fr-test_00383
< emea-fr-test_00384
< emea-fr-test_00385
< emea-fr-test_00386
< emea-fr-test_00387
< emea-fr-test_00388
< emea-fr-test_00389
< emea-fr-test_00390
< emea-fr-test_00391
< emea-fr-test_00392
< emea-fr-test_00393
< emea-fr-test_00394
< emea-fr-test_00395
< emea-fr-test_00396
< emea-fr-test_00397
< emea-fr-test_00398
< emea-fr-test_00399
< emea-fr-test_00400
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment