README-distrib 12 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
-------------------------------------------------------
Deep Sequoia corpus v8.0
-------------------------------------------------------
March 2017

The corpus contains French sentences, from Europarl, Est Republicain newspaper,
French Wikipedia and European Medicine Agency, with the following manual annotations:

- parts-of-speech and morphological features
- grammatical compound words (merged as one token)
- surface syntax (dependencies and constituents)
- deep syntax (dependencies only)


1. Licence
2. History of the corpus
3. References
4. Content of the release, and formats
5. Appendix

** contact : deep-sequoia@inria.fr

** web site : deep-sequoia.inria.fr

** annotation guide : http://passage.inria.fr/deepwiki/node/19 (in French)

------------------------------------------------------
1. Licence
------------------------------------------------------
The corpus is freely available under the free licence LGPL-LR
(Lesser General Public License For Linguistic Resources)
 cf. http://deep-sequoia.inria.fr/lgpl-lr/

------------------------------------------------------
2. History of the corpus
------------------------------------------------------

The Sequoia corpus was first manually annotated for part-of-speech and phrase-structure, and automatically converted to surface syntactic dependency trees.
(Candito and Seddah, 2012a).
The phrase-structure annotation follows mainly the French Treebank guidelines
(http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php),
modified in the context of conversion to dependencies:
   - prepositions that dominate a infinitival VP do project a PP
   - any sentence introduced by a complementizer (CS tag) is grouped into a Sint constituent


A further step of manual annotation was carried out, aiming at correcting the governor of extracted elements:
in case of long-distance dependency, the automatic conversion from constituents to dependencies picks out
a wrong governor for the extracted element.
These were manually corrected, leading to a few non-projective links.
(Candito and Seddah, 2012b).

Then a collaboration started in 2013 between the Alpage and Sémagramme teams,
to obtain DEEP SYNTACTIC DEPENDENCIES on top of surface dependencies.
The main characteristics of the deep syntactic annotation scheme are:
(i) explicitation of subjects of non finite verbs and of adjectives
(ii) neutralization of diathesis alternations
(iii) distribution of dependents over coordinated governors

This led to a first release of the Deep Sequoia corpus (v 1.0)
(Candito et al., 2014; Perrier et al., 2014).
Annotating the corpus for deep syntax has sometimes led to correct some surface dependencies.

A further step of systematic search for inconsistencies was carried out,
using the Grew system (http://grew.loria.fr).
This led to the current release (7.0).
(Note: the current release number (7.0) was chosen to get same version numbers for the surface and the deep syntactic annotations of the corpus)

The deep sequoia corpus and the surface sequoia corpus contain the same 3099 sentences,
but note that the original surface corpus (versions prior to 6.0) contained 101 more sentences, that turned out to be duplicates and were thus
subsequently removed (from the EMEA-test part of the corpus).
See the appendix for the ids of the removed sentences.

------------------------------------------------------
3. References
------------------------------------------------------

** Deep syntactic annotation:

- Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Éric de la Clergerie. (2014) Deep Syntax Annotation of the Sequoia French Treebank. Proc. of LREC 2014, Reykjavic, Iceland.

- Guy Perrier, Marie Candito, Bruno Guillaume, Corentin Ribeyre, Karën Fort and Djamé Seddah. (2014) Un schéma d’annotation en dépendances syntaxiques profondes pour le français. Proc. of TALN 2014, Marseille, France.


** Original paper (surface syntactic annotation):

Candito M. and Seddah D., 2012a : "Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical", Actes de TALN'2012, Grenoble, France

Candito M. and Seddah D., 2012b : "Effectively long-distance dependencies in French : annotation and parsing evaluation", Proceedings of TLT'11, 2012, Lisbon, Portugal)

------------------------------------------------------
4. Content and formats
------------------------------------------------------

The corpus contains 3099 sentences.

Number of sentences for each sub-domain :
----------------------------------------

561 sentences	Europarl	 file= Europar.550+fct.mrg
529 sentences	EstRepublicain   file= annodis.er+fct.mrg
996 sentences	French Wikipedia file= frwiki_50.1000+fct.mrg
574 sentences	EMEA (dev)  	 file= emea-fr-dev+fct.mrg
544 sentences	EMEA (test) 	 file= emea-fr-test+fct.mrg, among which 101 were removed (because duplicates) in surface version 6.0 and 1.0 deep version.

Tokenization :
--------------
Current tokenization contains some multi-word expressions.
More precisely, only grammatical (aka functional) multi-word were recognized as such, and treated as several tokens linked by the new relation dep_cpd.
Other multi-word expressions (e.g. nominal) are not represented (i.e. "pomme de terre" appears as three tokens).

The release contains :

******************
sequoia.surf.conll
******************
CoNLL format for surface syntax, augmented with "comments" lines (starting with a #)
A few arcs are non projective.
This arise in some cases of extraction (that can be traces by searching the "fctpath" attribute).

The first token of each sentence contains a sentid attribute for the sentence id.

For instance:
<------------------------------------ annodis.er_00449 [surf] ----------------------------------->
# sentence-text: Une prochaine réunion est prévue mardi
1	Une	un	D	DET	g=f|n=s|s=ind|sentid=annodis.er_00449	3	det	_	_
2	prochaine	prochain	A	ADJ	g=f|n=s|s=qual	3	mod	_	_
3	réunion	réunion	N	NC	g=f|n=s|s=c	5	suj	_	_
4	est	être	V	V	m=ind|n=s|p=3|t=pst	5	aux.pass	_	_
5	prévue	prévoir	V	VPP	g=f|m=part|n=s|t=past	0	root	_	_
6	mardi	mardi	N	NC	g=m|n=s|s=c	5	mod	_	_
<------------------------------------------------------------------------------------------------>
Graphical representation of the 3 formats is available at: http://talc2.loria.fr/deep-sequoia/sequoia-7.0/html/annodis.er_00449.html

******************
sequoia.deep.conll
******************
Extended CoNLL format for deep syntax:
The CoNLL format still contains one token per line, but a given token may have several governors:
this is represented by column 7 containing several labels, and column 8 containing several governor ids.
The number of labels in column 7 is the same as the number of governors in column 8 and they are interpreted in parallel:
first label in column 7 corresponds to first governor in column 8, etc.

For instance, the noun "réunion" in the example below is both the (canonical) object of token 5 (prévue),
and the subject of token 2 (prochaine).

Further the semantically void tokens are present in the deep format for readability (and parsability):
they are not strictly speaking belonging to the deep representation, and appear with a "0" in column 7 and "void" in column 8,
and a void=y feature (e.g token 4 below).

<------------------------------------ annodis.er_00449 [deep] ----------------------------------->
# sentence-text: Une prochaine réunion est prévue mardi
1	Une	un	D	DET	g=f|n=s|s=ind|sentid=annodis.er_00449	3	det	_	_
2	prochaine	prochain	A	ADJ	g=f|n=s|s=qual	3	mod	_	_
3	réunion	réunion	N	NC	g=f|n=s|s=c	5|2	obj|suj	_	_
4	est	être	V	V	dl=être|m=ind|n=s|p=3|t=pst|void=y	0	void	_	_
5	prévue	prévoir	V	VPP	diat=passif|dl=prévoir|dm=ind|g=f|m=part|n=s|t=past	0	root	_	_
6	mardi	mardi	N	NC	g=m|n=s|s=c	5	mod	_	_
<------------------------------------------------------------------------------------------------>

***************************
sequoia.deep_and_surf.conll
***************************
A compact representation of the two previous formats:
- relations subject to diathesis alternations contain a double label (ex: "suj:obj" stands for final subject and canonical object)
- relations that are only in the deep representation are prefixed by "D:" (ex: "D:suj:suj")
- relations that are only in the surf representation are prefixed by "S:" (ex: "S:aux.pass")
- a special feature "void=y" is used for tokens removed in the deep syntax (ex: token 4 below)

The surf format is obtained by removing relations prefixed by "D:" and by taking the first part of double labels.
The deep format is obtained by removing relations prefixed by "S:" and by taking the second part of double labels.

For instance:
<------------------------------- annodis.er_00449 [deep_and_surf] ------------------------------->
# sentence-text: Une prochaine réunion est prévue mardi
1	Une	un	D	DET	g=f|n=s|s=ind|sentid=annodis.er_00449	3	det	_	_
2	prochaine	prochain	A	ADJ	g=f|n=s|s=qual	3	mod	_	_
3	réunion	réunion	N	NC	g=f|n=s|s=c	5|2	suj:obj|D:suj:suj	_	_
4	est	être	V	V	dl=être|m=ind|n=s|p=3|t=pst|void=y	5	S:aux.pass	_	_
5	prévue	prévoir	V	VPP	diat=passif|dl=prévoir|dm=ind|g=f|m=part|n=s|t=past	0	root	_	_
6	mardi	mardi	N	NC	g=m|n=s|s=c	5	mod	_	_
<------------------------------------------------------------------------------------------------>

Note: previous versions containt consituent-based annotations of the data but these annotations are not maintained anymore.
Please, see version 7.0 to get last version on this corpus.

----------------------------------
5. Appendix
----------------------------------

** Data split (TALN 2012 experiments)

The "neutral" domain is made of EstRepublicain + Europarl + FrWiki,
and the split into dev and test sets is the following :

head -265 annodis.er+fct.mrg >> sequoia-neutre-dev+fct.mrg
head -280 Europar.550+fct.mrg >> sequoia-neutre-dev+fct.mrg
head -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-dev+fct.mrg

tail -264 annodis.er+fct.mrg >> sequoia-neutre-test+fct.mrg
tail -281 Europar.550+fct.mrg >> sequoia-neutre-test+fct.mrg
tail -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-test+fct.mrg

** Appendix : duplicate sentences removed in version 6.0

< emea-fr-test_00301
< emea-fr-test_00302
< emea-fr-test_00303
< emea-fr-test_00304
< emea-fr-test_00305
< emea-fr-test_00306
< emea-fr-test_00307
< emea-fr-test_00308
< emea-fr-test_00309
< emea-fr-test_00310
< emea-fr-test_00311
< emea-fr-test_00312
< emea-fr-test_00313
< emea-fr-test_00314
< emea-fr-test_00315
< emea-fr-test_00316
< emea-fr-test_00317
< emea-fr-test_00318
< emea-fr-test_00319
< emea-fr-test_00320
< emea-fr-test_00321
< emea-fr-test_00322
< emea-fr-test_00323
< emea-fr-test_00324
< emea-fr-test_00325
< emea-fr-test_00326
< emea-fr-test_00327
< emea-fr-test_00328
< emea-fr-test_00329
< emea-fr-test_00330
< emea-fr-test_00331
< emea-fr-test_00332
< emea-fr-test_00333
< emea-fr-test_00334
< emea-fr-test_00335
< emea-fr-test_00336
< emea-fr-test_00337
< emea-fr-test_00338
< emea-fr-test_00339
< emea-fr-test_00340
< emea-fr-test_00341
< emea-fr-test_00342
< emea-fr-test_00343
< emea-fr-test_00344
< emea-fr-test_00345
< emea-fr-test_00346
< emea-fr-test_00347
< emea-fr-test_00348
< emea-fr-test_00349
< emea-fr-test_00350
< emea-fr-test_00351
< emea-fr-test_00352
< emea-fr-test_00353
< emea-fr-test_00354
< emea-fr-test_00355
< emea-fr-test_00356
< emea-fr-test_00357
< emea-fr-test_00358
< emea-fr-test_00359
< emea-fr-test_00360
< emea-fr-test_00361
< emea-fr-test_00362
< emea-fr-test_00363
< emea-fr-test_00364
< emea-fr-test_00365
< emea-fr-test_00366
< emea-fr-test_00367
< emea-fr-test_00368
< emea-fr-test_00369
< emea-fr-test_00370
< emea-fr-test_00371
< emea-fr-test_00372
< emea-fr-test_00373
< emea-fr-test_00374
< emea-fr-test_00375
< emea-fr-test_00376
< emea-fr-test_00377
< emea-fr-test_00378
< emea-fr-test_00379
< emea-fr-test_00380
< emea-fr-test_00381
< emea-fr-test_00382
< emea-fr-test_00383
< emea-fr-test_00384
< emea-fr-test_00385
< emea-fr-test_00386
< emea-fr-test_00387
< emea-fr-test_00388
< emea-fr-test_00389
< emea-fr-test_00390
< emea-fr-test_00391
< emea-fr-test_00392
< emea-fr-test_00393
< emea-fr-test_00394
< emea-fr-test_00395
< emea-fr-test_00396
< emea-fr-test_00397
< emea-fr-test_00398
< emea-fr-test_00399
< emea-fr-test_00400