Introduction ============ MElt is a Python implementation of the MaxEnt Markov Model part-of-speech tagger described in the three following papers (should you cite one of them, please cite the LRE paper): P. Denis and B. Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proc. of PACLIC 23, Hong Kong, China. P. Denis and B. Sagot. 2010. Exploitation d'une ressource lexicale pour la construction d'un étiqueteur morphosyntaxique état-de-l'art du français. In Proc. of TALN 2010, Montreal, Canada. P. Denis and B. Sagot. 2012. Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging. Language Resources and Evaluation 46 (4), 721-736 MElt is released with an LGPL licence (see LICENCE). Installation ============ You first need to download the current release of the MELT tagger from the INRIA forge: You can then install MElt by running the usual sequence of commands: $> aclocal $> autoconf $> automake -a $> ./configure $> make $> make install The default location for the MElt executable in /usr/local/bin, which requires you to have root permissions when installing (i.e., you need to run 'sudo make install' instead of 'make install', which in turns requires you to be in the sudo'ers list). If you want to install MElt in a different directory, you need to run './configure --prefix=/your/specific/path' instead of './configure'. In any case, you might want to make sure make sure that the binary installation path (/usr/local/bin or /your/specific/path/bin) is part of PATH variable. To successfully run MElt, you need to install Numpy along with a fairly recent version of Python --we've tried 2.5, 2.6 and 2.7. Running the tagger ================== The MElt tagger is run as follows: $> cat <file> | MElt <options> If your corpus is already tokenized and segmented in sentences (one sentence per line), following the French Treebank conventions (, you should use MElt with no options. If your corpus is raw (no tokenization, no segmentation in sentences), you can activate MElt's embedded lightweight tokenizer by using the '-t' option. This tokenizer is a small subset of the SxPipe pre-parsing processing chain (see Sagot & Boullier 2005, 2008). It can be invoked independently of MElt (cat <file> | sxpipe-light). Also, note that currently MElt only supports utf-8 encoding. Tagset ====== The current tagset used by MElt is as follows (Crabbé & Candito, 2008): ADJ adjective ADJWH interrogative adjective ADV adverb ADVWH interrogative adverb CC coordination conjunction CLO object clitic pronoun CLR reflexive clitic pronoun CLS subject clitic pronoun CS subordination conjunction DET determiner DETWH interrogative determiner ET foreign word I interjection NC common noun NPP proper noun P preposition P+D preposition+determiner amalgam P+PRO prepositon+pronoun amalgam PONCT punctuation mark PREF prefix PRO full pronoun PROREL relative pronoun PROWH interrogative pronoun V indicative or conditional verb form VIMP imperative verb form VINF infinitive verb form VPP past participle VPR present participle VS subjunctive verb form When using normalization options, other tags may appear: - when using -n, Y means "non-last token of a multi-token unit", X means "multiword/multitag token" - when using -N, Y means "non-last token of a multi-token unit", multiword/multitag tokens are annotated with tags of the form T1+T2+...+Tn (ex.: chépa/CLS+V+ADV)

Benoît Sagot
git-svn-id: dc05b511-7f1d-0410-9f1c-d6f32a2df9e4
Name | Last commit | Last update |
MElt.pmdoc | ||
bin | ||
data | ||
doc | ||
models | ||
normalisation | ||
oldies | ||
pkgpythonlib | ||
sxpipe-melt | ||
tools | ||
LICENCE | || | ||
README | || | || |