almanach
ALMAnaCH Text Processing
MElt

Repository



Introduction
============

MElt is a Python implementation of the MaxEnt Markov Model
part-of-speech tagger described in the three following papers (should
you cite one of them, please cite the LRE paper):

P. Denis and B. Sagot. 2009. Coupling an annotated corpus and a
morphosyntactic lexicon for state-of-the-art POS tagging with less
human effort. In Proc. of PACLIC 23, Hong Kong, China.

P. Denis and B. Sagot. 2010. Exploitation d'une ressource lexicale
pour la construction d'un étiqueteur morphosyntaxique état-de-l'art du
français.  In Proc. of TALN 2010, Montreal, Canada.

P. Denis and B. Sagot. 2012. Coupling an annotated corpus and a
lexicon for state-of-the-art POS tagging. Language Resources and
Evaluation 46 (4), 721-736

MElt is released with an LGPL licence (see LICENCE). 


Installation
============

You first need to download the current release of the MELT tagger from
the INRIA forge:

http://gforge.inria.fr/projects/lingwb/

You can then install MElt by running the usual sequence of commands:

$> aclocal
$> autoconf
$> automake -a
$> ./configure
$> make
$> make install 

The default location for the MElt executable in /usr/local/bin, which
requires you to have root permissions when installing (i.e., you need
to run 'sudo make install' instead of 'make install', which in turns
requires you to be in the sudo'ers list).

If you want to install MElt in a different directory, you need to run
'./configure --prefix=/your/specific/path' instead of
'./configure'.

In any case, you might want to make sure make sure that the binary
installation path (/usr/local/bin or /your/specific/path/bin) is part
of PATH variable.

To successfully run MElt, you need to install Numpy along with a fairly 
recent version of Python --we've tried 2.5, 2.6 and 2.7. 


Running the tagger
==================

The MElt tagger is run as follows:

$> cat <file> | MElt <options>

If your corpus is already tokenized and segmented in sentences (one
sentence per line), following the French Treebank
conventions (http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php),
you should use MElt with no options.

If your corpus is raw (no tokenization, no segmentation in sentences),
you can activate MElt's embedded lightweight tokenizer by using the '-t'
option. This tokenizer is a small subset of the SxPipe pre-parsing
processing chain (see Sagot & Boullier 2005, 2008). It can be invoked
independently of MElt (cat <file> | sxpipe-light).

Also, note that currently MElt only supports utf-8 encoding.


Tagset
======

The current tagset used by MElt is as follows (Crabbé & Candito,
2008):

ADJ 	   adjective
ADJWH	   interrogative adjective
ADV	   adverb
ADVWH	   interrogative adverb
CC	   coordination conjunction
CLO	   object clitic pronoun
CLR	   reflexive clitic pronoun
CLS	   subject clitic pronoun
CS	   subordination conjunction
DET	   determiner
DETWH	   interrogative determiner
ET	   foreign word
I	   interjection
NC	   common noun
NPP	   proper noun
P	   preposition
P+D	   preposition+determiner amalgam
P+PRO	   prepositon+pronoun amalgam
PONCT	   punctuation mark
PREF	   prefix
PRO	   full pronoun
PROREL	   relative pronoun
PROWH	   interrogative pronoun
V	   indicative or conditional verb form
VIMP	   imperative verb form
VINF	   infinitive verb form
VPP	   past participle
VPR	   present participle
VS	   subjunctive verb form

When using normalization options, other tags may appear:
- when using -n, Y means "non-last token of a multi-token unit", X means "multiword/multitag token"
- when using -N, Y means "non-last token of a multi-token unit", multiword/multitag tokens are annotated with tags of the form T1+T2+...+Tn (ex.: chépa/CLS+V+ADV)