xml_entry_page 9.04 KB
Newer Older
Gérard Huet's avatar
Gérard Huet committed
1
<!DOCTYPE html>
huet's avatar
huet committed
2 3
<html>
<head>
Gérard Huet's avatar
Gérard Huet committed
4
<meta charset="utf-8">
huet's avatar
huet committed
5 6

<title>Sanskrit Linguistic Resources</title>
Gérard Huet's avatar
Gérard Huet committed
7
<meta name="author" content="Gérard Huet">
8
<meta property="dc:datecopyrighted" content="2018">
Gérard Huet's avatar
Gérard Huet committed
9
<meta property="dc:rightsholder" content="Gérard Huet">
huet's avatar
huet committed
10
<meta name ="keywords" content="india,dictionary,indology,sanskrit,lexicography,linguistics,indo-european,dictionnaire,sanscrit,panini,indology,linguistics">
11
<meta name ="date" content="2018-01-08">
huet's avatar
huet committed
12 13
<meta name="classification" content="computational linguistics, sanskrit, morphology, lexicography, indology">
<meta name="description" content="This page is for downloading Sanskrit resources.">
Gérard Huet's avatar
Gérard Huet committed
14 15
<link rel="shortcut icon" href="IMAGES/favicon.ico"/>
<link rel="stylesheet" type="text/css" href="DICO/style.css" media="screen,tv"/>
huet's avatar
huet committed
16 17
</head>

18
<body class="pink_back"> <!-- Pale_rose -->
huet's avatar
huet committed
19 20 21 22 23
<table class="body">

<table border="0pt" cellpadding="0" cellspacing="15pt" width="100%">
<tr><td>

Gérard Huet's avatar
Gérard Huet committed
24
<h1 class=b1>Sanskrit linguistic resources</h1>
huet's avatar
huet committed
25

26 27
<br>
<img src="IMAGES/Panini2.jpg" alt="Panini"/>
Gérard Huet's avatar
Gérard Huet committed
28
<br>
huet's avatar
huet committed
29

30
<div class="latin12">
huet's avatar
huet committed
31 32 33

<h2 class=b2>Sanskrit Morphology</h2>

Gérard Huet's avatar
Gérard Huet committed
34 35 36 37 38 39 40 41 42 43 44
<h3 class=b3>Background</h3>

This documents XML data banks of Sanskrit forms given with their morphological
taggings. They are produced mechanically by the declension and conjugation
engines of the Sanskrit Heritage Platform, processing the Sanskrit lexicon
underlying the Sanskrit Heritage Dictionary. The version used for this
generation is available <a href="Heritage.pdf">here</a> as a PDF document.
<p>

These databanks are regularly updated. They are available for public download
as a public git archive in the Sanskrit Heritage development site:
45
"https://gitlab.inria.fr/huet/Heritage_Resources".
Gérard Huet's avatar
Gérard Huet committed
46 47 48

<h3 class=b3>Databanks description</h3>

huet's avatar
huet committed
49 50
We provide here inflected forms and morphemes derived from the root forms
defined in the
Gérard Huet's avatar
Gérard Huet committed
51
<a href="DICO/index.html">Sanskrit Heritage Dictionary</a>. These forms are
huet's avatar
huet committed
52 53
presented as lemmas linking each form to its stem entry by possible morpho-phonetic
operations. We limit ourselves to classical Sanskrit, and do not cover precative,
54
subjunctive, injunctive and conditional forms of the verbs.
huet's avatar
huet committed
55 56 57 58 59

At present, we provide for two transliteration schemas, respectively
WX, used by the
<a href="http://sanskrit.uohyd.ernet.in/">Department of Sanskrit Studies at
University of Hyderabad</a>
60
and SLP1, used by the
huet's avatar
huet committed
61 62
<a href="http://sanskritlibrary.org/">Sanskrit Library</a>.

63
The respective data banks are listed in directories WX and SL.
huet's avatar
huet committed
64

65
The morphological lemmas are distributed in 6 files in
huet's avatar
huet committed
66 67
XML format, conformant to a common DTD.
The nominal morphological declensions of nouns, adjectives and numbers,
Gérard Huet's avatar
Gérard Huet committed
68
are covered in "T_nouns.xml" (where T is respectively WX or SL).
69
Those of pronouns are covered in "T_pronouns.xml".
huet's avatar
huet committed
70
The conjugated forms of roots in the present, imperfect, imperative, optative,
71
perfect, aorist
huet's avatar
huet committed
72 73
and future tenses, as well as passives of the present system,
for the primary conjugation and for some secondary conjugations
Gérard Huet's avatar
Gérard Huet committed
74
(causative, intensive, desiderative) are covered in "T_roots.xml".
huet's avatar
huet committed
75 76
Additional declensions of derived participial forms are given in T_parts.xml.
Absolutives, infinitives and other undeclinable words and particles
Gérard Huet's avatar
Gérard Huet committed
77 78
are listed in "T_adverbs.xml". In addition, "T_final.xml" gives additional
generative morphemes. The files are conformant to the DTD "T_morph.dtd".
huet's avatar
huet committed
79
<p>
Gérard Huet's avatar
Gérard Huet committed
80
Finally, the text file "X_preverbs.txt" lists common
81
preverb sequences, given with their sandhi analysis.
huet's avatar
huet committed
82

Gérard Huet's avatar
Gérard Huet committed
83
<h3 class=b2>Intellectual Property</h3>
huet's avatar
huet committed
84

85
All these linguistic data banks are Copyrighted Gérard Huet 1994-2018.
huet's avatar
huet committed
86 87 88 89
They are derived from the Sanskrit Heritage Dictionary
version #VERSION dated #DATE.
<p>
Use of these linguistic resources is granted according to the
Gérard Huet's avatar
Gérard Huet committed
90 91 92 93
Lesser General Public Licence for Linguistic Resources.
Copies of this license, in pdf as well as HTML, are provided at
the Heritage_Resources distribution site in its XML subdirectory.
Thank you for referencing the origin of this data if you use it in your own work.
huet's avatar
huet committed
94 95 96

<h2 class=b2>Methodology</h2>

97
We deal here with a mixture of derivational and inflexional morphology.
huet's avatar
huet committed
98 99 100 101 102
For instance, from the roots we generate verbal and propositional stems, and from
these stems we generate in turn inflected forms: conjugated forms from the
verbal stems, and declined forms from the participial stems. But at present
we do not generate mechanically primary nominal stems from roots,
nor secondary nominal stems from primary ones, because of overgeneration.
103 104
The nominal stems, as well as the undeclinable forms, are taken from the
lexicon, that lists also some frequent participles.
huet's avatar
huet committed
105 106 107 108 109 110 111 112 113 114 115 116 117 118
<p>
This organization entails a different role in our morphological data bases.
The <i>basic</i> morphological categories correspond to lexical phases,
which are atomic letters in the defining grammar of Sanskrit <i>word</i>.
The forms listed in these data bases act as morphemes of this high-level
morphological definition, which is recursive, since compounding may be
iterated, as well as preverb formation, to a certain extent.
But this recursion power is limited, in the sense that the grammar of a word
is a regular one (type 0 in the Chomsky hierarchy), and its recognizer is
a finite automaton, whose states are precisely the lexical categories indexing
the basic data bases. This definition of word implements correctly the geometry
of constructions such as absolutives (which fall in two distinct categories,
the preverb form and the root form) and periphrastic phrases (periphrastic
futures with substantives, and periphratic perfects as prefixes of finite
Gérard Huet's avatar
Gérard Huet committed
119 120
perfect forms of the auxiliary roots <i>as</i>, <i>bhū</i> and
<i>kṛ]</i> which are duplicated in a specific auxiliary lexicon).
huet's avatar
huet committed
121 122
Here is a simplified diagram of the current state space of our lexer.

123
<div class="center">
Gérard Huet's avatar
Gérard Huet committed
124
<img src="IMAGES/lexer17.jpg" alt="Lexer automaton">
huet's avatar
huet committed
125 126 127
</div>

This automaton is also the top-level view of our Sanskrit Tagger, which
Gérard Huet's avatar
Gérard Huet committed
128
implements Sanskrit analysis from <i>devanagarī</i> text.
huet's avatar
huet committed
129 130 131 132
The technical exposition of this method, together with its correctness
justification, has been exposed in various scientific journals and conferences,
and the corresponding articles are also available freely on my
<a href="http://pauillac.inria.fr/~huet/bib.html">
Gérard Huet's avatar
Gérard Huet committed
133 134 135
<strong>publications page</strong></a>
(papers [78], [87], [88], [94], [95], [105], [106] and [110]
are specially relevant).
huet's avatar
huet committed
136
This material will not be repeated here. Let us just explain a few difficulties
137
of the large-scale implementation of this Sanskrit analyser.
huet's avatar
huet committed
138 139 140 141
<p>
As usual in a non-deterministic search algorithm (here all the possible parsings
of a sentence as a sandhied stream of forms), we have two pitfalls, silence and noise.
Silence (lack of recall) means incompleteness. Some legal Sanskrit sentences
142
may fail to be recognized.
huet's avatar
huet committed
143 144 145 146
Typicallly, some root word may be missing from the base lexicon,
or some Vedic form may use some construction rare in the later language,
like precative or subjunctive.
Compounding gives rise to two complications, the raising of new cases by
147
<i>bahuvrīhi</i> compounding,
Gérard Huet's avatar
Gérard Huet committed
148
and the formation of <i>avyayībhava</i> compounds. Some of these
huet's avatar
huet committed
149
constructions are treated incompletely.
150 151 152
<p>
The opposite of silence is noise (lack of precision), that is overgeneration.
We deal with overgeneration
huet's avatar
huet committed
153 154 155 156
in the syntactico-semantic layer of our tagger, which filters out combinations of
tags inconsistent with semantic role assignments.
We shall not discuss this technology
further in this note on morphology, and refer the interested reader to our
157
<a href="DICO/reader.html"><strong>Sanskrit reader
Gérard Huet's avatar
Gérard Huet committed
158
demonstration page</strong></a> and its <a href="manual.html">
huet's avatar
huet committed
159
<strong>Reference manual</strong></a>
160 161
<p>
We remark that the respective data bases can be interrogated online by our
huet's avatar
huet committed
162 163 164
<a href="http://sanskrit.inria.fr/DICO/index.html#stemmer"><strong>stemmer
interface</strong></a>. But note that verbal forms prefixed by preverbs
are analysed by the tagger as non-atomic words, and only root forms and
165
their secondary conjugations are recognized by the stemmer.
huet's avatar
huet committed
166 167 168

<h2 class=b2>Help</h2>

169
Questions concerning these resources should be addressed to
Gérard Huet's avatar
Gérard Huet committed
170
<a href="mailto:Gerard.Huet@inria.fr">Gérard Huet</a>.
171
All suggestions for improvements will be gratefully considered.
huet's avatar
huet committed
172 173 174 175
</td></tr>
</table>
</div>

Gérard Huet's avatar
Gérard Huet committed
176 177
<table class="pad60"> <!--padding for bandeau -->
<tr><td></td></tr></table>
huet's avatar
huet committed
178 179 180
<div class="enpied">
<table class="bandeau"><tr><td>
<a href="http://ocaml.org">
Gérard Huet's avatar
Gérard Huet committed
181
<img src="IMAGES/icon_ocaml.png" alt="Objective Caml" height="50"></a>
huet's avatar
huet committed
182 183 184
</td><td>
<table class="center">
<tr><td>
185 186 187 188 189
<a href="index.html"><strong>Top</strong></a> |
<a href="DICO/index.en.html"><strong>Index</strong></a> |
<a href="DICO/index.en.html#stemmer"><strong>Stemmer</strong></a> |
<a href="DICO/grammar.en.html"><strong>Grammar</strong></a> |
<a href="DICO/sandhi.en.html"><strong>Sandhi</strong></a> |
Gérard Huet's avatar
Gérard Huet committed
190
<a href="DICO/reader.en.html"><strong>Reader</strong></a> |
191
<a href="DICO/corpus.en.html"><strong>Corpus</strong></a> |
Gérard Huet's avatar
Gérard Huet committed
192 193
<a href="faq.en.html"><strong>Help</strong></a> |
<a href="portal.en.html"><strong>Portal</strong></a>
huet's avatar
huet committed
194
</td></tr>
195
<tr><td>© Gérard Huet 1994-2018</td></tr>
huet's avatar
huet committed
196 197
</table></td><td>
<a href="http://www.inria.fr/">
Gérard Huet's avatar
Gérard Huet committed
198
<img src="IMAGES/logo_inria.png" alt="Logo Inria" height="50"></a>
huet's avatar
huet committed
199 200
<br></td></tr></table></div>
</body>
201
</html>