site_entry_page 22.1 KB
Newer Older
huet's avatar
huet committed
1 2 3 4 5 6
<!doctype html>
<meta charset="utf-8">

<title>The Sanskrit Heritage Site</title>
Gérard Huet's avatar
Gérard Huet committed
<meta name="author" content="Gérard Huet">
<meta property="dc:datecopyrighted" content="2018">
Gérard Huet's avatar
Gérard Huet committed
<meta property="dc:rightsholder" content="Gérard Huet">
huet's avatar
huet committed
10 11 12 13
<meta name ="keywords" content="india,dictionary,indology,sanskrit,lexicography,linguistics,indo-european,dictionnaire,sanscrit,panini,indology,linguistics">
<meta name="description" content="This site provides tools for Sanskrit processing: dictionary search, morphology generation and analysis, segmentation, tagging and parsing.">

<link rel="shortcut icon" href="IMAGES/favicon.ico">
<link rel="apple-touch-icon" href="IMAGES/touch-icon-iphone-60x60.png">
huet's avatar
huet committed
15 16 17 18 19 20 21
<link rel="apple-touch-icon" sizes="60x60" href="IMAGES/touch-icon-ipad-76x76.png">
<link rel="apple-touch-icon" sizes="114x114" href="IMAGES/touch-icon-iphone-retina-120x120.png">
<link rel="apple-touch-icon" sizes="144x144" href="IMAGES/touch-icon-ipad-retina-152x152.png">

<link rel="stylesheet" type="text/css" href="DICO/style.css" media="screen,tv">

<body class="pink_back"> <!-- Pale_rose -->
huet's avatar
huet committed
23 24
<table class="body">

huet's avatar
huet committed
26 27 28 29 30 31 32 33
<h1 class="title">The Sanskrit Heritage Site
<a href="IMAGES/Yantra.jpg">
<img src="IMAGES/smallyantra.gif" alt="Shri Yantra">

<h3 class="c3">Version #VERSION [#DATE] (#LANG)<br>
#CAPTION <!-- (since 01/09/2003) -->
huet's avatar
huet committed
35 36 37

Welcome to the Sanskrit Heritage site.
huet's avatar
huet committed
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
It provides various services for the computational treatment of Sanskrit.

The first service is dictionary access. The dictionary is a hypertext structure
giving access to the Sanskrit lexicon, given with grammatical information.
There are currently two versions of the dictionary.
The first one is the original Heritage Sanskrit-French dictionary, that
serves as morphology generator, and is thus fully equipped with grammatical
tools. Furthermore it offers a rich encyclopedic contents about Indian culture.
You may also download a printable pdf version of this dictionary, as
explained below. A fully hypertext version in the
<a href="goldendict.html">Goldendict</a> format is also available.
The second lexicon is a digital version of the Monier-Williams Sanskrit-English
dictionary, a much more complete lexicon for the Sanskrit language.
It is issued from Thomas Malten's digitalization of the Monier-Williams
at K&ouml;ln University, turned into an XML databank by Jim Funderburk,
and finally adapted to the HTML Heritage look and feel by Pawan Goyal.
The Sanskrit Heritage dictionary is thus mirrored in the Monier-Williams, which
allows compatibility of the grammatical tools.
The choice of the dictionary is set to a default by the configuration of the
server site. But each dictionary is accessible separately by its search page,
respectively <a href="DICO/">
<strong><span class="red">Sanskrit Heritage</span></strong></a> and
<a href="DICO/index.en.html"><strong><span class="red">Monier-Williams</span></strong></a>.
This site offers a number of linguistic services for the Sanskrit language, such
as a <a href="DICO/reader.html">Sanskrit Reader</a> that parses Sanskrit
text under various formats into Sanskrit banks of tagged hypertext.
huet's avatar
huet committed
71 72 73 74 75 76 77 78 79 80
Various phonological and morphological tools are also provided.
Please visit the <a href="manual.html">Reference manual</a> for learning how
to use the various facilities.

<h2 class="b2"> Sanskrit Heritage dictionary in book form </h2>

You may download the Heritage dictionary as a pdf document from
<a href="Heritage.pdf">PDF</a>.
huet's avatar
huet committed
82 83
This document is readable through Acrobat Reader,
a well-known document management software from Adobe freely available on Internet.
Since the document is rather large, you have to account for some delay
huet's avatar
huet committed
85 86 87
in loading its 5 Mb. This is a still on-going effort, lexical acquisition
implies quick obsolescense of this document which grows along with versions.

huet's avatar
huet committed
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
The Sanskrit Heritage dictionary is also available in an ebook format,
usable with the Babyloo, Stardict or Goldendict software.
Please visit the <a href="goldendict.html">Golden Sanskrit Heritage</a> page.

<h2 class="b2"> Multilingual hyper-text dictionary </h2>

<h3 class="b3"> Interactive browsing </h3>

The dictionary may be accessed through an indexing engine:
 <a href="DICO/">
<strong><span class="red">Sanskrit Heritage</span></strong></a> or
through its mirror <a href="DICO/index.en.html">
<strong><span class="red">Monier-Williams</span></strong></a> subset.
Your browser must be HTML5 compliant, and for proper viewing
of Sanskrit text you must have installed on your system open type fonts
for roman transliteration with diacritics, and for devan&#257;gar&#299;.
A Unicode-compliant font for devan&#257;gar&#299; with proper ligatures
is Apple's Devanagari MT for Macintosh OS X stations. For Windows users,
huet's avatar
huet committed
110 111 112 113 114
installation of font 'Arial MS Unicode' is advised for proper rendering.
You may have to fiddle with the controls of your browser, so that the font
declarations from the dictionary pages get precedence over the standard
selection, and thus encoding is specified as Unicode compliant (UTF-8 encoding).
huet's avatar
huet committed
116 117 118
Note that many words are given with their etymology as hypertext links. You
119 120 121
may thus navigate from a word to is morphological components, down to its roots.
Also, the gender declarations of
the main entries are mouse-sensitive, and give you direct access to the
huet's avatar
huet committed
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
relevant declension table. Similarly, the present class mark of the verbal roots
gives access to the conjugation schemes. Also for verb entries, preverbs
lead you to the correspondingly prefixed derived verbs.
All these grammatical tools, originally developed for the Heritage dictionary,
are being progressiveley extended to the Monier-Williams dictionary.
Thus our HTML Monier-Williams offers similar declension and conjugation

<h3 class="b3"> Sanskrit made easy </h3>

If you want to search for a Sanskrit word
without knowing its exact transliteration, go to section "Sanskrit made easy"
138 139
of the index page, which allows you to search for words without knowing
precise diacritics usage.
huet's avatar
huet committed
140 141 142 143
For instance, search Vishnou, Siva, or the grammarian Panini. This
interface is limited for the moment to the Sanskrit Heritage dictionary.

<h2 class="b2"> Sanskrit Grammarian
huet's avatar
huet committed
145 146 147 148
<img src="IMAGES/panini.jpg" alt="Panini">
149 150
This interface gives the declension tables for Sanskrit substantives.
Try out this
huet's avatar
huet committed
151 152 153 154 155 156
<a href="DICO/grammar.html">declension engine</a> by submitting Sanskrit stems
with intended gender. The same transliteration conventions as for the
dictionary index apply. For instance, submit "deva" with gender Mas,
or (assuming Velthuis transliteration) "devii" with gender Fem,
or "brahman" with gender Neu. The fourth
button, labeled "Any", may be used for the words which take their
157 158
gender from the context, such as deictic personal pronouns ("aham", "tvad"),
or numeral words such as "dva", "tri", etc.
huet's avatar
huet committed
159 160
161 162 163
A conjugation engine for roots is also available. It handles
the full present system: present indicative, imperfect, imperative and
optative, as well as the passive present system, the perfect, the aorist
huet's avatar
huet committed
164 165 166 167 168 169 170
and the future.
Participial stems, absolutives and infinitives are listed as well.
Some secondary conjugations (causative, intensive,
desiderative) are also generated, for the full present and future systems.
Try out this <a href="DICO/grammar.html#roots">conjugation engine</a>
with data such as "bhuu" 1, "as" 2, "m.rj" 2, "han" 2, "haa" 3, "hu" 3,
"daa" 4, "su" 5, "p.r" 6, "yuj" 7, "k.r" 8, "j~naa" 9, "cur" 10, "namas" 11.
In order to get the secondary conjugations of a root, enter code 0.
huet's avatar
huet committed
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
You may cascade by generating declensions of the generated participial stems.
A word of caution is called for here. The only safe way to get correct
inflected forms is to enter the stem and its morphological parameters
consistently with their specification in the Heritage dictionary. This is
specially true of roots, since they appear with various names according to
Sanskrit grammars. For instance, root hū is called  hū,
hvā or hve according to various grammarians. Another problem
is homophony. When two items have the same phonetic realization, their
respective lexemes are disambiguated
by an integer index, which is specific to the lexicon. Thus there are
three roots named mā in the Sanskrit Heritage dictionary. They are
adressed respectively (in Velthuis transliteration) as maa#1, maa#3 and maa#4.
If you ask for the conjugated forms of maa in present classes 2 or 3, the
system will guess you mean maa#1 (to measure). But if you mean maa#3
(to mow) or maa#4 (to exchange) you have to enter explicitly their disambiguated
stems maa#3 or maa#4. Entering an arbitrary stem and arbitrary morphology
parameters may yield random results or error messages.

<h2 class="b2"> Lemmatizer </h2>
195 196
Conversely, a
<a href="DICO/index.html#stemmer">lemmatiser</a>
huet's avatar
huet committed
197 198 199 200 201 202 203
attempts to tag inflected words.
Try for instance (in Velthuis format)
"devaat", "jagmivaan", "a.s.tau" (selecting Noun)
or "apibat", "akaar.siit", "dudoha", "vaahyate" (selecting Verb).
This lemmatizer knows about inflected forms of derived stems in some
secondary derivations.
For instance, "darzayi.syati" is found as conjugated form:
{ ca. fut. ac. sg. 3 }[dṛś_1],
huet's avatar
huet committed
205 206 207
"dariid.rzyate" yields { int. pr. md. sg. 3 }[dṛś_1],
"did.rk.sate" yields { des. pr. md. sg. 3 }[dṛś_1]
and "" yields { des. pft. md. sg. 3 | des. pft. md. sg. 1 }[bhaj].
Please note the multitag notation of this ambiguous form.
huet's avatar
huet committed
209 210 211 212 213 214
Other lexical categories are available, such as Part for participles.
For instance, "bhikṣitavyānām" (selecting IAST transliteration and the Part
lexical category, yields { g. pl. f. | g. pl. n. | g. pl. m. }[bhikṣitavya { des. pfp. [3] }[bhaj]].
The various grammatical abbreviations used in these lemmas are available
<a href="abrevs.pdf">here</a>.
huet's avatar
huet committed
216 217

N.B. Do not attempt to lemmatize verbal forms with preverbs - this will
not work, it knows only how to invert root forms. Lemmatizing
huet's avatar
huet committed
219 220 221 222 223 224 225
more complex forms is possible through the Sanskrit Reader interface,
as we shall see below.

<h2 class="b2"> Morphology </h2>

A dictionary of inflected forms of Sanskrit words is provided
in XML form under various transliteration schemes.
Gérard Huet's avatar
Gérard Huet committed
Please visit the <a href="xml.html">Sanskrit linguistic resources page</a>.
227 228 229 230
This resource may now be downloaded as a git repository, using command:<br/>
<span class="Green">
git clone
huet's avatar
huet committed
231 232 233 234 235

<a id="reader"></a>
<h2 class="b2"> Sanskrit Reader </h2>

236 237 238 239 240 241 242 243
The main tool provided by this site is a
<span class="Green">Sanskrit Reader</span> that allows machine-assisted
analysis of Sanskrit sentences, that is segmentation
(including sandhi viccheda), morphological tagging, and several parsers.
Please consult the <a href="manual.html">Reference manual</a> for learning how
to use these tools.

Try our interactive <a href="DICO/reader.html">Sanskrit Reader</a>.
huet's avatar
huet committed
245 246
It is able to segment simple sentences.
Try for instance to segment "tryambaka.myajaamahesugandhi.mpu.s.tivardhanam"
(we assume Velthuis transliteration here).
huet's avatar
huet committed
248 249 250 251 252
Then push the "Tagging" button and get the fully tagged sentence.
You will see two segmentations, one with an identified compound form
"tri-ambakam", the second with a compounded segment "tryambakam".
Note that each segment is indicated with a lemma giving its stem
and the set of morphological parameters that may generate the segment form
from its stem. The stem is hyperlinked to the dictionary of choice.
huet's avatar
huet committed
254 255 256 257 258 259 260 261 262 263

Note also that segments are separated by phonological information
in the shape of a sandhi rule, justifying correct obtention of the original
sentence by successive sandhi application. For instance, solution 1
explains the compound "tryambakam" as the sandhi of segments "tri" and
"ambakam" by rule‹<span class="Magenta">i</span><span class="Green">|</span><span class="Magenta">a</span><span class="Blue"> → </span><span class="Red">ya</span>›.
The reader may be helped by inserting blanks in the input at word junction.
For instance, the above mantra may be entered as
huet's avatar
huet committed
265 266
"tryambaka.m yajaamahe sugandhi.m pu.s.tivardhanam".
But compounds should stay in one piece.
267 268
Spaces are also needed for hiatus, in sentences such as:
"tacchrutvaasa~njaya uvaaca".
huet's avatar
huet committed
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
Many options are provided in the menu of the Reader page. For instance,
clicking on the Unsandhied button we may present text in
<i>padapāṭha</i> form, where each chunk is in terminal sandhi form.
For instance "tryambakam yajaamahe sugandhim pu.s.tivardhanam".
Two strengths of the Reader are provided. The Simplified mode, offered as a
default, does not recognize vocatives. The Complete mode is more powerful,
using the full range of participles of verbs, privative compounds, etc.
It may however return so many solutions that listing all solutions is
impractical, and other facilities must be used.
284 285
The grammar used to recognize sentences is explained
as a local automaton state transition graph
huet's avatar
huet committed
286 287 288 289 290 291 292 293 294
<a href="IMAGES/lexer17.jpg">Lexer automaton</a>.
This is actually a simplification of the segmenter automaton control.
A simpler one, close to the Simplified mode of the reader, is
<a href="IMAGES/lexer10.jpg">Simplified automaton</a>.
A fuller one, close to the Complete mode of the reader, is
<a href="IMAGES/lexer40.jpg">Complete automaton</a>.
The color codes of these diagrams explain the output conventions of the tags.
In these diagrams, transparent nodes are non generative, and colored nodes
huet's avatar
huet committed
296 297 298
correspond to the lexical categories recognized by the lemmatizer. The
category Auxi is the subset of Verb consisting of conjugated forms of
roots "k.r", "as" and "bhuu" used as auxiliaries in periphrastic constructions.
Pv denotes sequences of preverbs.
huet's avatar
huet committed
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343

<h2 class="b2"> Sanskrit Parser </h2>


Two parsers are currently in use with the Sanskrit Heritage Platform.
One is a shallow parser, available using the "Parsing" button, which
appears when there are not too many remaining solutions.
It is naive, but may be of use for beginners. For instance, try
"", checking the "Parsing" button.
It returns a unique solution among the 8 possible segmentations.
Each solution returned with the parser is marked with a green check sign,
which may be pressed to get the semantic analysis of the sentence in
terms of roles (<i>kāraka</i>).
The parser recognizes sentences. It may be made to recognize nominal phrases,
provided one presses the "Optional topic" button with the intended gender.
You may for instance analyze the compound:
as a masculine nominal.
Alternatively, one can ask to recognize this form as a single word, by pressing
"Word" rather than the default "Sentence" text category.
When breaking the text with spaces, the Word mode allows to recognize
texts given in <i>padapāṭha</i> fashion.
It is also possible to recognize sequences of chunks in final sandhi form
separated by spaces, where sandhi will be assumed to be undone between the
chunks, by specifying the "Unsandhied" mode in the reader interface.


Another dependency parser is under development at University of Hyderabad;
it may be accessed from the Heritage segmenter, seen as a plugin.
More documentation on these facilities are described in the
<a href="manual.html">Reference manual</a>.

<h2 class="b2"> Sanskrit Tagger </h2>

The semantic analysis may be still ambiguous, since a given segment may be
huet's avatar
huet committed
345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369
decorated by several morphological categories. All interpretations are
presented under the role matrix, sorted by increasing penalty. Check for
your favorite interpretation in this list, and select it by clicking on its
green heart symbol. The system will return the corresponding unambiguously
tagged sentence, as a page which you may save on your own station. Iterating
this process allows you to progressively tag a Sanskrit text with the Sanskrit
reader assistance.
Alternatively, you may select the ambiguous morphology choices, each being
provided with a selection button. Selections are chosen by default at the
first choice, but you may override this default and choose manually e.g.
the genders of nominals. When your choice is finalized, just click on the
"Submit" button and you will get the corresponding deterministically tagged
sentence. This tool is useful for semi-manual corpus annotation.

<h2 class="b2"> Summary mode <span class="Red"></span> </h2>

Now that you are more familiar with using the various modes of the Reader
on small Sanskrit sentences, it is time to try to analyse more complex
sentences. Obviously the listing of all solutions is out of the question with
long sentences in Complete mode. A new visual interface is offered for
semi-automatic segmentation. This new Summary mode is actually now proposed as a
default in the Reader page.
huet's avatar
huet committed
371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387
Try for instance
"satya.mbruuyaatpriya.mbruuyaannabruuyaatsatyamapriya.mpriya.mcanaan.rtambruuyaade.sadharma.h sanaatana.h".
The display presents a summary of the union of all solutions, as a chart of
segments aligned on their respective input contribution.
You see at the right end the segment <i>sanātanas</i> proposed first,
on top of a forest of smaller words combinations.
Click on the green check sign below it. The check sign becomes blue,
and the forest of irrelevant combinations vanishes.
Do the same under the satyam segments, then under apriyam, all segments
presented as top candidates. Now choose the particle ca (and thus na),
and the blue pronominal segment eṣa.
Now only one choice remains, between brūyāt and brūyām.
Clicking on the first one will finish the job.
Indeed only one solution remains, as indicated by the "Unique Solution" link.
Clicking on its check sign, you are now viewing the same output as given by the
Reader in Tagging mode, but constrained to use only segments checked in the Summary.
huet's avatar
huet committed
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403

<h2 class="b2"> Other Sanskrit Resources </h2>

We have on on-going cooperation with the Department of Sanskrit Studies
of the University of Hyderabad and the Computer Science Department of
the Indian Institute of Technology at Kharagpur
on computational linguistics for Sanskrit.
A joint research team has been formed, cooperating with scholars
from the Sanskrit Library. This team is actively developing cooperating
multi-platform Web services.
In october 2007 we organized the First International Sanskrit Computational
Linguistics Symposium. Please visit
huet's avatar
huet committed
the <a href= "Symposium/">Symposium Site</a>.
This was followed by the Second Symposium in may 2008 at
huet's avatar
huet committed
<a href="">Brown University</a>,
by a third one in january 2009 at
huet's avatar
huet committed
409 410 411 412 413 414 415 416 417 418 419 420
<a href="">Hyderabad University</a>,
a fourth one in december 2010 at
<a href="">JNU</a>.
a fifth one in january 2013 at
<a href="">IIT Bombay</a>.
A workshop on <a href="">Bridging
the gap between Sanskrit computational linguistics tools
and management of Sanskrit digital libraries</a> was organized in December 2016
at Banaras Hindu University, at the occasion of the ICON 2016 conference.

The computational tools for Sanskrit developed at University of Hyderabad
are available here as a <a href= "~anusaaraka/">Mirror Site</a>. 
huet's avatar
huet committed

<div class="center">
huet's avatar
huet committed
424 425
<img src="IMAGES/yinyang.gif" alt="Yinyang">
huet's avatar
huet committed
427 428 429 430 431
<h2 class="b2"><img src="IMAGES/JoeCaml.png" alt="Cool Joe Caml">
The Zen Library</h2>

This site reflects an ongoing project of Sanskrit processing
on a comprehensive software platform.
huet's avatar
huet committed
433 434
The project is based on a structured lexicographic database, compiled from
the Sanskrit Heritage dictionary, and on
the Zen computational linguistics toolkit. This toolkit is a library
of programs implemented in the 
<a href="">Objective Caml</a>
huet's avatar
huet committed
438 439
programming language. The Zen library and its documentation are available
as free software under the Gnu Lesser General Public License (LGPL) from the
Gérard Huet's avatar
Gérard Huet committed
<a href="">Zen site.</a>
huet's avatar
huet committed
<!-- Forum closed
huet's avatar
huet committed
443 444 445 446 447 448
Please visit the <a href="">Zen Forum</a> for
announcements and discussions concerning the ZEN toolkit. -->

<h2 class="b2"><img src="IMAGES/ganesh.jpg" alt="Ganesh">
The Sanskrit Portal</h2>

Gérard Huet's avatar
Gérard Huet committed
449 450 451
Please visit our <a href="portal.html">Sanskrit Portal</a>
to find links to other Sanskrit resources.
Gérard Huet's avatar
Gérard Huet committed
If you are reading this from a mirror site, don't forget to regularly update
this server with the development Git site
455 456
"". -->
huet's avatar
huet committed
457 458 459 460
<h2 class="b2"><img src="IMAGES/om1.jpg" alt="Om">
Artwork credits</h2>

<span class="green">Orissan artwork at this site courtesy of Shauraj Rath.
Gérard Huet's avatar
Gérard Huet committed
© Screenex, Bhubaneshwar, Ekamra, Orissa. All rights reserved.
huet's avatar
huet committed
<span class="green">Wallpaper om images courtesy of
huet's avatar
huet committed
464 465
<a href=""></a>.
<span class="green">Ganesh wallpaper courtesy of
Gérard Huet's avatar
Gérard Huet committed
<a href="">François Patte</a>.
huet's avatar
huet committed
<span class="green">Shri Yantra design ©
Gérard Huet's avatar
Gérard Huet committed
<a href="MAGES/Yantra.jpg">Gérard Huet</a> 1990.<br>
</span> -->
huet's avatar
huet committed
472 473 474
</table> <!-- body -->

Gérard Huet's avatar
Gérard Huet committed
<table class="pad60"> <!--padding for bandeau -->
huet's avatar
huet committed
476 477 478 479 480 481 482 483
<div class="enpied">
<table class="bandeau"><tr><td>
<a href="">
<img src="IMAGES/icon_ocaml.png" alt="Objective Caml" height="50"></a>
<table class="center">
484 485 486 487 488
<a href="index.html"><strong>Top</strong></a> |
<a href="DICO/index.#LANG.html"><strong>Index</strong></a> |
<a href="DICO/index.#LANG.html#stemmer"><strong>Stemmer</strong></a> |
<a href="DICO/grammar.#LANG.html"><strong>Grammar</strong></a> |
<a href="DICO/sandhi.#LANG.html"><strong>Sandhi</strong></a> |
huet's avatar
huet committed
<a href="DICO/reader.#LANG.html"><strong>Reader</strong></a> |
<a href="DICO/corpus.#LANG.html"><b>Corpus</b></a> |
huet's avatar
huet committed
491 492
<a href="faq.#LANG.html"><strong>Help</strong></a> |
<a href="portal.#LANG.html"><strong>Portal</strong></a>
</td></tr><tr><td>© Gérard Huet 1994-2018</td></tr></table></td><td>
huet's avatar
huet committed
494 495 496 497
<a href="">
<img src="IMAGES/logo_inria.png" alt="Logo Inria" height="50"></a>