...
 
Commits (2)
......@@ -245,11 +245,7 @@ In particular, it is the rendering engine that compiles the
ligatures into actual glyphs displayed on your screen.
Thus this site may be seen more or less correctly according to your local
configuration. Do not blame me if you get garbage on Explorer or Netscape.
I personally use Safari on MacOSX, and the rendering is generally good,
although there are still bugs in the displaying of complex ligatures
like for instance
<i>tacchrutv&#257;</i>, rendered incorrectly as <i>taccharutv&#257;</i>,
in spite of numerous anomaly reports to Apple.
I personally use Safari on MacOSX, and the rendering is generally good.
I occasionally check on Firefox and Chrome for interoperability, and I try to
insure that my Web pages are compliant with the W3C specification for HTML5.
I give some indications about proper fonts to load on the entry page of the site.
......
......@@ -39,7 +39,7 @@ The site started as a set of tools to exploit a digital version of the
Sanskrit Heritage Dictionary, which had been developped as a personal
independent project by Gérard Huet since 1996 as a
Sanskrit-French dictionary intended as a small encyclopedia of Indian culture.
These tools use the finite-state methods implemented in the ZEN
These tools use the finite-state methods implemented in the Zen
Objective Caml library to provide efficient lexicon representation,
morphology generation, and segmentation by sandhi recognition.
This technology was published in 2005 as
......@@ -50,7 +50,7 @@ published recently as
<a href="http://jlm.ipipan.waw.pl/index.php/JLM/article/view/108/140">Design and
analysis of a lean interface for Sanskrit corpus annotation</a>.
<p>
Written on November 28th 2018, for Sanskrit Engine Version 3.10.
Written on December 21st 2018, for Sanskrit Engine Version 3.11.
<h2 class="b2" id="tour">First approach to using the Sanskrit Heritage engine</h2>
......@@ -264,7 +264,7 @@ The Sanskrit Heritage Dictionary is the latest edition of a Sanskrit
to French Dictionary
"Dictionnaire Français de l'Héritage Sanskrit" compiled by
Gérard Huet since 1994. This dictionary is freely available
as a 945 pages <a href="Heritage.pdf">book</a> under the pdf format,
as a 952 pages <a href="Heritage.pdf">book</a> under the pdf format,
easily readable with Acrobat Reader, a free Adobe product.
This dictionary is still under development, and is
automatically updated along with the site,
......@@ -520,9 +520,9 @@ Let us illustrate this shallow parsing facility on a much-discussed
ambiguous sentence going back to Patañjali.
Go back to the Reader interface, and enter
<i>zvetodhaavati</i>, using Summary Mode. You see a display of the
26 segmentation solutions. You are also offered a green check sign labeled
16 segmentation solutions. You are also offered a green check sign labeled
Filtered Solutions. Click on it.
You see one particular solution, labeled 14, formed with blue <i>śvetaḥ</i>
You see one particular solution, labeled 9, formed with blue <i>śvetaḥ</i>
in the nominative, followed by red <i>dhāvati</i>, a verbal form in the present.
Actually form <i>dhāvati</i> is marked as ambiguous, since it may result from
root <i>dhāv_1</i> (running) or from root <i>dhāv_2</i> (cleaning).
......@@ -541,13 +541,13 @@ marking the absence of the object to a transitive verb.
In this example, the machine has succeeded in focusing on a correct solution
automatically, among many interpretations.
If we come back to the initial selection, it indeed tells
"1 solution kept among 26",
but actually lists also 5 other plausible additional solutions. Indeed, among
them, Solution 23 gives another correct decomposition <i>śvā+itaḥ+dhāvati</i>
"1 solution kept among 16",
but actually lists also 3 other plausible additional solutions. Indeed, among
them, Solution 13 gives another correct decomposition <i>śvā+itaḥ+dhāvati</i>
"The dog is running towards here". Here too, the tool analyses <i>dhāv_1</i>
as fitting the grammatical constraints. It has penalty 0 as well, but
was just disfavored over the first interpretation because it has 3 segments
rather than 2, exhibiting a "shortest length bias" heuristic.
rather than 2, exhibiting a "shortest number of words bias" heuristic.
<p>
This shallow parser cannot be used on large input sentences, since its output
......@@ -557,7 +557,7 @@ segmentation candidates is below a threshold set by default to 100.
This is in contrast with the situation with the graphical interface, which
is fast and robust. Thus entering the following verse from Kālidāsa, we obtain
very quicky a display factorizing an astronomical number
(24873394117017600) of solutions:
(37310091175526400) of solutions:
<i>yaa tapovize.saparizafkitasya sukumaaramprahara.nam mahendrasyapratyaadeza.h ruupagarvitaayaa.h zriya.h ala.mkaara.h svargasyasaana.h priyasakhyurvazii kuberabhavanaat pratinivartamaanaasamaapattid.r.s.tena kezinaadaanavenacitralekhaadvitiiyaa bandigraaha.mg.rhiitaa</i>.
<h3 class="b3" id="parser3">Lexical categories</h3>
......@@ -626,10 +626,10 @@ There exists yet another variety of compound, the so-called <i>avyayībhāva</i>
of segments, first the preposition <i>nis</i>, colored lavender, and then the
stem <i>makṣikā</i>, turned into an invariable form <i>makṣikam</i>,
colored magenta. We remark that the segment <i>makṣikam</i> is not accepted
as stand-alone input. Please also note with this last example
that an unrecognized chunk of input yields a grey rectangle.
as stand-alone input. Please also note with this last example that an
unrecognized chunk of input yields a grey rectangle with undefined morphology.
<p>
Verbal compounds exist, such as the periphrastic perfect construction,
used for secondary conjugations and nominative verbs. It builds
a special stem in <i>-ām</i>, suffixed by a perfect form of
......@@ -637,17 +637,17 @@ one of the auxiliaries <i>kṛ</i>, <i>as</i> and <i>bhū</i>.
Try for instance <i>āmantrayāṃcakre</i>. You see the periphrastic form
displayed as two segments, an orange <i>āmantrayām</i>, and the red
<i>cakre</i> of the perfect of root <i>kṛ</i>: "he/I summoned". The orange
and red segments are mutually linked, selecting one selects automatically
and red segments are mutually linked, thus selecting one selects automatically
the other.
<p>
Another periphrastic construction is the inchoative "cvi" verbal compound.
Its left part is a special substantival stem in <i>ī</i> or <i>ū</i>,
and its right part a finite verb form of one of the auxiliaries,
like <i>kadarthīkaroti</i> or <i>mṛdūbhavati</i>.
like <i>kadarthīkaroti</i> (he despises) or <i>mṛdūbhavati</i> (it softens).
It in turns gives rise to primary derivatives (<i>kṛdanta</i>) like
<i>khilībhūtaḥ</i>. Here too the left part is orange, and the right part is
either red for verbal forms or blue for participial forms.
<i>khilībhūtaḥ</i> (abandoned). Here too the left part is orange,
and the right part is either red for verbal forms or blue for participial forms.
<p>
This concludes the main grammatical paradigms implemented by our machinery.
......@@ -669,10 +669,11 @@ one is a form of <i>māna_2</i> ("measure"). Although the two segments have
the same color, being both <i>subanta</i> nominal forms, they do not obey
the same combinatorics, since a participle (<i>kṛdanta</i>) stem like
<i>māna_1</i> is liable to be prefixed by the preverb particles
(<i>upasarga</i>) allowed for root <i>man</i>.
(<i>upasarga</i>) allowed for root <i>man</i>. Check this with input
<i>pramānam</i>.
<p>
Another interesting exemple is <i>virodhitayā</i>. The two blue segments look
Another interesting example is <i>virodhitayā</i>. The two blue segments look
alike, and they are both instrumental singular forms of the feminine stem
<i>virodhitā</i>. But one is the past participle of the causative of
verb <i>vi-rudh</i>, the other is an abstract <i>taddhitānta</i> noun,
......@@ -779,8 +780,8 @@ blank space is mandatory. The others may be removed. They are just
help for the segmenter, in indicating pada boundaries. Of course, if you remove
them, the number of potential solutions may increase, since the system
will attempt analyses not respecting these word boundaries. The third space
above
is mandatory, and actually gives rise to two distinct segmentations, one with
above is mandatory even in devanāgarī, as a genuie hiatus.
It actually gives rise to two distinct segmentations, one with
the form <i>odanaḥ</i>, the other with the form <i>odane</i>.
<p>
Note that in the system's rendering, the mandatory space is indicated by
......@@ -824,7 +825,21 @@ Madhva's interpretation (with <i>abhāvaḥ</i>) has to be made explicit as
Finally, the system does not currently support degemination of stems,
such as modern renditions of <i>tattva</i> as <i>tatva</i>
or <i>vārttā</i> as <i>vārtā</i>; only a few common stems such as
<i>chatra</i>, <i>chātra</i> and <i>patra</i> are recognized.
<i>chatra</i>, <i>chātra</i> and <i>patra</i> are recognized.
<p>
A special warning must be given concerning vocatives. Because vocative forms
of the common substantives ending in <i>a</i> are undistinguishable
from their bare stem, usable in compound formation, we demand that vocatives
are chunk-final, i.e. ended by a space in the input. Thus
<i>rāma aśvampaśya</i> may not be written <i>rāmāśvampaśya</i>: in this second
input, vocative form <i>rāma</i> is not recognized. This poses a problem
only in the cases where the extra space would be interpreted as non-trivial
sandhi, like for instance in <i>rāma odanampacatu</i>, or in
<i>śatakrato vivardhasva</i>. In such cases, ending the vocative
with an exclamation mark <i>!</i> will allow the proper vocative recognition,
like <i>rāma!odanampacatu</i> and <i>śatakrato!vivardhasva</i>. More
generally, this exclamation mark may be used for explicit padapāṭha.
<h3 class="b3" id="zloka_input">Entering full verses (<i>śloka</i>).</h3>
......