|
Developing
the Sanskrit Authoring System - VYASA
With
a view to nurture and preserve the richness of Sanskrit,
C-DAC has taken the initiative of developing a Sanskrit
Authoring system under its ‘Heritage Group’ activities.
Shri Ramanujan has played a central role as part of this
project, and summarizes below the work being done.
____________________________________________________________________
INTRODUCTION
Sanskrit
has been a parent of many modern Indian languages. The name
Sanskrita suggests ‘adorned’, ‘elaborate’, or ‘perfected’
form of speech. The beauty of Sanskrit lies in the fact
that it has extremely well structured and unambiguous grammar.
This is a great advantage and helps in determining and representing
precisely the syntactic and semantic meaning of a sentence.
In ancient literature, the entire grammar of Sanskrit has
been laid down as a finite and exhaustive set of rules in
Panini’s Astadhyayi. These rules were valid for Sanskrit,
centuries ago and these rules still hold, eliminating any
scope for any evolutionary changes in grammar or usage.
In no other language can such a set of rules, which are
rather mathematical in nature, be found or formulated.
The Sanskrit grammar rules are well-structured amongst themselves.
There are sets of meta-rules, which, with the help of various
conditions, can be used in linguistic processing. It is
due to such a Sastraic (Sastra = science) nature of Sanskrit
that it is such an alluring language for Natual Language
Processing (NLP) research.
C-DAC’s Indian Heritage Group based at its Centre in Bangalore
is engaged in activities related to Sanskrit and Vedic texts’
processing since 1990. To consolidate the work done so far,
the Sanskrit Authoring System (SAS) project was launched
with the sponsorship of the Department of Information Technology.
Work on ISCII standard, GIST applications in Data processing,
building a Natural Language Understanding (NLU) System for
Sanskrit called DESIKA, Computational rendering of Paninian
grammar and computational text of Rgveda Samhita, preparing
‘Sakala Shastra Sutra Kosha’, a compendium of all the original
treatises of the Ancient Indian Sciences, are sought to
be covered.
Benefits
from Information Technology
It
is a common experience when one reads books or articles
in Indian languages, that help on retrieving the desired
analytical information such as indices of different types,
sources of quotations and their full forms is often lacking.
With Sanskrit texts, the need to be able to access such
information is even more since many scriptural sources get
quoted.
With a view to help Sanskritists in their creative scholarly
pursuits like research and academics, Content, Tools for
processing and Schemes for tagging, hyper-linking and references
are thus essential in such a system.These are all possible,
thanks to the information technology developments.
Let
us look at some of these as provided for in our system.
Editor
for Multi-script use, tools for morphological, syntactic
and semantic analysis, tools for searching/indexing/sorting,
lexical updation, lexical tagging, extraction / indexing
of quotations in commentaries/explanations, transliteration
facility, word split programs for sandhi and samasa, poetry
analysis (textual/metric/statistical), statistical tools
like concordance, thesauri, electronic dictionaries.
Digital
content from the reference compendium mentioned above, lexicons
like Amarakosha., Paninian Grammar rules, Word analyses,
Derivations, Quotes from Veda-s (scriptures), Epics like
Ramayana, Mahabharata, Puranas, Shastraic texts in Sutra
form are a part of the system.
The DESIKA Parser provides all the grammatically
valid identifications for a word in isolation, as Syntactic
Analysis - Mapping Vibhakti-s to Karaka-s, Semantic - Confirm/disambiguate
senses of each word including ontological compatibility
ascertainment.
Graphical outputs option, Query processing and Updating
lexicons for morphological and semantic processing are provided
alongwith tools for linguistic analysis like tagging, lemmatising,
statistical studies.
A
scheme to extract quotations from texts/commentaries and
locating their sources from our knowledge-base is provided
as a tool.
Abbreviations of sources updatable/modifiable search/retrieval
and hyperlinking with a standard form of source specification.
Indexing, sorting, concordance tools for multi-lingual inputs
are also provided for. Tamil sorting order also is considered
in isolation as well as in multi-lingual files.
The
other features include
- Semantic
lexical update provides the current ontology for view/revision/update
-
New instances under existing categories and new categories
can be added incrementally
-
Ontology with the revisions applied is displayed before
okaying multiple memberships at different layers and categories
allowed lexical/morphological suffixes/rules acceptable
for characterising sentence type also updatable (existing
types listed for view)
-
Sloka analysis taking different anvaya type specification.
With
regard to the Shabda-bodha, diagrammatic representation
according to any shastra can be studied
Newer types can be updated based on paradigm specification.
Case-relationships and case-marker mappings are provided
exhaustively with examples to help syntactic analysis and
creating new appropriate categories
Tried
out on several actual texts of various types and complexities
like:

Some
of the objectives realised as ‘proof-of-concepts’ are -
OCX
based applications using GIST-SDK for all vidyasthana-s,
Linux-based work alongwith ITRANS/JTRANS conversion.
Web-server
set up to help access our ‘on-line readers’ including free
font downloads
Vedic (accented) inputs also handled - RgVeda Samhita completely
indexed for words, mantras, rishi, chandas, devata, viniyoga,
anukramani, bhashya, and search on Mandala-anuvaka-sukta-rk
OR ashtaka-adhyaya-varga systems.
Online Gita Reader in multiple scripts based on Java
servlets, in Red Hat Linux 5.02 platform- searchable,
analysable, multi-script with english translation.
This is accessible at -
http://202.141.63.222/mgita
CD-ROM
of digital content has modules of a knowledge base of all
Ancient Indian Sciences, 14 Vidyasthana-s , Sankhya,
Yoga and Alankara systems.
Old Tamil belonging to Sangam era
Nalayira Divya Prabandham of Alwars; and Desika Prabandham
of Sri Vedanta Desika
Future scope
Modules in the Shastraic primers series to eventually cover
all the fourteen Vidyasthana-s completely
Knowledge
Tools, Resources (contents) and Structures to be devised
integrally to map to Speech input/output standards and systems
also
Phonetic standard development and interfacing TTS (synthesis)
to knowledgebases phonetic search/retrieval schemes and
tools
Integrating Vedic audio recording with our textual knowledge
base and making multi-media interactive, communicative,
educational software to teach regional languages beginning
with Sanskrit in schools
Conclusion
This effort is expected to provide valuable, computational
help to Sanskritists and thereby to contribute to making
the endeavors of Sanskrit students and scholars more rewarding
and enjoyable.
Veda
Varidhi P Ramanujan has both
traditional learning of Vedas and shastras in the gurukula-system
as well as modern education with degree in electrical engineering
and PG degree in Computer Science. He has over a decade
of professional design experience in Aeronautics prior to
joining C-DAC in 1990.
Click here to send an eMail.
|