Search Connect

Developing the Sanskrit Authoring System - VYASA

With a view to nurture and preserve the richness of Sanskrit, C-DAC has taken the initiative of developing a Sanskrit Authoring system under its ‘Heritage Group’ activities. Shri Ramanujan has played a central role as part of this project, and summarizes below the work being done.
____________________________________________________________________

INTRODUCTION

Sanskrit has been a parent of many modern Indian languages. The name Sanskrita suggests ‘adorned’, ‘elaborate’, or ‘perfected’ form of speech. The beauty of Sanskrit lies in the fact that it has extremely well structured and unambiguous grammar. This is a great advantage and helps in determining and representing precisely the syntactic and semantic meaning of a sentence.

In ancient literature, the entire grammar of Sanskrit has been laid down as a finite and exhaustive set of rules in Panini’s Astadhyayi. These rules were valid for Sanskrit, centuries ago and these rules still hold, eliminating any scope for any evolutionary changes in grammar or usage. In no other language can such a set of rules, which are rather mathematical in nature, be found or formulated.

The Sanskrit grammar rules are well-structured amongst themselves. There are sets of meta-rules, which, with the help of various conditions, can be used in linguistic processing. It is due to such a Sastraic (Sastra = science) nature of Sanskrit that it is such an alluring language for Natual Language Processing (NLP) research.

C-DAC’s Indian Heritage Group based at its Centre in Bangalore is engaged in activities related to Sanskrit and Vedic texts’ processing since 1990. To consolidate the work done so far, the Sanskrit Authoring System (SAS) project was launched with the sponsorship of the Department of Information Technology.

Work on ISCII standard, GIST applications in Data processing, building a Natural Language Understanding (NLU) System for Sanskrit called DESIKA, Computational rendering of Paninian grammar and computational text of Rgveda Samhita, preparing ‘Sakala Shastra Sutra Kosha’, a compendium of all the original treatises of the Ancient Indian Sciences, are sought to be covered.

Benefits from Information Technology

It is a common experience when one reads books or articles in Indian languages, that help on retrieving the desired analytical information such as indices of different types, sources of quotations and their full forms is often lacking. With Sanskrit texts, the need to be able to access such information is even more since many scriptural sources get quoted.

With a view to help Sanskritists in their creative scholarly pursuits like research and academics, Content, Tools for processing and Schemes for tagging, hyper-linking and references are thus essential in such a system.These are all possible, thanks to the information technology developments.

Let us look at some of these as provided for in our system.

Editor for Multi-script use, tools for morphological, syntactic and semantic analysis, tools for searching/indexing/sorting, lexical updation, lexical tagging, extraction / indexing of quotations in commentaries/explanations, transliteration facility, word split programs for sandhi and samasa, poetry analysis (textual/metric/statistical), statistical tools like concordance, thesauri, electronic dictionaries.

Digital content from the reference compendium mentioned above, lexicons like Amarakosha., Paninian Grammar rules, Word analyses, Derivations, Quotes from Veda-s (scriptures), Epics like Ramayana, Mahabharata, Puranas, Shastraic texts in Sutra form are a part of the system.

The DESIKA Parser provides all the grammatically valid identifications for a word in isolation, as Syntactic Analysis - Mapping Vibhakti-s to Karaka-s, Semantic - Confirm/disambiguate senses of each word including ontological compatibility ascertainment.

Graphical outputs option, Query processing and Updating lexicons for morphological and semantic processing are provided alongwith tools for linguistic analysis like tagging, lemmatising, statistical studies.

A scheme to extract quotations from texts/commentaries and locating their sources from our knowledge-base is provided as a tool.

Abbreviations of sources updatable/modifiable search/retrieval and hyperlinking with a standard form of source specification.

Indexing, sorting, concordance tools for multi-lingual inputs are also provided for. Tamil sorting order also is considered in isolation as well as in multi-lingual files.

The other features include

  • Semantic lexical update provides the current ontology for view/revision/update
  • New instances under existing categories and new categories can be added incrementally
  • Ontology with the revisions applied is displayed before okaying multiple memberships at different layers and categories allowed lexical/morphological suffixes/rules acceptable for characterising sentence type also updatable (existing types listed for view)
  • Sloka analysis taking different anvaya type specification.

With regard to the Shabda-bodha, diagrammatic representation according to any shastra can be studied

Newer types can be updated based on paradigm specification.

Case-relationships and case-marker mappings are provided exhaustively with examples to help syntactic analysis and creating new appropriate categories

Tried out on several actual texts of various types and complexities like:

Some of the objectives realised as ‘proof-of-concepts’ are -

OCX based applications using GIST-SDK for all vidyasthana-s, Linux-based work alongwith ITRANS/JTRANS conversion.

Web-server set up to help access our ‘on-line readers’ including free font downloads

Vedic (accented) inputs also handled - RgVeda Samhita completely indexed for words, mantras, rishi, chandas, devata, viniyoga, anukramani, bhashya, and search on Mandala-anuvaka-sukta-rk OR ashtaka-adhyaya-varga systems.

Online Gita Reader in multiple scripts based on Java servlets, in Red Hat Linux 5.02 platform- searchable, analysable, multi-script with english translation.

This is accessible at -

http://202.141.63.222/mgita

CD-ROM of digital content has modules of a knowledge base of all Ancient Indian Sciences, 14 Vidyasthana-s , Sankhya, Yoga and Alankara systems.

Old Tamil belonging to Sangam era

Nalayira Divya Prabandham of Alwars; and Desika Prabandham of Sri Vedanta Desika

Future scope

Modules in the Shastraic primers series to eventually cover all the fourteen Vidyasthana-s completely

Knowledge Tools, Resources (contents) and Structures to be devised integrally to map to Speech input/output standards and systems also

Phonetic standard development and interfacing TTS (synthesis) to knowledgebases phonetic search/retrieval schemes and tools

Integrating Vedic audio recording with our textual knowledge base and making multi-media interactive, communicative, educational software to teach regional languages beginning with Sanskrit in schools

Conclusion

This effort is expected to provide valuable, computational help to Sanskritists and thereby to contribute to making the endeavors of Sanskrit students and scholars more rewarding and enjoyable.

Veda Varidhi P Ramanujan has both traditional learning of Vedas and shastras in the gurukula-system as well as modern education with degree in electrical engineering and PG degree in Computer Science. He has over a decade of professional design experience in Aeronautics prior to joining C-DAC in 1990.

Click here to send an eMail.