The Centre for Linguistic Science and Technology is an attempt to study, achieve and preserve to create a knowledge base of indigenous resources available in Indian Languages with a specific emphasis to north-east India. Keeping this in view, people from four departments have collaborated to create this resource facility.
The center will be built in the area of language and primary functions of the center will include preservation and archiving of minority languages spoken in India, speech and written text analysis of the languages and technology development in Indian languages. As part of the goal to archive and preserve the local languages, the center will aim at creating speech and text databases for the languages spoken in the North East, specifically the minority languages.
The center will be responsible for disseminating information and technology developed in the center at both local and global levels. At the local level, the center will be focused in empowering the local communities in the NE region by making the center’s knowledge and experience available to the communities. At the same time the center will be a nodal center that will involve and mentor other institutions in the area in the field of research and development especially in language analysis and related technology development.
At the global level, the center envisions itself as a storehouse of information regarding the languages of the NE region. As most of the languages spoken in this area belong to the Tibeto-Burman family, information provided about the NE languages will help global understanding of the languages of this family. The speech and text database created by the center surely be of global interest. Hence, the center envisions disseminating information to the global community through the Internet.
The center aims at collaboration among different streams namely, computer science, linguistics, optical character recognition, handwriting recognition, speech technology and typography.
In the domain of computer science, the focus of the center will be natural language processing and development of language resources for major languages of NE. The resources will include synset dictionary, POS tagged data, Parallel Corpora and Transliteration data. In addition, the center will also work on creating parallel resources in some major Indian languages. The center will focus its attention on crowd-based model for creating language resources. Apart from that, data will also be sourced from the web. This part of the project aims at creating tools for NE languages such as Morphological Analyzers/ Synthesizers, Automatic POS tagger, NE Transliteration and Unknown Word Transliteration Module, Translation Memory Tool for Human editing etc.
In the domain of linguistics, the center will focus at the languages spoken in the NE region and explore the languages through analysis and experiments. The linguists in the center will aim at building an archive for the speech and text resources that will aid phonology, phonetics, syntactic processing and technology development areas as also for the proper understanding of these languages by linguists, speech technologists, NLP specialists for further research and development. This center aims at creating speech corpora of languages spoken in the North-East India primarily because there is not much work done in these languages. As mentioned before, most of the languages that belong to the NE area are Tibeto-Burman languages and many are Austro-Asiatic languages. That set them apart from the Indo-European languages. Hence, from typological and technology development perspective, these languages will pose a completely different set of challenges that this center will try to address. The center will also archive grammars, linguistic texts etc. on the languages of NE and make them available to the local and global communities.
From the experimental linguistics perspective the center may investigate the integration of visual and linguistic information in spoken language comprehension. It also aims at investigating language processing in bi/multilingual population of the region. Linguists in the project will also examine the role of script/or the lack of it in non-linguistic processing etc.