Demo systems References
1 Text to speech synthesis systems for Manipuri language (Male Voice HTS system) TTS Project Site
2 Transliteration systems for Manipuri language written in phonetic text TTS Project Site
3 Automatic Syllabification for Manipuri language Dataset
4 Manipuri OCR System --
4 Meetei/Meitei Mayek Text Processing Tool --
Dataset and Code base for the following studies are available on request*

Word polarity detection using syllable features

  • Language: Manipuri

Sentence-level language identification

  • Source: Facebook, Languages: Assamese, Bengali, Karbi, Boro, Hindi and English
  • Source: Youtube, Languages: Assamese, Bengali, Hindi and English

Word-level language identification

  • Source: Facebook, Languages: Assamese, Bengali, Hindi and English

Chart-type classification

Jennil Thiyam, Sanasam Ranbir Singh, and Prabin K. Bora. 2021. Chart classification: an empirical comparative study of different learning models. In Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP '21). Association for Computing Machinery, New York, NY, USA, Article 32, 1–9.

This module contains the following:

  1. A chart dataset of 28 classes.
  2. Supporting codes for the following tasks:
  • Chart-type classification using traditional classification methods
  • Chart-type classification using deep learning-based methods.
  • Chart-type classification using two attention mechanisms.
  • Visualization of features (using GradCAM)

Chart-type studies (Attention and Triplet loss-based function)

Thiyam, J., Singh, S.R. & Bora, P.K. Effect of attention and triplet loss on chart classification: a study on noisy charts and confusing chart pairs. J Intell Inf Syst (2022).

This module contains the supporting codes for the following tasks:

  • Euclidean distance calculation of confusing chart class pairs.
  • Finding hard triplets from the confusing chart class pairs.
  • Integration of Attention Mechanisms in the Xception.
  • Triplet loss training.

Manipuri OCR

This module consists of the following:
  • Tesseract-based Manipuri OCR.
  • The fine-tuning procedure for existing OCR provided by Tesseract
  • A Python script to perform semi-supervised training for populating text corpora.
  • OCR evaluation tool (provided by Tesseract).

Document segmentaiton

This module consists of a dataset and Python scripts for document region segmentation to perform the following tasks:
  • A sample dataset for document segmentation.
  • Building of two segmentation models.
  • building of two classification models.
  • Document region (textual, equational, and graphic) segmentation.

IndiSentiment140

This module consists of the following:
  • We generate parallel corpora, known as IndiSentiment140, translating Sentiment140 into 22 Indian languages supported by Google Translate.
  • We have considered 22 Indian languages i.e. Assamese (as), Bengali (bn), Bhojpuri (bho), Dogri(doi), Gujarati (gu), Hindi (hi), Kannada (kn), Konkani (gom), Maithili (mai), Malayalam (ml), Marathi(Mr), Meiteilon (Manipuri) (mni), Mizo (lus), Nepali (ne), Odia (or), Punjabi(pa), Sanskrit (sa), Sindhi (sd), Sinhala (si), Tamil (ta), Telugu (te), and Urdu (ur) for our study.
  • We have created the dataset for each language listed above after translating the Sentiment140 dataset to the corresponding language.
  • The same sentiment labels are kept back after translating to different Indian languages.
*Request for the dataset can be mail to the following email id: osi.iitg@gmail.com