Machine translation of English-Manipuri made possible
Source: Chronicle News Service
Imphal, October 12 2021:
As the world advances in the held of technology, automation of translation from one language to another is in an advanced state with researchers and engineers collecting huge databases and designing machine translation system.
Joining the fray of such machine translation, Manipuri researcher Rudali Huidrom has made a breakthrough in this field by making a database of required resources to make it possible to automate translation of Manipuri language into English or other languages and vice-versa.
Daughter of Huidrom Nandakumar and RK Priyashini, Rudali has completed Master ol Engineering in Information, Production and System Engineering from Japan's renowned Waseda University.
She is now heading to DCU, Dublin in Ireland for PhD.
Speaking to The People's Chronicle, Rudali informed that she was working under the supervision of professor Yves Lepage at the EBMT/NLP laboratory, Waseda University, Japan, as part of her thesis "Machine translation for a less resourced language: Manipuri (Meiteilon)" during which she set the milestone by creating EM Corpus (abbreviation of Emalon Manipuri Corpus), the first comparable text to text corpus built for Manipuri-English language pair from sentences crawled and collected from a local daily of the state from August 2020 to 2021 .
Collection of the sentences was done by a robot but it needed a separate algorithm to encode Bengali script for the robot to understand.
Since Manipuri language is monolingual, a programme is written to match Manipuri sentence with English sentence and making the database was a big and time consuming effort, she said.
Rudali created monolingual data, which has 1,034,715 Manipuri sentences and 846,796 English sentences in version 1, and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2 .
This makes the comparable corpus in the two languages.
To create parallel data, 124,975 Manipuri-English aligned sentences were extracted from the comparable data version 2 .
Her database EM-ALBERT is the first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences from the version 1 of the EM corpus.
EM-FT is also FastText word embedding available for Manipuri language on 1,880,035 Manipuri sentences, she said.
The researcher further informed that these resources that were created are now available to all, free of cost at the ELRA catalogue under CC-BY-NC-4.0 LICENSE, the European Language Resources Association, one of the oldest and most prominent associations in the world that promotes language resources and evaluation for the Human Language Technology sector in all their forms and uses.
Rudali was very optimistic that, with this size of data available flow for 'Emalon' Meeteilon, and said, "We are already a step closure in the advancement of the technological applications of AI in NLP (Natural Language Processing) related to Machine translation like Google translate, Summarisation, chatbots among others" .