IIT-Madras’ lab AI4Bharat launches IndicVoices dataset covering 22 languages

IIT Madras’ research lab AI4Bharat on March 6 launched IndicVoices, an open-source natural and speech dataset, covering 22 Indian languages.

 

The mission of this dataset was to collect spontaneous speech of Indian languages, said AI4Bharat said in a blog. IndicVoices is funded by Bhashini, which is backed by the Ministry of Electronics and Information Technology, Ekstep Foundation, and Nilekani Philanthropies.

Using IndicVoices, AI4Bharat aims to build IndicASR, the first Automatic Speech Recognition (ASR) model to support all the 22 languages listed in the 8th schedule of the Constitution of India. ASR models, as the name implies, are used in systems that aim to transcribe spoken language into text, which can be used to carry out various functions.

The dataset contains a total of 7,348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16,237 speakers covering 145 Indian districts and 22 languages, AI4Bharat said in a release.

The cost of collecting the dataset could be approximately Rs 30 crore, an expert said who did not wish to be named.

AI4Bharat said 1,639 hours have already been transcribed of these 7,348 hours, with a median of 73 hours per language. AI4Bharat has also shared an open-source blueprint for data collection, standardised protocols, centralised tools, questions repository, prompts and conversation scenarios, quality control mechanisms, comprehensive transcription guidelines, and transcription tools.

“We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world,” AI4Bharat said in the blog.

Google, IISc help make local apps in India more inclusive

Bhashini, an artificial intelligence (AI) language project by the government, has funded $5-6 million to AI4Bharat, instructing it to collect data. The open-source data so collected, on behalf of Bhashini, will be then be used by the government-backed organisation.

Bhashini has funded over 70 research institutes such as IIT-Bombay IIT-Mandi, Indian Institute of Science Bengaluru, etc, apart from IIT-Madras’ AI4Bharat.

“This (datasets) will lead us to 22 language models and further lead us to use cases which we are building up,” Amitabh Nag, Chief Executive Officer, Bhashini, told Moneycontrol.

Bhashini aims to build a National Public Digital Platform for languages to develop services and products for citizens by leveraging the power of AI and other emerging technologies. Its aim is also to increase the content in Indian languages in the domains of public interest, particularly, governance and policy, in Indic languages.

Tanuj Bhojwani, head of PeoplePlusAI, said that further innovation could be done on top of the existing open-source datasets, as the barrier of collecting such high-cost datasets have been eliminated.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *