India's First IT Company Launches 'Indian Version of ChatGPT', Supporting English and 40 Indian Dialects to Boost Indian Language Computing

baoshi.rao

Recently, Indian IT company Tech Mahindra announced the launch of an open-source foundational language model for Indian languages, 'Project Indus'. This project could become the company's most significant initiative to date. Currently, large language models like OpenAI's GPT, despite their multilingual capabilities, are limited by English datasets when it comes to understanding and generating content in Indian languages.

Image Caption: AI-generated image, licensed by Midjourney

Tech Mahindra CEO Gurnani stated that this model will be the largest Indian language model, potentially serving 25% of the global population. Tech Mahindra has not disclosed the project's cost or expected release date, but the goal is to first build a 7-billion-parameter language model.

The model will initially support 40 different Hindi language dialects, with more languages and dialects to be added gradually. They noted that while some Indian language models like Bhashini and AI4Bharat already exist, there is still a need to develop a foundational model. Their interface may include voice and text features, but a ChatGPT-like chat interface has not yet been considered.

Tech Mahindra's primary goal is to first create a language model for text continuation, followed by conversational capabilities. Once the model's performance and dialect generation effectiveness are confirmed, they will release it as open-source.

An Indian language model can prioritize cultural sensitivity, ensuring generated content respects local customs and norms. It can also democratize AI, serving the country's broader non-English-speaking population.

<p style="margin-top: 0px; margin-bottom: 28px; padding: 0px; box-sizing: border-box; outline: 0px; border-width: 0px; border-style: solid; border-color: rgb(229, 231, 235); --tw-shadow:0 0 #0000; --tw-ring-inset:var(--tw-empty, ); --tw-ring-offset-width:0px; --tw-ring-offset-color:#fff; --tw-ring-color:rgba(41, 110, 228, 0.5); --tw-ring-offset-shadow:0 0 #0000; --tw-ring-shadow:0 0 #0000; line-height: 32px; text-align: justify; color: rgb(59, 59, 59); word-break: break-word; font-family: "PingFang SC", "Microsoft YaHei", Helvetica, "Hiragino Sans GB", "WenQuanYi Micro Hei", sans-serif; letter-spacing: 0.5px; white-space: normal; background-color: rgb(255, 255, 255);">然而，采集不同语言和方言的数据仍然是Tech Mahindra面临的<span class="spamTxt" style="margin: 0px; padding: 0px; box-sizing: border-box; outline: 0px; border-width: 0px; border-style: solid; border-color: rgb(229, 231, 235); --tw-shadow:0 0 #0000; --tw-ring-inset:var(--tw-empty, ); --tw-ring-offset-width:0px; --tw-ring-offset-color:#fff; --tw-ring-color:rgba(41, 110, 228, 0.5); --tw-ring-offset-shadow:0 0 #0000; --tw-ring-shadow:0 0 #0000;">最大挑战。为此，该公司正在寻求不同方言使用者的贡献，以帮助构建数据集。他们已开设了一个门户网站，以获取印度人的语言捐献。