Scientists develop translation and information-retrieval system for uncommon languages

Researchers receive $16.7 million grant to automatically translate and summarize “low-resource” language documents into English.


A group of scientists from the Information Sciences Institute at USC Viterbi has gotten a $16.7 million concede from the Intelligence Advanced Research Projects Activity (IARPA) to build up a translation and information-retrieval system to rapidly decipher cloud dialects.

Essential agent and ISI investigate group pioneer Scott Miller, ISI PC researcher Jonathan May, ISI look into lead Elizabeth Boschee—with senior guides Prem Natarajan, ISI’s Michael Keston official chief and research educator of software engineering, and Kevin Knight, ISI explore executive and Dean’s teacher of software engineering—are driving a group of around 30 specialists, including scholastics from the University of Massachusetts, Northeastern University, MIT, RPI, and the University of Notre Dame.

The ISI group’s undertaking is called SARAL, which remains for Summarization and domain Adaptive Retrieval (a Hindi word whose interpretations incorporate “straightforward” and “smart”), and incorporates specialists in machine interpretation, discourse acknowledgment, morphology, data recovery, portrayal, and rundown.

Miller said, “The overall objective is to provide a Google like capability, except the queries are in English but the retrieved documents are in a low-resource foreign language.”

“The aim is to retrieve relevant foreign-language documents and to provide English summaries explaining how each document is relevant to the English query.” 

The scientists will start the undertaking by gathering records in the test dialects, including discourse, online archives, and video cuts, which have beforehand been converted into English.

They will then create calculations to break down the dialect designs, for example, sentence structure—subject, verb and protest position, for instance—and morphology, the structure of words and their connection to different words in a similar dialect.

The framework will be intended to react to area particular inquiries, for instance, natural insurance in the “administration and governmental issues” space or essential training in the “way of life” area, and will deliver a condensed reaction of around 100 words depicting how the outcome is pertinent to the hunt.”

May said, “Since we don’t have a lot of written data in these languages, we have to do more with less. Ideally, we would use about 300 million words to train a machine translation system—and in this case, we have around 800,000 words. There are about 100,000 words per novel, so we have only eight novels’ worth of words to work from.”

“You can think of the summary as something like CliffsNotes, but with the added feature that it is indexed to the precise part you want to write your essay about.”

Natarajan said, “IARPA’s MATERIAL program is the first organized attempt at synthesizing recent advances in machine translation, speech recognition, cross-lingual retrieval and summarization into a powerful new capability that allows users to accurately access all relevant information, across languages and modalities. We are tremendously grateful for the opportunity to contribute to this nationally important effort.”

- Advertisement -

Latest Updates