Machine-learning system tackles speech and object recognition

Model learns to pick out objects within an image, using spoken descriptions.

MIT computer scientists have built up a system that figures out how to distinguish protests inside an image, in view of a spoken description of the picture. Given an image and a sound inscription, the model will feature continuously the important regions of the image being described.

The model works by learning words directly from recorded speech clips and objects in raw images and associates them with one another.

Consider Siri and Google Voice, both require transcriptions of many thousands of hours of speech recordings. Utilizing these data, the systems figure out how to map signals with particular words. Such a methodology turns out to be particularly problematic when, say, new terms enter our dictionary, and the system must be retrained.

David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory said, “We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to. We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”

Scientists demonstrated their model on an image of a young girl with blonde hair and blue eyes, wearing a blue dress, with a white lighthouse with a red roof in the background. The model learned to associate which pixels in the image corresponded with the words ‘girl’, ‘blonde hair’, ‘blue eyes’, ‘blue dress’, ‘white lighthouse’, and ‘red roof’. When an audio caption was narrated, the model then highlighted each of those objects in the image as they were described.

This work expands on an earlier model developed by scientists correlates speech with groups of thematically related images. In the earlier research, they put images of scenes from a classification database on the crowdsourcing Mechanical Turk platform. They then had people describe the images as if they were narrating to a child, for about 10 seconds. They compiled more than 200,000 pairs of images and audio captions, in hundreds of different categories, such as beaches, shopping malls, city streets, and bedrooms.

After that, they designed a model that involves two different convolutional neural networks (CNNs). One processes images while other processes on spectrograms. There is one additional top layer included in the model that functions outputs of the two networks and maps the speech patterns with image data.

The researchers would, for instance, feed the model caption A and image A, which is correct. Then, they would feed it a random caption B with image A, which is an incorrect pairing. After comparing thousands of wrong captions with image A, the model learns the speech signals corresponding with image A, and associates those signals with words in the captions. As described in a 2016 study, the model learned, for instance, to pick out the signal corresponding to the word “water,” and to retrieve images with bodies of water.

In the new paper, the specialists adjusted the model to connect particular words with particular patches of pixels. The scientists prepared the model on a similar database, yet with another aggregate of 400,000 picture captions pairs. They held out 1,000 random pairs for testing.

In training, the model is correspondingly given correct and incorrect images and captions. In any case, this time, the image-recognizing CNN separates the image into a system of cells comprising of patches of pixels. The audio-analyzing CNN separates the spectrogram into fragments of, say, one second to capture a word or two.

By recognizing correct image and caption pair, the model matches the first cell of the grid to the first segment of audio, then matches that same cell with the second segment of audio. For each cell and audio segment, it provides a similarity score, depending on how closely the signal corresponds to the object.

Dr. Harwath said, “The challenge is that, during training, the model doesn’t have access to any true alignment information between the speech and the image. The biggest contribution of the paper is demonstrating that these cross-modal alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don’t.”

The authors’ dub this automatic-learning association between a spoken caption’s waveform with the image pixels a ‘matchmap’. After training on thousands of image-caption pairs, the network narrows down those alignments to specific words representing specific objects in that matchmap.

Harwath said, “It’s kind of like the Big Bang, where the matter was really dispersed, but then coalesced into planets and stars. Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects.”

The CSAIL co-authors are graduate student Adria Recasens; visiting student Didac Suris; former researcher Galen Chuang; Antonio Torralba, a professor of electrical engineering and computer science who also heads the MIT-IBM Watson AI Lab; and Senior Research Scientist James Glass, who leads the Spoken Language Systems Group at CSAIL.

Latest Updates

Trending