How Neural Networks Think

General-purpose technique sheds light on inner workings of neural nets trained to process language.


Artificial Intelligence is a new scientific infrastructure for research and learning. The potential of AI is boundless. In addition, machine learning algorithms called neural networks now transforming AI to perform tasks.

Understanding the working of neural networks can help researchers improve their performance and transfer their insights to other applications. To do so, MIT scientists have developed some clever techniques for divining the computations of particular neural networks.

These general purpose techniques make sense of neural networks to perform natural-language-processing tasks. When applied to any system, it takes the text as input and produces strings of symbols in output. Its analysis comes about because of shifting information sources and looking at the consequences for yields.

The techniques also work with any black-box text-processing system. Additionally, it can identify idiosyncrasies in the work of human translators.

The technique is similar to those that analyze the neural network to perform computer vision tasks, such as object recognition. Software that systematically perturbs — or varies — different parts of an image and resubmit the image to an object recognizer can identify which image features lead to which classifications. But adopting that approach to natural language processing isn’t straightforward.

Professor Tommi Jaakkola said, “What does it even mean to perturb a sentence semantically? I can’t just do a simple randomization. And what you are predicting is now a more complex object, like a sentence, so what does it mean to give an explanation?”

To generate test sentences, scientists used a black-box neural net. They started training a network to both compress and decompress natural sentences. This enables them to create some intermediate, compact digital representation of the sentence. They also evaluated both encoder and decoder simultaneously, so that decoder’s output matches the encoder’s input.

The neural network is inherently probabilistic. It feeds an image of a small dog, for instance, might conclude that the image has a 70 percent probability of representing a dog and a 25 percent probability of representing a cat. Similarly, this newly developed network provide a substitute for each word in a decoded sentence, along with the probabilities that each alternative is correct.

By developing words to increase its decoding accuracy, its output probabilities define a cluster of semantically related sentences. For instance, if the encoded sentence is ‘She gasped in surprise’, the system might assign the alternatives ‘She squealed in surprise’.

Similarly, for any sentence, it generates a list of closely related sentences.

Scientists applied their techniques to three different set types of natural-language-processing system. 1. A system that inferred words’ pronunciation, 2. A set of translators, two automated and one human, 3. A simple computer dialogue system.

And as expected, the system demonstrated strong dependencies between individual words in the input and output sequences. It even identified gender biases in the texts.

For instance, the nongendered English word “dancer” has two gendered translations in French, “danseur” and “danseuse.” The system translated the sentence “The dancer is charming” using the feminine: “la danseuse est charmante.”

Alvarez-Melis explained, “The other experiment we do is in flawed systems. If you have a black-box model that is not doing a good job, can you first use this kind of approach to identify the problems? A motivating application of this kind of interpretability is to fix systems, to improve systems, by understanding what they’re getting wrong and why.”


See stories of the future in your inbox each morning.