Neural networks are used for classification tasks across domains, but determining whether they are consistent for arbitrary data distributions has been a long-standing open problem in machine learning.
MIT researchers discovered that neural networks could be designed to be ” optimal,” meaning that they minimize the probability of misclassifying borrowers or patients when given a large amount of labeled training data. These networks must be built with a specific architecture to be optimal.
The researchers discovered that the building blocks required for an optimal neural network are different from those developers use in practice. According to the researchers, the optimal building blocks derived from the new analysis are unconventional and have never been considered.
They describe these optimal building blocks, known as activation functions, and demonstrate how they can be used to design neural networks that perform better than on any dataset. This research could help developers select the appropriate activation function, allowing them to build neural networks that classify data more accurately in various application areas.
Uhler, who is also co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at MIT’sMIT’s Laboratory for Information and Decision Systems (LIDS) and its Institute for Data, Systems, and Society (IDSS), said “While these are new activation functions that have never been used before, they are simple functions that someone could actually implement for a particular problem. This work really shows the importance of having theoretical proof. If you go after a principled understanding of these models, that can actually lead you to new activation functions that you would otherwise never have thought of.”
A neural network is a machine-learning model inspired by the human brain. Researchers train a neural network to complete a task by showing it to millions of examples from a dataset.
For example, an image encoded as numbers are presented to a network trained to classify images, such as dogs and cats. Layer by layer, the network performs a series of complex multiplication operations until the result is just one number. If that number is positive, the network classifies the image as a dog; if it is negative, it is classified as a cat.
By applying a transformation to the output of one layer before data is sent to the next layer, activation functions help the network learn complex patterns in the input data. When building a neural network, researchers choose one activation function to use. They also decide on the network’snetwork’s width and depth.
Radhakrishnan said, “It turns out that if you take the standard activation functions that people use in practice and keep increasing the depth of the network, it gives you really terrible performance. We show that if you design with different activation functions, as you get more data, your network will get better and better.”
Adityanarayanan Radhakrishnan and his colleagues investigated a case where a neural network is infinitely deep and wide and has been trained to perform classification tasks. They discovered that this type of network could only learn to classify inputs in three ways: one method classifies an input based on the majority of information in the training data; if there are more dogs than cats, it will decide that every new input is a dog; and another method categorizes all inputs into separate categories.
The third method classifies a new input using a weighted average of all similar training data points. According to their analysis, this is the only method of the three that results in optimal performance.
He says, “That was one of the most surprising things, no matter what you choose for an activation function. It is just going to be one of these three classifiers. We have formulas that will tell you explicitly which of these three it is going to be. It is a very clean picture.”
The researchers discovered a set of activation functions that always use the best classification method.
They tested this theory on classification benchmarking tasks and found that it improved performance in many cases. According to Radhakrishnan, neural network builders could use their formulas to select an activation function that improves classification performance.
In the future, the researchers hope to apply what they’ve learned to analyze situations with limited data and networks that aren’t infinitely wide or deep. They also want to use this analysis in cases where the data does not have labels.
He says, “In deep learning, we want to build theoretically grounded models to reliably deploy them in some mission-critical setting. This is a promising approach at getting toward something like that, building architectures in a theoretically grounded way that translates into better results in practice.”
The National Science Foundation, the Office of Naval Research, the MIT-IBM Watson AI Lab, the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award funded this research.
Journal Reference:
- Radhakrishnan, A., et al. Wide and deep neural networks achieve consistency for classification. Proceedings of the National Academy of Sciences. DOI: 10.1073/pnas.2208779120