Companies have been ramping up their research on convolutional neural networks or “deep learning,” which is a method of training computers to recognize images and speech. It’s an advancement in technology that, when developed correctly, makes computers almost as capable as humans at the task.
Microsoft is one of these companies, and it recently announced advances in its technology after winning a computer vision challenge. Last Thursday, Microsoft took first place in several categories in the ImageNet and Microsoft Common Objects in Context Challenges, which brought corporate and research institutes together to look at approaches to systems recognizing images, like objects, faces, and emotions in both photographs and videos.
(Related: How deep learning is making an impact on AI)
Jian Sun, a principal research manager at Microsoft Research who led the project, said the company’s system outperformed the competitors and won all main tasks (five in total) by a single type of neural network and a single system, passing others on detection by a “large margin.”
Sun said that deep neural networks have been used for years in things like speech translation in Skype Translator, as one example.
Microsoft said its system is more effective than existing neural network systems because it enabled researchers to build deep neural nets, which were 5x deeper than its previous systems. According to a Microsoft blog post, the researchers trained the deep neural network system three years ago, enabling it to have eight layers—each containing small neuron collections that look at portions of the input image. Last year, it delivered results from 20 to 30 layers. With their newest system, it had 152 layers, which is about 5x more than any previous Microsoft system, according to Sun.
“Regarding future research, what we learned from our extremely deep networks is so powerful and generic that we think it can substantially improve many other vision tasks,” said Sun. “Because our system is a major expansion in the number of layers successfully used in a deep neural network, it’s an important development for the field, broadly.”
In practice, Sun said that using more layers resulted in signals vanishing as they passed through each layer, which led to difficulties training the whole system.
This is why the research team spent months figuring out ways to add layers but still get results. They hit a system, which they call “deep residual networks,” which is what they used to win the ImageNet contest. Residual learning recreates the learning procedure and redirects the information flow into deep neural networks. By redirecting the flow of information, the researchers were able to train the whole system.
These advances in deep neural networks can be used to improve Microsoft Project Oxford, according to the blog. While the company can’t get into all the specifics, Sun said that its research is going to be used to help the Project Oxford team to improve how the Oxford APIs recognize images and perform face recognition. Right now, Project Oxford is an artificial intelligence tool that developers can use.
Project Oxford expands on Microsoft’s portfolio of machine-learning APIs that developers can use to add intelligence features like emotion and video detection, facial and vision recognition, and speech and language understanding. They can put these features right into their applications without having prior knowledge of AI, and this means developers can make more-intelligent apps.
With both Project Oxford and deep neural networks, Sun said Microsoft wants to see “how far we can push the technology.