We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. Learn More
The use of deep learning has grown rapidly over the past decade, thanks to the adoption of cloud-based technology and use of deep learning systems in big data, according to Emergen Research, which expects deep learning to become a $93 billion market by 2028.
But what exactly is deep learning and how does it work?
Deep learning is a subset of machine learning which uses neural networks to perform learning and predictions. Deep learning has shown amazing performance in various tasks, whether it be text, time series or computer vision. The success of deep learning comes primarily from the availability of large data and compute power. However, it is more than that, which makes deep learning far better than any of the classical machine learning algorithms.
Deep learning: Neural networks and functions
A neural network is an interconnected network of neurons with each neuron being a limited function approximator. This way, neural networks are considered as universal function approximators. If you recall from high school math, a function is a mapping from input space to an output space. A simple sin(x) function is mapping from angular space (-180o to 180 o or 0 o to 360 o) to real number space (-1 to 1).
Let’s see why neural networks are considered to be universal function approximators. Each neuron learns a limited function: f(.) = g(W*X) where W is the weight vector to be learned, X is the input vector and g(.) is a non-linear transformation. W*X can be visualized as a line (being learned) in high-dimensional space (hyperplane) and g(.) can be any non-linear differentiable function like sigmoid, tanh, ReLU, etc. (commonly used in the deep learning community). Learning in neural networks is nothing but finding the optimum weight vector W. As an example, in y = mx+c, we have 2 weights: m and c. Now, depending on the distribution of points in 2D space, we find the optimum value of m & c which satisfies some criteria: the difference between predicted y and actual points is minimal for all data points.
The layer effect
Now that each neuron is a nonlinear function, we stack several such neurons in a “layer” where each neuron receives the same set of inputs but learn different weights W. Therefore, each layer has a set of learned functions: [f1, f2, …, fn], which are called as hidden layer values. These values are again combined, in the next layer: h(f1, f2, …, fn) and so on. This way, each layer is composed of functions from the previous layer (something like h(f(g(x)))). It has been shown that through this composition, we can learn any non-linear complex function.
Deep learning is a neural network with many hidden layers (usually identified by > 2 hidden layers). But effectively, what deep learning is a complex composition of functions from layer to layer, thereby finding the function that defines a mapping from input to output. For example, if the input is an image of a lion and output is the image classification that the image belongs to the class of lions, then deep learning is learning a function that maps image vectors to classes. Similarly, input is word sequence and output is whether the input sentence has a positive/neutral/negative sentiment. Therefore, deep learning is learning a map from input text to output classes: neutral or positive or negative.
Deep learning as interpolation
From a biological interpretation, humans process images of the world by hierarchically interpreting them bit by bit, from low-level features like edges and contours to high-level features like objects and scenes. Function composition in neural networks is in line with that, where each function composition is learning complex features about an image. The most common neural network architecture that is used for images is Convolutional Neural Network (CNN), which learns those features in a hierarchical fashion and then a fully connected neural network classifies image features into different classes.
By using high school math again, given a set of data points in 2D, we try to fit a curve through interpolation that somewhat represents a function defining those data points. The more complex the function we fit (in interpolation, for example, determined through polynomial degree), the more it fits the data; however, the less it generalizes for a new data point. This is where deep learning faces challenges and what is generally referred to as an overfitting problem: fitting to data as much as possible, but compromising in generalization. Almost all architectures in deep learning had to handle this important factor to be able to learn a general function that can perform equally well on unseen data.
A deep learning pioneer, Yann LeCun (creator of the convolutional neural network and ACM Turing award winner) posted on his Twitter handle (based on a paper): “Deep Learning is not as impressive as you think because it is mere interpolation resulting from glorified curve fitting. But in high dimensions, there is no such thing as interpolation. In high dimensions, everything is extrapolation.” Thus, as part of function learning, deep learning is doing nothing but interpolation or in some cases, extrapolation. That’s all!
The learning aspect
So, how do we learn this complex function? Well, it completely depends on the problem at hand and that’s what determines the neural network architecture. If we are interested in image classification, then we use CNN. If we are interested in time-dependent predictions or text then we use RNN or transformers and if we have a dynamic environment (like car driving) then we use reinforcement learning. Apart from this, learning involves handling different challenges:
- Ensuring the model learns general function and doesn’t just fit to train data; this is handled by using regularization
- Depending on the problem at hand, choice of the loss function is made; loosely speaking, the loss function is error function between what we want (true value) and what we currently have (current prediction).
- Gradient descent is the algorithm used for converging to an optimal function; deciding learning rate becomes challenging because when we are away from optimal, we want to move faster to optimal, and when we are near-optimal, then we want to move slower to ensure we converge to optimal and global minima.
- High number of hidden layers need to handle the vanishing gradient problem; architectural changes like skip connections and appropriate non-linear activation function helps to solve it.
Now that we know deep learning is simply a learning complex function, it brings other compute challenges:
- To learn a complex function, we require a large amount of data
- To process large data, we require fast compute environments
- We need an infrastructure that supports such environments
Parallel processing with CPUs is not enough to compute millions or billions of weights (also called parameters of DL). Neural networks requires learning weights that require vector (or tensor) multiplications. That is where GPUs come in handy, as they can do parallel vector multiplications very fast. Depending on the deep learning architecture, data size, and task at hand, we sometimes require 1 GPU, and sometimes, several of them, a decision data scientist needs to make based on known literature or by measuring the performance on 1 GPU.
With the use of proper neural network architecture (number of layers, number of neurons, non-linear function, etc.) along with large enough data, a deep learning network can learn any mapping from one vector space to another vector space. That’s what makes deep learning such a powerful tool for any machine learning task.
Abhishek Gupta is the principal data scientist at Talentica Software.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!