Ten Minutes to Deeply Understand Ten Key Concepts for AI Development

baoshi.rao

Whether it's AI or other disciplines, continuously reflecting on the history of the field, summarizing its current state of development, and identifying the most important concepts can always help one 'follow a consistent path.' Software engineer James Le recently summarized ten essential deep learning methods for AI research based on his experience, which are highly enlightening.

First, let's define what 'deep learning' is. For many, defining 'deep learning' is indeed challenging because its form has evolved significantly over the past decade.

Let's start by visualizing the position of 'deep learning.' The diagram below illustrates the relationship between the three concepts: AI, machine learning, and deep learning.

The field of AI is relatively broad, machine learning is a subfield of AI, and deep learning is a subset within the field of machine learning.

There are some differences between deep learning networks and 'classical' feedforward multilayer networks, as follows:

Deep learning networks have more neurons than previous networks. Deep learning networks have more complex ways of connecting layers

Deep learning networks require powerful computing capabilities for training

Deep learning networks can perform automatic feature extraction

Therefore, deep learning can be defined as neural networks with a large number of parameters and layers within the following four basic network frameworks:

Unsupervised Pre-trained Networks
Convolutional Neural Networks
Recurrent Neural Networks Recursive Neural Networks (Recursive Neural Networks)

In this article, I'm primarily interested in comparing the latter three frameworks.

Convolutional Neural Networks are essentially standard neural networks extended spatially through shared weights. CNNs are primarily designed to recognize images through internal convolutions that can detect edges of objects to be identified.

Recurrent Neural Networks are basically standard neural networks extended temporally, where edges flow into the next time step rather than the next layer at the same time step. RNNs are mainly designed to recognize sequences, such as speech signals or text. Their recurrent nature means the network has short-term memory.

Recursive Neural Networks are more similar to hierarchical networks where the input sequence doesn't have a true temporal dimension, but rather the input must be processed hierarchically in a tree-like manner. The following 10 methods can be applied to all these architectures.

1. Backpropagation

Backpropagation, short for "backward propagation of errors," is a method for computing partial derivatives of functions (which exist in the form of functions in neural networks). When you want to solve an optimization problem using a gradient-based method (note that gradient descent is just one method for solving such problems), you need to compute the gradient of the function in each iteration.

For neural networks, the objective function has a composite form. So how do you compute the gradient? Generally, there are two common methods:

Analytical differentiation. When you know the form of the function, you only need to compute the derivative using the chain rule.
Finite difference methods to approximate differentiation. This method is computationally expensive because the number of function evaluations is O(N), where N is the number of parameters. Compared to analytical differentiation, this is more costly. However, finite differences are often used to verify backend implementations during debugging. 2. Stochastic Gradient Descent

An intuitive way to understand gradient descent is to imagine a river tracing its way down from a mountain peak. This river flows along the gradient direction of the mountain terrain toward the lowest point at the mountain's base.

If a human were to descend, the approach might differ. You might first choose a random direction and follow its gradient downward; after a while, you might randomly switch to another downward direction; eventually, you'd find yourself roughly at the valley bottom.

Mathematically speaking:

Stochastic gradient descent is primarily used to solve optimization problems that can be expressed as summation forms like:

Gradient descent method: When n is large, calculating all gradients in each iteration can be very time-consuming.

The idea of stochastic gradient descent is to randomly select one Δf_i each time to compute instead of the full Δf_i, using this randomly chosen direction as the descent direction. This method actually reaches the (local) optimum faster than gradient descent.

3. Learning Rate Decay

During model training, we often encounter this situation: after balancing training speed and loss, we select a relatively appropriate learning rate, but the training loss stops decreasing after reaching a certain level. For example, the training loss keeps oscillating between 0.7 and 0.9 without further reduction, as shown in the following figure:

In such cases, it can usually be addressed by appropriately reducing the learning rate. However, lowering the learning rate will extend the required training time. Learning rate decay is a solution that balances the trade-off between training speed and model performance. The basic idea is that the learning rate gradually decreases as training progresses.

There are two main methods to implement learning rate decay:

Linear decay: For example, halving the learning rate every 5 epochs.
Exponential decay: For example, multiplying the learning rate by 0.1 every 5 epochs.

4. Dropout

Current large-scale neural networks have two main drawbacks:

Prone to overfitting Dropout can effectively address this issue. Simply put, Dropout involves randomly deactivating a neuron's activation with a certain probability p during forward propagation, as illustrated below:

Each time Dropout is applied, it essentially results in a thinner network derived from the original one.

Hinton drew an analogy in his paper: asexual reproduction preserves large segments of excellent genes, while sexual reproduction randomly breaks and recombines genes, disrupting the joint adaptability of large gene segments. However, nature has favored sexual reproduction—survival of the fittest—demonstrating its power. Dropout achieves a similar effect by forcing a neuron to work with randomly selected other neurons, thereby reducing the joint adaptability between neuron nodes and enhancing generalization capabilities.

5. Max Pooling

Pooling is another crucial concept in convolutional neural networks, essentially a form of downsampling. Among various nonlinear pooling functions, "Max Pooling" is the most common. It divides the input image into rectangular regions and outputs the maximum value for each sub-region. Intuitively, this mechanism works effectively because after detecting a feature, its exact location is far less important than its relative position to other features. Pooling layers continuously reduce the spatial size of data, thereby decreasing the number of parameters and computational load, which to some extent also controls overfitting. Typically, pooling layers are periodically inserted between convolutional layers in CNNs.

6. Batch Normalization

Neural networks, including deep networks, require careful tuning of weight initialization and learning parameters. Batch normalization makes these tasks much easier.

Weight issues:

Regardless of how weights are initialized—whether randomly or through empirical selection—they are far from the learned weights. Consider a small batch; there will be many outliers in the required feature activations during the early stages.

Deep neural networks themselves are inherently unstable, where small disturbances in the initial layers can lead to very large changes in subsequent layers. During backpropagation, these phenomena can lead to gradient dispersion. This means that before the weights can learn to produce the desired output, compensation must be made for gradient outliers, which requires additional epochs for convergence.

Batch normalization transforms these gradients from dispersed values to normal values and directs them toward a common goal within mini-batches (through normalization).

Learning rate issue: Generally, learning rates need to be kept low so that only a small portion of gradients correct the weights, because the goal is to prevent gradients from abnormal activations from affecting learned activations. With batch normalization, these abnormal activations can be reduced, allowing higher learning rates to accelerate the learning process.

7. Long Short-Term Memory

LSTM networks have three key aspects that differentiate them from common neurons in recurrent neural networks:

They can decide when to let inputs enter the neuron; 2) It can determine when to remember the content calculated in the previous time step;
It decides when to pass the output to the next time step.

The beauty of LSTM lies in its ability to make all these decisions based on the current input itself. So you see the following diagram:

The input signal x(t) at the current time determines all the above 3 points. The input gate determines point 1, the forget gate determines point 2, and the output gate determines point 3. Any single input can make all three decisions. This design is actually inspired by how our brain works and can handle sudden context switches based on input.

8. skip-gram

The goal of word embedding models is to learn a high-dimensional dense representation for each term, where the similarity between embedding vectors shows the semantic or syntactic similarity between corresponding words. Skip-gram is a model for learning word embedding algorithms. The main idea behind the skip-gram model (and many other word embedding models) is as follows: two terms are similar if they share similar contexts.

In other words, suppose you have a sentence like "Cats are mammals"; if you replace "cats" with "dogs", the sentence still makes sense. Therefore, in this example, "dogs" and "cats" can share the same context (i.e., "are mammals").

Based on this assumption, you can consider a context window (a window containing k consecutive terms), then skip one of the words and try to train a neural network that can obtain all terms except the skipped one and predict the skipped term. If two words repeatedly share similar contexts in a large corpus, the embedding vectors of these words will have similar vectors.

9. Continuous Bag of Words

In natural language processing problems, we want to learn to represent each word in a document as a numerical vector, so that words appearing in similar contexts have vectors close to each other. In the continuous bag of words model, the goal is to use the context surrounding a specific word to predict that specific word. We achieve this by taking numerous sentences from a large corpus. Whenever we encounter a word, we extract the surrounding words. Then, we input these contextual words into a neural network to predict the word that should appear in the middle of this context.

When we have thousands of such contextual words and center words, we obtain an instance of a neural network dataset. We train the neural network, and finally, the output of the encoded hidden layer represents the embedding of a specific word. Coincidentally, when we train on a vast number of sentences, words in similar contexts end up with similar vectors.

10. Transfer Learning

Let's think about how an image is processed in a CNN. Suppose there's an image; you perform convolution on it, and the output you get is a combination of pixels, which we might temporarily call "edges." We use convolution again, and this time the output will be a combination of edges, which we'll call "lines." If we apply convolution once more, we'll get combinations of lines, and so on.

Each layer is searching for specific patterns corresponding to that level. The final layer of your neural network typically identifies very specific patterns. Perhaps you're working with ImageNet, and your network's final layer might be looking for children, dogs, airplanes, or anything else. If you look two layers back, the network might be searching for eyes, ears, mouths, or wheels. Each layer in a deep convolutional neural network progressively builds higher-level feature representations. The final layers capture specific patterns in your input data, while earlier layers extract broader features containing numerous simple patterns across multiple classes.

Transfer learning occurs when you train a CNN on one dataset, remove the final layer(s), then retrain these last layer(s) with a different dataset. Intuitively, you're retraining the model to recognize different high-level features. This approach dramatically reduces training time, making it particularly valuable when working with limited data or computational resources.

Deep learning is highly technical, often presenting new concepts without extensive explanations. Most innovations are validated primarily through experimental results. Mastering deep learning resembles playing with LEGO – challenging to perfect but relatively easy to begin.