深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)

1. Tips for DNN In this lesson, Pro. LEE taught us some tips for deep neural network, which contains:

  1. Adaptive Learning Rate
  2. New Activation Function
  3. Dropout
  4. Regularization
  5. Early Stopping
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

1.1 Adaptive Learning rate The knowledge about Adaptive Learning Rate has already been introduced in my previous blog. Notes for Deep Learning Lessons of Pro. Hung-yi Lee (2).
1.2 New Activate Function 【深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)】The reason that we need to find a new activate function, rather than previous Sigmoid Function, can be explained in the following figure. Because Sigmoid Function maps a large value to a small value,the influence of the input layer will become less and less during the process of propogation. Judging from the perspective of back-propogation theory, the gradient of the input layer will become so little that we can not train the network.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

With the aim of solving this problem, the new activate function introduced in this lesson is ReLU. Looking at the first quadrant,the activate function does not change the value of the input, which will vanish the gradient problem. And the second quadrant of the function resemble the working process of our brain. Most neurons in the brain are not excited. They are excited only when the stimulation exceeds a certain threshold.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

The stage of different neurals can be shown as following figure. For the activate neurals, they does not change the value of input. For the unactivate neurals, we can regard them as unexisted ones.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

So, the structure of the neural network can be shown as:
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

ReLU has some other versions. Some people think gradient should not equals to zero but to a very small number when the ouput is less than zero, so the left one is proposed. Some people think the gradient should be a changeable parameter, so the right one is developed.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

ReLu actually offers us a linear activate function, but just one linear structure. Can we find a activate containing different linear structure for different input value? The answer is yes. The maxout can do this.
The following figure tells us maxout can do the same thing with ReLU.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

The following figures show us the way in which maxout offers us different linear structure in one activate function.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

1.3 Early Stopping Early stopping is an effective way to deal with the problem of overfitting. We need to make the training of our neural network stop before finding the minima of the loss function of training set.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

1.4 Regularization Regularization is adding a for our parameters in order to avoid overfitting.
1.5 Dropout Dropout is also a method to avoid overfitting. Dropout means we need to drop some parts of our neurals every time we train our network.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

When we use our network in testing set, we should not drop any parts of our network and should times (1-p)% for all weights if our droping rate is p% at training set.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

Dropout can be seen as a method of ensemble, just like Randomforest or XGBoost. The reason why we can regard dropout as a kind of ensemble method can be explained by the following figures.
深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

深度学习|Notes for Deep Learning Lessons of Pro. Hung-yi Lee (4)
文章图片

    推荐阅读