22_가중치초기화구현

가중치 초깃값을 잘 셋팅하면 좋은점

경사하강법을 통한 극소점에 빠르게 수렴할 수 있음.
훈련데이터 에러를 줄이고 일반화 능력을 높일 가능성이 많음.

분류기 구현 문제

빨간 점과 파란점을 구분하는 분류기를 만들고자 한다. matplot 라이브러리의 rcParams을 통해 차트의 크기, 폰트 등의 기본 설정을 생각보다 많은 부분을 바꿀 수 있다.

import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# load image dataset: blue/red dots in circles
train_X, train_Y, test_X, test_Y = load_dataset()

Zero Initialization

층의 가중치 매트릭스의 shpae은 이전 층의 은닉 노드의 수 * 현재 층의 은닉노드의 수이다. 0으로 초기화한 결과 전혀 학습이 되지 않음을 확인할 수 있다.

    parameters = {}
    L = len(layers_dims)            # number of layers in the network
    
    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
        ### END CODE HERE ###
    return parameters

parameters = initialize_parameters_zeros([3,2,1])
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

plt.title("Model with Zeros initialization")
axes = plt.gca()
axes.set_xlim([-1.5,1.5])
axes.set_ylim([-1.5,1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

The model is predicting 0 for every example.
In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.

The weights $W^{[l]}$ should be initialized randomly to break symmetry.
It is however okay to initialize the biases $b^{[l] }$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly.

Random Initialization

Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.

로스가 비교적 큰 값에서부터 시작함을 확인할 수 있다. 큰 값으로 가중치가 초기화되었기 때문이다. 활성화함수를 통과한 후의 값이 0 이 되어지면, 매우 큰 로스로 무한대에 수렴한다. log(a[3])=log(0)
따라서, 가중치를 잘 초기화하지 않으면, vanishing/exploding gradients 이슈가 발생하며, 이는 최적화의 속도를 감소시킨다.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

In summary:

Initializing weights to very large random values does not work well.
Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!

He initialization

Finally, try "He Initialization"; this is named for the first author of He et al., 2015. (If you have heard of "Xavier initialization", this is similar except Xavier initialization uses a scaling factor for the weights W[l] of sqrt(1./layers_dims[l-1]) where He initialization would use sqrt(2./layers_dims[l-1]).)

Exercise: Implement the following function to initialize your parameters with He initialization.

Hint: This function is similar to the previous initialize_parameters_random(...). The only difference is that instead of multiplying np.random.randn(..,..) by 10, you will multiply it by 2dimension of the previous layer⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√, which is what He initialization recommends for layers with a ReLU activation.

Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don't intialize to values that are too large
He initialization works well for networks with ReLU activations.

Previous23_Gradient Checking Next21_Batch Normalization

Last updated 4 years ago