경사하강법은 가중치를 갱신할 때, 현재의 배치 샘플들의 그라디언트를 사용하기 때문에, 매 갱신들에 대해서 분산이 크다. 즉, 극소점을 향해 진동하면서 가기때문에 시간이 오래 걸린다. 하지만, 지수가중이동평균을 기반으로 한 모멘텀은 이전 시점의 그라디언트를 v 변수(사전형)에 계속 저장하여 계속 더해나가기 때문에 스무딩의 역할을 해준다고 볼 수 있다. v는 velocity 라고 볼 수 있다. 모멘텀은 이전 그라디언트(방향)을 계속 더해가서 수평적으로 멀리 나아갈 수 있다. 단순히 학습률(Learning Rate)로만 조절할 수 없다. 너무 크면 오버슈팅이 되고 너무 작으면 diverging의 문제가 생길 수 있다. 수직적인 방향으로 가기 때문이다. 우리가 원하는 것은 수직적으로는 조금 가고, 수평적으로 멀리 가길 원한다.
Momentum 구성
모멘텀은 두 항으로 이루어져 있으며, 각각 두개의 하이퍼파라미터가 존재한다.
Momentum Term : Velocity (속도) 를 나타낸다. β 가 커질수록 이전 시점의 그라디언트를 많이 반영하는 것이기 때문 평평해진다. 하지만 너무 크면, 업데이트가 너무 많이 평평해질수 있으므로 주의하여야 한다. 일반적으로 β 는 0.8 ~ 0.999 사이이며, 조정할 필요가 없다고 느껴지면 0.9 를 사용하면 된다. 이 값이 0 이 되면 일반적인 경사하강법과 동일하다.
Derivative Term : Acceleration (가속도) 를 나타내며 1−β 를 사용한다.
And so imagine that you have a bowl, and you take a ball and the derivative imparts acceleration to this little ball as the little ball is rolling down this hill, right? And so it rolls faster and faster, because of acceleration. And data, because this number a little bit less than one, displays a row of friction and it prevents your ball from speeding up without limit. But so rather than gradient descent, just taking every single step independently of all previous steps. Now, your little ball can roll downhill and gain momentum, but it can accelerate down this bowl and therefore gain momentum. I find that this ball rolling down a bowl analogy, it seems to work for some people who enjoy physics intuitions. But it doesn't work for everyone, so if this analogy of a ball rolling down the bowl doesn't work for you, don't worry about it. Finally, let's look at some details on how you implement this. Here's the algorithm and so you now have two
Momentum 구현
initializevelocity(parameters) : v 를 가중치 W, b 의 shape과 같게 0으로 초기화를 함.
updateparameterswithmomentum : 공식을 사용해서 업데이트 함.
Nesterov 모멘텀
은 현재시점에서 이전까지의 그라디언트들로 다음에 나아갈 시점으로 먼저 계산하여 이동한다. 그 곳에서 이전 그라디언트와 예측한 곳에서의 시점을 더하여 이동한다. 따라서 더 가까워질 수 있을 것 같다.
Adaptive Learning Rate
Adaptive Gradient, Root Mean Squared Prop, 그리고 Adaptive Moment는 가중치 원소를 하나의 통일된 그라디언트값으로 갱신하지 않는다. 가중치 원소중에는 수직방향으로 이동에 영향을 주는 원소가 있을 수 있고, 수평방향의 이동에 영향을 주는 원소가 있을 수 있다. 따라서, 우리는 수직방향으로는 느리게 이동하여 진동을 억제할 것이고, 수평방향으로는 빠르게 이동하여 최저점에 쉽게 도달하고자 한다.
RMSprop
RMSProp은 현재 시점의 그라디언트를 element-wise 제곱하고, 이를 제곱근한 값을 분모로 나누어준다. 수평방향 (W) 의 적은 그라디언트를 제곱하면 더 적은 값이 되고, 이를 분모에 나눠주면 상대적으로 큰 값이 나온다. 큰 값을 빼주기 때문에 수직 방향에 비하여 비교적 수평적 방향으로 빠르게 이동할 수 있다. 반대로 수직방향 (b) 의 큰 그라디언트가 분모로 가기 때문에 작은 보폭을 이동하므로, 수직방향으로는 적게 이동한다.
참고로 분모에 그라디언트가 0으로 수렴할때는 너무 커지기 때문에 엡실론을 더해주며, 엡실론의 값을 고민할 필요는 없다.
Adam
Adam은 모멘텀과 RMSProp를 결합한 방법론으로, 다른 최적화 방식에 비해 다양한 데이터에도 좋은 성능을 보일 만큼 일반적인 방법이라 생각할 수 있다. 여기서 하이퍼파라미터는 총 4개인데 보통 모멘텀과 RMSProp의 하이퍼파라미터, 에러항은 수정하지 않을 정도로 좋은 결과라 입증이 되었기 때문에 학습률만 수정해가면 된다.
점점 적어지는 학습률을 하게 되면 극소점에 가까워질수록 보폭이 작아지면서 수렴할 수 있다.
α=1+decayrate∗epoch1∗α0
학습률을 점점 감소시키는 방법은 여러가지가 있다.
Exponential Decay
α=0.95epoch∗α0
α=epochk∗α0 ( k : hyperparameter )
α=tk∗α0 ( t : # of minibatch )
Discrete Staircase
Local Optima
Implementation
import numpy as npimport matplotlib.pyplot as pltimport scipy.ioimport mathimport sklearnimport sklearn.datasetsfrom opt_utils_v1a import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagationfrom opt_utils_v1a import compute_cost, predict, predict_dec, plot_decision_boundary, load_datasetfrom testCases import*%matplotlib inlineplt.rcParams['figure.figsize']= (7.0,4.0) # set default size of plotsplt.rcParams['image.interpolation']='nearest'plt.rcParams['image.cmap']='gray'# GRADED FUNCTION: update_parameters_with_gddefupdate_parameters_with_gd(parameters,grads,learning_rate):""" Update parameters using one step of gradient descent Arguments: parameters -- python dictionary containing your parameters to be updated: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients to update each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl learning_rate -- the learning rate, scalar. Returns: parameters -- python dictionary containing your updated parameters """ L =len(parameters)//2# number of layers in the neural networks# Update rule for each parameterfor l inrange(L):### START CODE HERE ### (approx. 2 lines) parameters["W"+str(l+1)]= parameters["W"+str(l+1)]+ learning_rate * grads['dW'+str(l+1)] parameters["b"+str(l+1)]= parameters["b"+str(l+1)]+ learning_rate * grads['db'+str(l+1)]### END CODE HERE ###return parametersparameters, grads, learning_rate =update_parameters_with_gd_test_case()parameters =update_parameters_with_gd(parameters, grads, learning_rate)print("W1 =\n"+str(parameters["W1"]))print("b1 =\n"+str(parameters["b1"]))print("W2 =\n"+str(parameters["W2"]))print("b2 =\n"+str(parameters["b2"]))# GRADED FUNCTION: initialize_velocitydefinitialize_velocity(parameters):""" Initializes the velocity as a python dictionary with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl Returns: v -- python dictionary containing the current velocity. v['dW' + str(l)] = velocity of dWl v['db' + str(l)] = velocity of dbl """ L =len(parameters)//2# number of layers in the neural networks v ={}# Initialize velocityfor l inrange(L):### START CODE HERE ### (approx. 2 lines) v["dW"+str(l+1)]= np.zeros((parameters["W"+str(l+1)].shape[0], parameters["W"+str(l+1)].shape[1])) v["db"+str(l+1)]= np.zeros((parameters["b"+str(l+1)].shape[0], parameters["b"+str(l+1)].shape[1]))### END CODE HERE ###return vparameters =initialize_velocity_test_case()v =initialize_velocity(parameters)print("v[\"dW1\"] =\n"+str(v["dW1"]))print("v[\"db1\"] =\n"+str(v["db1"]))print("v[\"dW2\"] =\n"+str(v["dW2"]))print("v[\"db2\"] =\n"+str(v["db2"]))# GRADED FUNCTION: update_parameters_with_momentumdefupdate_parameters_with_momentum(parameters,grads,v,beta,learning_rate):""" Update parameters using Momentum Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- python dictionary containing the current velocity: v['dW' + str(l)] = ... v['db' + str(l)] = ... beta -- the momentum hyperparameter, scalar learning_rate -- the learning rate, scalar Returns: parameters -- python dictionary containing your updated parameters v -- python dictionary containing your updated velocities """ L =len(parameters)//2# number of layers in the neural networks# Momentum update for each parameterfor l inrange(L):### START CODE HERE ### (approx. 4 lines)# compute velocities v["dW"+str(l+1)]= beta * v["dW"+str(l+1)]+ (1-beta) * grads['dW'+str(l+1)] v["db"+str(l+1)]= beta * v["db"+str(l+1)]+ (1-beta) * grads['db'+str(l+1)]# update parameters parameters["W"+str(l+1)]= parameters["W"+str(l+1)]- learning_rate * v["dW"+str(l+1)] parameters["b"+str(l+1)]= parameters["b"+str(l+1)]- learning_rate * v["db"+str(l+1)]### END CODE HERE ###return parameters, vparameters, grads, v =update_parameters_with_momentum_test_case()parameters, v =update_parameters_with_momentum(parameters, grads, v, beta =0.9, learning_rate =0.01)print("W1 = \n"+str(parameters["W1"]))print("b1 = \n"+str(parameters["b1"]))print("W2 = \n"+str(parameters["W2"]))print("b2 = \n"+str(parameters["b2"]))print("v[\"dW1\"] = \n"+str(v["dW1"]))print("v[\"db1\"] = \n"+str(v["db1"]))print("v[\"dW2\"] = \n"+str(v["dW2"]))print("v[\"db2\"] = v"+str(v["db2"]))# GRADED FUNCTION: initialize_adamdefinitialize_adam(parameters) :""" Initializes v and s as two python dictionaries with: - keys: "dW1", "db1", ..., "dWL", "dbL" - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. Arguments: parameters -- python dictionary containing your parameters. parameters["W" + str(l)] = Wl parameters["b" + str(l)] = bl Returns: v -- python dictionary that will contain the exponentially weighted average of the gradient. v["dW" + str(l)] = ... v["db" + str(l)] = ... s -- python dictionary that will contain the exponentially weighted average of the squared gradient. s["dW" + str(l)] = ... s["db" + str(l)] = ... """ L =len(parameters)//2# number of layers in the neural networks v ={} s ={}# Initialize v, s. Input: "parameters". Outputs: "v, s".for l inrange(L):### START CODE HERE ### (approx. 4 lines) v["dW"+str(l+1)]= np.zeros((parameters["W"+str(l+1)].shape[0], parameters["W"+str(l+1)].shape[1])) v["db"+str(l+1)]= np.zeros((parameters["b"+str(l+1)].shape[0], parameters["b"+str(l+1)].shape[1])) s["dW"+str(l+1)]= np.zeros((parameters["W"+str(l+1)].shape[0], parameters["W"+str(l+1)].shape[1])) s["db"+str(l+1)]= np.zeros((parameters["b"+str(l+1)].shape[0], parameters["b"+str(l+1)].shape[1]))### END CODE HERE ###return v, sparameters =initialize_adam_test_case()v, s =initialize_adam(parameters)print("v[\"dW1\"] = \n"+str(v["dW1"]))print("v[\"db1\"] = \n"+str(v["db1"]))print("v[\"dW2\"] = \n"+str(v["dW2"]))print("v[\"db2\"] = \n"+str(v["db2"]))print("s[\"dW1\"] = \n"+str(s["dW1"]))print("s[\"db1\"] = \n"+str(s["db1"]))print("s[\"dW2\"] = \n"+str(s["dW2"]))print("s[\"db2\"] = \n"+str(s["db2"]))# GRADED FUNCTION: update_parameters_with_adamdefupdate_parameters_with_adam(parameters,grads,v,s,t,learning_rate=0.01,beta1=0.9,beta2=0.999,epsilon=1e-8):""" Update parameters using Adam Arguments: parameters -- python dictionary containing your parameters: parameters['W' + str(l)] = Wl parameters['b' + str(l)] = bl grads -- python dictionary containing your gradients for each parameters: grads['dW' + str(l)] = dWl grads['db' + str(l)] = dbl v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary learning_rate -- the learning rate, scalar. beta1 -- Exponential decay hyperparameter for the first moment estimates beta2 -- Exponential decay hyperparameter for the second moment estimates epsilon -- hyperparameter preventing division by zero in Adam updates Returns: parameters -- python dictionary containing your updated parameters v -- Adam variable, moving average of the first gradient, python dictionary s -- Adam variable, moving average of the squared gradient, python dictionary """ L =len(parameters)//2# number of layers in the neural networks v_corrected ={}# Initializing first moment estimate, python dictionary s_corrected ={}# Initializing second moment estimate, python dictionary# Perform Adam update on all parametersfor l inrange(L):# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".### START CODE HERE ### (approx. 2 lines) v["dW"+str(l+1)]= beta1 * v["dW"+str(l+1)]+ (1-beta1) * grads["dW"+str(l+1)] v["db"+str(l+1)]= beta1 * v["db"+str(l+1)]+ (1-beta1) * grads["db"+str(l+1)]### END CODE HERE #### Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".### START CODE HERE ### (approx. 2 lines) v_corrected["dW"+str(l+1)]= v["dW"+str(l+1)]/ (1-beta1) v_corrected["db"+str(l+1)]= v["db"+str(l+1)]/ (1-beta1)### END CODE HERE #### Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".### START CODE HERE ### (approx. 2 lines) s["dW"+str(l+1)]= beta2 * s["dW"+str(l+1)]+ (1-beta2) * (grads["dW"+str(l+1)]**2) s["db"+str(l+1)]= beta2 * s["db"+str(l+1)]+ (1-beta2) * (grads["db"+str(l+1)]**2)### END CODE HERE #### Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".### START CODE HERE ### (approx. 2 lines) s_corrected["dW"+str(l+1)]= s["dW"+str(l+1)]/ (1-beta2) s_corrected["db"+str(l+1)]= s["db"+str(l+1)]/ (1-beta2)### END CODE HERE ### # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
### START CODE HERE ### (approx. 2 lines) parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * ( v_corrected["dW" + str(l+1)] / (np.square(s_corrected["dW" + str(l+1)])) + epsilon)
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * ( v_corrected["db" + str(l+1)] / (np.square(s_corrected["db" + str(l+1)])) + epsilon)
### END CODE HERE ###return parameters, v, sparameters, grads, v, s =update_parameters_with_adam_test_case()parameters, v, s =update_parameters_with_adam(parameters, grads, v, s, t =2)print("W1 = \n"+str(parameters["W1"]))print("b1 = \n"+str(parameters["b1"]))print("W2 = \n"+str(parameters["W2"]))print("b2 = \n"+str(parameters["b2"]))print("v[\"dW1\"] = \n"+str(v["dW1"]))print("v[\"db1\"] = \n"+str(v["db1"]))print("v[\"dW2\"] = \n"+str(v["dW2"]))print("v[\"db2\"] = \n"+str(v["db2"]))print("s[\"dW1\"] = \n"+str(s["dW1"]))print("s[\"db1\"] = \n"+str(s["db1"]))print("s[\"dW2\"] = \n"+str(s["dW2"]))print("s[\"db2\"] = \n"+str(s["db2"]))train_X, train_Y =load_dataset()defmodel(X,Y,layers_dims,optimizer,learning_rate=0.0007,mini_batch_size=64,beta=0.9,beta1=0.9,beta2=0.999,epsilon=1e-8,num_epochs=10000,print_cost=True):""" 3-layer neural network model which can be run in different optimizer modes. Arguments: X -- input data, of shape (2, number of examples) Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) layers_dims -- python list, containing the size of each layer learning_rate -- the learning rate, scalar. mini_batch_size -- the size of a mini batch beta -- Momentum hyperparameter beta1 -- Exponential decay hyperparameter for the past gradients estimates beta2 -- Exponential decay hyperparameter for the past squared gradients estimates epsilon -- hyperparameter preventing division by zero in Adam updates num_epochs -- number of epochs print_cost -- True to print the cost every 1000 epochs Returns: parameters -- python dictionary containing your updated parameters """ L =len(layers_dims)# number of layers in the neural networks costs = [] # to keep track of the cost t =0# initializing the counter required for Adam update seed =10# For grading purposes, so that your "random" minibatches are the same as ours m = X.shape[1]# number of training examples# Initialize parameters parameters =initialize_parameters(layers_dims)# Initialize the optimizerif optimizer =="gd":pass# no initialization required for gradient descentelif optimizer =="momentum": v =initialize_velocity(parameters)elif optimizer =="adam": v, s =initialize_adam(parameters)# Optimization loopfor i inrange(num_epochs):# Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch seed = seed +1 minibatches =random_mini_batches(X, Y, mini_batch_size, seed) cost_total =0for minibatch in minibatches:# Select a minibatch (minibatch_X, minibatch_Y) = minibatch# Forward propagation a3, caches =forward_propagation(minibatch_X, parameters)# Compute cost and add to the cost total cost_total +=compute_cost(a3, minibatch_Y)# Backward propagation grads =backward_propagation(minibatch_X, minibatch_Y, caches)# Update parametersif optimizer =="gd": parameters =update_parameters_with_gd(parameters, grads, learning_rate)elif optimizer =="momentum": parameters, v =update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)elif optimizer =="adam": t = t +1# Adam counter parameters, v, s =update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon) cost_avg = cost_total / m# Print the cost every 1000 epochif print_cost and i %1000==0:print ("Cost after epoch %i: %f"%(i, cost_avg))if print_cost and i %100==0: costs.append(cost_avg)# plot the cost plt.plot(costs) plt.ylabel('cost') plt.xlabel('epochs (per 100)') plt.title("Learning rate = "+str(learning_rate)) plt.show()return parameters# train 3-layer modellayers_dims = [train_X.shape[0],5,2,1]parameters =model(train_X, train_Y, layers_dims, optimizer ="gd")# Predictpredictions =predict(train_X, train_Y, parameters)# Plot decision boundaryplt.title("Model with Gradient Descent optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambdax: predict_dec(parameters, x.T), train_X, train_Y)# train 3-layer modellayers_dims = [train_X.shape[0],5,2,1]parameters =model(train_X, train_Y, layers_dims, beta =0.9, optimizer ="momentum")# Predictpredictions =predict(train_X, train_Y, parameters)# Plot decision boundaryplt.title("Model with Momentum optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambdax: predict_dec(parameters, x.T), train_X, train_Y)# train 3-layer modellayers_dims = [train_X.shape[0],5,2,1]parameters =model(train_X, train_Y, layers_dims, optimizer ="adam")# Predictpredictions =predict(train_X, train_Y, parameters)# Plot decision boundaryplt.title("Model with Adam optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambdax: predict_dec(parameters, x.T), train_X, train_Y)