21_Batch Normalization

Mini-Batch, Baby Training Set !

미니배치를 수행한다하면, 다음 두가지를 하는 것이다.

Shuffling
Partitioning

배치 사이즈의 크기는 2의 배수로 하는 것이 좋다. ex) 16, 32, 64, 128

Mini-Batch Code

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, k* mini_batch_size : (k+1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k* mini_batch_size : (k+1) * mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, -(m-(mini_batch_size*math.floor(m/mini_batch_size))):]
        mini_batch_Y = shuffled_Y[:, -(m-(mini_batch_size*math.floor(m/mini_batch_size))):]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

가중치를 갱신할 때, for loop 안에서 할수도 있고, 저장해놓았다가, for loop 밖에서도 할 수 있다.

def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9,
          beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8, num_epochs = 10000, print_cost = True):
    """
    3-layer neural network model which can be run in different optimizer modes.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates 
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(layers_dims)             # number of layers in the neural networks
    costs = []                       # to keep track of the cost
    t = 0                            # initializing the counter required for Adam update
    seed = 10                        # For grading purposes, so that your "random" minibatches are the same as ours
    m = X.shape[1]                   # number of training examples
    
    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
        cost_total = 0
        
        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost and add to the cost total
            cost_total += compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2,  epsilon)
        cost_avg = cost_total / m
        
        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print ("Cost after epoch %i: %f" %(i, cost_avg))
        if print_cost and i % 100 == 0:
            costs.append(cost_avg)
                
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters

배치 경사하강법과 스토캐스틱 경사하강법으로 나눌수 있으며, 스토캐스틱의 미니배치사이즈가 1이면 그냥 SGD라 하며, 1이상이면 미니배치 스토캐스틱 경사하강법이다. 미니배치사이즈가 훈련 데이터 셋과 같으면 배치 경사하강법이며, 1이면, 스토캐스틱 경사하강법이다. 만약에 2,000개 미만의 작은 데이터셋을 가지고 있다면 배치 경사하강법을 사용을 고려해볼 수 있다.

Stocastic Gradient Descent (SGD) : Mini Batch의 파라미터가 1인것과 같다. 한번씩 업데이트
Mini Batch Stocastic Gradient Descent : 미니배치 단위의 평균 그라디언트로 업데이트
Batch Gradient Descent : 총 데이터 샘플 평균 그라디언트로 업데이트

Batch Gradient Descent 노이즈도 없고 빨리 수렴하는데, 물론 학습시간은 오래걸린다. 왜 Stochastic Gradient Descent를 사용할까?

SGD는 전역 최소점에 도달 할 수 있다.
어떤 학습 방법을 쓸지는 error manifold에 따라서 다르다고 한다. error manifold는 데이터의 분포인 것 같다. 컨벡스 문제일 때는 하나의 최저점을 가지기 때문에 배치 경사하강법이 좋을 수 있다. 하지만, 많은 local optima와 global optima를 가지는 분포에서는 Stochastic Gradient Descent가 가지는 노이즈가 오히려 local optima 를 빠져 나와 최소점을 찾는데 도움을 줄 수 있다. 그리고 한 데이터 샘플 보다는 여러 노이즈의 평균값으로 가는 것이 더 일리가 있어서 미니배치를 많이 사용한다.
SGD는 대용량 훈련 데이터에 적합하며, 빠르다.
BGD는 데이터가 큰 경우 vectorization이 비효율적이다. 즉, RAM사이즈에 들어갈 수 없으며, 계산 속도가 훨씬느리다.

Single Sample SGD

MiniBatch SGD

Batch Gradient Descent

장점

1.샘플사이즈가 1이다.

2.노이즈가 심하지만, 한 에폭에 들어갈 샘플 사이즈(배치크기)를 잘 결정해주면, 이는 지역 최저점을 벗어날 수 있도록 해준다. => 미니배치

1.지역최저점을 벗어난다.

2.계산량이 적다.

1.단순하다.

2.거의 사용하지 않는다. 3. 노이즈가 없다.

4. 한번에 큰 스텝을 가므로 빠르게 학습한다.

단점

1.확률적으로 추출한 데이터를 기반으로 가중치를 업데이트 한다. 따라서, 노이즈가 심하다.

1.한 에폭당 모든 샘플의 그라디언트를 구하여 평균으로 가중치를 업데이트함에 따라 계산량이 크다. 2. 시간이 오래 걸린다.

[ 참고 ]

https://stats.stackexchange.com/questions/49528/batch-gradient-descent-versus-stochastic-gradient-descent

Batch Norm in Machine Learning

KNN, K-Means, Logistic Regression, SVM 등 많은 머신러닝 기법들은 데이터 전처리 과정 중 하나로 scaling을 포함한다. 한 데이터를 표현하기 위해 여러가지 속성을 사용하는데, 그 속성의 단위가 다르므로, 각 특징들은 Feature Scaling 을 사용하여 데이터를 표현한다. 이때, 스케일링 하는 방법은 여러가지가 있다. https://sebastianraschka.com/Articles/2014_about_feature_scaling.html

Covariate Shift : Problems of Deep Learning

딥 러닝에서도 층이 깊어질수록 가중치의 영향을 받아, 다음 층으로 입력되는 값들의 분산이 달라지는 Covariate Shift 문제가 생긴다. 따라서 첫 층의 입력 값뿐만 아니라 매 히든 레이어에서 정규화를 할 필요성이 있다. 즉, 학습하는 동안 이전 레이어에서의 가중치 매개변수가 변함에 따라 활성화 함수 출력값의 분포가 변화하는 내부 공변량 변화(Internal Covariate Shift) 문제를 줄이는 방법이 바로 배치 정규화 기법이다. 자세한 논문은 여기에 있다.

Batch Norm

1. 미니배치 단위의 정규화

배치 정규화는 아래의 그림과 같이 미니배치(mini-batch)의 데이터에서 각 feature(특성)별 평균(mean)과 분산(variance)을 구한 뒤 정규화(normalize) 해준다. 또한, $w^Tx+b$ 에서 $b$ 는 정규화에서 빼준 평균값을 다시 더하는 의미이기 때문에 생략하여준다.

정규화는 활성화함수를 들어가기 전에 이루어진다.새로 변경된 x (=z)는 인풋의 범위가 줄어들기 때문에 비선형함수의 선형적인 부분에 들어갈 위험이 존재한다. 따라서, 감마와 베타를 학습하도록 한다.

2. 활성화 함수를 통과하기 직전 시점에서의 정규화

일반적으로 배치 정규화는 아래의 그림과 같이 Fully Connected (FC) 나 Convolutional layer 바로 다음, 활성화 함수를 통과하기 전에 배치 정규화(BN) 레이어를 삽입하여 사용한다.

Let's find INSIGHT of Batch Norm!

일반 고양이를 분류하는 분류기를 만든다고 생각해보자. 하지만, 내가 가지고 있는 데이터는 검은 색 고양이가 현저히 많은 데이터 셋이다. 이 데이터로 훈련을 마치고, 갈색 고양이를 넣으면, 고양이라고 분류를 잘 하지 못할 것이다.

그 이유는 모델이 학습한 분포와 입력 값으로 들어온 분포가 다르기 때문이다.

즉 이를 covariate shift가 일어났다고 하며, 이 covariate shift는 딥러닝 학습과정 안에서도 문제가 될 수 있다. 3번째 층에서 활성화 함수를 거치고 나온 벡터들은 그 층에서 학습된 가중치와 바이어스가 Ground Truth 값과 매핑되도록 학습이 된다. 하지만, 이 3번째 네트워크는 1번째, 2번째 층의 가중치가 바뀌면 3번째 층에서 활성화함수를 거친 값도 변한다. 따라서 배치놈은 이 값들이 많이 움직이지 않게 함으로써, 평균과 분산을 고정하게 함으로써, 이전 층 들의 가중치의 영향이 크게 가지 않도록 제어한다. 감마와 베타는 그 학습을 통해 결정된다.