[AAA603] Project Report

2019021166 이민예

Project Motivation

CPU 에 적합하게 설계된 알고리즘의 성능을 GPU 에서도 높은 성능을 가지도록 하는 것은 어려움이 있다. 이러한 알고리즘 중에 하나가 BFS(Breadth First Search)이다. BFS 를 GPU 에서 성능을 향상시키는 선행연구는 다음과 같다.

Harish and Narayanan (IIIT-BFS) (2007) : Sparse graph 에서는 CPU 에서 성능이 우월함.
- Complexity : O(VL+E) > O(V+E) / Sparse : O(V2) > O(V)
Deng et al., (2009) : Matrix-based BFS 를 구현하였으나, 기존의 BFS 보다 느림.
- Complexity : O(V+EL) > O(V+E) / Sparse : O(V+V2) > O(V)

본 프로젝트는 GPU 장치 에서 Memory Coalescing 을 고려하여 너비우선탐색(BFS, Breadth-First Search) 알고리즘 성능을 개선하기 위해, 2010 년에 발표된 <An Effective GPU Implementation of Breadth-First Search> 논문에서 Memory Coalescing 을 보장하는 BFS 알고리즘을 탐구한다. 또한, Rodinia 벤치마크와 본 논문의 Memory Coalescing 측정 지표를 NVIDIA Nsight Compute 를 사용하여 비교해본다.

Memory Coalescing

Warp 안의 연속적인 쓰레드가 연속적인 메모리 구조를 참조할 때, 데이터를 하나로 묶어서 가져올 수 있다. 메모리를 자주 접근하는 것은 성능의 저하(300 Clock cycles per Global Memory, 2 Clock cycles per shared memory)를 가져오는데, 이처럼 Memory Coalescing 을 고려한 프로그램은 DRAM(Global memory)의 Transaction를 줄임으로써 성능을 향상시킬 수 있다.

Breadth-First Search Algorithm

BFS 알고리즘은 시작점 (s) 로부터 방문할 수 있는 모든 노드를 방문한다. Frontier 는 각 Level 별 속한 노드를 의미하며, Frontier Propagation 은 이웃 노드를 탐색하고, 방문하지 않았으면 새로운 Level 의 Frontier 에 ADD 하는 것을 의미한다. 일반적인 BFS 알고리즘은 Frontier 를 저장하기 위한 하나의 Queue 를 사용하여서 구현하며, Complexity 는 다음과 같다.

Complexity : O(V+E) / Sparse Graph O(V)

Type of Parallelism in BFS

각 쓰레드가 각 Frontier 에 할당되어 Frontier Propagation 을 병렬 수행 하는 방식
여러 쓰레드가 Frontier Propagation 을 병렬 처리 하는 방식

본 논문에서는 1번째 방식을 선택함. 그 이유는 실제 그래프는 Sparse Graph 인 경우 많고, 이웃노드들이 많지 않기 때문이다.

Hierarchical Queue for Memory Coalescing

하나의 Queue 기반으로 한 기존 BFS 알고리즘은 병렬 처리 될 경우, 여러 쓰레드가 같은 지점 (Queue 의 TAIL 부분)의 update 되기 때문에 충돌(conflicts)의 문제가 발생한다. 이러한 충돌을 막기 위해서 Hierarchical Queue Structure 를 사용하여서 Memory Coalescing 문제를 해결하였다. Hierarchical Queue Structure 는 Memory Coalescing 이 보장되도록 BFS 를 재설계할 수 있다.

Hierarchical Queue Structure 은 3 Level로 나누어지는데, W Frontier, B Frontier, 그리고 G Frontier 이다.

W-Frontier 는 한 Warp 안에 32개의 쓰레드가 있을 시, 4개의 쓰레드씩, 8개의 쓰레드 그룹으로 나뉘어진다. 한 줄(row)이 W-Frontier 가 된다. 예를 들어, W-Frontiers [0] 의 쓰레드 T1, T9, T17, T25 는 동시 수행되지 않으므로, 기존 BFS 알고리즘의 충돌이 발생하지 않는다. Warp 의 사이즈(ex 32)를 알 경우, W-Frontier[0] 부터 W-Frontier[7] 에 4개의 쓰레드씩 들어가는 것을 미리 알게 되므로, W-Frontier 들의 병렬 처리가 가능하다. 따라서, Global Memory 에 있는 G-Frontier 에 연속적인 쓰레드가 연속적인 메모리 접근에 가능하다.

본 논문에서 제안하는 UIUC-BFS 가 다른 알고리즘에 비해 수행속도가 감소되는 것을 보여준다. 하지만, 수행 속도 외 Memory Coalescing 을 측정하여 비교한 결과표는 없으므로, 본 프로젝트에서 NVIDA Nsight Compute 를 사용하여서 측정하였다.

BFS Algorithm Source Code

Traditional BFS : Rodinia3.1 /cuda/ bfs code 를 사용함. (bfs.cu, kernel.cu, kernel2.cu [2])
Effective GPU Implementation of BFS : Implementing Breadth first search on CUDA using algorithm given in DAC'10 를 사용함. (main_sm.cu, kernel_sm.cu [3])

Memory Coalescing Metrics via ncu

Compute Capability 에 따라 추천하는 GPU Profiling Tool 이 다름. (Pascal 이후 Nsight 를 권장함)

nvprof : Kepler, Maxwell 을 지원한다.
nsight : Pascal, Volta, Turing 을 지원한다.

[1] nvdia 공식문서에 따르면 nvprof 에서는 memory coalscing 을 측정하기 위해서 다음 metrics 을 사용한다.

gld_efficiency (global load efficiency)
gst_efficiency (global store efficiency)

하지만 nsight 에서는 기존 metrics 에 해당하는 지표가 없으므로 Transacionts per Request[4] 로 계산한다.

transaction : the movement of a unit of data between two regions of memory
request : an instruction which accesses memory

효율적인 메모리 사용은 $gld\ efficiency$ 가 낮아야 한다. 두 메모리 간의 transaction 은 적되, 많은 데이터를 가져오면 효율적인 것이다. 비효율적인 접근 패턴 (Access Pattern) 은 적은 데이터(Low Requests) 임에도 다른 두 메모리 계층 간의(Large Transactions) 움직임이 많다.

gld\ efficiency = \frac{Total\ Load\ Transaction}{Total\ Load\ Requests}

Total Load Transaction 과 Total Load Request 는 아래 그림의 1번에 해당한다. 1번에 해당하는 Nsight compute metrics 의 이름은 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum 이다.

$Total\ Load\ Transaction:$ l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum (# of sectors requested for global loads)
$Total\ Load\ Requests:$ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum (# of requests sent to T-Stage for global loads)

gst\ efficiency = \frac{Total\ Store\ Transaction}{Total\ Store\ Requests}

$Total\ Store\ Transaction:$ l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum (# of sectors requested for global stores)
$Total\ Store\ Requests:$ l1tex__t_requests_pipe_lsu_mem_global_op_st.sum (# of requests sent to T-Stage for global stores)

Result

rodinia3.1 Benchmark dataset 을 사용하였다. dataset 은 graph4096.txt, graph65536.txt, 그리고 graph1MW6.txt 이다.

graph4096.txt
graph65536.txt
graph1MW_6.txt

Transactions per Request

	bfs	bfs_effecitve
graph4096.txt	3.3014	2.0826
graph65536.txt	3.3001	11.45
graph1MW_6.txt	3.1866	13.7699

graph4096.txt 그래프 데이터를 실행하였을 경우, 전체 Kernel 의 합계에 대한 Transaction per Request 는 bfs_effecitve 가 비교적 우수했지만 개별 커널을 Efficiency 를 보았을 때, 기존 bfs 가 효율적이었다.

기존 bfs 알고리즘은 bfs 실행파일이다. nv-nsight-cu-cli 커맨드에 --metrics 파라미터에 추적하고자 하는 측정 지표 이름을 넣어준다.

nv-nsight-cu-cli --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum, l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum ./bfs ./data/graph4096.txt

nv-nsight-cu-cli --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum, l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum ./bfs ./data/graph4096.txt

memory coalescing 을 고려한 알고리즘은 bfs_sm 실행 파일이다.

nv-nsight-cu-cli --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum, l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum ./bfs_sm2 ./data/graph4096.txt

Appendix

GPU Configuration

Usage

기존 bfs 알고리즘은 bfs 실행파일이다.

nvprof ./bfs {graph65536.txt 의 절대경로}

memory coalescing 을 고려한 알고리즘은 bfs_sm 실행파일이다.

nvprof ./bfs_sm2 {graph65536.txt 의 절대경로}

configuration

make 시, config/make.config 의 CUDA_DIR 변경 필요함

BFS Trace the Application (nvprof)

Reference

PreviousCh7. Swap (스왑)NextgVisor Review

Last updated 3 years ago