Stanford CS149 (Fall 2023) Parallel Computing
Optional Programming Assignment 5: Big Graph Processing in OpenMP
Github repo
Part 1:
Parallel "Top Down" Breadth-First Search (20 points)
使用 OpenMP 加速从起点开始的 BFS
看了一下给出的代码,就是维护两层队列,一层是当前层,一层是下一层,那么能够加速的就是每一次向下一层扩展的过程。
这里选择对第一层 for
循环进行并行化,由于每个点的边数不一样,所以选择动态的调度方式,并且每个线程都有自己的
buffer,最后再合并。
提示中说到了部分情况可以避免 CAS 操作,也就是提前判断一下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| void top_down_step(Graph g, vertex_set *frontier, vertex_set *new_frontier, int *distances) { const int new_dist = distances[frontier->vertices[0]] + 1; #pragma omp parallel { Vertex *buffer = new Vertex[g->num_nodes]; int buffer_size = 0;
#pragma omp for schedule(dynamic, 200) for (int i = 0; i < frontier->count; i++) { const int node = frontier->vertices[i]; for (const Vertex *x = outgoing_begin(g, node), *limit = outgoing_end(g, node); x < limit; x++) { if (distances[*x] == NOT_VISITED_MARKER && __sync_bool_compare_and_swap(distances + *x, NOT_VISITED_MARKER, new_dist)) { buffer[buffer_size++] = *x; } } }
int index = __sync_fetch_and_add(&new_frontier->count, buffer_size); memcpy(new_frontier->vertices + index, buffer, buffer_size * sizeof(Vertex));
delete[] buffer; } }
|
Part 2: "Bottom Up" BFS (25
points)
反过来的 BFS,每次枚举没有到达的点判断是否可以到达
一开始试了一下维护没有到达的点的数组,但发现图似乎是不完全联通的,最后会剩下大量的点反复拷贝,反而效果不是很好,于是就保持和
Part1 一样
最终效果也不是很完美,只有 reference 的 75% 左右
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
| void bottom_up_step(Graph g, vertex_set *frontier, vertex_set *new_frontier, int *distances) { const int new_dist = distances[frontier->vertices[0]] + 1;
#pragma omp parallel { Vertex *buffer = new Vertex[g->num_nodes]; int buffer_size = 0;
#pragma omp for schedule(dynamic, 2000) for (int i = 0; i < g->num_nodes; ++i) { if (distances[i] != NOT_VISITED_MARKER) continue; for (const Vertex *x = incoming_begin(g, i), *limit = incoming_end(g, i); x < limit; ++x) { if (distances[*x] == new_dist - 1) { distances[i] = new_dist; buffer[buffer_size++] = i; break; } } }
int index = __sync_fetch_and_add(&new_frontier->count, buffer_size); memcpy(new_frontier->vertices + index, buffer, buffer_size * sizeof(Vertex));
delete[] buffer; } }
void bfs_bottom_up(Graph graph, solution *sol) { vertex_set list1; vertex_set list2; vertex_set_init(&list1, graph->num_nodes); vertex_set_init(&list2, graph->num_nodes);
vertex_set *frontier = &list1; vertex_set *new_frontier = &list2;
for (int i = 0; i < graph->num_nodes; i++) sol->distances[i] = NOT_VISITED_MARKER;
frontier->vertices[frontier->count++] = ROOT_NODE_ID; sol->distances[ROOT_NODE_ID] = 0;
while (frontier->count != 0) { vertex_set_clear(new_frontier);
bottom_up_step(graph, frontier, new_frontier, sol->distances);
vertex_set *tmp = frontier; frontier = new_frontier; new_frontier = tmp; } }
|
Part 3: Hybrid BFS (25 points)
混合 Part 1 和 Part 2 的 BFS
这里简单设了一个
frontier->count < graph->num_nodes * 0.05
的条件,来判断使用哪种 BFS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| void bfs_hybrid(Graph graph, solution *sol) { vertex_set list1; vertex_set list2; vertex_set_init(&list1, graph->num_nodes); vertex_set_init(&list2, graph->num_nodes);
vertex_set *frontier = &list1; vertex_set *new_frontier = &list2;
for (int i = 0; i < graph->num_nodes; i++) sol->distances[i] = NOT_VISITED_MARKER;
frontier->vertices[frontier->count++] = ROOT_NODE_ID; sol->distances[ROOT_NODE_ID] = 0;
while (frontier->count != 0) { vertex_set_clear(new_frontier);
if (frontier->count < graph->num_nodes * 0.05) { top_down_step(graph, frontier, new_frontier, sol->distances); } else { bottom_up_step(graph, frontier, new_frontier, sol->distances); }
vertex_set *tmp = frontier; frontier = new_frontier; new_frontier = tmp; } }
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| Your Code: Timing Summary Threads Top Down Bottom Up Hybrid 1: 4.81 (1.00x) 7.03 (1.00x) 2.29 (1.00x) 2: 2.58 (1.86x) 3.66 (1.92x) 1.24 (1.84x) 4: 1.40 (3.43x) 2.12 (3.31x) 0.68 (3.37x) 8: 0.98 (4.91x) 1.37 (5.14x) 0.47 (4.91x) 16: 0.85 (5.64x) 1.22 (5.75x) 0.52 (4.43x) ---------------------------------------------------------- Reference: Timing Summary Threads Top Down Bottom Up Hybrid 1: 5.21 (1.00x) 5.77 (1.00x) 3.06 (1.00x) 2: 2.81 (1.86x) 2.99 (1.93x) 1.63 (1.88x) 4: 1.54 (3.38x) 1.76 (3.28x) 0.91 (3.35x) 8: 1.05 (4.96x) 1.14 (5.05x) 0.63 (4.86x) 16: 1.00 (5.19x) 0.84 (6.89x) 0.64 (4.82x) ---------------------------------------------------------- Correctness:
Speedup vs. Reference: Threads Top Down Bottom Up Hybrid 1: 1.08 0.82 1.34 2: 1.09 0.82 1.31 4: 1.10 0.83 1.35 8: 1.07 0.84 1.36 16: 1.18 0.68 1.23
|
最后的得分(记得把内存开大,我 wsl 的 8G
内存根本不够用,一直被杀进程)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| -------------------------------------------------------------------------- SCORES : | Top-Down | Bott-Up | Hybrid | -------------------------------------------------------------------------- grid1000x1000.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 | -------------------------------------------------------------------------- soc-livejournal1_68m.graph | 1.95 / 2 | 3.00 / 3 | 3.00 / 3 | -------------------------------------------------------------------------- com-orkut_117m.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 | -------------------------------------------------------------------------- random_500m.graph | 7.00 / 7 | 8.00 / 8 | 8.00 / 8 | -------------------------------------------------------------------------- rmat_200m.graph | 7.00 / 7 | 4.87 / 8 | 8.00 / 8 | -------------------------------------------------------------------------- TOTAL | 66.82 / 70 | --------------------------------------------------------------------------
|