Stanford CS149 (Fall 2023) Parallel Computing

Optional Programming Assignment 5: Big Graph Processing in OpenMP

Github repo

Part 1: Parallel "Top Down" Breadth-First Search (20 points)

使用 OpenMP 加速从起点开始的 BFS

看了一下给出的代码,就是维护两层队列,一层是当前层,一层是下一层,那么能够加速的就是每一次向下一层扩展的过程。

这里选择对第一层 for 循环进行并行化,由于每个点的边数不一样,所以选择动态的调度方式,并且每个线程都有自己的 buffer,最后再合并。

提示中说到了部分情况可以避免 CAS 操作,也就是提前判断一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void top_down_step(Graph g, vertex_set *frontier, vertex_set *new_frontier,
int *distances) {
const int new_dist = distances[frontier->vertices[0]] + 1;
#pragma omp parallel
{
Vertex *buffer = new Vertex[g->num_nodes];
int buffer_size = 0;

#pragma omp for schedule(dynamic, 200)
for (int i = 0; i < frontier->count; i++) {
const int node = frontier->vertices[i];
for (const Vertex *x = outgoing_begin(g, node),
*limit = outgoing_end(g, node);
x < limit; x++) {
if (distances[*x] == NOT_VISITED_MARKER &&
__sync_bool_compare_and_swap(distances + *x, NOT_VISITED_MARKER,
new_dist)) {
buffer[buffer_size++] = *x;
}
}
}

int index = __sync_fetch_and_add(&new_frontier->count, buffer_size);
memcpy(new_frontier->vertices + index, buffer,
buffer_size * sizeof(Vertex));

delete[] buffer;
}
}

Part 2: "Bottom Up" BFS (25 points)

反过来的 BFS,每次枚举没有到达的点判断是否可以到达

一开始试了一下维护没有到达的点的数组,但发现图似乎是不完全联通的,最后会剩下大量的点反复拷贝,反而效果不是很好,于是就保持和 Part1 一样

最终效果也不是很完美,只有 reference 的 75% 左右

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
void bottom_up_step(Graph g, vertex_set *frontier, vertex_set *new_frontier,
int *distances) {
const int new_dist = distances[frontier->vertices[0]] + 1;

#pragma omp parallel
{
Vertex *buffer = new Vertex[g->num_nodes];
int buffer_size = 0;

#pragma omp for schedule(dynamic, 2000)
for (int i = 0; i < g->num_nodes; ++i) {
if (distances[i] != NOT_VISITED_MARKER)
continue;
for (const Vertex *x = incoming_begin(g, i), *limit = incoming_end(g, i);
x < limit; ++x) {
if (distances[*x] == new_dist - 1) {
distances[i] = new_dist;
buffer[buffer_size++] = i;
break;
}
}
}

int index = __sync_fetch_and_add(&new_frontier->count, buffer_size);
memcpy(new_frontier->vertices + index, buffer,
buffer_size * sizeof(Vertex));

delete[] buffer;
}
}

void bfs_bottom_up(Graph graph, solution *sol) {
vertex_set list1;
vertex_set list2;
vertex_set_init(&list1, graph->num_nodes);
vertex_set_init(&list2, graph->num_nodes);

vertex_set *frontier = &list1;
vertex_set *new_frontier = &list2;

for (int i = 0; i < graph->num_nodes; i++)
sol->distances[i] = NOT_VISITED_MARKER;

frontier->vertices[frontier->count++] = ROOT_NODE_ID;
sol->distances[ROOT_NODE_ID] = 0;

while (frontier->count != 0) {
vertex_set_clear(new_frontier);

bottom_up_step(graph, frontier, new_frontier, sol->distances);

vertex_set *tmp = frontier;
frontier = new_frontier;
new_frontier = tmp;
}
}

Part 3: Hybrid BFS (25 points)

混合 Part 1 和 Part 2 的 BFS

这里简单设了一个 frontier->count < graph->num_nodes * 0.05 的条件,来判断使用哪种 BFS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
void bfs_hybrid(Graph graph, solution *sol) {
vertex_set list1;
vertex_set list2;
vertex_set_init(&list1, graph->num_nodes);
vertex_set_init(&list2, graph->num_nodes);

vertex_set *frontier = &list1;
vertex_set *new_frontier = &list2;

for (int i = 0; i < graph->num_nodes; i++)
sol->distances[i] = NOT_VISITED_MARKER;

frontier->vertices[frontier->count++] = ROOT_NODE_ID;
sol->distances[ROOT_NODE_ID] = 0;

// int remaining_nodes = graph->num_nodes - 1;

while (frontier->count != 0) {
vertex_set_clear(new_frontier);

if (frontier->count < graph->num_nodes * 0.05) {
top_down_step(graph, frontier, new_frontier, sol->distances);
} else {
bottom_up_step(graph, frontier, new_frontier, sol->distances);
}

// remaining_nodes -= frontier->count;

vertex_set *tmp = frontier;
frontier = new_frontier;
new_frontier = tmp;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Your Code: Timing Summary
Threads Top Down Bottom Up Hybrid
1: 4.81 (1.00x) 7.03 (1.00x) 2.29 (1.00x)
2: 2.58 (1.86x) 3.66 (1.92x) 1.24 (1.84x)
4: 1.40 (3.43x) 2.12 (3.31x) 0.68 (3.37x)
8: 0.98 (4.91x) 1.37 (5.14x) 0.47 (4.91x)
16: 0.85 (5.64x) 1.22 (5.75x) 0.52 (4.43x)
----------------------------------------------------------
Reference: Timing Summary
Threads Top Down Bottom Up Hybrid
1: 5.21 (1.00x) 5.77 (1.00x) 3.06 (1.00x)
2: 2.81 (1.86x) 2.99 (1.93x) 1.63 (1.88x)
4: 1.54 (3.38x) 1.76 (3.28x) 0.91 (3.35x)
8: 1.05 (4.96x) 1.14 (5.05x) 0.63 (4.86x)
16: 1.00 (5.19x) 0.84 (6.89x) 0.64 (4.82x)
----------------------------------------------------------
Correctness:

Speedup vs. Reference:
Threads Top Down Bottom Up Hybrid
1: 1.08 0.82 1.34
2: 1.09 0.82 1.31
4: 1.10 0.83 1.35
8: 1.07 0.84 1.36
16: 1.18 0.68 1.23

最后的得分(记得把内存开大,我 wsl 的 8G 内存根本不够用,一直被杀进程)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
--------------------------------------------------------------------------
SCORES : | Top-Down | Bott-Up | Hybrid |
--------------------------------------------------------------------------
grid1000x1000.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
soc-livejournal1_68m.graph | 1.95 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
com-orkut_117m.graph | 2.00 / 2 | 3.00 / 3 | 3.00 / 3 |
--------------------------------------------------------------------------
random_500m.graph | 7.00 / 7 | 8.00 / 8 | 8.00 / 8 |
--------------------------------------------------------------------------
rmat_200m.graph | 7.00 / 7 | 4.87 / 8 | 8.00 / 8 |
--------------------------------------------------------------------------
TOTAL | 66.82 / 70 |
--------------------------------------------------------------------------