Stanford CS149 (Fall 2023) Parallel Computing

Programming Assignment 3: A Simple Renderer in CUDA

Github repo

环境配置

因为这次的 PA 要用到 CUDA,所以需要电脑有 NVIDIA 显卡,并且 Compute Capability >= 7.5(20 代及以上显卡)

我是用的 WSL,显卡驱动理论上只要外部环境装好了新版的 NVIDIA 驱动 WSL 就能直接用了,可以用 nvidia-smi 命令查看是否能识别到显卡。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
~ » nvidia-smi                                                                                     mizukicry@S-Terminal
Mon Apr 8 09:49:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.65 Driver Version: 551.86 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4050 ... On | 00000000:01:00.0 On | N/A |
| N/A 40C P8 4W / 95W | 506MiB / 6141MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

然后是配置 CUDA 环境,理论上直接用课程给出的安装脚本就行(Ubuntu 22.04),但我试了之后没效果。后来参照官网 WSL 的方式(CUDA Toolkit 12.4)再装了一边依然无效,查了一下似乎是没添加环境变量,我用的 zsh,课程安装脚本是 bash,要自己手动添加环境变量。使用 nvcc -V 查看是否能识别到 CUDA 编译器

1
2
3
4
5
6
~ » nvcc -V                                                                                        mizukicry@S-Terminal
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Part 1: CUDA Warm-Up 1: SAXPY (5 pts)

在给定的 saxpy.cu 文件中实现 SAXPY 算法(主要是内存拷贝部分),测试性能

很简单,就调用 cudaMemcpy 函数,性能由于我用的不是 AWS 就没测了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
// saxpyCuda --
//
// This function is regular C code running on the CPU. It allocates
// memory on the GPU using CUDA API functions, uses CUDA API functions
// to transfer data from the CPU's memory address space to GPU memory
// address space, and launches the CUDA kernel function on the GPU.
void saxpyCuda(int N, float alpha, float *xarray, float *yarray,
float *resultarray) {

// must read both input arrays (xarray and yarray) and write to
// output array (resultarray)
int totalBytes = sizeof(float) * 3 * N;

// compute number of blocks and threads per block. In this
// application we've hardcoded thread blocks to contain 512 CUDA
// threads.
const int threadsPerBlock = 512;

// Notice the round up here. The code needs to compute the number
// of threads blocks needed such that there is one thread per
// element of the arrays. This code is written to work for values
// of N that are not multiples of threadPerBlock.
const int blocks = (N + threadsPerBlock - 1) / threadsPerBlock;

// These are pointers that will be pointers to memory allocated
// *one the GPU*. You should allocate these pointers via
// cudaMalloc. You can access the resulting buffers from CUDA
// device kernel code (see the kernel function saxpy_kernel()
// above) but you cannot access the contents these buffers from
// this thread. CPU threads cannot issue loads and stores from GPU
// memory!
float *device_x = nullptr;
float *device_y = nullptr;
float *device_result = nullptr;

//
// CS149 TODO: allocate device memory buffers on the GPU using cudaMalloc.
//
// We highly recommend taking a look at NVIDIA's
// tutorial, which clearly walks you through the few lines of code
// you need to write for this part of the assignment:
//
// https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
//
cudaMalloc(&device_x, sizeof(float) * N);
cudaMalloc(&device_y, sizeof(float) * N);
cudaMalloc(&device_result, sizeof(float) * N);

// start timing after allocation of device memory
double startTime = CycleTimer::currentSeconds();

//
// CS149 TODO: copy input arrays to the GPU using cudaMemcpy
//
cudaMemcpy(device_x, xarray, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(device_y, yarray, sizeof(float) * N, cudaMemcpyHostToDevice);

// run CUDA kernel. (notice the <<< >>> brackets indicating a CUDA
// kernel launch) Execution on the GPU occurs here.
saxpy_kernel<<<blocks, threadsPerBlock>>>(N, alpha, device_x, device_y,
device_result);

//
// CS149 TODO: copy result from GPU back to CPU using cudaMemcpy
//
cudaMemcpy(resultarray, device_result, sizeof(float) * N,
cudaMemcpyDeviceToHost);

// end timing after result has been copied back into host memory
double endTime = CycleTimer::currentSeconds();

cudaError_t errCode = cudaPeekAtLastError();
if (errCode != cudaSuccess) {
fprintf(stderr, "WARNING: A CUDA error occured: code=%d, %s\n", errCode,
cudaGetErrorString(errCode));
}

double overallDuration = endTime - startTime;
printf("Effective BW by CUDA saxpy: %.3f ms\t\t[%.3f GB/s]\n",
1000.f * overallDuration, GBPerSec(totalBytes, overallDuration));

//
// CS149 TODO: free memory buffers on the GPU using cudaFree
//
cudaFree(device_x);
cudaFree(device_y);
cudaFree(device_result);
}

Part 2: CUDA Warm-Up 2: Parallel Prefix-Sum (10 pts)

首先是实现课件上的 Exclusive Prefix Sum 并行算法(scan/scan.cu 中的 exclusive_scan 函数)

说实话这个算法课件也没解释,看起来像是个树状数组建树的思想,不过没太看懂,就直接照着 PA 给出的伪代码翻译了

关于每个 Block 里面的 Thread 数量,直接用了代码里定义的 #define THREADS_PER_BLOCK 256

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
__global__ void upsweep(int *result, int numThreads, int stride) {
int threadId = blockIdx.x * blockDim.x + threadIdx.x;
if (threadId < numThreads) {
int index = (threadId + 1) * stride - 1;
result[index] += result[index - (stride >> 1)];
}
}

__global__ void downsweep(int *result, int numThreads, int stride) {
int threadId = blockIdx.x * blockDim.x + threadIdx.x;
if (threadId < numThreads) {
int index = (threadId + 1) * stride - 1;
int t = result[index - (stride >> 1)];
result[index - (stride >> 1)] = result[index];
result[index] += t;
}
}

// exclusive_scan --
//
// Implementation of an exclusive scan on global memory array `input`,
// with results placed in global memory `result`.
//
// N is the logical size of the input and output arrays, however
// students can assume that both the start and result arrays we
// allocated with next power-of-two sizes as described by the comments
// in cudaScan(). This is helpful, since your parallel scan
// will likely write to memory locations beyond N, but of course not
// greater than N rounded up to the next power of 2.
//
// Also, as per the comments in cudaScan(), you can implement an
// "in-place" scan, since the timing harness makes a copy of input and
// places it in result
void exclusive_scan(int *input, int N, int *result) {
N = nextPow2(N);

// upsweep phase
for (int stride = 2; stride <= N; stride <<= 1) {
int numThreads = N / stride;
int numBlocks = (numThreads + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
upsweep<<<numBlocks, THREADS_PER_BLOCK>>>(result, numThreads, stride);
cudaDeviceSynchronize();
}

// reset the last element to 0
cudaMemset(&result[N - 1], 0, sizeof(int));
cudaDeviceSynchronize();

// downsweep phase
for (int stride = N; stride >= 2; stride >>= 1) {
int numThreads = N / stride;
int numBlocks = (numThreads + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
downsweep<<<numBlocks, THREADS_PER_BLOCK>>>(result, numThreads, stride);
cudaDeviceSynchronize();
}
}
1
2
3
4
5
6
7
8
9
10
11
~/codes/CS149/asst3/scan (master*) » ./cudaScan -n 4000000                                                                                   mizukicry@S-Terminal
---------------------------------------------------------
Found 1 CUDA devices
Device 0: NVIDIA GeForce RTX 4050 Laptop GPU
SMs: 20
Global mem: 6140 MB
CUDA Cap: 8.9
---------------------------------------------------------
Array size: 4000000
Student GPU time: 2.802 ms
Scan outputs are correct!

一开始我在 Device 里面用计算 index 判断是否越界,结果测试的时候发现似乎数组大小超过 4194304(2^22)就会报错,一开始以为是爆栈了,但解除栈限制也一样,排查时发现是因为 threadId 超出过多导致炸 int ,被坑惨了

话说给的这个模板竟然没释放 Device 内存

然后是通过 prefix sum 实现 find_repeats 函数

大致思路就是先找出相应的位置设为 1,然后求一边前缀和算出答案的位置,最后复制到 output 数组中

课件中提到了一个优化:当 Thread 数量很少的时候不用创建 THREADS_PER_BLOCK 个 Thread,这个取个 min 就行(顺便可以将前面的 exclusive_scan 也改一下)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
__global__ void fr_part1(int *input, int *a, int length) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
a[index] = (index < length - 1 && input[index] == input[index + 1]) ? 1 : 0;
}

__global__ void fr_part2(int *a, int length, int *output) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < length - 1 && a[index] != a[index + 1]) {
output[a[index]] = index;
}
}

// find_repeats --
//
// Given an array of integers `device_input`, returns an array of all
// indices `i` for which `device_input[i] == device_input[i+1]`.
//
// Returns the total number of pairs found
int find_repeats(int *device_input, int length, int *device_output) {
int N = nextPow2(length);

int *a;
cudaMalloc(&a, N * sizeof(int));

fr_part1<<<(length + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK,
std::min(THREADS_PER_BLOCK, length)>>>(device_input, a, length);
cudaDeviceSynchronize();

exclusive_scan(a, N, a);

fr_part2<<<(length + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK,
std::min(THREADS_PER_BLOCK, length)>>>(a, length, device_output);
cudaDeviceSynchronize();

int result;
cudaMemcpy(&result, &a[length - 1], sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(a);
return result;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
~/codes/CS149/asst3/scan (master*) » ./checker.pl scan                                                                                       mizukicry@S-Terminal
Test: scan
--------------
Running tests:
--------------

Element Count: 1000000
Correctness passed!
Student Time: 1.169
Ref Time: 6.415

Element Count: 10000000
Correctness passed!
Student Time: 9.895
Ref Time: 16.378

Element Count: 20000000
Correctness passed!
Student Time: 19.002
Ref Time: 27.951

Element Count: 40000000
Correctness passed!
Student Time: 37.298
Ref Time: 55.261

-------------------------
Scan Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count | Ref Time | Student Time | Score |
-------------------------------------------------------------------------
| 1000000 | 6.415 | 1.169 | 1.25 |
| 10000000 | 16.378 | 9.895 | 1.25 |
| 20000000 | 27.951 | 19.002 | 1.25 |
| 40000000 | 55.261 | 37.298 | 1.25 |
-------------------------------------------------------------------------
| | Total score: | 5/5 |
-------------------------------------------------------------------------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
~/codes/CS149/asst3/scan (master*) » ./checker.pl find_repeats                                                                               mizukicry@S-Terminal
Test: find_repeats
--------------
Running tests:
--------------

Element Count: 1000000
Correctness passed!
Student Time: 2.025
Ref Time: 2.666

Element Count: 10000000
Correctness passed!
Student Time: 14.239
Ref Time: 19.440

Element Count: 20000000
Correctness passed!
Student Time: 25.547
Ref Time: 35.342

Element Count: 40000000
Correctness passed!
Student Time: 48.733
Ref Time: 70.690

-------------------------
Find_repeats Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count | Ref Time | Student Time | Score |
-------------------------------------------------------------------------
| 1000000 | 2.666 | 2.025 | 1.25 |
| 10000000 | 19.440 | 14.239 | 1.25 |
| 20000000 | 35.342 | 25.547 | 1.25 |
| 40000000 | 70.690 | 48.733 | 1.25 |
-------------------------------------------------------------------------
| | Total score: | 5/5 |
-------------------------------------------------------------------------

Part 3: A Simple Circle Renderer (85 pts)

PA3 的重头戏,实现一个简单的并行圆形渲染器

(话说配环境是真烦啊,说是 WSL2 新版已经支持 GUI 了但试了没效果,配 X11 也配了好半天,后来莫名其妙 Clangd Server 一遇到 CUDA 代码里面出现 delete[] 就崩溃,没提示写代码真难受。)

(另外感叹一下自己读英文文档的能力真差,大部分没多少内容的东西自己都得读个好几天,总感觉不想读)

先简单看一遍 cudaRenderer.cu,基本代码就两件事,一个是 advanceAnimation,一个是 render,显然 advanceAnimation 中每个像素对应一个简单线程,没有错误,那么需要改的部分就是 render 的过程

PA 给的代码思路是对每个圆并行,然后枚举圆内的像素,计算是否在圆内,然后计算颜色贡献,这样显然是不满足圆覆盖顺序的。

很容易想到把顺序换过来:对每个像素并行,然后枚举每个圆,计算是否在圆内,然后计算颜色贡献。但实际试了一下发现当圆很小但数量很多的时候就很浪费。

所以可以先将像素分块,对每个块先处理出可能覆盖的圆,然后再对每个像素并行计算颜色贡献。具体实现的时候还要对圆分块处理,否则内存不够用,具体看代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
namespace Solution3 {

constexpr int BLOCK_DIM = 16;
constexpr int BLOCK_SIZE = BLOCK_DIM * BLOCK_DIM;

#define SCAN_BLOCK_DIM BLOCK_SIZE
#include "exclusiveScan.cu_inl"

__global__ void kernelRenderCircles() {
__shared__ uint circleIsInBox[BLOCK_SIZE];
__shared__ uint circleIndex[BLOCK_SIZE];
__shared__ uint scratch[2 * BLOCK_SIZE];
__shared__ int inBoxCircles[BLOCK_SIZE];

int boxL = blockIdx.x * BLOCK_DIM;
int boxB = blockIdx.y * BLOCK_DIM;
int boxR = min(boxL + BLOCK_DIM, cuConstRendererParams.imageWidth);
int boxT = min(boxB + BLOCK_DIM, cuConstRendererParams.imageHeight);
float boxLNorm = boxL * cuConstRendererParams.invWidth;
float boxRNorm = boxR * cuConstRendererParams.invWidth;
float boxTNorm = boxT * cuConstRendererParams.invHeight;
float boxBNorm = boxB * cuConstRendererParams.invHeight;

int index = threadIdx.y * BLOCK_DIM + threadIdx.x;
int pixelX = boxL + threadIdx.x;
int pixelY = boxB + threadIdx.y;
int pixelId = pixelY * cuConstRendererParams.imageWidth + pixelX;

for (int i = 0; i < cuConstRendererParams.numCircles; i += BLOCK_SIZE) {
int circleId = i + index;
if (circleId < cuConstRendererParams.numCircles) {
float3 p = *reinterpret_cast<float3 *>(
&cuConstRendererParams.position[3 * circleId]);
circleIsInBox[index] =
circleInBox(p.x, p.y, cuConstRendererParams.radius[circleId],
boxLNorm, boxRNorm, boxTNorm, boxBNorm);
} else {
circleIsInBox[index] = 0;
}
__syncthreads();

sharedMemExclusiveScan(index, circleIsInBox, circleIndex, scratch,
BLOCK_SIZE);
if (circleIsInBox[index]) {
inBoxCircles[circleIndex[index]] = circleId;
}
__syncthreads();

int numCirclesInBox =
circleIndex[BLOCK_SIZE - 1] + circleIsInBox[BLOCK_SIZE - 1];
__syncthreads();

if (pixelX < boxR && pixelY < boxT) {
float4 *imgPtr = reinterpret_cast<float4 *>(
&cuConstRendererParams.imageData[4 * pixelId]);
for (int j = 0; j < numCirclesInBox; j++) {
circleId = inBoxCircles[j];
shadePixel(
circleId,
make_float2((pixelX + 0.5) * cuConstRendererParams.invWidth,
(pixelY + 0.5) * cuConstRendererParams.invHeight),
*reinterpret_cast<float3 *>(
&cuConstRendererParams.position[3 * circleId]),
imgPtr);
}
}
}
}

void renderCircles(int width, int height) {
kernelRenderCircles<<<dim3((width + BLOCK_DIM - 1) / BLOCK_DIM,
(height + BLOCK_DIM - 1) / BLOCK_DIM),
dim3(BLOCK_DIM, BLOCK_DIM)>>>();
cudaCheckError(cudaDeviceSynchronize());
}
} // namespace Solution3

然后 WSL 里面最后分数是 68/72(biglittle 场景跑不满),波动大的时候还会往下掉。没怎么写过 CUDA 的话总会遇到很多问题,这个 PA 上花的时间有点多了,就先这样了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
~/codes/CS149/asst3/render (master*) » ./checker.py                                                                                                                                                                                                  mizukicry@S-Terminal

Running scene: rgb...
[rgb] Correctness passed!
[rgb] Student times: [0.6076, 0.6574, 0.7273]
[rgb] Reference times: [0.688, 0.5434, 0.6265]

Running scene: rgby...
[rgby] Correctness passed!

Running scene: rand10k...
[rand10k] Correctness passed!
[rand10k] Student times: [4.0515, 4.0682, 4.2037]
[rand10k] Reference times: [3.5459, 3.6099, 3.6559]

Running scene: rand100k...
[rand100k] Correctness passed!
[rand100k] Student times: [35.3552, 32.3574, 32.7131]
[rand100k] Reference times: [28.2137, 28.2372, 28.4035]

Running scene: biglittle...
[biglittle] Correctness passed!
[biglittle] Student times: [43.0692, 31.4728, 31.7712]
[biglittle] Reference times: [16.8258, 16.6435, 16.6761]

Running scene: littlebig...
[littlebig] Correctness passed!

Running scene: pattern...
[pattern] Correctness passed!
[pattern] Student times: [0.8449, 0.7732, 0.7803]
[pattern] Reference times: [0.7716, 0.7156, 0.8219]

Running scene: bouncingballs...
[bouncingballs] Correctness passed!

Running scene: hypnosis...
[hypnosis] Correctness passed!

Running scene: fireworks...
[fireworks] Correctness passed!

Running scene: snow...
[snow] Correctness passed!

Running scene: snowsingle...
[snowsingle] Correctness passed!
[snowsingle] Student times: [20.3475, 19.8425, 18.5326]
[snowsingle] Reference times: [18.8349, 19.9449, 19.1156]
------------
Score table:
------------
--------------------------------------------------------------------------
| Scene Name | Ref Time (T_ref) | Your Time (T) | Score |
--------------------------------------------------------------------------
| rgb | 0.5434 | 0.6076 | 12 |
| rand10k | 3.5459 | 4.0515 | 12 |
| rand100k | 28.2137 | 32.3574 | 12 |
| pattern | 0.7156 | 0.7732 | 12 |
| snowsingle | 18.8349 | 18.5326 | 12 |
| biglittle | 16.6435 | 31.4728 | 8 |
--------------------------------------------------------------------------
| | Total score: | 68/72 |
--------------------------------------------------------------------------