CS149 Programming Assignment 3 - A Simple Renderer in CUDA

Stanford CS149 (Fall 2023) Parallel Computing

Programming Assignment 3: A Simple Renderer in CUDA

环境配置

因为这次的 PA 要用到 CUDA，所以需要电脑有 NVIDIA 显卡，并且 Compute Capability >= 7.5（20 代及以上显卡）

我是用的 WSL，显卡驱动理论上只要外部环境装好了新版的 NVIDIA 驱动 WSL 就能直接用了，可以用 nvidia-smi 命令查看是否能识别到显卡。

~ » nvidia-smi                                                                                     mizukicry@S-Terminal
Mon Apr  8 09:49:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.65                 Driver Version: 551.86         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4050 ...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   40C    P8              4W /   95W |     506MiB /   6141MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

然后是配置 CUDA 环境，理论上直接用课程给出的安装脚本就行（Ubuntu 22.04），但我试了之后没效果。后来参照官网 WSL 的方式（CUDA Toolkit 12.4）再装了一边依然无效，查了一下似乎是没添加环境变量，我用的 zsh，课程安装脚本是 bash，要自己手动添加环境变量。使用 nvcc -V 查看是否能识别到 CUDA 编译器

~ » nvcc -V                                                                                        mizukicry@S-Terminal
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Part 1: CUDA Warm-Up 1: SAXPY (5 pts)

在给定的 saxpy.cu 文件中实现 SAXPY 算法（主要是内存拷贝部分），测试性能

很简单，就调用 cudaMemcpy 函数，性能由于我用的不是 AWS 就没测了

// saxpyCuda --
//
// This function is regular C code running on the CPU.  It allocates
// memory on the GPU using CUDA API functions, uses CUDA API functions
// to transfer data from the CPU's memory address space to GPU memory
// address space, and launches the CUDA kernel function on the GPU.
void saxpyCuda(int N, float alpha, float *xarray, float *yarray,
               float *resultarray) {

  // must read both input arrays (xarray and yarray) and write to
  // output array (resultarray)
  int totalBytes = sizeof(float) * 3 * N;

  // compute number of blocks and threads per block.  In this
  // application we've hardcoded thread blocks to contain 512 CUDA
  // threads.
  const int threadsPerBlock = 512;

  // Notice the round up here.  The code needs to compute the number
  // of threads blocks needed such that there is one thread per
  // element of the arrays.  This code is written to work for values
  // of N that are not multiples of threadPerBlock.
  const int blocks = (N + threadsPerBlock - 1) / threadsPerBlock;

  // These are pointers that will be pointers to memory allocated
  // *one the GPU*.  You should allocate these pointers via
  // cudaMalloc.  You can access the resulting buffers from CUDA
  // device kernel code (see the kernel function saxpy_kernel()
  // above) but you cannot access the contents these buffers from
  // this thread. CPU threads cannot issue loads and stores from GPU
  // memory!
  float *device_x = nullptr;
  float *device_y = nullptr;
  float *device_result = nullptr;

  //
  // CS149 TODO: allocate device memory buffers on the GPU using cudaMalloc.
  //
  // We highly recommend taking a look at NVIDIA's
  // tutorial, which clearly walks you through the few lines of code
  // you need to write for this part of the assignment:
  //
  // https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
  //
  cudaMalloc(&device_x, sizeof(float) * N);
  cudaMalloc(&device_y, sizeof(float) * N);
  cudaMalloc(&device_result, sizeof(float) * N);

  // start timing after allocation of device memory
  double startTime = CycleTimer::currentSeconds();

  //
  // CS149 TODO: copy input arrays to the GPU using cudaMemcpy
  //
  cudaMemcpy(device_x, xarray, sizeof(float) * N, cudaMemcpyHostToDevice);
  cudaMemcpy(device_y, yarray, sizeof(float) * N, cudaMemcpyHostToDevice);

  // run CUDA kernel. (notice the <<< >>> brackets indicating a CUDA
  // kernel launch) Execution on the GPU occurs here.
  saxpy_kernel<<<blocks, threadsPerBlock>>>(N, alpha, device_x, device_y,
                                            device_result);

  //
  // CS149 TODO: copy result from GPU back to CPU using cudaMemcpy
  //
  cudaMemcpy(resultarray, device_result, sizeof(float) * N,
             cudaMemcpyDeviceToHost);

  // end timing after result has been copied back into host memory
  double endTime = CycleTimer::currentSeconds();

  cudaError_t errCode = cudaPeekAtLastError();
  if (errCode != cudaSuccess) {
    fprintf(stderr, "WARNING: A CUDA error occured: code=%d, %s\n", errCode,
            cudaGetErrorString(errCode));
  }

  double overallDuration = endTime - startTime;
  printf("Effective BW by CUDA saxpy: %.3f ms\t\t[%.3f GB/s]\n",
         1000.f * overallDuration, GBPerSec(totalBytes, overallDuration));

  //
  // CS149 TODO: free memory buffers on the GPU using cudaFree
  //
  cudaFree(device_x);
  cudaFree(device_y);
  cudaFree(device_result);
}

Part 2: CUDA Warm-Up 2: Parallel Prefix-Sum (10 pts)

首先是实现课件上的 Exclusive Prefix Sum 并行算法（scan/scan.cu 中的 exclusive_scan 函数）

说实话这个算法课件也没解释，看起来像是个树状数组建树的思想，不过没太看懂，就直接照着 PA 给出的伪代码翻译了

关于每个 Block 里面的 Thread 数量，直接用了代码里定义的 #define THREADS_PER_BLOCK 256

__global__ void upsweep(int *result, int numThreads, int stride) {
  int threadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (threadId < numThreads) {
    int index = (threadId + 1) * stride - 1;
    result[index] += result[index - (stride >> 1)];
  }
}

__global__ void downsweep(int *result, int numThreads, int stride) {
  int threadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (threadId < numThreads) {
    int index = (threadId + 1) * stride - 1;
    int t = result[index - (stride >> 1)];
    result[index - (stride >> 1)] = result[index];
    result[index] += t;
  }
}

// exclusive_scan --
//
// Implementation of an exclusive scan on global memory array `input`,
// with results placed in global memory `result`.
//
// N is the logical size of the input and output arrays, however
// students can assume that both the start and result arrays we
// allocated with next power-of-two sizes as described by the comments
// in cudaScan().  This is helpful, since your parallel scan
// will likely write to memory locations beyond N, but of course not
// greater than N rounded up to the next power of 2.
//
// Also, as per the comments in cudaScan(), you can implement an
// "in-place" scan, since the timing harness makes a copy of input and
// places it in result
void exclusive_scan(int *input, int N, int *result) {
  N = nextPow2(N);

  // upsweep phase
  for (int stride = 2; stride <= N; stride <<= 1) {
    int numThreads = N / stride;
    int numBlocks = (numThreads + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    upsweep<<<numBlocks, THREADS_PER_BLOCK>>>(result, numThreads, stride);
    cudaDeviceSynchronize();
  }

  // reset the last element to 0
  cudaMemset(&result[N - 1], 0, sizeof(int));
  cudaDeviceSynchronize();

  // downsweep phase
  for (int stride = N; stride >= 2; stride >>= 1) {
    int numThreads = N / stride;
    int numBlocks = (numThreads + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    downsweep<<<numBlocks, THREADS_PER_BLOCK>>>(result, numThreads, stride);
    cudaDeviceSynchronize();
  }
}

~/codes/CS149/asst3/scan (master*) » ./cudaScan -n 4000000                                                                                   mizukicry@S-Terminal
---------------------------------------------------------
Found 1 CUDA devices
Device 0: NVIDIA GeForce RTX 4050 Laptop GPU
   SMs:        20
   Global mem: 6140 MB
   CUDA Cap:   8.9
---------------------------------------------------------
Array size: 4000000
Student GPU time: 2.802 ms
Scan outputs are correct!

一开始我在 Device 里面用计算 index 判断是否越界，结果测试的时候发现似乎数组大小超过 4194304（2^22）就会报错，一开始以为是爆栈了，但解除栈限制也一样，排查时发现是因为 threadId 超出过多导致炸 int ，被坑惨了

话说给的这个模板竟然没释放 Device 内存

然后是通过 prefix sum 实现 find_repeats 函数

大致思路就是先找出相应的位置设为 1，然后求一边前缀和算出答案的位置，最后复制到 output 数组中

课件中提到了一个优化：当 Thread 数量很少的时候不用创建 THREADS_PER_BLOCK 个 Thread，这个取个 min 就行（顺便可以将前面的 exclusive_scan 也改一下）

__global__ void fr_part1(int *input, int *a, int length) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  a[index] = (index < length - 1 && input[index] == input[index + 1]) ? 1 : 0;
}

__global__ void fr_part2(int *a, int length, int *output) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index < length - 1 && a[index] != a[index + 1]) {
    output[a[index]] = index;
  }
}

// find_repeats --
//
// Given an array of integers `device_input`, returns an array of all
// indices `i` for which `device_input[i] == device_input[i+1]`.
//
// Returns the total number of pairs found
int find_repeats(int *device_input, int length, int *device_output) {
  int N = nextPow2(length);

  int *a;
  cudaMalloc(&a, N * sizeof(int));

  fr_part1<<<(length + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK,
             std::min(THREADS_PER_BLOCK, length)>>>(device_input, a, length);
  cudaDeviceSynchronize();

  exclusive_scan(a, N, a);

  fr_part2<<<(length + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK,
             std::min(THREADS_PER_BLOCK, length)>>>(a, length, device_output);
  cudaDeviceSynchronize();

  int result;
  cudaMemcpy(&result, &a[length - 1], sizeof(int), cudaMemcpyDeviceToHost);

  cudaFree(a);
  return result;
}

~/codes/CS149/asst3/scan (master*) » ./checker.pl scan                                                                                       mizukicry@S-Terminal
Test: scan
--------------
Running tests:
--------------

Element Count: 1000000
Correctness passed!
Student Time: 1.169
Ref Time: 6.415

Element Count: 10000000
Correctness passed!
Student Time: 9.895
Ref Time: 16.378

Element Count: 20000000
Correctness passed!
Student Time: 19.002
Ref Time: 27.951

Element Count: 40000000
Correctness passed!
Student Time: 37.298
Ref Time: 55.261

-------------------------
Scan Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 6.415           | 1.169           | 1.25            |
| 10000000        | 16.378          | 9.895           | 1.25            |
| 20000000        | 27.951          | 19.002          | 1.25            |
| 40000000        | 55.261          | 37.298          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 5/5             |
-------------------------------------------------------------------------

~/codes/CS149/asst3/scan (master*) » ./checker.pl find_repeats                                                                               mizukicry@S-Terminal
Test: find_repeats
--------------
Running tests:
--------------

Element Count: 1000000
Correctness passed!
Student Time: 2.025
Ref Time: 2.666

Element Count: 10000000
Correctness passed!
Student Time: 14.239
Ref Time: 19.440

Element Count: 20000000
Correctness passed!
Student Time: 25.547
Ref Time: 35.342

Element Count: 40000000
Correctness passed!
Student Time: 48.733
Ref Time: 70.690

-------------------------
Find_repeats Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 2.666           | 2.025           | 1.25            |
| 10000000        | 19.440          | 14.239          | 1.25            |
| 20000000        | 35.342          | 25.547          | 1.25            |
| 40000000        | 70.690          | 48.733          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 5/5             |
-------------------------------------------------------------------------

Part 3: A Simple Circle Renderer (85 pts)

PA3 的重头戏，实现一个简单的并行圆形渲染器

（话说配环境是真烦啊，说是 WSL2 新版已经支持 GUI 了但试了没效果，配 X11 也配了好半天，后来莫名其妙 Clangd Server 一遇到 CUDA 代码里面出现 delete[] 就崩溃，没提示写代码真难受。）

（另外感叹一下自己读英文文档的能力真差，大部分没多少内容的东西自己都得读个好几天，总感觉不想读）

先简单看一遍 cudaRenderer.cu，基本代码就两件事，一个是 advanceAnimation，一个是 render，显然 advanceAnimation 中每个像素对应一个简单线程，没有错误，那么需要改的部分就是 render 的过程

PA 给的代码思路是对每个圆并行，然后枚举圆内的像素，计算是否在圆内，然后计算颜色贡献，这样显然是不满足圆覆盖顺序的。

很容易想到把顺序换过来：对每个像素并行，然后枚举每个圆，计算是否在圆内，然后计算颜色贡献。但实际试了一下发现当圆很小但数量很多的时候就很浪费。

所以可以先将像素分块，对每个块先处理出可能覆盖的圆，然后再对每个像素并行计算颜色贡献。具体实现的时候还要对圆分块处理，否则内存不够用，具体看代码

namespace Solution3 {

constexpr int BLOCK_DIM = 16;
constexpr int BLOCK_SIZE = BLOCK_DIM * BLOCK_DIM;

#define SCAN_BLOCK_DIM BLOCK_SIZE
#include "exclusiveScan.cu_inl"

__global__ void kernelRenderCircles() {
  __shared__ uint circleIsInBox[BLOCK_SIZE];
  __shared__ uint circleIndex[BLOCK_SIZE];
  __shared__ uint scratch[2 * BLOCK_SIZE];
  __shared__ int inBoxCircles[BLOCK_SIZE];

  int boxL = blockIdx.x * BLOCK_DIM;
  int boxB = blockIdx.y * BLOCK_DIM;
  int boxR = min(boxL + BLOCK_DIM, cuConstRendererParams.imageWidth);
  int boxT = min(boxB + BLOCK_DIM, cuConstRendererParams.imageHeight);
  float boxLNorm = boxL * cuConstRendererParams.invWidth;
  float boxRNorm = boxR * cuConstRendererParams.invWidth;
  float boxTNorm = boxT * cuConstRendererParams.invHeight;
  float boxBNorm = boxB * cuConstRendererParams.invHeight;

  int index = threadIdx.y * BLOCK_DIM + threadIdx.x;
  int pixelX = boxL + threadIdx.x;
  int pixelY = boxB + threadIdx.y;
  int pixelId = pixelY * cuConstRendererParams.imageWidth + pixelX;

  for (int i = 0; i < cuConstRendererParams.numCircles; i += BLOCK_SIZE) {
    int circleId = i + index;
    if (circleId < cuConstRendererParams.numCircles) {
      float3 p = *reinterpret_cast<float3 *>(
          &cuConstRendererParams.position[3 * circleId]);
      circleIsInBox[index] =
          circleInBox(p.x, p.y, cuConstRendererParams.radius[circleId],
                      boxLNorm, boxRNorm, boxTNorm, boxBNorm);
    } else {
      circleIsInBox[index] = 0;
    }
    __syncthreads();

    sharedMemExclusiveScan(index, circleIsInBox, circleIndex, scratch,
                           BLOCK_SIZE);
    if (circleIsInBox[index]) {
      inBoxCircles[circleIndex[index]] = circleId;
    }
    __syncthreads();

    int numCirclesInBox =
        circleIndex[BLOCK_SIZE - 1] + circleIsInBox[BLOCK_SIZE - 1];
    __syncthreads();

    if (pixelX < boxR && pixelY < boxT) {
      float4 *imgPtr = reinterpret_cast<float4 *>(
          &cuConstRendererParams.imageData[4 * pixelId]);
      for (int j = 0; j < numCirclesInBox; j++) {
        circleId = inBoxCircles[j];
        shadePixel(
            circleId,
            make_float2((pixelX + 0.5) * cuConstRendererParams.invWidth,
                        (pixelY + 0.5) * cuConstRendererParams.invHeight),
            *reinterpret_cast<float3 *>(
                &cuConstRendererParams.position[3 * circleId]),
            imgPtr);
      }
    }
  }
}

void renderCircles(int width, int height) {
  kernelRenderCircles<<<dim3((width + BLOCK_DIM - 1) / BLOCK_DIM,
                             (height + BLOCK_DIM - 1) / BLOCK_DIM),
                        dim3(BLOCK_DIM, BLOCK_DIM)>>>();
  cudaCheckError(cudaDeviceSynchronize());
}
} // namespace Solution3

然后 WSL 里面最后分数是 68/72（biglittle 场景跑不满），波动大的时候还会往下掉。没怎么写过 CUDA 的话总会遇到很多问题，这个 PA 上花的时间有点多了，就先这样了

~/codes/CS149/asst3/render (master*) » ./checker.py                                                                                                                                                                                                  mizukicry@S-Terminal

Running scene: rgb...
[rgb] Correctness passed!
[rgb] Student times:  [0.6076, 0.6574, 0.7273]
[rgb] Reference times:  [0.688, 0.5434, 0.6265]

Running scene: rgby...
[rgby] Correctness passed!

Running scene: rand10k...
[rand10k] Correctness passed!
[rand10k] Student times:  [4.0515, 4.0682, 4.2037]
[rand10k] Reference times:  [3.5459, 3.6099, 3.6559]

Running scene: rand100k...
[rand100k] Correctness passed!
[rand100k] Student times:  [35.3552, 32.3574, 32.7131]
[rand100k] Reference times:  [28.2137, 28.2372, 28.4035]

Running scene: biglittle...
[biglittle] Correctness passed!
[biglittle] Student times:  [43.0692, 31.4728, 31.7712]
[biglittle] Reference times:  [16.8258, 16.6435, 16.6761]

Running scene: littlebig...
[littlebig] Correctness passed!

Running scene: pattern...
[pattern] Correctness passed!
[pattern] Student times:  [0.8449, 0.7732, 0.7803]
[pattern] Reference times:  [0.7716, 0.7156, 0.8219]

Running scene: bouncingballs...
[bouncingballs] Correctness passed!

Running scene: hypnosis...
[hypnosis] Correctness passed!

Running scene: fireworks...
[fireworks] Correctness passed!

Running scene: snow...
[snow] Correctness passed!

Running scene: snowsingle...
[snowsingle] Correctness passed!
[snowsingle] Student times:  [20.3475, 19.8425, 18.5326]
[snowsingle] Reference times:  [18.8349, 19.9449, 19.1156]
------------
Score table:
------------
--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.5434           | 0.6076          | 12              |
| rand10k         | 3.5459           | 4.0515          | 12              |
| rand100k        | 28.2137          | 32.3574         | 12              |
| pattern         | 0.7156           | 0.7732          | 12              |
| snowsingle      | 18.8349          | 18.5326         | 12              |
| biglittle       | 16.6435          | 31.4728         | 8               |
--------------------------------------------------------------------------
|                                    | Total score:    | 68/72           |
--------------------------------------------------------------------------