Learning objectives and review

Lecture 20

Learning objectives

After this class, you should be able to:

Determine bottlenecks to thread occupancy and latency hiding due to resource constraints on the GPU and resource usage of individual threads.
Optimize memory performance of CUDA code by enabling coalescing, avoiding contention for memory banks, and enabling effective use of constant cache.

Reading assignment

GPU-5 on Blackboard, under the "course library" tab, except sections 5.1 and 5.3.

Exercises and review questions

Exercises and review questions on current lecture's material

Determine the right number of threads per block for your group project based on the criteria discussed in class. Post your answer on the discussion board.
(Modified from Kirk and Hwu's book) The following kernel is called with N=256 threads per block and M=1024 blocks.

__global__ void ScalarProd(float *d_A, float *d_B, float *d_C){
__shared__ float accum[N];
float *A = d_A + N*blockIdx.x;
float *B = d_B + N*blockIdx.x;
int tx = threadIdx.x;
accum[tx] = A[tx]*B[tx];
for(int i = N/2; i>0; i >>=1){
__synchthreads();
if(tx < i)
accum[tx] += accum[i+tx];
}
d_C[blockIdx.x] = accum[0];
}

How many threads are there in total?
How many threads are there in a warp?
How many global memory loads and store are performed per thread?
Is there any shared memory bank conflict? Explain your answer.
Are memory accesses necessarily coalesced on gpu.cs.fsu.edu? Explain your answer.

Assume that in some code with N threads per block (and a large number of blocks), global memory latency is around 200 clock cycles, register usage is 8 registers per thread, and that we have 2 log₂N instructions between global memory accesses. For what values of N can the memory access latency be hidden?

Preparation for the next lecture

None. Start preparing for the midterm.

Last modified: 9 Apr 2010