Lecture 20
Learning objectives
After this class, you should be able to:
- Determine bottlenecks to thread occupancy and latency hiding due to resource constraints on the GPU and resource usage of individual threads.
- Optimize memory performance of CUDA code by enabling coalescing, avoiding contention for memory banks, and enabling effective use of constant cache.
Reading assignment
- GPU-5 on Blackboard, under the "course library" tab, except sections 5.1 and 5.3.
Exercises and review questions
- Exercises and review questions on current lecture's material
- Determine the right number of threads per block for your group project based on the criteria discussed in class. Post your answer on the discussion board.
- (Modified from Kirk and Hwu's book) The following kernel is called with N=256 threads per block and M=1024 blocks.
__global__ void ScalarProd(float *d_A, float *d_B, float *d_C){
__shared__ float accum[N];
float *A = d_A + N*blockIdx.x;
float *B = d_B + N*blockIdx.x;
int tx = threadIdx.x;
accum[tx] = A[tx]*B[tx];
for(int i = N/2; i>0; i >>=1){
__synchthreads();
if(tx < i)
accum[tx] += accum[i+tx];
}
d_C[blockIdx.x] = accum[0];
}
- How many threads are there in total?
- How many threads are there in a warp?
- How many global memory loads and store are performed per thread?
- Is there any shared memory bank conflict? Explain your answer.
- Are memory accesses necessarily coalesced on
gpu.cs.fsu.edu
? Explain your answer.- Assume that in some code with N threads per block (and a large number of blocks), global memory latency is around 200 clock cycles, register usage is 8 registers per thread, and that we have
2 log2N
instructions between global memory accesses. For what values of N can the memory access latency be hidden?- Preparation for the next lecture
- None. Start preparing for the midterm.
Last modified: 9 Apr 2010