CUDA shared memory efficiency at 50%?

Question

I have the following code that performs a tiled matrix transpose using shared memory to improve performance. The shared memory is padded with 1 column to avoid bank conflict for a 32x32 thread block.

__global__ void transpose_tiled_padded(float *A, float *B, int n)
{
    int i_in = blockDim.x*blockIdx.x + threadIdx.x;
    int j_in = blockDim.y*blockIdx.y + threadIdx.y;
    int i_out = blockDim.x*blockIdx.y + threadIdx.x;
    int j_out = blockDim.y*blockIdx.x + threadIdx.y;

    extern __shared__ float tile[];

    // coalesced read of A rows to (padded) shared tile column (transpose)
    tile[threadIdx.y + threadIdx.x*(blockDim.y+1)] = A[i_in + j_in*n];
    __syncthreads();

    // coalesced write from (padded) shared tile column to B rows
    B[i_out + j_out*n] = tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
}

Running this code, I get 100% shared memory efficiency in the NVIDIA visual profiler, as I expect. But, when I run it with a 16x16 thread block, I only get 50% efficiency. Why is that? As far as I can tell, no thread in a warp reads from the same bank with this layout. Or am I mistaken?

Robert Crovella · Accepted Answer · 2018-06-21 00:07:11Z

6

Yes, you are mistaken.

Considering this (read) access for warp 0 in a 16x16 block:

tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     "index"

Here are the relevant calculations for each thread in the warp:

warp lane:    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23 25 26 27 28 29 30 31
threadIdx.x:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
threadIdx.y:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
"index":      0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
bank:         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  0

So we see that for this warp, the first and the last thread both read from bank 0. This results in a 2-way bank conflict, 2-way serialization, and 50% efficiency.

answered Jun 21, 2018 at 0:07

Robert Crovella

154k12 gold badges255 silver badges300 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Moody Over a year ago

I see. So is there a systematic approach to calculating a padding width to ensure no bank conflicts occur for a particular block dimension?

Guilherme Simas Over a year ago

Shouldn't the index for thread x,y=[0,1] be 16? That would mean there are no bank conflicts in that warp lane and, therefore, the conflict must reside somewhere else.

Robert Crovella Over a year ago

No that is not correct. It should be 17. blockDim.x is 16. Adding one to it makes it 17. Study the code.

Guilherme Simas Over a year ago

@RobertCrovella you are perfectly correct. My fault for not noticing it. Thank you for the reply.

Collectives™ on Stack Overflow

CUDA shared memory efficiency at 50%?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related