0

I have the following code that performs a tiled matrix transpose using shared memory to improve performance. The shared memory is padded with 1 column to avoid bank conflict for a 32x32 thread block.

__global__ void transpose_tiled_padded(float *A, float *B, int n)
{
    int i_in = blockDim.x*blockIdx.x + threadIdx.x;
    int j_in = blockDim.y*blockIdx.y + threadIdx.y;
    int i_out = blockDim.x*blockIdx.y + threadIdx.x;
    int j_out = blockDim.y*blockIdx.x + threadIdx.y;

    extern __shared__ float tile[];

    // coalesced read of A rows to (padded) shared tile column (transpose)
    tile[threadIdx.y + threadIdx.x*(blockDim.y+1)] = A[i_in + j_in*n];
    __syncthreads();

    // coalesced write from (padded) shared tile column to B rows
    B[i_out + j_out*n] = tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
}

Running this code, I get 100% shared memory efficiency in the NVIDIA visual profiler, as I expect. But, when I run it with a 16x16 thread block, I only get 50% efficiency. Why is that? As far as I can tell, no thread in a warp reads from the same bank with this layout. Or am I mistaken?

1 Answer 1

6

Yes, you are mistaken.

Considering this (read) access for warp 0 in a 16x16 block:

tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                     "index"

Here are the relevant calculations for each thread in the warp:

warp lane:    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23 25 26 27 28 29 30 31
threadIdx.x:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
threadIdx.y:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
"index":      0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
bank:         0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  0

So we see that for this warp, the first and the last thread both read from bank 0. This results in a 2-way bank conflict, 2-way serialization, and 50% efficiency.

Sign up to request clarification or add additional context in comments.

4 Comments

I see. So is there a systematic approach to calculating a padding width to ensure no bank conflicts occur for a particular block dimension?
Shouldn't the index for thread x,y=[0,1] be 16? That would mean there are no bank conflicts in that warp lane and, therefore, the conflict must reside somewhere else.
No that is not correct. It should be 17. blockDim.x is 16. Adding one to it makes it 17. Study the code.
@RobertCrovella you are perfectly correct. My fault for not noticing it. Thank you for the reply.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.