0

I am working on a kernel that does a vector reduction. It basically adds up all the positions in the vector and stores the result in position 0.

I'm following this scheme, with blocks of 512 float elements:

reduction scheme

The code:

//scratch[] is a vector located in shared memory with all 512 elements
NUM_ELEMENTS = 512;
for( stride=NUM_ELEMENTS/2; stride>=1; stride = stride/2 ) {
  if (threadIdx.x < stride){
    scratch[threadIdx.x] += scratch[threadIdx.x + stride];
  }
  __syncthreads();
}

The odd thing is, I'm expecting to get shared bank conflicts and I'm not. In the first iteration, thread 0 is adding up position 0 and position 256, which reside in the same bank. Thread 1 is adding up position 1 and position 257, and so on.

All of these operations require each thread in the warp to obtain 2 distinct values from the same bank, yet, I get no conflicts whatsoever:

result

What am I missing?

1 Answer 1

3

The calculation for bank conflicts is on a per memory instruction per request basis. The shared load (right hand side) and the shared store (left hand side) are executed as separate instructions many clock cycles apart.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I see what you're saying now and it is embarrassingly obvious. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.