0

I'm trying to copy 2 arrays from global memory to shared memory:

  • The global array's type is double and they have 32 elements each.
  • The grid is 1D and the blocks are 1D.
  • The grid dimension is 10000 and NumberThreadPerBlock is 32

Code:

__global__ void kernel_0(double px[], double py[], int N)
{
    int ii,
        jj,tid;
    tid=blockIdx.x*blockDim.x + threadIdx.x;
    __shared__ double s_px[256];
    __shared__ double s_py[256];
    __shared__ double s[256];

    s_px[threadIdx.x]=px[tid];
    s_py[threadIdx.x]=py[tid];
    s[threadIdx.x]=py[tid];
    __syncthreads();
}

int main (int argc, char *argv[]){
    double *px, *py , *x, *y, PI, step, *d_x, *d_y,*d_px, *d_py,sharedMemSize;
    int N, Nx, ii;
    PI = 4*atan(1.0);
    Nx = 10000; 
    N = 32; 
    
    px = (double *) malloc(N*sizeof(double));
    py = (double *) malloc(N*sizeof(double));

    // lookup table: sin // from 0 to PI 
    step = 1.0 / (N-1);
    for (ii = 0; ii < N; ii++){ 
        px[ii] = ii*step*PI;
        py[ii] = sin(px[ii]);
    }   

    cudaMalloc( (void **) &d_px, N*sizeof(double) );
    cudaMalloc( (void **) &d_py, N*sizeof(double) );        

    cudaMemcpy( d_px, px, N*sizeof(double), cudaMemcpyHostToDevice );
    cudaMemcpy( d_py, py, N*sizeof(double), cudaMemcpyHostToDevice );
        
    dim3 dimGrid(Nx);
    dim3 dimBlock(N,1,1);
    kernel_0<<< dimGrid, dimBlock>>>(px, py, N);
}

It compiles but cuda-memmcheck shows me many errors:

========= Invalid __global__ read of size 8
=========     at 0x00000058 in kernel_0
=========     by thread (31,0,0) in block (6,0,0)
=========     Address 0x11e0db38 is out of bounds
=========
========= ERROR SUMMARY: 96 errors

Can you help me?

3
  • 2
    This kernel will not run as shown here. Please post the actual kernel you are using. Commented Jan 11, 2013 at 22:18
  • Please also include your allocation of the global arrays and the kernel invocation. Probably best if you post a complete, simple, compilable example that demonstrates the issue. Commented Jan 11, 2013 at 22:31
  • I would bet that tid is being used in one of the reads of either px or py, but you would have to show your actual code to confirm it. Commented Jan 11, 2013 at 22:50

1 Answer 1

1

From what I can see, the memory allocated to the device pointers (px, py) is 32*sizeof(double) big, however the number of blocks you have is 10000.

Device memory is global and all blocks share it, only the shared memory is defined for each block.
Therefore for blockId.x >= 1, you should get an invalid memory access.

Moreover, in the kernel launch, it should be d_px, d_py.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.