cuda copy array from global memory to shared memory

Question

I'm trying to copy 2 arrays from global memory to shared memory:

The global array's type is double and they have 32 elements each.
The grid is 1D and the blocks are 1D.
The grid dimension is 10000 and NumberThreadPerBlock is 32

Code:

__global__ void kernel_0(double px[], double py[], int N)
{
    int ii,
        jj,tid;
    tid=blockIdx.x*blockDim.x + threadIdx.x;
    __shared__ double s_px[256];
    __shared__ double s_py[256];
    __shared__ double s[256];

    s_px[threadIdx.x]=px[tid];
    s_py[threadIdx.x]=py[tid];
    s[threadIdx.x]=py[tid];
    __syncthreads();
}

int main (int argc, char *argv[]){
    double *px, *py , *x, *y, PI, step, *d_x, *d_y,*d_px, *d_py,sharedMemSize;
    int N, Nx, ii;
    PI = 4*atan(1.0);
    Nx = 10000; 
    N = 32; 
    
    px = (double *) malloc(N*sizeof(double));
    py = (double *) malloc(N*sizeof(double));

    // lookup table: sin // from 0 to PI 
    step = 1.0 / (N-1);
    for (ii = 0; ii < N; ii++){ 
        px[ii] = ii*step*PI;
        py[ii] = sin(px[ii]);
    }   

    cudaMalloc( (void **) &d_px, N*sizeof(double) );
    cudaMalloc( (void **) &d_py, N*sizeof(double) );        

    cudaMemcpy( d_px, px, N*sizeof(double), cudaMemcpyHostToDevice );
    cudaMemcpy( d_py, py, N*sizeof(double), cudaMemcpyHostToDevice );
        
    dim3 dimGrid(Nx);
    dim3 dimBlock(N,1,1);
    kernel_0<<< dimGrid, dimBlock>>>(px, py, N);
}

It compiles but cuda-memmcheck shows me many errors:

========= Invalid __global__ read of size 8
=========     at 0x00000058 in kernel_0
=========     by thread (31,0,0) in block (6,0,0)
=========     Address 0x11e0db38 is out of bounds
=========
========= ERROR SUMMARY: 96 errors

Can you help me?

This kernel will not run as shown here. Please post the actual kernel you are using. — Pavan Yalamanchili
– Pavan Yalamanchili, Commented Jan 11, 2013 at 22:18
Please also include your allocation of the global arrays and the kernel invocation. Probably best if you post a complete, simple, compilable example that demonstrates the issue. — Robert Crovella
– Robert Crovella, Commented Jan 11, 2013 at 22:31
I would bet that tid is being used in one of the reads of either px or py, but you would have to show your actual code to confirm it. — talonmies
– talonmies, Commented Jan 11, 2013 at 22:50

paleonix · Accepted Answer · 2023-05-11 15:06:16Z

1

From what I can see, the memory allocated to the device pointers (px, py) is 32*sizeof(double) big, however the number of blocks you have is 10000.

Device memory is global and all blocks share it, only the shared memory is defined for each block.
Therefore for blockId.x >= 1, you should get an invalid memory access.

Moreover, in the kernel launch, it should be d_px, d_py.

edited May 11, 2023 at 15:06

paleonix

3,3255 gold badges20 silver badges42 bronze badges

answered Jan 12, 2013 at 2:24

Bharat

2,2092 gold badges18 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

cuda copy array from global memory to shared memory

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related