Revisions to CUDA 4.1 Particle Update

spelling (tessellate)

Source Link

edited Jul 5, 2012 at 13:53

24k
2
65
95

Second approach: re-tesselatingtessellating

As I said earlier, it might be wise to tesselatetessellate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

Second approach: re-tesselating

As I said earlier, it might be wise to tesselate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

Second approach: re-tessellating

As I said earlier, it might be wise to tessellate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

added 20 characters in body

Source Link

edited Mar 25, 2012 at 0:11

kaoD

1.9k
15
17

__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/particlesPerThread4) { // the unrolled loop is 4 iterations
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}

__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/particlesPerThread) {
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}

__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/4) { // the unrolled loop is 4 iterations
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}

added 120 characters in body

Source Link

edited Mar 24, 2012 at 23:56

kaoD

1.9k
15
17

You can safely overcome that problem checking for the correct indices. This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Compute the index

    if (idx < numParticles) { // Is this index valid?
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host codeeven better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocksregisters are also shared between blocks, so registers (any non-qualified variable inside your kernel, like idx in my examples) can be a bottleneck too. Bear it in mind!

This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles) {
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocks, so registers can be a bottleneck too. Bear it in mind!

You can safely overcome that problem checking for the correct indices. This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Compute the index

    if (idx < numParticles) { // Is this index valid?
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocks, so registers (any non-qualified variable inside your kernel, like idx in my examples) can be a bottleneck too. Bear it in mind!

added 701 characters in body

Source Link

edited Mar 24, 2012 at 23:50

kaoD

1.9k
15
17

Loading

added 701 characters in body

Source Link

edited Mar 24, 2012 at 23:42

kaoD

1.9k
15
17

Loading

added 701 characters in body

Source Link

edited Mar 24, 2012 at 23:26

kaoD

1.9k
15
17

Loading

added 701 characters in body

Source Link

edited Mar 24, 2012 at 23:20

kaoD

1.9k
15
17

Loading

added 1892 characters in body

Source Link

edited Mar 24, 2012 at 23:12

kaoD

1.9k
15
17

Loading

Source Link

answered Mar 24, 2012 at 23:02

kaoD

1.9k
15
17

Loading

Stack Exchange Network

Return to Answer

Second approach: re-tesselatingtessellating

Second approach: re-tesselating

Second approach: re-tessellating