Skip to main content
spelling (tessellate)
Source Link
sam hocevar
  • 24k
  • 2
  • 65
  • 95

Second approach: re-tesselatingtessellating

As I said earlier, it might be wise to tesselatetessellate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

Second approach: re-tesselating

As I said earlier, it might be wise to tesselate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

Second approach: re-tessellating

As I said earlier, it might be wise to tessellate further inside your kernel. Your GPU bus is 128 bits, so reading 4 floats (4*32 bits = 128 bits) per thread is just as cheap as reading just one float. You might have to adjust this further if you end up using the ParticleType array. Just to try keep your data 128-bit aligned.

added 20 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/particlesPerThread4) { // the unrolled loop is 4 iterations
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}
__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/particlesPerThread) {
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}
__global__ void UpdateParticle(float* position, float* velocity, float frameTime)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles/4) { // the unrolled loop is 4 iterations
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        position[idx+1] = position[idx+1] - velocity[idx+1] * frameTime * 0.001f;
        position[idx+2] = position[idx+2] - velocity[idx+2] * frameTime * 0.001f;
        position[idx+3] = position[idx+3] - velocity[idx+3] * frameTime * 0.001f;
        ... // some more updates
    }
}
added 120 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17

You can safely overcome that problem checking for the correct indices. This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Compute the index

    if (idx < numParticles) { // Is this index valid?
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host codeeven better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocksregisters are also shared between blocks, so registers (any non-qualified variable inside your kernel, like idx in my examples) can be a bottleneck too. Bear it in mind!

This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < numParticles) {
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocks, so registers can be a bottleneck too. Bear it in mind!

You can safely overcome that problem checking for the correct indices. This is your complete kernel call:

__global__ void UpdateParticle(float* position, float* velocity, float frameTime, int numParticles)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Compute the index

    if (idx < numParticles) { // Is this index valid?
        position[idx] = position[idx] - velocity[idx] * frameTime * 0.001f;
        ... // some more updates
    }
}

You might also want to precompute the frameTime * 0.001f bit in a register before anything else (just do float realTime = frameTime * 0.001f and use it instead) or even better: pass it already transformed from host code. It won't be a problem for such a small number of operations, but registers are also shared between blocks, so registers (any non-qualified variable inside your kernel, like idx in my examples) can be a bottleneck too. Bear it in mind!

added 701 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading
added 701 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading
added 701 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading
added 701 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading
added 1892 characters in body
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading
Source Link
kaoD
  • 1.9k
  • 15
  • 17
Loading