Edit with DMGregory's suggestions:
I tried,
- Multiplying joint matrices by position vector, then summing the results.
- Pre-multiplying model with joint matrices on cpu.
It looks like this now;
vec4 positionVec4 = vec4(position, 1.0);
vec4 sum =
meshUniform.jointMatrices[joints[0]] * weights[0] * positionVec4 +
meshUniform.jointMatrices[joints[1]] * weights[1] * positionVec4 +
meshUniform.jointMatrices[joints[2]] * weights[2] * positionVec4 +
meshUniform.jointMatrices[joints[3]] * weights[3] * positionVec4;
positionVec4 = sum;
It's still taking 5-6ms to run.
Someone in lwjgl forums posted a question similar to mine in 2012.
http://forum.lwjgl.org/index.php?topic=4519.0
In his last message he said;
using a constant as the array index while accessing boneMatrixes
brings performance up
Sure enough if I exclude joints array lookup from above code like this;
vec4 positionVec4 = vec4(position, 1.0);
vec4 sum =
meshUniform.jointMatrices[0] * weights[0] * positionVec4 +
meshUniform.jointMatrices[1] * weights[1] * positionVec4 +
meshUniform.jointMatrices[2] * weights[2] * positionVec4 +
meshUniform.jointMatrices[3] * weights[3] * positionVec4;
positionVec4 = sum;
it renders in 1ms. But of course resulting image is not correct.
Maybe it will give some ideas to more experienced people on OpenGL.