(Followup question for Compile-time ceiling function, for literals, in C?)
Considering the following CUDA function:
__device__ int foo_f() { return ceilf(1007.1111); }
It should be easy to optimize this to produce a device function which simply returns 1008:
mov.u32 %r1, 1008;
st.param.b32 [func_retval0+0], %r1;
ret;
but instead, it compiles (using NVCC 11.5) into the costlier:
mov.f32 %f1, 0f447C0000;
cvt.rzi.s32.f32 %r1, %f1;
st.param.b32 [func_retval0+0], %r1;
ret;
The optimization is also missed if the code is:
static __device__ int poor_mans_ceilf(float x)
{
return (int) x + ( ((float)(int) x < x) ? 1 : 0);
}
__device__ int foo_pf() { return poor_mans_ceilf(1007.1111); }
which should be even easier for the compiler to "notice" as an optimization opportunity.
So, why is NVCC failing to make the optimization here (while typical C/C++ compilers do take it)? Is there some subtle hitch preventing the optimization in (edit) PTX code? I realize that the ptxas has its own chance of optimizing this away eventually, but this is not a microarchitecture-specific optimization.
See it all on GodBolt.
PS: I know that this might be circumvented by using constexpr.
ptxas, or, alternatively, the JIT mechanism. Analyzing PTX gives an incomplete picture of whatnvccwill do, especially when your question pertains to device code. Study the sass code then see if you can find any evidence of0x3f0there. You might ask yourself at that point why is that integer constant there? You asked specifically about: "in device-side code?" Here is what I would say: "PTX is not device-side code".