While conceptually, ldexpf(a,b) is just a multiplication a*2b, this does not mean it can actually be implemented as a simple floating-point multiplication in all cases. Consider the case ldexpf(-0x1.8p-143f, 270) = -0x1.8p+127, which would require multiplying by 2270, which is not representable as a float mapped to IEEE-754 binary32 format.
This means that the actual computation performed by ldexpf() for the general case is more complicated. In particular, between one and three multiplications are required to implement the scaling, and at most one of these multiplies may involve actual rounding. It is also required to correctly handle the special cases called out by the relevant language standard, which is ISO-C++, which in turn gets it from ISO-C99 (the C99 math library was imported wholesale into C++ at the C++11 stage).
In CUDA, the functionality of the standard math library is provided via PTX-level templates. The NVVM part of the CUDA compiler, which generates PTX, pulls these templates out of a library. PTX is an intermediate representation that is compiled by the optimizing compiler ptxas into machine code (SASS) for the target architecture(s) specified by the programmer.
ptxas performs many standard optimizations such as constant propagation and dead-code elimination, and at present (CUDA 12.8.1) it is able to eliminate most of the code from the ldexpf() template when a compile-time constant second argument is provided, but not all of it. For an sm_89 target, the generated SASS looks something like this:
FSETP.NEU.AND P0, PT, |R0|, +INF , PT
FSETP.EQ.OR P0, PT, |R0|, RZ, !P0
FSETP.GT.OR P1, PT, |R0|, RZ, !P0
@!P0 FMUL R4, R0, 32
@!P1 FADD R0, R0, R0
This looks like the desired multiplication plus instructions for handling special cases triggered by the first argument, about which nothing is known at compile time for the case at hand. It is not clear to me by what logical reasoning or known code transformations the compiler could establish that it is safe to eliminate these checks.
Alternatively, it might be possible to enhance the NVVM part of the compiler that generates PTX so it recognizes ldexpf() as a well-known function and replaces the general PTX template for ldexpf() with specialized code when the second argument to ldexpf() is a small integer known at compile time. As I am not a compiler engineer, I am not sure what is practical; the asker may want to file an enhancement request with NVIDIA.
I checked many of the compilers at Compiler Explorer, and even with -ffast-math none translates ldexpf (x,5) into anything other than a call to the library function. So reducing this to a single instruction does not seem to be a transformation generally available in compiler frameworks (the NVVM portion of the CUDA compiler is derived from the widely used LLVM).
A potential solution might be achieved by deploying a different template for ldexpf() that reduces more easily under standard code transformations implemented by ptxas at this time. Below is an example of such an implementation (lightly tested), which reduces to the desired single multiplication instruction when I compile with CUDA 12.81.1:
FMUL R4, R4, 32
Note that it is entirely possible that this proposed replacement hurts the performance of the general case, so this is a tricky design issue calling for a careful evaluation of all possible design alternatives, something I am unable to accomplish in the course of answering an SO question.
__device__ float raw_ex2 (float a)
{
asm ("ex2.approx.ftz.f32 %0,%0;" : "+f"(a));
return a;
}
__device__ float my_ldexpf (float a, int b)
{
unsigned int abs_b = abs (b);
int scale = (abs_b <= 126) ? (b) : (__mulhi (0x55555556, b));
float t = raw_ex2 ((float)scale);
float r = a * t;
r = (abs_b <= 126) ? r : (r * t * raw_ex2 ((float)(b - 2 * scale)));
((abs_b > 280) && (isinf (a) || (a == 0))) ? a : r; // special case: zero or infinity
return r;
}