Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?

Question

Consider the following CUDA code:

enum { p = 5 };
__device__ float adjust_mul(float x) { return x * (1 << p); }
__device__ float adjust_ldexpf(float x) { return ldexpf(x, p); }

I would expect NVCC to optimize the first and the second function into the exact same code, as ldexpf(x, p) is defined to "multiply x by 2 to the power of p".

but the PTX code for these two functions is quite different (GodBolt)... the first function becomes:

.visible .func  (.param .b32 func_retval0) adjust_mul(float)(
    .param .b32 adjust_mul(float)_param_0
)
{
    ld.param.f32    %f1, [adjust_mul(float)_param_0];
    mul.f32     %f2, %f1, 0f42000000;
    st.param.f32    [func_retval0+0], %f2;
    ret;
}

and the second becomes:

.visible .func  (.param .b32 func_retval0) adjust_ldexpf(float)(
    .param .b32 adjust_ldexpf(float)_param_0
)
{
    ld.param.f32    %f5, [adjust_ldexpf(float)_param_0];
    abs.f32     %f1, %f5;
    setp.eq.f32     %p1, %f1, 0f00000000;
    setp.eq.f32     %p2, %f1, 0f7F800000;
    or.pred     %p3, %p1, %p2;
    @%p3 bra    $L__BB1_2;
    bra.uni     $L__BB1_1;
$L__BB1_2:
    setp.gt.f32     %p4, %f1, 0f00000000;
    add.f32     %f8, %f5, %f5;
    selp.f32    %f9, %f5, %f8, %p4;
    bra.uni     $L__BB1_3;
$L__BB1_1:
    mov.f32     %f6, 0f40A00000;
    ex2.approx.ftz.f32  %f7, %f6;
    mul.f32     %f9, %f7, %f5;
$L__BB1_3:
    st.param.f32    [func_retval0+0], %f9;
    ret;
}

Why doesn't NVCC optimize the second function into the same PTX code as the first one?

The PTX code must never be used to evaluate the performance of a code at fine grained. It is not representative of what is actually executed. It is like an LLVM IR code. Instead you should look the SASS code. See here for the generated code. It is the same SASS code for the two functions, so I expect identical performance. — Jérôme Richard
– Jérôme Richard, Commented Jul 20 at 17:19
@JérômeRichard: Disagree. Typically (if not always), CUDA code becomes PTX, and the PTX goes through another compilation stage to become SASS. Now, it's true that two different PTX functions could yield the same SASS, but - it is perfectly valid to consider PTX differences. Especially considering how SASS not officially documented, and more specific-GPU-dependent, and often more verbose. — einpoklum
– einpoklum, Commented Jul 20 at 21:55

njuffa · Accepted Answer · 2025-07-21 11:55:43Z

While conceptually, ldexpf(a,b) is just a multiplication a*2^b, this does not mean it can actually be implemented as a simple floating-point multiplication in all cases. Consider the case ldexpf(-0x1.8p-143f, 270) = -0x1.8p+127, which would require multiplying by 2²⁷⁰, which is not representable as a float mapped to IEEE-754 binary32 format.

This means that the actual computation performed by ldexpf() for the general case is more complicated. In particular, between one and three multiplications are required to implement the scaling, and at most one of these multiplies may involve actual rounding. It is also required to correctly handle the special cases called out by the relevant language standard, which is ISO-C++, which in turn gets it from ISO-C99 (the C99 math library was imported wholesale into C++ at the C++11 stage).

In CUDA, the functionality of the standard math library is provided via PTX-level templates. The NVVM part of the CUDA compiler, which generates PTX, pulls these templates out of a library. PTX is an intermediate representation that is compiled by the optimizing compiler ptxas into machine code (SASS) for the target architecture(s) specified by the programmer.

ptxas performs many standard optimizations such as constant propagation and dead-code elimination, and at present (CUDA 12.8.1) it is able to eliminate most of the code from the ldexpf() template when a compile-time constant second argument is provided, but not all of it. For an sm_89 target, the generated SASS looks something like this:

 FSETP.NEU.AND P0, PT, |R0|, +INF , PT 
 FSETP.EQ.OR P0, PT, |R0|, RZ, !P0 
 FSETP.GT.OR P1, PT, |R0|, RZ, !P0 
 @!P0 FMUL R4, R0, 32 
 @!P1 FADD R0, R0, R0

This looks like the desired multiplication plus instructions for handling special cases triggered by the first argument, about which nothing is known at compile time for the case at hand. It is not clear to me by what logical reasoning or known code transformations the compiler could establish that it is safe to eliminate these checks.

Alternatively, it might be possible to enhance the NVVM part of the compiler that generates PTX so it recognizes ldexpf() as a well-known function and replaces the general PTX template for ldexpf() with specialized code when the second argument to ldexpf() is a small integer known at compile time. As I am not a compiler engineer, I am not sure what is practical; the asker may want to file an enhancement request with NVIDIA.

I checked many of the compilers at Compiler Explorer, and even with -ffast-math none translates ldexpf (x,5) into anything other than a call to the library function. So reducing this to a single instruction does not seem to be a transformation generally available in compiler frameworks (the NVVM portion of the CUDA compiler is derived from the widely used LLVM).

A potential solution might be achieved by deploying a different template for ldexpf() that reduces more easily under standard code transformations implemented by ptxas at this time. Below is an example of such an implementation (lightly tested), which reduces to the desired single multiplication instruction when I compile with CUDA 12.81.1:

FMUL R4, R4, 32

Note that it is entirely possible that this proposed replacement hurts the performance of the general case, so this is a tricky design issue calling for a careful evaluation of all possible design alternatives, something I am unable to accomplish in the course of answering an SO question.

__device__ float raw_ex2 (float a)
{
  asm ("ex2.approx.ftz.f32 %0,%0;" : "+f"(a));
  return a;
}

__device__ float my_ldexpf (float a, int b)
{
    unsigned int abs_b = abs (b);
    int scale = (abs_b <= 126) ? (b) : (__mulhi (0x55555556, b));
    float t = raw_ex2 ((float)scale);
    float r = a * t;
    r = (abs_b <= 126) ? r : (r * t * raw_ex2 ((float)(b - 2 * scale)));
    ((abs_b > 280) && (isinf (a) || (a == 0))) ? a : r; // special case: zero or infinity
    return r;
}

"consider the case etc." <- why is that case not relevant for the straight-up multiplication by (1 << p)?
Sorry, I neglected to use ldexpf consistently in naming, throughout my question. It's always ldexpf(), so it's just as problematic to pass 270 to it as the exponent.

Collectives™ on Stack Overflow

Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related