ldexp and frexp decompositions for positive numbers.
If you are okay with up to 2-16 amount of relative error, you can express both sides of the transformation using just basic arithmetic and the ld/frexp decomposition.
Note that this is much slower than the struct hack, which can be more succinctly represented as struct.unpack('f', struct.pack('I', value)).
Here is the decomposition method.
def to_bits(x):
man, exp = math.frexp(x)
return int((2 * man + (exp + 125)) * 0x800000)
def from_bits(y):
y -= 0x3e800000
return math.ldexp(
float(0x800000 + y & 0x7fffff) / 0x1000000,
(y - 0x800000) >> 23)
While the from_bits function looks scarier, it is actually nothing more than the inverse of to_bits, modified so that we only perform a single floating point division (not because of speed considerations, just because it should be the sort of mindset we have when we do need to work with machine representations of floats). Therefore, I'll focus on explaining the forward transformation.
Derivation
Recall that a (positive) IEEE 754 floating point number is represented as a tuple of a biased exponent and its mantissa. The lower 23 bits m are the mantissa, and the upper 8 bits e (minus the most significant bit, which we assume to always be zero) represent the exponent, so that
x = (1 + m / 223) * 2e - 127
Let man' = m / 223 and exp' = e - 127, then 0 <= man' < 1 and exp' is an integer. Therefore
(man' + exp' + 127) * 223
gives the IEEE 754 representation.
On the other hand, the frexp decomposition computes a pair man, exp = frexp(x) such that man * 2exp = x, and 0.5 <= man < 1.
A moment of thought will show that man' = 2 * man - 1 and exp' = exp - 1, therefore its IEEE machine representation is
(man' + exp' + 127) * 0x800000 = (2 * man + exp + 125) * 0x800000
Error Analysis
How much roundoff error do we expect? Well, let's assume that frexp introduces no error within its decomposition. This is unfortunately impossible, but we can relax this down the line.
The main feature is the computation 2 * man + (exp + 125). Why? 0x800000 is an perfect power of two, and therefore a floating point multiplication of a power of two will nearly always be lossless (unless we overflow), since the FPU is just adding 23 << 23 to its machine representation (without touching the Mantissa, which is when error arise). Similarly, the multiplication 2 * man is also lossless (akin to just adding 1 << 23 to the machine representation). Furthermore, exp and 125 are integers, so (exp + 125) is also computed to exact precision.
Therefore, we are left to analyze the error behavior of m + e, where 1 <= m < 1 and |e| < 127. In the worst case, m has all 23 bits filled (corresponding to m = 2 - 2-22) and e = +/- 127. Here, this addition will unfortunately clobber the 8 least significant bits of m, since it has to renormalize m (which is at the exponential range of 20) to the exponential range of 28, which means losing 8 bits. However, since a mantissa has 24 significant bits, we effectively lose 2-(24 - 8) amount of precision, which upper-bounds the error.
In a similar line of reasoning for from_bits, you can show that float(0x800000 + y & 0x7fffff) is basically computing the operation (1.0f + m), where m may have up to 23 bits of precision and it is strictly less than 1. Therefore, we're adding a precise number at the scale of 20 with another number at the scale of 2-1, so we expect a loss of one bit. This then suggests that we would incur up to 2-22 relative error in the backwards transformation.
Both of these transformations incur very little roundoff, and if you throw in an extra multiplication into to_bits, you can also bring its error down to just 2-22.
Final Words
Do not do this in production.
- You should never have to explicitly manipulate the machine representation of a number.
- Even if for some ungodly reason you need to do this, you should not do something this hacky.
This is just a clever float-hack that seems fun. It's not meant to be anything more than that.
structmodule is the right one to use for most things which you would use a C-stylestructorunion(as you would find in Java).