Floating point

Floating point numbers are represented as a fractional part and an exponent, which allows for a wide range of values to be represented at a fixed precision.  The fractional part is formally called the significand although the term mantissa is also in common usage.

For example, the number 1048576 can be represented in decimal as 1.048576 * 106 or in binary as 1.0 * 220.

There are very many ways of representing floating point numbers in memory but one standard dominates, the IEEE Standard for Floating-Point Arithmetic (IEEE 754).   For practical implementation reasons, as well as for run time speed, we do not propose to implement the standard, just use the single precision binary format as a convenient representation for floating point numbers.

To get a feel for how floating point numbers are stored have a play with this float converter, you can input any number and see it's bit pattern, hex representation, representation errors, etc.

CATG uses the f3 format for floating point which uses one word for the significand and sign and two words for the fractional part.   The sign is stored as the lowest bit, which makes it easy to shift off leaving a 16 bit signed number for the significand.   In floating point arithmetic often the sign is dealt with first, leaving positive numbers (sign zero), and this means the significands can be added/subtracted without any shifting.   The fractional part always has the top bit set, this means that the normal 32 bit integer routines can be used.

In comparison to IEEE 754, there are more bits in the significand so a larger range of flaoting point numbers can be represented.   The 31 bit fractional part has more precision than 32 bit float and less than 64 bit double.   There is no support for underflow/overflow, infinity, NaN, etc.   Unlike IEEE 754, f3 numbers are not sorted when considered as 48 bit integers.

OLD HACK NO LONGER USED - IEEE 754 has effectively a 24 bit significnad, so to multiply two numbers together we need space for a 48 bit result that is three words.   However, we want the implementation to be as fast as possible, and there isn't enough register space to keep everything in registers, so significant memory manipulation would be needed (e.g. see the implementation of 4mul).   As a compromise, we choose not to compute some of the lower bits, as these are unlikely to affect the result.   However, this results in potential loss in accuracy, 1.3% of the time we don't compute a carry bit when we should do, so the result is biased to be too small by 0.013.   As we know the result will always be too small, we can compensate for this - adding 3 to the result before rounding makes the number of times the result is in error reduce to 0.4%.   What's more, sometimes the result is now too high by one bit, so the overall bias is reduced to 0.001 per multiply.    Given the choice of implementing a custom format with a 16 bit significand which gave exact answers, or one with a standard format and non-standard behavior the choice was easy - after all CATG only claims to use the IEEE 754 format, not be complaint with the behavior.