Skip to main content

IAR Embedded Workbench for RX 5.20

Basic data types—floating-point types

In this section:

In the IAR C/C++ Compiler for RX, floating-point values are represented in standard IEC 60559 format. The sizes for the different floating-point types are:

Type

Size

Range (+/-)

Decimals

Exponent

Mantissa

Alignment

__fp16

16 bits

±2E-14 to 65504

3

5 bits

11 bits

2

float

32 bits

±1.18E-38 to ±3.39E+38

7

8 bits

23 bits

4

double with ‑‑double=32 (default)

32 bits

±1.18E-38 to ±3.39E+38

7

8 bits

23 bits

4

double with ‑‑double=64

64 bits

±2.23E-308 to ±1.79E+308

15

11 bits

52 bits

4

long double with ‑‑double=32 (default)

32 bits

±1.18E-38 to ±3.39E+38

7

8 bits

23 bits

4

long double with ‑‑double=64

64 bits

±2.23E-308 to ±1.79E+308

15

11 bits

52 bits

4

Table 79. Floating-point types 


Note

The size of double and long double depends on the ‑‑double={32|64} option, see ‑‑double. The type long double uses the same precision as double.

The __fp16 floating-point type is only a storage type. All numerical operations will operate on values promoted to float.

Floating-point environment

Exception flags are not supported. The feraiseexcept function does not raise any exceptions.

32-bit floating-point format

The representation of a 32-bit floating-point number as an integer is:

32bitFloatFormat_1.png

The exponent is 8 bits, and the mantissa is 23 bits.

The value of the number is:

(-1)S * 2(Exponent-127) * 1.Mantissa

The range of the number is at least:

±1.18E-38 to ±3.39E+38

The precision of the float operators (+, -, *, and /) is approximately 7 decimal digits.

Representation of special floating-point numbers

This list describes the representation of special floating-point numbers:

  • Zero is represented by zero mantissa and exponent. The sign bit signifies positive or negative zero.

  • Infinity is represented by setting the exponent to the highest value and the mantissa to zero. The sign bit signifies positive or negative infinity.

  • For the float type, Not a number (NaN) is represented by setting the exponent to the highest positive value and the mantissa to a non-zero value. The value of the sign bit is ignored.

  • For the double type, Not a number (NaN) is represented by setting the exponent to 7FF and at least one of the highest twenty bits in the mantissa to non-zero. The lower thirty-two bits of the mantissa are ignored. The value of the sign bit is also ignored.

  • Subnormal numbers are used for representing values smaller than what can be represented by normal values. The drawback is that the precision will decrease with smaller values. The exponent is set to 0 to signify that the number is subnormal, even though the number is treated as if the exponent was 1. Unlike normal numbers, subnormal numbers do not have an implicit 1 as the most significant bit (the MSB) of the mantissa. The value of a subnormal number is:

    (-1)S * 2(1-BIAS) * 0.Mantissa

    where BIAS is 127.

By default, subnormal numbers are only supported for 64-bit floating-point numbers. However, the RX600 libraries can use the unimplemented processing exceptionof the CPU to support 32-bit floating-point subnormal numbers.

Note

If the 64-bit FPU is used (‑‑fpu=64) subnormal numbers are not supported, neither for 32-bit nor for 64-bit floating-point numbers.

To enable the subnormal number exception handler, use the linker option ‑‑redirect and use this linker command:

‑‑redirect __float_placeholder=__unimpl_processing_handler

Supporting subnormal numbers for 32-bit floating-point numbers this way requires a large overhead, both in size and speed, compared to a normal FPU instruction which requires very few CPU cycles. The subnormal number exception handler will use approximately 900 bytes of code space, and about 50–200 cycles per exception, depending on the operation and the operands. For that reason, if execution speed is important, try to use floating-point algorithms that do not require subnormal number capabilities for 32-bit floating-point numbers.

To remove subnormal number handling for 32-bit floating-point numbers, use this linker command:

‑‑redirect __float_placeholder=__floating_point_handler