Supported arithmetic

Rounding directions

Some of the classes of operators presented in the following sections are templated by a rounding direction. This is the direction chosen when converting a real number that cannot be exactly represented in the destination format.

There are eleven directions:

toward zero
away from zero
toward minus infinity (down)
toward plus infinity
to odd mantissas
to nearest, tie breaking to even mantissas
to nearest, tie breaking to odd mantissas
to nearest, tie breaking toward zero
to nearest, tie breaking away from zero
to nearest, tie breaking toward minus infinity
to nearest, tie breaking toward plus infinity

The rounding directions mandated by the IEEE-754 standard are ne (default mode, rounding to nearest), zr, dn, up, and na (introduced for decimal arithmetic).

Floating-point operators

This class of operators covers all the formats whose number sets are \(F(p,d) = \{m \cdot 2^e; |m| < 2^p, e \ge d\}\). In particular, IEEE-754 floating-point formats (with subnormal numbers) are part of this class, if we set apart overflow issues. Both parameters p and d select a particular format. The last parameter selects the rounding direction.

float< precision, minimum_exponent, rounding_direction >(...)

Formats with no minimal exponent (and thus no underflow) are also available:

float< precision, rounding_direction >(...)

Having to remember the precision and minimum exponent parameters may be a bit tedious, so an alternate syntax is provided: instead of these two parameters, a name can be given to the float class.

float< name, rounding_direction >(...)

There are four predefined formats:

IEEE-754 single precision
IEEE-754 double precision
IEEE-754 quadruple precision
extended precision on x86-like processors

Fixed-point operators

This class of operators covers all the formats whose number sets are \(F(e) = \{m \cdot 2^e\}\). The first parameter selects the weight of the least significant bit. The second parameter selects the rounding direction.

fixed< lsb_weight, rounding_direction >(...)

Rounding to integer is a special case of fixed point rounding of weight 0. A syntactic shortcut is provided.

int< rounding_direction >(...)

Miscellaneous operators

The following operators are underspecified and therefore not suitable for formal proofs.

Functions with relative error

This set of functions is defined with related theorems on relative error. They can be used to express properties that cannot be directly expressed through unary rounding operators.

{add|sub|mul}_rel< precision [, minimum_exponent] >(..., ...)

If the minimum exponent is not provided, the bound on the relative error is assumed to be valid on the entire domain. Otherwise the interval \([-2^e,2^e]\) is excluded from the domain.

Talking about the expression add_rel<20,-60>(a,b) is like talking about a fresh expression c such that not |a + b| <= 1b-60 -> |c -/ (a + b)| <= 1b-20.

Rounding operators with homogen properties

To be written.