**A. FIXED-POINT ARITHMETIC**

In this appendix the main issues surrounding fixed-point arithmetic are
discussed, and some suggestions for a possible approach are made. In summary, we
recommend an approximate fixed-point facility ("scaled fractions"), with explicit
rescaling, and supporting arbitrary scale factors.

**A.1 APPROXIMATING REAL ARITHMETIC IN FIXED-POINT**

Most embedded applications require some approximation to the real numbers. For
the past twenty-five years, large general purpose computers have provided such an
approximation via floating point arithmetic. General purpose languages have assumed
the existence of floating point since the earliest FORTRAN days and have benefited
by the simplicity of their numeric data types. Unfortunately, the majority of
embedded computer systems (especially those embedded in weapons systems) provide
inadequate support for floating point. The problems are:

- no floating point hardware;
- floating point hardware which is too slow;
- floating point hardware with insufficient precision.

Notice that this problem is shared by both obsolete and modern equipment. The
PDP-11, for instance, has only recently become available in a militarized form and
is one of the more modern and more popular architectures. Yet most machines in the
PDP-11 family do not provide floating point. The same is true for Data General's
NOVA line of computers.

There are some definite advantages to fixed point. John VonNeumann opposed the
introduction of floating point hardware because it is slower, less precise, and more
expensive than fixed point. Whether through choice or through necessity, a large
number of embedded military applications are still using fixed point reals. For a
variety of reasons they cannot substitute floating point. The fixed point being
used is an approximation to the reals. It is not exact. Computations discard low
order bits just as is done in floating point calculations. The exact fixed point
required by Steelman provides a nice generalization of the integers. It enables
programmers to write clearer programs than can be written using only integers.
Desirable as it is in its own right, however, it cannot fulfill the above described
requirement for an alternative to floating point.

**A.2 AUTOMATICALLY SCALED APPROXIMATE FIXED-POINT**

An assembly language programmer on a fixed point machine starts with the
precision and range of each variable and chooses scale factors for the data and then
any necessary scaling operations. In most cases, there are several ways of
achieving the desired end and the programmer must make a choice, keeping in mind:

- that a large number of rescalings may cause an unacceptable reduction in
efficiency;
- that it is important to avoid overflow; and
- that it is also important not to discard too many low order bits by being
too conservative about avoiding overflow.

A complete description of the difficulties and procedures to follow in scaling
calculations can be found in [Per70]. For our purposes, however, a simple example
will serve to bring the problems into better focus.

Suppose that a, b, and c are integers, and that b and c have the same range.
The assignment

** a := b + c**

would be performed by adding b to c, and then storing the sum in a. The ranges of
integers are small and are accurately known by the programmer. If a, b, and c are
single precision, the addition can be performed in single precision with no loss of
information. If a is double precision with b and c single precision, the compiler
will perform the single to double conversion before the addition only if an overflow
can occur; otherwise, the conversion will be delayed until after the addition.
In the case of x, y, and z fixed point with y and z sharing the same range,
precision, and scaling, the assignment

** X := y + z**

is somewhat more complicated. If x has the same range, precision, and scaling as y
and z, then the only reasonable option is to add y to z and store the sum in x. If
the addition overflows, then the sum would not fit in x anyway. On the other hand,
what if x has the same number of bits but twice the range of y? Rescaling the sum
after the addition incurs the risk of an overflow even though the sum will fit in x.
Rescaling both y and z before performing the addition adds some instructions and
sacrifices some precision. There is an initial tendency to say that the lost
precisioni is unimportant, and that the extra rescaling operation is not very
expensive. Further reflection belies these impressions. Numerical analysts have
made it clear over the years that improper manipulation of low order bits will
ultimately destroy all the accuracy in a computation. In fact, IBM retrofitted
guard digits to the floating point of System 360 at great expense simply to overcome
the premature loss of low order bits. As to the lack of expense in a rescaling, let
us estimate the cost for a typical single accumulator machine with the rescaling
performed by a high speed shift.

**with two rescalings** |
separates 2 cols |
**with one rescaling** |

**cost** |
**code** |
. |
**code** |
**cost** |

2 |
lfs y |
. |
. |
. |

1 |
shr 1 |
. |
. |
. |

2 |
sto temp |
. |
. |
. |

2 |
lda z |
. |
lda z |
2 |

1 |
shr 1 |
. |
. |
. |

2 |
add temp |
. |
add y |
2 |

. |
. |
. |
shr 1 |
1 |

2 |
sto x |
. |
sto x |
2 |

**_________** |
**__________** |
. |
**_________** |
**________** |

**12 cycles** |
**7-12 words** |
. |
**4-7 words** |
**7 cycles** |

Assuming that all memory reference instructions require two cycles and all other
instructions require one cycle, the ratio of times is 1.7 to 1. Depending on the
machine's architecture, the memory reference instructions will require one or two
words. In either case the code space ratios are also about 1.7 to 1. In short, the
cost of adding extra rescalings may easily be unacceptably high. If the rescaling
requires a multiply instead of a shift, the penalty is just that much higher.

When determining the rescaling operations to be performed, an assembly language
programmer potentially has a considerable amount of related information to use. The
size of the values of y and z at that particular point in the computation may be
known. Perhaps they're of opposite signs or perhaps they're both small or perhaps
they're both likely to be close to 1. The precisions of y and z at that point in
the calculation may be known. The intended use of z may be known. A higher order
language translator must obtain this information from the programmer or do without
the information. It is extremely difficult for a translator to use contextual
information when determining rescaling strategy because of the difficulty of
defining such rules in the language. In particular, one would expect that a higher
order language would define the scaling of y+z independently of whether it appeared
in

** x := y + z**

or in

** x := y + z + w**

On the other hand, it is quite possible that an assembly language programmer would
make different scaling decisions in the two different contexts.

Making rescaling decisions is tedious and error prone. There is little doubt
about the optimal level of language support. The programmer would like to declare
the range, precision, and perhaps the scaling of each fixed point variable. The
compiler should take care of everything else. The problem is that devising a scheme
which is implementable, object-time efficient, numerically sound, and nonetheless
completely automatic does not seem to be within the state of the art.

Since integer and floating point arithmetic work automatically, it is useful to
examine the reasons why neither of these approaches can be adapted to approximate
fixed point. Approximate fixed point is intended as an alternative to floating
point, so an examination of the floating point rules seems a reasonable point of
departure.

**A.3 ADAPTING THE FLOATING POINT RULES**

There are two issues to be dealt with, range and precision. The precision of
the result of a floating point operation is defined to be the maximum of the
precisions of its operands. To implement this rule, the compiler need only
transform the implementation precision of the less precise operand to that of the
more precise one and then perform the operation. Floating point values are
approximations. Since the mathematical properties of the built-in operators are to
reduce precision, the definition is numerically sound. No potentially significant
bits are lost. The rules define the range of the result to be large enough to
contain the maximum possible result (subject to an implementation dependent
maximum). Since most floating point formats support a range far exceeding that
required by practical calculations, the implementation of this rule is trivial and
clearly numerically sound. Notice, however, that the implementability of this
approach is highly dependent upon the properties of the floating point format. That
is, the amount of precision maintained in a fixed size computer word is independent
of the magnitude of the number because only the high order significant bits are
retained, and a very large range of values can be maintained because the magnitude
is cleverly encoded. The fixed point format does not share these attributes.

With the floating point format, an execution time decision is made enabling
retention of high-order significant bits; with the fixed-point format this decision
must be made at compile time. Thus, if the fixed point format, emulating the
floating point philosophy, is to limit the number of bits in the result, a compile
time guess must be made about the position of the high order bit. While the
floating point format typically can compactly encode an object varying over two
hundred orders of magnitude with uniform precision, the fixed point format typically
accommodates a range of less than fifteen orders of magnitude with significant losses
of precision for values at the low end of the range. To implement the floating
point range rule in a limited number of bits requires sacrifice of all the low order
bits. Unfortunately, all the significance may be in the low order bits.

**A.4 ADAPTING THE INTEGER RULES**

The rules for integer arithmetic are straightforward. The precision issue is
non-existent. Integer arithmetic is exact and consequently all low order bits must
be retained. The rule for range is, of course, that the range must be large enough
to hold the maximum possible result (up to an implementation dependent maximum).
Adopting these rules for fixed point would yield an exact rather than an approximate
fixed point, but isn't that adequate? Not really. There is at least one problem
with the definition and a whole host of efficiency problems.

When dividing integers, one expects an integer quotient and an integer
remainder. When dividing exact fixed point numbers, one once again expects an exact
quotient and remainder, although the rules may define the quotient to be an integer
or to have some other scale. In the approximate fixed point case, the quotient
should beta good (but how good?) approximation to the true value and the concept of
a remainder is inappropriate. Although vexing, a widely acceptable definition of
division can probably be found. The really major issue is efficiency.

Embedded computers have hardware support for some small number of arithmetic
lengths. Some machines support a single length. Some support single and double. A
very few add triple and/or quadruple lengths. Furthermore, on many of these
machines, even when supported by hardware, multiple length arithmetic is much slower
than single length. If the numerical properties of a computation require
multi-length computations, then even the most efficient implementation must use
multi-length arithmetic. On the other hand, if the computation only requires short
arithmetic, but the language's rules require the compiler to generate long
arithmetic, a degradation of efficiency occurs. If the application is highly
numerical (e.g., guidance, navigation, control, targeting), the introduced
inefficiency may be unacceptable. If the introduced inefficiency causes a
requirement for more bits than the largest supported unit, the language rules may
make the program untranslatable even though the desired computation is supportable
on the hardware. The way that integer arithmetic is actually used in real programs
is compatible with an efficient implementation of the rules. Approximate fixed
point is used for quite a different purpose and the application of similar rules
leads to unacceptable sacrifices of efficiency.

The important properties of integers are:

- Programmers can easily assess the range of a variable with great confidence
and therefore the compiler knows the correct number of bits necessary to
represent each integer object.
- The number of bits required for an integer object is usually much smaller
than the size supported by the hardware.
- Integer expressions are usually simple, rarely involving more than one
explicit operator.
- Integer expressions usually appear in contexts which provide optimization
opportunities (e.g., subscripts, assignment).

Approximate fixed point shares none of these advantages:

- Programmers can determine the range of most fixed point variables with some
expenditure of effort. The number of bits of precision required is quite
another matter. This is a mathematical problem, not a programming issue.
The state of the art of numerical analysis is not sufficient for the
majority of real fixed point programs.
- Because of the uncertainty in the required precision, declared precisions
are usually determined by experimentation. The natural units to experiment
in are single precision, double precision, etc. Thus, the declarations
claim that the number of bits required by a variable is close to the size
supported by the hardware.
- Fixed point expressions are typically complicated. They often represent
numerical approximations to special functions or to ordinary differential
equations. There is ample opportunity for uncertainties to accumulate.
- Fixed point expressions rarely appear in contexts providing optimization
opportunities. Rather, they are used as component parts of larger
expressions and as arguments to functions.

The differences discussed above are fundamental. While a conscientious
application of the integer rules will typically lead to completely single precision
evaluation code, the same rules applied to fixed point will usually lead to
exceeding the maximum available precision. Thus, the integer arithmetic is always
correct and usually efficient while the fixed point arithmetic is likely to be
incorrect and very inefficient.

**A.5 MANUALLY SCALED APPROXIMATE FIXED-POINT**

In the preceding discussion we argued that first, an approximate (rather than
an exact) fixed-point facility is appropriate and second, automatic rescaling is not
desired. One possible approach which meets these criteria is to provide a "scaled
fraction" fixed point facility with the following properties:

- Arbitrary scale factors are allowed.
- Fixed point operations require scales to be identical, and an explicit
rescaling function is provided.
- Variable declarations must include a specification of scale and precision,
and may include a specification of range.

The first property, arbitrary scale factors, is important for many
applications; as noted in SM 3-1G, restricting scales to, say, powers of two is not
sufficient. As an example, pi is a natural scale factor for angles, and the
characteristics of various devices associated with the run-time environment are
likely to impose other scale factors.