. navigate
10. High-Level Input-Output left arrow
right arrow B. Contract Problems
Red Rationale


Approx Real Arithmetic in Fixed-Point     Auto Scaled Appropx Fixed-Point     Adapting Floating Point Rules
Adapting Integer Rules     Manually Scalled Approx Fixed-Point    


In this appendix the main issues surrounding fixed-point arithmetic are discussed, and some suggestions for a possible approach are made. In summary, we recommend an approximate fixed-point facility ("scaled fractions"), with explicit rescaling, and supporting arbitrary scale factors.


Most embedded applications require some approximation to the real numbers. For the past twenty-five years, large general purpose computers have provided such an approximation via floating point arithmetic. General purpose languages have assumed the existence of floating point since the earliest FORTRAN days and have benefited by the simplicity of their numeric data types. Unfortunately, the majority of embedded computer systems (especially those embedded in weapons systems) provide inadequate support for floating point. The problems are:

  • no floating point hardware;

  • floating point hardware which is too slow;

  • floating point hardware with insufficient precision.

Notice that this problem is shared by both obsolete and modern equipment. The PDP-11, for instance, has only recently become available in a militarized form and is one of the more modern and more popular architectures. Yet most machines in the PDP-11 family do not provide floating point. The same is true for Data General's NOVA line of computers.

There are some definite advantages to fixed point. John VonNeumann opposed the introduction of floating point hardware because it is slower, less precise, and more expensive than fixed point. Whether through choice or through necessity, a large number of embedded military applications are still using fixed point reals. For a variety of reasons they cannot substitute floating point. The fixed point being used is an approximation to the reals. It is not exact. Computations discard low order bits just as is done in floating point calculations. The exact fixed point required by Steelman provides a nice generalization of the integers. It enables programmers to write clearer programs than can be written using only integers. Desirable as it is in its own right, however, it cannot fulfill the above described requirement for an alternative to floating point.


An assembly language programmer on a fixed point machine starts with the precision and range of each variable and chooses scale factors for the data and then any necessary scaling operations. In most cases, there are several ways of achieving the desired end and the programmer must make a choice, keeping in mind:

  • that a large number of rescalings may cause an unacceptable reduction in efficiency;

  • that it is important to avoid overflow; and

  • that it is also important not to discard too many low order bits by being too conservative about avoiding overflow.

A complete description of the difficulties and procedures to follow in scaling calculations can be found in [Per70]. For our purposes, however, a simple example will serve to bring the problems into better focus.

Suppose that a, b, and c are integers, and that b and c have the same range. The assignment

      a := b + c

would be performed by adding b to c, and then storing the sum in a. The ranges of integers are small and are accurately known by the programmer. If a, b, and c are single precision, the addition can be performed in single precision with no loss of information. If a is double precision with b and c single precision, the compiler will perform the single to double conversion before the addition only if an overflow can occur; otherwise, the conversion will be delayed until after the addition. In the case of x, y, and z fixed point with y and z sharing the same range, precision, and scaling, the assignment

      X := y + z

is somewhat more complicated. If x has the same range, precision, and scaling as y and z, then the only reasonable option is to add y to z and store the sum in x. If the addition overflows, then the sum would not fit in x anyway. On the other hand, what if x has the same number of bits but twice the range of y? Rescaling the sum after the addition incurs the risk of an overflow even though the sum will fit in x. Rescaling both y and z before performing the addition adds some instructions and sacrifices some precision. There is an initial tendency to say that the lost precisioni is unimportant, and that the extra rescaling operation is not very expensive. Further reflection belies these impressions. Numerical analysts have made it clear over the years that improper manipulation of low order bits will ultimately destroy all the accuracy in a computation. In fact, IBM retrofitted guard digits to the floating point of System 360 at great expense simply to overcome the premature loss of low order bits. As to the lack of expense in a rescaling, let us estimate the cost for a typical single accumulator machine with the rescaling performed by a high speed shift.

with two rescalings separates 2 cols with one rescaling
cost code . code cost
2     lfs y . . .
1     shr 1 . . .
2     sto temp . . .
2     lda z .     lda z 2
1     shr 1 . . .
2     add temp .     add y 2
. . .     shr 1 1
2     sto x .     sto x 2
_________ __________ . _________ ________
12 cycles 7-12 words . 4-7 words 7 cycles

Assuming that all memory reference instructions require two cycles and all other instructions require one cycle, the ratio of times is 1.7 to 1. Depending on the machine's architecture, the memory reference instructions will require one or two words. In either case the code space ratios are also about 1.7 to 1. In short, the cost of adding extra rescalings may easily be unacceptably high. If the rescaling requires a multiply instead of a shift, the penalty is just that much higher.

When determining the rescaling operations to be performed, an assembly language programmer potentially has a considerable amount of related information to use. The size of the values of y and z at that particular point in the computation may be known. Perhaps they're of opposite signs or perhaps they're both small or perhaps they're both likely to be close to 1. The precisions of y and z at that point in the calculation may be known. The intended use of z may be known. A higher order language translator must obtain this information from the programmer or do without the information. It is extremely difficult for a translator to use contextual information when determining rescaling strategy because of the difficulty of defining such rules in the language. In particular, one would expect that a higher order language would define the scaling of y+z independently of whether it appeared in

      x := y + z

or in

      x := y + z + w

On the other hand, it is quite possible that an assembly language programmer would make different scaling decisions in the two different contexts.

Making rescaling decisions is tedious and error prone. There is little doubt about the optimal level of language support. The programmer would like to declare the range, precision, and perhaps the scaling of each fixed point variable. The compiler should take care of everything else. The problem is that devising a scheme which is implementable, object-time efficient, numerically sound, and nonetheless completely automatic does not seem to be within the state of the art.

Since integer and floating point arithmetic work automatically, it is useful to examine the reasons why neither of these approaches can be adapted to approximate fixed point. Approximate fixed point is intended as an alternative to floating point, so an examination of the floating point rules seems a reasonable point of departure.


There are two issues to be dealt with, range and precision. The precision of the result of a floating point operation is defined to be the maximum of the precisions of its operands. To implement this rule, the compiler need only transform the implementation precision of the less precise operand to that of the more precise one and then perform the operation. Floating point values are approximations. Since the mathematical properties of the built-in operators are to reduce precision, the definition is numerically sound. No potentially significant bits are lost. The rules define the range of the result to be large enough to contain the maximum possible result (subject to an implementation dependent maximum). Since most floating point formats support a range far exceeding that required by practical calculations, the implementation of this rule is trivial and clearly numerically sound. Notice, however, that the implementability of this approach is highly dependent upon the properties of the floating point format. That is, the amount of precision maintained in a fixed size computer word is independent of the magnitude of the number because only the high order significant bits are retained, and a very large range of values can be maintained because the magnitude is cleverly encoded. The fixed point format does not share these attributes.

With the floating point format, an execution time decision is made enabling retention of high-order significant bits; with the fixed-point format this decision must be made at compile time. Thus, if the fixed point format, emulating the floating point philosophy, is to limit the number of bits in the result, a compile time guess must be made about the position of the high order bit. While the floating point format typically can compactly encode an object varying over two hundred orders of magnitude with uniform precision, the fixed point format typically accommodates a range of less than fifteen orders of magnitude with significant losses of precision for values at the low end of the range. To implement the floating point range rule in a limited number of bits requires sacrifice of all the low order bits. Unfortunately, all the significance may be in the low order bits.


The rules for integer arithmetic are straightforward. The precision issue is non-existent. Integer arithmetic is exact and consequently all low order bits must be retained. The rule for range is, of course, that the range must be large enough to hold the maximum possible result (up to an implementation dependent maximum). Adopting these rules for fixed point would yield an exact rather than an approximate fixed point, but isn't that adequate? Not really. There is at least one problem with the definition and a whole host of efficiency problems.

When dividing integers, one expects an integer quotient and an integer remainder. When dividing exact fixed point numbers, one once again expects an exact quotient and remainder, although the rules may define the quotient to be an integer or to have some other scale. In the approximate fixed point case, the quotient should beta good (but how good?) approximation to the true value and the concept of a remainder is inappropriate. Although vexing, a widely acceptable definition of division can probably be found. The really major issue is efficiency.

Embedded computers have hardware support for some small number of arithmetic lengths. Some machines support a single length. Some support single and double. A very few add triple and/or quadruple lengths. Furthermore, on many of these machines, even when supported by hardware, multiple length arithmetic is much slower than single length. If the numerical properties of a computation require multi-length computations, then even the most efficient implementation must use multi-length arithmetic. On the other hand, if the computation only requires short arithmetic, but the language's rules require the compiler to generate long arithmetic, a degradation of efficiency occurs. If the application is highly numerical (e.g., guidance, navigation, control, targeting), the introduced inefficiency may be unacceptable. If the introduced inefficiency causes a requirement for more bits than the largest supported unit, the language rules may make the program untranslatable even though the desired computation is supportable on the hardware. The way that integer arithmetic is actually used in real programs is compatible with an efficient implementation of the rules. Approximate fixed point is used for quite a different purpose and the application of similar rules leads to unacceptable sacrifices of efficiency.

The important properties of integers are:

  1. Programmers can easily assess the range of a variable with great confidence and therefore the compiler knows the correct number of bits necessary to represent each integer object.

  2. The number of bits required for an integer object is usually much smaller than the size supported by the hardware.

  3. Integer expressions are usually simple, rarely involving more than one explicit operator.

  4. Integer expressions usually appear in contexts which provide optimization opportunities (e.g., subscripts, assignment).

Approximate fixed point shares none of these advantages:

  1. Programmers can determine the range of most fixed point variables with some expenditure of effort. The number of bits of precision required is quite another matter. This is a mathematical problem, not a programming issue. The state of the art of numerical analysis is not sufficient for the majority of real fixed point programs.

  2. Because of the uncertainty in the required precision, declared precisions are usually determined by experimentation. The natural units to experiment in are single precision, double precision, etc. Thus, the declarations claim that the number of bits required by a variable is close to the size supported by the hardware.

  3. Fixed point expressions are typically complicated. They often represent numerical approximations to special functions or to ordinary differential equations. There is ample opportunity for uncertainties to accumulate.

  4. Fixed point expressions rarely appear in contexts providing optimization opportunities. Rather, they are used as component parts of larger expressions and as arguments to functions.

The differences discussed above are fundamental. While a conscientious application of the integer rules will typically lead to completely single precision evaluation code, the same rules applied to fixed point will usually lead to exceeding the maximum available precision. Thus, the integer arithmetic is always correct and usually efficient while the fixed point arithmetic is likely to be incorrect and very inefficient.


In the preceding discussion we argued that first, an approximate (rather than an exact) fixed-point facility is appropriate and second, automatic rescaling is not desired. One possible approach which meets these criteria is to provide a "scaled fraction" fixed point facility with the following properties:

  1. Arbitrary scale factors are allowed.

  2. Fixed point operations require scales to be identical, and an explicit rescaling function is provided.

  3. Variable declarations must include a specification of scale and precision, and may include a specification of range.

The first property, arbitrary scale factors, is important for many applications; as noted in SM 3-1G, restricting scales to, say, powers of two is not sufficient. As an example, pi is a natural scale factor for angles, and the characteristics of various devices associated with the run-time environment are likely to impose other scale factors.

Approx Real Arithmetic in Fixed-Point     Auto Scaled Appropx Fixed-Point     Adapting Floating Point Rules
Adapting Integer Rules     Manually Scalled Approx Fixed-Point    

10. High-Level Input-Output left arrow
right arrow B. Contract Problems



RED Reference
RED Rationale

Types in RED
Time/Life Computer Languages

Site Index

Overview             Reference ToC             Rationale ToC             Site Index

Home   Favorites   Map

IME logo Copyright © 2009, Mary S. Van Deusen