## IBM CONFIDENTIAL

Date:

May 1, 1967

From (location

Advanced Computing Systems

r U.S. mail address):

Menlo Park, California

ot. & Bldg.:

985 275 MEI

Subject: Dual Arithmetic on ACS-1

Reference: S.J.C.C., 1967 and our recent conversation

To: Dr. J. E. Bertram

One of the more formidable features of the ILLIAC IV is dual arithmetic, where a pair of floating point numbers are made to interact with another pair, yielding a pair of independent results:

$$\begin{pmatrix} A_1 \\ A_2 \end{pmatrix} \begin{pmatrix} \phi \\ \phi \end{pmatrix} \begin{pmatrix} B_1 \\ B_2 \end{pmatrix} \rightarrow \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} = \begin{pmatrix} A_1 & \phi & B_1 \\ A_2 & \phi & B_2 \end{pmatrix}$$

The scheme is useful on the ILLIAC IV for the following reasons:

- 1. The 64-bit word length is adequate for a pair of hex-floating numbers, each with 8-bit exponent and 24-bit hex-fraction.
- 2. Significant time savings can be achieved in the PE by using the already-wide data paths for dual arithmetic. There may be an extra shift cost of 2 cycles per instruction comparing with single 64-bit operations, this extra cost is something like 33% on floating adds (8 cycles rather than a possible 6) and may be more than offset in multiplies because of the shorter fractions.
- 3. For usual partial differential equations even 16 fraction bits may be adequate because of the sizable discretizing error. Parts of computation which call for longer lengths can be localized without serious effort.
- 4. Many problems do exhibit low-order parallelism exploitable by this feature. This even includes Monte Carlo computations, where the precision demand is low; radar signal analysis, and pattern analysis in general. Where parallelism is lacking, the two components in the packed word can be detached for individual attention at low timing cost.



Dr. J. E. Bertram May 1, 1967 Page 2

Dual Arithmetic on ACS-1

With the dual arithmetic feature, the ILLIAC IV PE can claim to be an 8-MIPS machine. Their weather program (NCAR model) by the full 4-QUAD machine is said to achieve 600 x 6600, with upper and lower hemispheres treated "dually".

The proper way to counteract this claim is to install dual arithmetic ourselves. There are several difficulties:

- 1. The 48-bit word length is not adequate for an independent pair of floating point numbers each with 12-bit exponent. The fraction would have only 12 bits, small even by the most optimistic advocates of short precision arithmetic.
- 2. Unless one performs at a rate of <u>two</u> operations per cycle, the saving in time is <u>invisible</u>. The <u>shifting</u> cost would be a major handicap.
- 3. Excessive hardware to achieve dual arithmetic is more likely on a pipeline machine, where the "fixed-time duration" requirement is compounded by a "uniform flush rate" requirement.
- 4. The operation code repertoire is already near the 256 "limit".

I would like to advocate a limited form of dual arithmetic in which one exponent is shared by two fractions. This "block-normalization" philosophy is quite acceptable for partial differential equations and matrix computations (Cf. discussions in an earlier memo to file, "Mixed floating add operations" by T. C. Chen, dated March 14, 1967). The following advantages of the new dual arithmetic are apparent, many are unique to the block normalizing format.

- 1. Parallel comparison shifting with one <u>single</u> shifter.
- 2. Parallel add with one 48-bit adder (with, however, added extra sign detection, overflow detection, and perhaps extra partial recomplementation features).
- 3. Parallel post-shifting (normalizing usually just one of the fractions).



Dr. J. E. Bertram May 1, 1967 Page 3

Dual Arithmetic on ACS-1

- 4. Parallel multiply (with added hardware blocking of carries).
- Only one exponent handling mechanism is needed.
- TWO OPERATIONS PER CYCLE PER UNIT.

(It is suspected that the ILLIAC IV dual operations will turn out to be "block normalized" also, to reduce the circuit count.)

There are still some problems. With exponent unaltered, the fraction length is only 17 bits + sign, adequate only for very limited computations such as the weather problem and radar signal analysis. A better deal might be the format

$$S_1 \to F_1; S_2 \to S_2$$
 or  $S_1 \to F_1; F_2 \to S_2$ 

with

1 bit for  $S_1$ , 8 bits for E, (7090 size!)

19 bits for F<sub>1</sub>;

1 bit for S<sub>2</sub>,

19 bits for F<sub>2</sub>;

which will have roughly the same fraction capacity as the hex-fraction of 24 bits.

There ought to be a reasonably full dual-instruction set, including packing and unpacking (but perhaps no pipelined divide). I feel dual arithmetic to be more useful than double multiply and double divide, and am again advocating their removal to make room for the dual instructions.

Tien Chi Chen

TCC:va

cc: Dr. G. M. Amdahl

Mr. G. F. Nielsen

Mr. R. E. Pickett

Dr. H. Schorr

Dr. E. H. Sussenguth

SADL

053

L. Conway Archives