[Up] [Previous]

Mixing Old and New

Given the goal of using a 36-bit single-precision floating-point format with a hidden first bit, and the determination that a 60-bit double-precision floating-point format is as precise as is necessary for advanced scientific computation, so far we have examined various ways to organize a computer's memory so as to allow floats composed of three and five fundamental units of twelve bits in length to be handled efficiently.

However, the Cray I illustrated another possibililty, that of avoiding the problem altogether.

The Control Data 6600 had a 60-bit word, and reserved 48 bits of a floating-point number for the mantissa.

The Cray I, also designed by Seymour Cray, had a 64-bit word, and it also reserved 48 bits of that word for the mantissa (coefficient) of a floating-point number, simply expanding the exponent field to fill the larger word.

This is unlike the 64-bit double-precision floating-point format in IEEE 754, which reserves 52 bits of such floating-point numbers for the mantissa (significand).

Thus, one could simply go with a perfectly conventional computer having a 36-bit word and a nine-bit byte to meet these goals, with floating-point formats such as shown above.

In practice, though, the fact that the 36-bit floating-point format combines the precision of the IBM 7090 with the exponent range of the IBM 360 suggests that a similar example shouild be followed with the double-precision format. While making the exponent field of the double-precision number larger, so that its 72-bit length does not lead to unnecessary precision, is a valid thing to do, having four bits less precision, however unneeded those bits could be, than that of the current IEEE 754 standard would mean that programs couldn't be ported to this new architecture and yield equivalent results.

Thus, in the diagram shown above, the double-precision floating-point format reserves 55 bits for the mantissa. Given that there is a hidden first bit, as in IEEE 754, this matches the 56 bits so reserved in the double-precision format of the IBM System/360, rather than merely the 48 bits so reserved in the IEEE 754 double-precision format.

After all, having 36 and 72 bits to play with, instead of 32 and 64 bits, it's not really difficult to meet or exceed what is done in formats of these shorter lengths.

Also defined is an intermediate-precision format, occupying three 18-bit halfwords; the exponent range covers that of the typical pocket calculator, and a 43-bit mantissa gives a precision of well over 12 decimal digits.

Given, however, that the System/360 floating-point format was based on an exponent that was a power of sixteen rather than a power of two, the precision of the mantissa was illusory rather than real, with the effective precision being that of a 52-bit mantissa rather than a 55-bit one.

Thus, a 52-bit mantissa with a hidden first bit, as used in the IEEE 754 standard for 64-bit floating-point numbers already provides one more bit of real precision than that floating-point format had provided, and thus a format like the one shown above will probably be the preferred option.


Going to 72-bit double precision with an enlalrged exponent field, to replace 60-bit double precision, eliminates the issues associated with unusual lengths if only those types are used.

But it is considered desirable to include an intermediate precision type as well. For scalar operations, if that type is 54 bits wide, that means it is composed of three 18-bit halfwords, and so it can be easily addressed, and access to it can be made efficient through standard methods of supporting unaligned memory access.

For vector operations, however, this does not seem to work well.

Assume each vector register consists of 64 registers, each 72 bits long, to support a vector of up to 64 double-precision numbers.

This could easily support vectors of single-precision numbers as well; simply put two numbers side by side in each register, and so vectors of single-precision numbers could have up to 128 elements.

Modifying this scheme to support intermediate precision numbers is possible. Change the vector registers so that they contain thirty-two registers, each 144 bits long.

This would fit nicely into a structure built around a nine-bit byte and a 36-bit word. Each individual 144-bit register could contain two double-precision numbers, four single-precision numbers, and three intermediate-precision numbers... if those intermediate-precision numbers were reduced to 48 bits in length.

So the connection between the vector unit and memory is kept simple and straightforward, by having two sizes of intermediate-precision floats, one optimal for vector calculations, and one optimal for scalar calculations.

Is this necessary? If one had vector registers made up of individual registers that were 216 bits wide, each could contain six 36-bit single-precision floats, three 72-bit double-precision floats, and four 54-bit intermediate precision floats.

But then memory would need to be organized around a 216-bit width, and so single-precision and double-precision numbers would no longer be addressed in a simple binary fashion; a division by three would be needed.

This is why a 54-bit width, instead of a 48-bit width, is needed for scalar intermediate-precision numbers, to avoid having to use division by three to address intermediate-precision numbers.

Another wrinkle comes from the fact that it is desirable to keep floating-point numbers in registers in an internal form which avoids the complexity associated with the hidden first bit and gradual underflow when calculating.

So when floating-point numbers are loaded into a vector register, type information would also be set to indicate if the vector register was being used for double-precision numbers or single-precision numbers.

One possible format for the register contents is as shown here:

In the upper half of the diagram is shown how each 144-bit portion of a vector looks in memory: four 36-bit numbers, three 48-bit numbers, or two 72-bit numbers.

The internal form of a number in a format like that of those in IEEE 754 format would need to be at least two bits longer for fast computation: eliminate the hidden first bit, and also increase the length of the exponent by one to allow gradual underflow to be eliminated (from the internal form only, while still fully supporting gradual underflow for numbers as stored in memory) without reducing the range of numbers.

However, the original 8087 took 64-bit double-precision floating-point numbers, and worked on them in a temporary real form that was 80 bits long. The exponent was four bits longer, so the other 12 bits, rather than just one, went to the significand.

If one draws from that merely that one is allowed to lengthen the mantissa in the internal form of a floating-point number, then adding one bit to the mantissa for 36-bit floating-point numbers now means that three bits would be added to each of the four numbers in a 144-bit unit... which makes the total number of added bits divisible by three.

In this way, the vector registers can be made up of 156-bit parts that can be divided up evenly, with no unused bits, for each of the three data types with which they can be filled.

For 36-bit floats, add one exponent bit, one mantissa bit for the hidden first bit, and one extra mantissa bit, for a total of three bits each;

For 48-bit floats, add one exponent bit, one mantissa bit for the hidden first bit, and two extra mantissa bits, for a total of four bits each; and

For 72-bit floats, add one exponent bit, one mantissa bit for the hidden first bit, and four extra mantissa bits, for a total of six bits each.

Add three bits to each of four numbers, or four bits to each of three numbers, or six bits to each of two numbers, and twelve bits have been added in each case.

However, this consistency could not be maintained across all numeric types.

Presumably, there would also be offered a 144-bit floating-point type which did not have a hidden first bit, being the maximum-length type whjich is therefore a temporary real type.

Assuming vector operations were supported on this type, since it was already in temporary real format, it would not be lengthened when placed in one of the 152-bit elements of a vector register.

This type, as it lacks the hidden first bit, can be unnormalized, and among the consequences of that is that it can be used as the building block for a 288-bit floating-point format, which, like double-precision on the IBM 704, or extended precision on the IBM System/360 model 85 and later machines, is composed of two floating-point numbers in the format that is half its width.

Given that vector arithmetic instructions work on 36, 48, and 72-bit floats, while scalar arithmetic instructions work on 36, 54, and 72-bit floats, it is obviously necessary to provide instructions to save 48-bit floats from the vector registers into 54-bit floats for later scalar computation, and to load 54-bit floats from memory into the vector registers rounded to 48-bits.

Not perhaps quite as obviously necessary, there should also be instructions for loading 72-bit floats in the vector registers from, and storing them to, 54-bit floats in memory, so that if desired, a scalar computation on 54-bit floats could be supplemented by vector calculations without a loss of precision.

One unfortunate characteristic of this scheme, however, is that only one extra mantissa bit is present in the 36-bit floats. When it comes time to save them to memory, that last bit can either be 0 or 1, which are, relative to the mantissa to be saved, if it's considered as an integer, equivalent to 0 or 0.5. As it's not clear which way is best to round 0.5, in effect the extra bit is containing no information.

Of course, though, it still improves accuracy during the course of a calculation done between the vector registers, by making intermediate results more precise. But this still is uncomfortable.

So, one possibility that might be considered would be to lengthen the vector registers from 144 bits to 168 bits instead of 156 bits.

This would lead to:

36-bit floats would now have four extra mantissa bits;

48-bit floats would now have six extra mantissa bits; and

72-bit floats would now have ten extra mantissa bits.

Adding six extra mantissa bits to 48-bit floats (in addition to the one bit added to the exponent, and the one mantissa bit added for the hidden first bit) would increase the size of the mantissa to match that of a 54-bit float, so now the registers could contain three 54-bit intermediate-precision floating-point numbers. This could perhaps be exploited to switch to having only 54-bit intermediate-precision floating-point numbers, eliminating the extra complexity of having two sizes.

However, ten extra mantissa bits for 72-bit floating-point numbers might seem excessive, even though the 80-bit temporary real format of the 8087 added eleven bits to the precision of the mantissa of double-precision numbers.

Having any extra mantissa bits, while a way to make the vector registers neat and symmetrical, and avoid unused space in them, risks making floating-point arithmetic slower. While retaining more bits in intermediate results in calculations may seem to be a positive thing, the fact they're lost whenever numbers are saved to memory means that they would have a tendency to make floating-point arithmetic less repeatable, and since they're present at some times but not at some others, essentially at random from the view of the problem being solved, their benefits might be illusory.

Another theoretical possibility that might be considered with this basic design is this: since the basic idea that led to obtaining a desired set of floating-point precisions without an unusual memory architecture that accomodated data items of incompatible widths was to lengthen the exponent field in double-precision numbers beyond what was really needed... why not shorten that exponent field by one bit, take the first bit of the mantissa out of hiding, thus making the 72-bit floating-point format a temporary real format?

This also has the benefit that an extended precision of 144 bits would then be the logical ultimate length for floating-point numbers, removing the temptation to support 288-bit floats which presumably are of limited utility.


[Up] [Previous]