On the previous page, we have seen a way to reconcile the basic conflict inherent in using floating-point numbers of a nonstandard length by mixing numbers of different lengths in memory.

However, this leaves open the possibility of a very large amount of memory going to waste if the appropriate data types are not required in the right proportions by a program.

On this page, a technique will be examined that, by accepting a small, fixed percentage of memory being wasted, allows efficient addressing of unusual, but desirable, lengths of floating-point numbers.

Let us assume that memory is organized into words of 256 bits in length.

In addition to a 256-bit memory bus being used on my imaginary architecture, it may be noted that an important benefit of a memory bus of that width has been demonstrated by IBM's most recent mainframe, as of this writing, the zEnterprise EC12, as well as its predecessors such as the zEnterprise 196.

These computers use memory organized, as per IBM's RAIM (Redundant Array of Independent Memories) technology, as follows: five memory banks, each composed of 72-bit memory modules, each providing 64-bit wide memory with an error-correcting code (ECC) providing single-error correction and double-error detection (SECDED), provide a 256-bit wide highly reliable memory by using four of the memories to provide each 64 bits of the 256-bit memory words, and a fifth one to provide parity, so that if there is a complete failure of one memory module, the relevant portion of the memory word can be calculated from the other portions of that word and the parity.

(Incidentally, while the novel part of RAIM, using one memory unit for parity,
so that one failed unit can be ignored, would seem like one that would have been used before with
memories, and not just with disks, U.S. patent 7,200,770 was granted to IBM for its RAIM
technology. And indeed something slightly similar was used before with memories, as shown by U.S. patent
4,849,978... also issued to IBM; however, the system described in that patent was intended for
after-the-fact reconstruction of the contents of a failed memory unit, not as a method for
conventional error correction in wide memory accesses.)

Five intermediate-precision floating-point numbers of 51 bits in length occupy 255 bits, wasting one bit out of 256.

Seven single-precision floating-point numbers of 36 bits in length occupy 252 bits, wasting four bits out of 256.

The 36-bit floating-point format would follow the IEEE 754 standard, except for having an eight-bit excess-126 exponent, and the 51-bit floating-point format would, like the 48-bit one proposed, have a ten-bit excess-510 exponent, thus providing an exponent range and a precision comparable to that of a pocket calculator: the 37-bit mantissa of a 48-bit float with such an exponent providing 11 digits of precision, and a 40-bit mantissa, as a 51-bit float would provide, giving 12 digits of precision. Scientific calculators tended to display numbers with ten digits of precision, while having 11, 13, or even 15 digits of internal precision.

By subdividing the memory unit into aliquot parts, either five parts or seven parts, no floating-point number ever crosses the boundary between one 256-bit memory word and the next, thus achieving one of the two desirable conditions for handling numbers of unusual lengths.

But, if we stop here, we fail to meet the other condition: finding the physical memory address to locate a particular floating-point number of these sizes from an array requires a division by five, or a division by seven; and I, at least, do not know of a logic circuit that can perform that operation on an address at acceptable speeds to be part of address calculation for instructions.

What I have now come up with, though, is a way to replace division by multiplication of an adequately simple type, provided that we reconcile ourselves to a small amount of additional wastage of memory area in addition to the one, or four, bits in each 256-bit memory word that we have already accepted.

65 is equal to 5 times 13, and is one greater than 64.

Let us assume that 51-bit floating-point numbers are addressed as follows:

The base register contains a conventional binary address; it may be a byte address, but it will normally be aligned on 32-byte, or 256-bit, boundaries, so if it is a byte address, its last five bits will be zero.

The displacement field of the instruction may, if it is 16 bits in length, consist of a 13 bit displacement in units of 256 bits, followed by three bits which may contain a number from 0 through 4, indicating one of the five 51-bit floating-point numbers contained within a 256-bit memory word. (We will assume, for now, that nonzero values of that field will not be used with indexed addressing, so as to avoid the need to define the behavior in that case.)

The index register, when used, however, contains the number of a specific 51-bit floating-point number in an array of 51-bit numbers. It is a simple binary integer, with no other structure.

We refuse to take the index register contents and divide them by 5. Instead, we will leave gaps in the array of 51-bit floating-point numbers so that we can multiply instead of dividing.

Thus, we separate out the last six bits of the index register contents.

We take the number formed by the bits prior to those last six bits, and multiply it by thirteen.

Thirteen times five equals sixty-five. Thirteen memory words of 256 bits contain 65 floating-point numbers that are 51 bits long.

So multiplying the leading bits of the index register by 13, which involves shifting and adding, which can be done quickly, even when addresses are long, gives us a pointer to the beginning of a block of thirteen 256-bit memory words, which can contain 65 51-bit floats.

The last 6 bits of the index register indicate which of 64 of those 51-bit floats that we want, and the remaining portion of the address calculation can be performed with the desired level of efficiency either by table look-up or by reciprocal multiplication.

Now, let's do the same thing for 36-bit floating-point numbers.

Again, the base register would point to a 256-bit memory word to keep things simple.

The displacement, if 16 bits long, would again be a 13-bit displacement in memory words of 256 bits, followed by a three-bit field which would now contain a value from 0 to 6.

The index register, again, just indicates which 36-bit float you want, and is a simple binary integer.

Here, we will take the last eight bits of the index register, to indicate which one of 256 36-bit floating-point numbers we want.

The preceding part of the contents of the index register will be multiplied by 37. That will
yield a pointer to a block of 37 memory words of 256 bits, containing 259 36-bit floating-point
numbers, so *three* of them will be wasted.

Can't we do better than that?

No, in the sense of wasting fewer words per block. Of course, we could waste only three words in a larger block at the cost of more complexity, and so lose less memory to overhead.

Why not?

In binary notation, the number seven is 111.

Thus, 7 is 111, 14 is 1110, 28 is 11100, and in general, any number of the form 111000...0 is a multiple of 7.

Powers of two look like this: 1, 10, 100, 1000, 10000, and so on.

A power of two *minus one* is a string of ones; 1, 11, 111, 1111, 11111, 111111, and
so on, for the same reason that in decimal notation, powers of ten minus one look like 9, 99, 999,
and so on.

Thus, the residues of powers of two, modulo 7, recur in a cycle of three:

2^3 - 1 111 8 = 7 + 1 = 7 + 1 2^4 - 1 1111 16 = (14 + 1) + 1 = 14 + 2 2^5 - 1 11111 32 = (28 + 3) + 1 = 28 + 4 2^6 - 1 111111 64 = 63 + 1 = 63 + 1 2^7 - 1 1111111 128 = (126 + 1) + 1 = 126 + 2 2^8 - 1 11111111 256 = (252 + 3) + 1 = 252 + 4

And then again, those last eight bits of the index register contents will indicate one of 256 36-bit floating-point numbers, within a block of 37 256-bit words; as noted, this can be done through a table look-up, or division for a number of strictly limited length can be performed, for example by reciprocal multiplication, at an acceptable speed.

It should be noted that, while this scheme for handling 36-bit and 51-bit floating-point numbers, embodying a principle that can be used for other data items of unusual lengths, solves what I see as the main problems in attempting to do this, it still is not without certain limitations requiring caution.

Let us suppose that 51-bit floating-point is used in a FORTRAN language system with INTERMEDIATE PRECISION as the statment type.

Then, supposing we have the following type declaration statements:

INTERMEDIATE PRECISION A(125), B(120) EQUIVALENCE (A(6), B(1))

By having an offset of five floating-point numbers between the two arrays, even if a nonzero displacement within the 256-bit memory word were not allowed with indexed addressing (as I did not define the behavior in that case above) the possibility of an error due to an illegal displacement would be avoided.

The result of such statements, assuming straightforward processing of references to the two arrays, would be that the elements of the two arrays would correspond in the following fashion:

-------------------------------------------- | A(6) | A(7) | A(8) | A(9) | A(10) | | B(1) | B(2) | B(3) | B(4) | B(5) | |--------------------------------------------| ... |--------------------------------------------| | A(61) | A(62) | A(63) | A(64) | --- | | B(56) | B(57) | B(58) | B(59) | B(60) | |--------------------------------------------| | A(65) | A(66) | A(67) | A(68) | A(69) | | B(61) | B(62) | B(63) | B(64) | --- | |--------------------------------------------| | A(70) | A(71) | A(72) | A(73) | A(74) | | B(65) | B(66) | B(67) | B(68) | B(69) | |--------------------------------------------|

since, as long as references to either array are handled through the base-displacement address pointing to the start of the array in question, the blocks of 65 floating-point numbers, the last of which is not used to simplify addressing, of which an array is composed would begin at the start of each array, leading to an EQUIVALENCE with an offset address where the offset is not an integral number of blocks causing the kind of discrepancy illustrated above.

Can't this limitation be avoided? In order to ensure that the choice of which floating-point slot to be skipped was always consistent, either one would have to add annotations to addresses which would lengthen them considerably to specify their offset from the start of a block, or one would have to take the binary address and divide it by 13 or 37. And, of course, the overhead of omitting some potential slots for numbers in a block was accepted specifically to avoid any need for division during address calculation.

However, another technique does exist which can avoid the problem. Instead of treating the displacement field in the instruction as a combination of a conventional memory address, and a specification of which item in a memory word is to be used, it could be handled in the same way as the index register contents are handled. In that case, the items to be skipped would be the same for all memory-reference instructions using the same base register.

This creates the problem that locations skipped in arrays are now completely inaccessible as data items of this length, and so they can't be used as simple variables.

However, when this technique is used, references to unaligned data items are simply not meaningful. On architectures where displacement fields in instructions normally contain displacements in bytes, therefore, the least significant field of the displacement could be used to select either approach to handling the displacement field.

Note that this is not strictly necessary to avoid this issue; the same area of memory could be accessed by two different base registers, having a small offset between the addresses they contain, in order to make all locations accessible. However, it is assumed that base registers are a valuable and scarce resource, and so added features which, among other things, allow the need for such a trick to be avoided are of value.

With modern arithmetic techniques, such as Slansky adders and Wallace Tree multipliers, the time required to add and multiply floating-point numbers does not grow linearly with their length. As a result, modern microprocessors use the same floating-point hardware for single-precision and double-precision arithmetic.

And, with pipelining, one can initiate a new floating-point operation per cycle
in any case, and so if arithmetic were made slightly faster by working on shorter
numbers, the benefit would likely be nothing more than shaving one cycle off of the
*latency* of a floating-point instruction.

One piece of good news is that 13 is 1101 in binary, while 37 is 100101 in binary, so both of the multiplications required for address calculation only involve adding three numbers together.

If one also wished to divide a 256-bit memory word into three 85-bit portions, with one bit left over, then one could use multiplication by 11 (1011 in binary) to omit one portion in a block of 33.

[Next][Up][Previous]