External Vector Coprocessing

The idea of replacing an essentially passive array of memory elements by one that is peppered with processors for every so much memory is one that has been proposed many times, but one that has seldom been implemented. It has been considered visionary rather than practical.

On the other hand, banks of arithmetic-logic units set up to perform calculations in parallel have been implented many times, and have been in use for some time.

A limited amount of vector arithmetic capability had been present in some computers using vacuum tubes (the AN/FSQ-7) or transistors (the TX-2 and the AN/FSQ-32), of a type resembling the MMX feature available with Intel microprocessors or the short vector instructions on the architecture described here.

Many of the early machines which provided long vector capabilities did so through a special rapid pipelining mode rather than by having multiple arithmetic-logic units, one for each element of a vector or at least each element of a sizable chunk of a vector.

It was still obvious from the start that a vector unit should ideally connected to memory along a data path having the same width in bits as the combined width of the arithmetic units involved.

In order that the position on this wide data bus of each arithmetic unit be fixed, it would appear preferable to organize an external vector unit of this type along the lines of that performing the short vector instructions of the main processor architecture we have been examining, rather than along the lines of that performing the long vector instructions. However, historically, external vector units have tended to be organized with a fixed number of arithmetic-logic units, so that when doing double-precision arithmetic, half the arithmetic-logic units, even if they are ones only capable of single-precision arithmetic, would not be idle. This can be minimized by making the ALU reconfigurable, so that much of the same logic is used whether one long or two short operands are being worked on, but some circuitry (such as that for normalization) will unavoidably be idle in the case of fewer operands.

In addition to supercomputers with vector capabilities, later vector floating-point units were available as third-party add-on systems for both mainframes and minicomputers from companies such as Floating Point Systems.

More recently, IBM's Enterprise Systems Architecture/390 provided for an optional external vector arithmetic unit the internal width of which would vary from one model to another, but which shared a common interface from the programmer's perspective. This vector facility connected directly to the CPU, and, at least in the case of the version used with ES/9000, operated on data in cache. It had internal vector registers, so it provided a basic architecture similar to that of vector supercomputers like the Cray-1 and its successors. Before that, add-on vector units offered for the System/360 and the original System/370, such as the IBM 2938 Array Processor and the IBM 3838 Array Processor, had their own internal memories, to which data would be transferred before the start of a computation involving vectors.

In the context of the specific architecture being presented here on these pages, it may be noted that the vector mode and the short page mode of operation provide memory-to-memory vector operations only, and the modes which provide access to the internal long vector registers also include memory-to-memory vector operations. Circuitry to detect the presence of an appropriate external coprocessor, and to permit the delegation of these operations to that coprocessor, can be added to the design without in any way altering the instructions themselves. However, if an external vector coprocessor were present, it would be useful to have available vector instructions which had a length field which was 16 or even 32 bits in length, instead of only six bits in length, so that a single instruction could cause a long vector to be processed concurrently with normal machine operations. Of course, the external vector coprocessing units would be pipelined, as this is an inexpensive method of permitting them to perform a larger number of operations in a given time.

Since the external vector coprocessing system would be connected directly to main memory, it always treats memory as divided into 32-bit words of the normal size, ignoring any use of 36-bit, 24-bit, 40-bit or 60-bit words by the central processor, and it also does not handle floating-point numbers of the Medium type, but instead provides support for floating-point numbers of the Small type, just as is done by the Short Vector arithmetic unit.

The following illustration depicts the architecture for external vector coprocessing that is envisaged as being associated with the processor architecture described here, as will be discussed below:

Because the external vector coprocessing system could consist of a multiplicity of functional units, such as, for example, a set of sixteen identical chips, and since it works with operands in memory rather than the cache, it would have a wide path to memory. This suggests the possibility of a design in which the external vector coprocessing chips also perform a memory interface function; thus, while the main microprocessor could have, as given as an example, a 256 bit wide data bus, each of sixteen vector coprocessing units could have a data bus to memory of similar width, as well as being connected to the processor data bus, which could then operate at a higher speed than that of the memory.

However, if it is not possible to have a processor data bus that runs more quickly than the bus that can be used with the memory, then a different architecture would be more appropriate:

Here, instead of the chips just being memory controllers in addition to containing external vector processing units, they would also contain external L3 cache. This still serves a useful purpose, as external DRAM has latency limitations in addition to that imposed by the speed of the external data bus being slower than the processor cycle time. Four of these chips, instead of eight, are shown in this diagram, since each of those chips would supply 64 out of the 256 bits of the processor data bus; this way, double-precision floating-point numbers would stay in one piece inside each of them.

This also indicates that the individual chips in that system, having a fixed-width path to memory, would likely work on a principle similar to that of the ALU used for the short vector instructions.

Also, while an elaborate bus structure was shown on this page for allowing, in two steps, vector operands beginning at any arbitrary position to be brought into alignment, for the case of an elaborate implementation of the architecture having a set of 64 ALUs within the chip itself, for the external system, such a capability is unlikely. Instead, if the external system consists of a row of chips, each one coupled to a bank of memory, likely each chip would simply be connected to its two immediately adjacent neighbors. This, of course, would mean that vectors not fully aligned, in terms of the width of the entire assembly of external coprocessor chips, would involve a significant performance penalty, although even that penalty would not outweigh the benefit of avoiding the limited-width data path to the CPU itself.

Note also that the bus between the RAM and each VPU would be able to handle data more quickly than the access time of the RAM, and, thus, the RAM would still be interleaved in this configuration; that would combine with the wider path to memory provided by the multiple vector processor units to further improve memory bandwidth. Essentially, the memory is interleaved because it is slower than the speed of a conventional data bus, and with the vector processing units, multiple conventional data buses are present, each one connecting one of those units to a slice of memory, and a high-speed data bus is provided to the main CPU.

Modern memory modules, used in today's personal computers, include on-chip interleaving of memory banks for higher data transfer speeds. Two-way interleaving appeared first, and more recently, four-way interleaving has made an appearance. Even before this happened, of course, high-performance computers could implement interleaving with external circuitry and by requiring that memory modules be installed in matched sets.

Let us suppose that, in a high-performance implementation, interleaving was taken to what may seem to be an extreme; let the memory be 16-way interleaved, and, in addition, eight of these combination vector coprocessor and memory management chips are used, for a bus to main memory that is 2,048 bits wide. This would mean that the speed at which data is fed to the CPU chip, through its 256-bit external data bus, would be 128 times the basic speed at which data is available from the memory cells used in the main memory.

Using the high-performance chip architecture given here as an example implementation, it takes sixteen fetches of 256 bits of data to fill a single cache line, each cache line being 4,096 bits wide, in order to provide 64 vector arithmetic-logic units each with 64 bits of data in parallel. Thus, even with all this effort being taken to provide data to the processor at a high speed, external data streamed to the chip at the maximum possible rate would only match the rate at which data could be fetched internally from the cache if the memory cells of the cache were but eight times faster than the memory cells used in main memory.

Due to latencies and other issues, the cache would still justify its existence if the disparity in speeds were only eightfold, but it is likely that the disparity will be larger than that.

These instructions have the form:

The length field must be a multiple of 32 bytes, which means its last five bits must be zero, as it is in units of bytes, and all operands must be aligned on a 32 byte, or 256 bit, boundary.

These instructions are available from normal mode, scratchpad mode, and compact mode. As well, they are available from simple mode, with their first 16 bits modified as follows:

                              Normal
                              Compact  Simple
                              -------  ------
External Vector Coprocessor   011155   051550
Three-Address                 011177   051770
Single Operand                011144   051440

Their opcodes are:

011155 000004 XVAB      External Vector Add Byte
011155 000005 XVSB      External Vector Subtract Byte

011155 000013 XVMVSM    External Vector Move Small

011155 000016 XVMINB    External Vector Minimum Byte
011155 000017 XVMAXB    External Vector Maximum Byte

011155 000024 XVAH      External Vector Add Halfword
011155 000025 XVSH      External Vector Subtract Halfword
011155 000026 XVMH      External Vector Multiply Halfword
011155 000027 XVDH      External Vector Divide Halfword

011155 000033 XVMVF     External Vector Move Floating

011155 000036 XVMINH    External Vector Minimum Halfword
011155 000037 XVMAXH    External Vector Maximum Halfword

011155 000044 XVA       External Vector Add
011155 000045 XVS       External Vector Subtract
011155 000046 XVM       External Vector Multiply
011155 000047 XVD       External Vector Divide

011155 000052 XVMV      External Vector Move
011155 000053 XVMVD     External Vector Move Double
011155 000054 XVN       External Vector AND
011155 000055 XVO       External Vector OR
011155 000056 XVMINH    External Vector Minimum
011155 000057 XVMAXH    External Vector Maximum

011155 000064 XVAL      External Vector Add Long
011155 000065 XVSL      External Vector Subtract Long
011155 000066 XVML      External Vector Multiply Long
011155 000067 XVDL      External Vector Divide Long

011155 000073 XVMVQ     External Vector Move Quad
011155 000074 XVSW      External Vector Swap
011155 000075 XVX       External Vector XOR
011155 000076 XVMINH    External Vector Minimum Long
011155 000077 XVMAXH    External Vector Maximum Long

011155 001004 XVASM     External Vector Add Small
011155 001005 XVSSM     External Vector Subtract Small
011155 001006 XVMSM     External Vector Multiply Small
011155 001007 XVMSM     External Vector Divide Small

011155 001012 XVMINSM   External Vector Minimum Small
011155 001013 XVMAXSM   External Vector Maximum Small

011155 001024 XVAF      External Vector Add Floating
011155 001025 XVSF      External Vector Subtract Floating
011155 001026 XVMF      External Vector Multiply Floating
011155 001027 XVDF      External Vector Divide Floating

011155 001032 XVMINF    External Vector Minimum Floating
011155 001033 XVMAXF    External Vector Maximum Floating
011155 001034 XVAU      External Vector Add Unnormalized
011155 001035 XVSU      External Vector Subtract Unnormalized
011155 001036 XVMU      External Vector Multiply Unnormalized
011155 001037 XVDU      External Vector Divide Unnormalized

011155 001044 XVAD      External Vector Add Double
011155 001045 XVSD      External Vector Subtract Double
011155 001046 XVMD      External Vector Multiply Double
011155 001047 XVDD      External Vector Divide Double

011155 001052 XVMINF    External Vector Minimum Double
011155 001053 XVMAXF    External Vector Maximum Double
011155 001054 XVAUD     External Vector Add Unnormalized Double
011155 001055 XVSUD     External Vector Subtract Unnormalized Double
011155 001056 XVMUD     External Vector Multiply Unnormalized Double
011155 001057 XVDUD     External Vector Divide Unnormalized Double

011155 001064 XVAQ      External Vector Add Quad
011155 001065 XVSQ      External Vector Subtract Quad
011155 001066 XVMQ      External Vector Multiply Quad
011155 001067 XVDQ      External Vector Divide Quad

011155 001072 XVMINQ    External Vector Minimum Quad
011155 001073 XVMAXQ    External Vector Maximum Quad
011155 001074 XVAUQ     External Vector Add Unnormalized Quad
011155 001075 XVSUQ     External Vector Subtract Unnormalized Quad
011155 001076 XVMUQ     External Vector Multiply Unnormalized Quad
011155 001077 XVDUQ     External Vector Divide Unnormalized Quad

and the C bit in the instruction indicates the source operand is a single variable instead of a vector, and the R bit indicates that the destination is to be subtracted from the source, with the result going in the destination, or the source is to be divided by the destination with the result going in the destination. For constant operations, the mnemonics in these cases end in C (Constant) or RC (Reversed Constant) respectively.

The MIN instruction returns the lesser of its two arguments, the MAX instruction returns the greater of its two arguments; just like the ZIN single-operand instruction, to be described below, this helps in performing more sophisticated operations on external vectors without the need for conditional branches on individual elements. This is a technique used in graphics processors; in the main CPU, this problem is dealt with in a different fashion, using mask bits and multi-way vector instructions.

Three-address instructions, in which either the source operand or the operand operand may be constants, but which do not need, and do not have, the option of a reversed direction, are also present.

Note that unnormalized floating-point operations are included; this is important so that it is possible to keep track of significance in what may be an enormous computation.

The single-operand instruction format shows a seven-bit field as available for the opcode. This opcode field is similar to a concatenation of the two-bit type field, and the five-bit opcode field, in a normal floating-point single-operand instruction, but the values of the five-bit opcode field are modified to avoid conflicts between these opcodes and some opcodes that are reserved for special purposes from those used for two-address and three-address instructions; they must also be reserved here, since whether an instruction is a single-address instruction, a two-address instruction, or a three-address instruction, is not indicated until after the first halfword of the instruction.

The opcodes for external vector single-operand instructions are:

011144 000001 XVSINSM  011144 000041 XVSIN    011144 001001 XVSIND   011144 001041 XVSINQ
011144 000002 XVCOSSM  011144 000042 XVCOS    011144 001002 XVCOSD   011144 001042 XVCOSQ
011144 000003 XVTANSM  011144 000043 XVTAN    011144 001003 XVTAND   011144 001043 XVTANQ
011144 000004 XVRSQSM  011144 000044 XVRSQ    011144 001004 XVRSQD   011144 001044 XVRSQQ
011144 000005 XVASNSM  011144 000045 XVASN    011144 001005 XVASND   011144 001045 XVASNQ
011144 000006 XVACSSM  011144 000046 XVACS    011144 001006 XVACSD   011144 001046 XVACSQ
011144 000007 XVATNSM  011144 000047 XVATN    011144 001007 XVATND   011144 001047 XVATNQ

011144 000011 XVSINHSM 011144 000051 XVSINH   011144 001011 XVSINHD  011144 001051 XVSINHQ
011144 000012 XVCOSHSM 011144 000052 XVCOSH   011144 001012 XVCOSHD  011144 001052 XVCOSHQ
011144 000013 XVTANHSM 011144 000053 XVTANH   011144 001013 XVTANHD  011144 001053 XVTANHQ
011144 000014 XVRECSM  011144 000054 XVREC    011144 001014 XVRECD   011144 001054 XVRECQ
011144 000015 XVASNHSM 011144 000055 XVASNH   011144 001015 XVASNHD  011144 001055 XVASNHQ
011144 000016 XVACSHSM 011144 000056 XVACSH   011144 001016 XVACSHD  011144 001056 XVACSHQ
011144 000017 XVATNHSM 011144 000057 XVATNH   011144 001017 XVATNHD  011144 001057 XVATNHQ

                                              011144 001021 XVCLR    011144 001061 XVINV
011144 000022 XVZINB   011144 000062 XVZINH   011144 001022 XVZIN    011144 001062 XVZINL
011144 000023 XVABSB   011144 000063 XVABSH   011144 001023 XVABS    011144 001063 XVABSL
011144 000024 XVSQRSM  011144 000064 XVSQR    011144 001024 XVSQRD   011144 001064 XVSQRQ
011144 000025 XVQBRSM  011144 000065 XVQBR    011144 001025 XVQBRD   011144 001065 XVQBRQ
011144 000026 XVLOGSM  011144 000066 XVLOG    011144 001026 XVLOGD   011144 001066 XVLOGQ
011144 000027 XVEXPSM  011144 000067 XVEXP    011144 001027 XVEXPD   011144 001067 XVEXPQ

011144 000031 XVZINSM  011144 000071 XVZINF   011144 001031 XVZIND   011144 001071 XVZINQ
011144 000032 XVSGNB   011144 000072 XVSGNH   011144 001032 XVSGN    011144 001072 XVSGNL
011144 000033 XVNEGB   011144 000073 XVNEGH   011144 001033 XVNEGD   011144 001073 XVNEGL
011144 000034 XVCLRSM  011144 000074 XVCLRF   011144 001034 XVCLRD   011144 001074 XVCLRQ
011144 000035 XVABSSM  011144 000075 XVABSF   011144 001035 XVABSD   011144 001075 XVABSQ
011144 000036 XVSGNSM  011144 000076 XVSGNF   011144 001036 XVSGND   011144 001076 XVSGNQ
011144 000037 XVNEGSM  011144 000077 XVNEGF   011144 001037 XVNEGD   011144 001077 XVNEGQ

The additional opcodes that may be unfamiliar here are RSQ, which calculates the reciprocal of the square root, ZIN, which replaces its argument by zero if it is negative, and REC, which calculates the reciprocal. Note also the presence of a few fixed-point single operand instructions, fixed-point forms of CLR, ABS, SGN, NEG, and ZIN, as well as INV, which performs a one's complement. Also note that XVINV is distinguished by XVCLR by type bits rather than opcode bits (as was also done for the AND and Swap instructions, and the OR and XOR instructions to make room for the floating-point move instructions), so that opcode space is available for the fixed-point XVZIN instructions.

Multiple-Component Instructions

The diagram below:

illustrates the instruction formats associated with the external vector opcodes which end in three zero bits, which were unused up to this point. As with the other instructions, in the case of advanced compound mode, the 123400 prefix halfword is added. These opcodes are followed by the additional instruction halfwords as shown in the diagrams above, and then the length field (omitted within a stretch, as described below) and the other addressing fields of normal two-address and three-address external vector memory-reference instructions which vary depending on the mode of operation in effect.

Because the positions of later halfwords in the instruction are changed, although the bits in them which indicate whether an instruction is a two-address or a three-address instruction should still be set to their appropriate values, as shown above for the different instruction modes, the first bit of the halfword immediately following the opcode also indicates, if 0, that the instruction is a two-address instruction, and, if 1, that the instruction is a three-address instruction.

These multiple-component instructions treat vectors as composed of repeated groups of two (as illustrated in the first two formats in the diagram above) or four (in the case of the last two formats in the diagram above) numbers.

In a two-address instruction, the values for op1, op2, op3 and op4 have the following meanings:

0010 MOV

0100 ADD
0101 SUB
0110 MUL
0111 DIV

1011 XOR
1100 AND  ADDU
1101 OR   SUBU
1110      MULU
1111      DIVU

and in a three-address instruction, the values for op1, op2, op3 and op4 have the following meanings:

000 ADD
001 SUB
010 MUL
011 DIV
100 AND  ADDU
101 OR   SUBU
110      MULU
111 XOR  DIVU

the ones in the second column replacing the logical operations for floating-point numbers with unnormalized floating-point arithmetic.

In a two-component instruction, operations are performed on pairs of numbers of the type indicated in the first halfword of the instruction. For each component in the result, an operation is indicated, and a source component is indicated, and, if it is a three-address instruction, an operand component is also indicated. All possible destination components are used; the destination component of the operation indicated by op1 is the first of the two destination components, and the destination component of the operation indicated by op2 is the second of two destination components.

In a three-address instruction, there is also a two-bit op field; this indicates how the result of the operation performed on the source component and the operand component is applied to the destination component. The possible values for this field are:

00 replace
01 zero and subtract
10 add
11 subtract

Thus, it is possible to divide the contents of the operand field by the contents of the source field, and then subtract the result from the contents of the destination field, with the result being placed in the destination field.

With two two-component instructions, it is possible to multiply two vectors of complex numbers.

In a four-component instruction, we again have a series of four instructions which select a source component, and, in a three-address instruction, an operand component, within the corresponding elements of the source vector and the operand vector of the instruction.

This type of instruction is very similar to, but is a superset of, the type of operation commonly found in pixel (or fragment) and vertex shaders in graphics chips. However, tasks such as the rasterization of polygons still require either a conventional software program, or the use of a conventional special-purpose graphics chip.

Also note that if the external vector coprocessors have a 256-bit path to memory, four-component operations on the Quad floating-point type are not possible unless pairs of external vector coprocessors co-operate in performing them. Narrower paths to main memory could impose more severe restrictions.

In many cases, an alternative to using multiple-component instructions would be to use multiple vectors, each vector being of one component. This requires more instructions, but that is not a real cost if the length of a vector handled by an instruction is bounded with sufficient severity that it is shorter than the actual length of the arrays on which operations are being performed. With a 32-bit length for vectors, this is not the case for the external vector coprocessor. The use of long vector instructions with stride is also possible for conversion between the two possible memory organizations.

Using Registers in the External Vector Coprocessor

Because the operands of these instructions have a length indicated by a thirty-two bit length field, it is not practical for the external coprocessor units to posess registers of sufficient size to store a complete operand. However, the ability to use registers, so that not every step in a calculation requires a memory access, is very important. This is handled by providing a special interpretation to the following opcodes for the two-address form of an external vector operation:

011244 000002 XVLB   External Vector Load Byte
011244 000003 XVSTB  External Vector Store Byte

011244 000022 XVLH   External Vector Load Halfword
011244 000023 XVSTH  External Vector Store Halfword

011244 000042 XVL    External Vector Load
011244 000043 XVST   External Vector Store

011244 000062 XVLL   External Vector Load Long
011244 000063 XVSTL  External Vector Store Long

011244 001002 XVLSM  External Vector Load Small
011244 001003 XVSTSM External Vector Store Small

011244 001022 XVLF   External Vector Load Floating
011244 001023 XVSTF  External Vector Store Floating

011244 001042 XVLD   External Vector Load Double
011244 001043 XVSTD  External Vector Store Double

011244 001062 XVLQ   External Vector Load Quad
011244 001063 XVSTQ  External Vector Store Quad

These are memory-to-register instructions. The dX field instead serves as a dR field, and the halfword containing the corresponding address (or, in the case of the short page modes, the indirect bit, the dB field, and the address) is omitted from the instruction, because the destination is a register, and is not in memory; the halfword containing the source address remains present in the instruction. The store instruction also omits the length specification.

A load instruction begins a stretch of code using registers, and a store instruction ends a stretch of code using registers.

The stretch should be treated as if it were a single instruction; no attempt should be made to branch into it, or out of it. For one thing, an attempt to branch into a stretch from instructions not part of a stretch would result in the instructions within it not being interpreted correctly, because the format of instructions is being changed by the omission of the length field. A stretch should be thought of as being similar to a series of instructions within an FLL (Fixed-Length Loop) instruction, since the series is sent once to the external vector coprocessors, but is performed repeatedly by them, as many times as the width of their path to memory lies within the length specification of the initial load instruction.

The length specification in the load instruction applies to all the instructions in the stretch. A stretch consists exclusively of memory-reference instructions and single-operand instructions, both fixed and floating. The multiple-component instructions are allowed within a stretch. All these instructions will be treated as external vector instructions; register references will be to the internal registers of the external vector coprocessor.

These registers will be as wide as the bus connecting each external vector coprocessor to memory; the instructions making up the stretch will be repeated, therefore, with a suitable displacement each time, until the entire length of the vector is processed. Thus, none of the vectors referenced in memory may overlap, or the results will be unpredictable.

Also, within a stretch, the opcode 011144 000021 will be used as an instruction prefix, to indicate that the source operand of an instruction is a scalar, to be used as a constant operand acting on every element of a vector.

In addressing modes that provide three-address memory-to-memory scalar operations, they may be used; otherwise, a three-address external vector instruction with the length field omitted may be present within a stretch as well. Register operands are specified by using a zero base register value in the vector register and symmetric vector register modes, as well as in the conventional modes; in the short page modes, the three bits following the index register field, normally zero, are to contain the bits 111 to indicate a register operand.

Since a load instruction begins a stretch, and a store instruction ends a stretch, transfers within the stretch between registers and memory are handled by the following instructions:

011177 000012 XVMVB     External Vector Move Byte
011177 000032 XVMVH     External Vector Move Halfword
011177 000052 XVMV      External Vector Move
011177 000072 XVMVL     External Vector Move Long

011177 000013 XVMVSM    External Vector Move Small
011177 000033 XVMVF     External Vector Move Floating
011177 000053 XVMVD     External Vector Move Double
011177 000073 XVMVQ     External Vector Move Quad

Note that, due to a lack of available opcodes, the move instructions for the floating-point types are grouped with fixed-point opcodes.

Only a limited subset of the available floating-point formats supported by the architecture would be available by means of vector coprocessor operations. It is not yet defined how the format to use would be specified, if any choice at all is made available. Presumably, a limited-width field in the Program Status Block would contain the external vector coprocessor floating-point format in current use, which would be signalled to the external vector coprocessor each time an operation is requested.

It is envisaged that in any case, the Standard floating-point format would be supported, since the IEEE 754 floating-point format has become the standard floating-point format supported by virtually all microprocessors. As a second choice, the Compatible or Modified formats, for compatibility with the IBM System/360 series of computers and their successors, suggest themselves. But another very important alternative format would be the Common floating-point format. This is the floating-point format used when it is desired to increase the total floating-point performance of the central processing unit to its maximum value by modifying Simple Floating-Point operation, which is carried out by the fixed-point arithmetic units, by decreasing the length of the exponent field, so that compatibility with the floating-point arithmetic units is achieved. Allowing the external vector coprocessing units to handle this format as well permits all available floating-point processing capacity to use the same floating-point format to maximize floating-point throughput.

Since the external vector coprocessor only works with memory in its native width, the only variation of the Common floating-point format that would be used would be the one where an excess-128 binary exponent occupied the last eight bits of a 32, 64, or 128 bit long floating-point number with a sign-magnitude mantissa. Note also that the integer arithmetic units of the central processing unit do not provide the guard, round, and sticky bits, or their equivalents, that both the central processing unit floating-point units and the external vector coprocessor provide, and, thus, there is a penalty in accuracy for maximizing floating-point performance in this manner.