[Next] [Up] [Previous] [Next Section]

External Vector Coprocessing

The idea of replacing an essentially passive array of memory elements by one that is peppered with processors for every so much memory is one that has been proposed many times, but one that has seldom been implemented. It has been considered visionary rather than practical.

On the other hand, banks of arithmetic-logic units set up to perform calculations in parallel have been implented many times, and have been in use for some time.

A limited amount of vector arithmetic capability had been present in some computers using vacuum tubes (the AN/FSQ-7) or transistors (the TX-2 and the AN/FSQ-32), of a type resembling the MMX feature available with Intel microprocessors or the short vector instructions on the architecture described here.

Many of the early machines which provided long vector capabilities did so through a special rapid pipelining mode rather than by having multiple arithmetic-logic units, one for each element of a vector or at least each element of a sizable chunk of a vector.

It was still obvious from the start that a vector unit should ideally connected to memory along a data path having the same width in bits as the combined width of the arithmetic units involved.

In order that the position on this wide data bus of each arithmetic unit be fixed, it would appear preferable to organize an external vector unit of this type along the lines of that performing the short vector instructions of the main processor architecture we have been examining, rather than along the lines of that performing the long vector instructions. However, historically, external vector units have tended to be organized with a fixed number of arithmetic-logic units, so that when doing double-precision arithmetic, half the arithmetic-logic units, even if they are ones only capable of single-precision arithmetic, would not be idle. This can be minimized by making the ALU reconfigurable, so that much of the same logic is used whether one long or two short operands are being worked on, but some circuitry (such as that for normalization) will unavoidably be idle in the case of fewer operands.

In addition to supercomputers with vector capabilities, later vector floating-point units were available as third-party add-on systems for both mainframes and minicomputers from companies such as Floating Point Systems.

More recently, IBM's Enterprise Systems Architecture/390 provided for an optional external vector arithmetic unit the internal width of which would vary from one model to another, but which shared a common interface from the programmer's perspective. This vector facility connected directly to the CPU, and, at least in the case of the version used with ES/9000, operated on data in cache. It had internal vector registers, so it provided a basic architecture similar to that of vector supercomputers like the Cray-1 and its successors. Before that, add-on vector units offered for the System/360 and the original System/370, such as the IBM 2938 Array Processor and the IBM 3838 Array Processor, had their own internal memories, to which data would be transferred before the start of a computation involving vectors.

In the context of the specific architecture being presented here on these pages, it may be noted that the vector mode and the short page mode of operation provide memory-to-memory vector operations only, and the modes which provide access to the internal long vector registers also include memory-to-memory vector operations. Circuitry to detect the presence of an appropriate external coprocessor, and to permit the delegation of these operations to that coprocessor, can be added to the design without in any way altering the instructions themselves. However, if an external vector coprocessor were present, it would be useful to have available vector instructions which had a length field which was 16 or even 32 bits in length, instead of only six bits in length, so that a single instruction could cause a long vector to be processed concurrently with normal machine operations. Of course, the external vector coprocessing units would be pipelined, as this is an inexpensive method of permitting them to perform a larger number of operations in a given time.

Since the external vector coprocessing system would be connected directly to main memory, it always treats memory as divided into 32-bit words of the normal size, ignoring any use of 36-bit, 24-bit, 40-bit or 60-bit words by the central processor, and it also does not handle floating-point numbers of the Medium type, but instead provides support for floating-point numbers of the Small type, just as is done by the Short Vector arithmetic unit.

The following illustration depicts the architecture for external vector coprocessing that is envisaged as being associated with the processor architecture described here, as will be discussed below:

Because the external vector coprocessing system could consist of a multiplicity of functional units, such as, for example, a set of sixteen identical chips, and since it works with operands in memory rather than the cache, it would have a wide path to memory. This suggests the possibility of a design in which the external vector coprocessing chips also perform a memory interface function; thus, while the main microprocessor could have, as given as an example, a 256 bit wide data bus, each of sixteen vector coprocessing units could have a data bus to memory of similar width, as well as being connected to the processor data bus, which could then operate at a higher speed than that of the memory.

This also indicates that the individual chips in that system, having a fixed-width path to memory, would likely work on a principle similar to that of the ALU used for the short vector instructions.

Also, while an elaborate bus structure was shown on this page for allowing, in two steps, vector operands beginning at any arbitrary position to be brought into alignment, for the case of an elaborate implementation of the architecture having a set of 64 ALUs within the chip itself, for the external system, such a capability is unlikely. Instead, if the external system consists of a row of chips, each one coupled to a bank of memory, likely each chip would simply be connected to its two immediately adjacent neighbors. This, of course, would mean that vectors not fully aligned, in terms of the width of the entire assembly of external coprocessor chips, would involve a significant performance penalty, although even that penalty would not outweigh the benefit of avoiding the limited-width data path to the CPU itself.

Note also that the bus between the RAM and each VPU would be able to handle data more quickly than the access time of the RAM, and, thus, the RAM would still be interleaved in this configuration; that would combine with the wider path to memory provided by the multiple vector processor units to further improve memory bandwidth. Essentially, the memory is interleaved because it is slower than the speed of a conventional data bus, and with the vector processing units, multiple conventional data buses are present, each one connecting one of those units to a slice of memory, and a high-speed data bus is provided to the main CPU.

Modern memory modules, used in today's personal computers, include on-chip interleaving of memory banks for higher data transfer speeds. Two-way interleaving appeared first, and more recently, four-way interleaving has made an appearance. Even before this happened, of course, high-performance computers could implement interleaving with external circuitry and by requiring that memory modules be installed in matched sets.

Let us suppose that, in a high-performance implementation, interleaving was taken to what may seem to be an extreme; let the memory be 16-way interleaved, and, in addition, eight of these combination vector coprocessor and memory management chips are used, for a bus to main memory that is 2,048 bits wide. This would mean that the speed at which data is fed to the CPU chip, through its 256-bit external data bus, would be 128 times the basic speed at which data is available from the memory cells used in the main memory.

Using the high-performance chip architecture given here as an example implementation, it takes sixteen fetches of 256 bits of data to fill a single cache line, each cache line being 4,096 bits wide, in order to provide 64 vector arithmetic-logic units each with 64 bits of data in parallel. Thus, even with all this effort being taken to provide data to the processor at a high speed, external data streamed to the chip at the maximum possible rate would only match the rate at which data could be fetched internally from the cache if the memory cells of the cache were but eight times faster than the memory cells used in main memory.

Due to latencies and other issues, the cache would still justify its existence if the disparity in speeds were only eightfold, but it is likely that the disparity will be larger than that.


Available opcode space common to all the 16-bit alignment modes beyond that used for the mode-independent instructions is available, some of which was used for block transfer instructions for the full cache modes. Instructions to allow direct control of the external vector coprocessor are also defined in this space. Therefore, in vector register mode, and symmetric vector register mode, the external vector coprocessor instructions would take this form:

in short page mode, register short page mode, stack mode, register stack mode, short page extended operate mode, register short page extended operate mode, short page short shift mode, and register short page short shift mode, the external vector coprocessor instructions take the form:

and in nearly all the other modes, the external vector coprocessor instructions look like this:

with the sole exception of advanced compound mode, in which the instructions look the same except for the addition of the prefix halfword used for the operate instructions:

The length field must be a multiple of 32 bytes, which means its last five bits must be zero, as it is in units of bytes, and all operands must be aligned on a 32 byte, or 256 bit, boundary.

The available opcodes are:

176004 XVAB      External Vector Add Byte
176005 XVSB      External Vector Subtract Byte

176013 XVMVSM    External Vector Move Small

176016 XVMINB    External Vector Minimum Byte
176017 XVMAXB    External Vector Maximum Byte

176024 XVAH      External Vector Add Halfword
176025 XVSH      External Vector Subtract Halfword
176026 XVMH      External Vector Multiply Halfword
176027 XVDH      External Vector Divide Halfword

176033 XVMVF     External Vector Move Floating

176036 XVMINH    External Vector Minimum Halfword
176037 XVMAXH    External Vector Maximum Halfword

176044 XVA       External Vector Add
176045 XVS       External Vector Subtract
176046 XVM       External Vector Multiply
176047 XVD       External Vector Divide

176052 XVMV      External Vector Move
176053 XVMVD     External Vector Move Double
176054 XVN       External Vector AND
176055 XVO       External Vector OR
176056 XVMINH    External Vector Minimum
176057 XVMAXH    External Vector Maximum

176064 XVAL      External Vector Add Long
176065 XVSL      External Vector Subtract Long
176066 XVML      External Vector Multiply Long
176067 XVDL      External Vector Divide Long

176073 XVMVQ     External Vector Move Quad
176074 XVSW      External Vector Swap
176075 XVX       External Vector XOR
176076 XVMINH    External Vector Minimum Long
176077 XVMAXH    External Vector Maximum Long

177004 XVASM     External Vector Add Small
177005 XVSSM     External Vector Subtract Small
177006 XVMSM     External Vector Multiply Small
177007 XVMSM     External Vector Divide Small

176012 XVMINSM   External Vector Minimum Small
176013 XVMAXSM   External Vector Maximum Small

177024 XVAF      External Vector Add Floating
177025 XVSF      External Vector Subtract Floating
177026 XVMF      External Vector Multiply Floating
177027 XVDF      External Vector Divide Floating

176032 XVMINF    External Vector Minimum Floating
176033 XVMAXF    External Vector Maximum Floating
177034 XVAU      External Vector Add Unnormalized
177035 XVSU      External Vector Subtract Unnormalized
177036 XVMU      External Vector Multiply Unnormalized
177037 XVDU      External Vector Divide Unnormalized

177044 XVAD      External Vector Add Double
177045 XVSD      External Vector Subtract Double
177046 XVMD      External Vector Multiply Double
177047 XVDD      External Vector Divide Double

177052 XVMINF    External Vector Minimum Double
177053 XVMAXF    External Vector Maximum Double
177054 XVAUD     External Vector Add Unnormalized Double
177055 XVSUD     External Vector Subtract Unnormalized Double
177056 XVMUD     External Vector Multiply Unnormalized Double
177057 XVDUD     External Vector Divide Unnormalized Double

177064 XVAQ      External Vector Add Quad
177065 XVSQ      External Vector Subtract Quad
177066 XVMQ      External Vector Multiply Quad
177067 XVDQ      External Vector Divide Quad

177072 XVMINQ    External Vector Minimum Quad
177073 XVMAXQ    External Vector Maximum Quad
177074 XVAUQ     External Vector Add Unnormalized Quad
177075 XVSUQ     External Vector Subtract Unnormalized Quad
177076 XVMUQ     External Vector Multiply Unnormalized Quad
177077 XVDUQ     External Vector Divide Unnormalized Quad

and the C bit in the instruction indicates the source operand is a single variable instead of a vector, and the R bit indicates that the destination is to be subtracted from the source, with the result going in the destination, or the source is to be divided by the destination with the result going in the destination. For constant operations, the mnemonics in these cases end in C (Constant) or RC (Reversed Constant) respectively.

The MIN instruction returns the lesser of its two arguments, the MAX instruction returns the greater of its two arguments; just like the ZIN single-operand instruction, to be described below, this helps in performing more sophisticated operations on external vectors without the need for conditional branches on individual elements. This is a technique used in graphics processors; in the main CPU, this problem is dealt with in a different fashion, using mask bits and multi-way vector instructions.

Three-address instructions, in which either the source operand or the operand operand may be constants, but which do not need, and do not have, the option of a reversed direction, are also present.

Note that unnormalized floating-point operations are included; this is important so that it is possible to keep track of significance in what may be an enormous computation.

The single-operand instruction format shows a seven-bit field as available for the opcode. This opcode field is similar to a concatenation of the two-bit type field, and the five-bit opcode field, in a normal floating-point single-operand instruction, but the values of the five-bit opcode field are modified to avoid conflicts between these opcodes and some opcodes that are reserved for special purposes from those used for two-address and three-address instructions; they must also be reserved here, since whether an instruction is a single-address instruction, a two-address instruction, or a three-address instruction, is not indicated until after the first halfword of the instruction.

The opcodes for external vector single-operand instructions are:

176001 XVSINSM  176041 XVSIN    177001 XVSIND   177041 XVSINQ
176002 XVCOSSM  176042 XVCOS    177002 XVCOSD   177042 XVCOSQ
176003 XVTANSM  176043 XVTAN    177003 XVTAND   177043 XVTANQ
176004 XVRSQSM  176044 XVRSQ    177004 XVRSQD   177044 XVRSQQ
176005 XVASNSM  176045 XVASN    177005 XVASND   177045 XVASNQ
176006 XVACSSM  176046 XVACS    177006 XVACSD   177046 XVACSQ
176007 XVATNSM  176047 XVATN    177007 XVATND   177047 XVATNQ

176011 XVSINHSM 176051 XVSINH   177011 XVSINHD  177051 XVSINHQ
176012 XVCOSHSM 176052 XVCOSH   177012 XVCOSHD  177052 XVCOSHQ
176013 XVTANHSM 176053 XVTANH   177013 XVTANHD  177053 XVTANHQ
176014 XVRECSM  176054 XVREC    177014 XVRECD   177054 XVRECQ
176015 XVASNHSM 176055 XVASNH   177015 XVASNHD  177055 XVASNHQ
176016 XVACSHSM 176056 XVACSH   177016 XVACSHD  177056 XVACSHQ
176017 XVATNHSM 176057 XVATNH   177017 XVATNHD  177057 XVATNHQ

                                177021 XVCLR    177061 XVINV
176022 XVZINB   176062 XVZINH   177022 XVZIN    177062 XVZINL
176023 XVABSB   176063 XVABSH   177023 XVABS    177063 XVABSL
176024 XVSQRSM  176064 XVSQR    177024 XVSQRD   177064 XVSQRQ
176025 XVQBRSM  176065 XVQBR    177025 XVQBRD   177065 XVQBRQ
176026 XVLOGSM  176066 XVLOG    177026 XVLOGD   177066 XVLOGQ
176027 XVEXPSM  176067 XVEXP    177027 XVEXPD   177067 XVEXPQ

176031 XVZINSM  176071 XVZINF   177031 XVZIND   177071 XVZINQ
176032 XVSGNB   176072 XVSGNH   177032 XVSGN    177072 XVSGNL
176033 XVNEGB   176073 XVNEGH   177033 XVNEGD   177073 XVNEGL
176034 XVCLRSM  176074 XVCLRF   177034 XVCLRD   177074 XVCLRQ
176035 XVABSSM  176075 XVABSF   177035 XVABSD   177075 XVABSQ
176036 XVSGNSM  176076 XVSGNF   177036 XVSGND   177076 XVSGNQ
176037 XVNEGSM  176077 XVNEGF   177037 XVNEGD   177077 XVNEGQ

The additional opcodes that may be unfamiliar here are RSQ, which calculates the reciprocal of the square root, ZIN, which replaces its argument by zero if it is negative, and REC, which calculates the reciprocal. Note also the presence of a few fixed-point single operand instructions, fixed-point forms of CLR, ABS, SGN, NEG, and ZIN, as well as INV, which performs a one's complement. Also note that XVINV is distinguished by XVCLR by type bits rather than opcode bits (as was also done for the AND and Swap instructions, and the OR and XOR instructions to make room for the floating-point move instructions), so that opcode space is available for the fixed-point XVZIN instructions.

Multiple-Component Instructions

The diagram below:

illustrates the instruction formats associated with the external vector opcodes which end in three zero bits, which were unused up to this point. As with the other instructions, in the case of advanced compound mode, the 123400 prefix halfword is added. These opcodes are followed by the additional instruction halfwords as shown in the diagrams above, and then the length field (omitted within a stretch, as described below) and the other addressing fields of normal two-address and three-address external vector memory-reference instructions which vary depending on the mode of operation in effect.

Because the positions of later halfwords in the instruction are changed, although the bits in them which indicate whether an instruction is a two-address or a three-address instruction should still be set to their appropriate values, as shown above for the different instruction modes, the first bit of the halfword immediately following the opcode also indicates, if 0, that the instruction is a two-address instruction, and, if 1, that the instruction is a three-address instruction.

These multiple-component instructions treat vectors as composed of repeated groups of two (as illustrated in the first two formats in the diagram above) or four (in the case of the last two formats in the diagram above) numbers.

In a two-address instruction, the values for op1, op2, op3 and op4 have the following meanings:

0010 MOV

0100 ADD
0101 SUB
0110 MUL
0111 DIV

1011 XOR
1100 AND  ADDU
1101 OR   SUBU
1110      MULU
1111      DIVU

and in a three-address instruction, the values for op1, op2, op3 and op4 have the following meanings:

000 ADD
001 SUB
010 MUL
011 DIV
100 AND  ADDU
101 OR   SUBU
110      MULU
111 XOR  DIVU

the ones in the second column replacing the logical operations for floating-point numbers with unnormalized floating-point arithmetic.

In a two-component instruction, operations are performed on pairs of numbers of the type indicated in the first halfword of the instruction. For each component in the result, an operation is indicated, and a source component is indicated, and, if it is a three-address instruction, an operand component is also indicated. All possible destination components are used; the destination component of the operation indicated by op1 is the first of the two destination components, and the destination component of the operation indicated by op2 is the second of two destination components.

In a three-address instruction, there is also a two-bit op field; this indicates how the result of the operation performed on the source component and the operand component is applied to the destination component. The possible values for this field are:

00 replace
01 zero and subtract
10 add
11 subtract

Thus, it is possible to divide the contents of the operand field by the contents of the source field, and then subtract the result from the contents of the destination field, with the result being placed in the destination field.

With two two-component instructions, it is possible to multiply two vectors of complex numbers.

In a four-component instruction, we again have a series of four instructions which select a source component, and, in a three-address instruction, an operand component, within the corresponding elements of the source vector and the operand vector of the instruction.

This type of instruction is very similar to, but is a superset of, the type of operation commonly found in pixel (or fragment) and vertex shaders in graphics chips. However, tasks such as the rasterization of polygons still require either a conventional software program, or the use of a conventional special-purpose graphics chip.

Also note that if the external vector coprocessors have a 256-bit path to memory, four-component operations on the Quad floating-point type are not possible unless pairs of external vector coprocessors co-operate in performing them. Narrower paths to main memory could impose more severe restrictions.

In many cases, an alternative to using multiple-component instructions would be to use multiple vectors, each vector being of one component. This requires more instructions, but that is not a real cost if the length of a vector handled by an instruction is bounded with sufficient severity that it is shorter than the actual length of the arrays on which operations are being performed. With a 32-bit length for vectors, this is not the case for the external vector coprocessor. The use of long vector instructions with stride is also possible for conversion between the two possible memory organizations.

Using Registers in the External Vector Coprocessor

Because the operands of these instructions have a length indicated by a thirty-two bit length field, it is not practical for the external coprocessor units to posess registers of sufficient size to store a complete operand. However, the ability to use registers, so that not every step in a calculation requires a memory access, is very important. This is handled by providing a special interpretation to the following opcodes for the two-address form of an external vector operation:

176002 XVLB   External Vector Load Byte
176003 XVSTB  External Vector Store Byte

176022 XVLH   External Vector Load Halfword
176023 XVSTH  External Vector Store Halfword

176042 XVL    External Vector Load
176043 XVST   External Vector Store

176062 XVLL   External Vector Load Long
176063 XVSTL  External Vector Store Long

177002 XVLSM  External Vector Load Small
177003 XVSTSM External Vector Store Small

177022 XVLF   External Vector Load Floating
177023 XVSTF  External Vector Store Floating

177042 XVLD   External Vector Load Double
177043 XVSTD  External Vector Store Double

177062 XVLQ   External Vector Load Quad
177063 XVSTQ  External Vector Store Quad

These are memory-to-register instructions. The dX field instead serves as a dR field, and the halfword containing the corresponding address (or, in the case of the short page modes, the indirect bit, the dB field, and the address) is omitted from the instruction, because the destination is a register, and is not in memory; the halfword containing the source address remains present in the instruction. The store instruction also omits the length specification.

A load instruction begins a stretch of code using registers, and a store instruction ends a stretch of code using registers.

The stretch should be treated as if it were a single instruction; no attempt should be made to branch into it, or out of it. For one thing, an attempt to branch into a stretch from instructions not part of a stretch would result in the instructions within it not being interpreted correctly, because the format of instructions is being changed by the omission of the length field. A stretch should be thought of as being similar to a series of instructions within an FLL (Fixed-Length Loop) instruction, since the series is sent once to the external vector coprocessors, but is performed repeatedly by them, as many times as the width of their path to memory lies within the length specification of the initial load instruction.

The length specification in the load instruction applies to all the instructions in the stretch. A stretch consists exclusively of memory-reference instructions and single-operand instructions, both fixed and floating. The multiple-component instructions are allowed within a stretch. All these instructions will be treated as external vector instructions; register references will be to the internal registers of the external vector coprocessor.

These registers will be as wide as the bus connecting each external vector coprocessor to memory; the instructions making up the stretch will be repeated, therefore, with a suitable displacement each time, until the entire length of the vector is processed. Thus, none of the vectors referenced in memory may overlap, or the results will be unpredictable.

Also, within a stretch, the opcode 176021 will be used as an instruction prefix, to indicate that the source operand of an instruction is a scalar, to be used as a constant operand acting on every element of a vector.

In addressing modes that provide three-address memory-to-memory scalar operations, they may be used; otherwise, a three-address external vector instruction with the length field omitted may be present within a stretch as well. Register operands are specified by using a zero base register value in the vector register and symmetric vector register modes, as well as in the conventional modes; in the short page modes, the three bits following the index register field, normally zero, are to contain the bits 111 to indicate a register operand.

Since a load instruction begins a stretch, and a store instruction ends a stretch, transfers within the stretch between registers and memory are handled by the following instructions:

176012 XVMVB     External Vector Move Byte
176032 XVMVH     External Vector Move Halfword
176052 XVMV      External Vector Move
176072 XVMVL     External Vector Move Long

176013 XVMVSM    External Vector Move Small
176033 XVMVF     External Vector Move Floating
176053 XVMVD     External Vector Move Double
176073 XVMVQ     External Vector Move Quad

Note that, due to a lack of available opcodes, the move instructions for the floating-point types are grouped with fixed-point opcodes.

Only a limited subset of the available floating-point formats supported by the architecture would be available by means of vector coprocessor operations. It is not yet defined how the format to use would be specified, if any choice at all is made available. Presumably, a limited-width field in the Program Status Block would contain the external vector coprocessor floating-point format in current use, which would be signalled to the external vector coprocessor each time an operation is requested.

It is envisaged that in any case, the Standard floating-point format would be supported, since the IEEE 754 floating-point format has become the standard floating-point format supported by virtually all microprocessors. As a second choice, the Compatible or Modified formats, for compatibility with the IBM System/360 series of computers and their successors, suggest themselves. But another very important alternative format would be the Common floating-point format. This is the floating-point format used when it is desired to increase the total floating-point performance of the central processing unit to its maximum value by modifying Simple Floating-Point operation, which is carried out by the fixed-point arithmetic units, by decreasing the length of the exponent field, so that compatibility with the floating-point arithmetic units is achieved. Allowing the external vector coprocessing units to handle this format as well permits all available floating-point processing capacity to use the same floating-point format to maximize floating-point throughput.

Since the external vector coprocessor only works with memory in its native width, the only variation of the Common floating-point format that would be used would be the one where an excess-128 binary exponent occupied the last eight bits of a 32, 64, or 128 bit long floating-point number with a sign-magnitude mantissa. Note also that the integer arithmetic units of the central processing unit do not provide the guard, round, and sticky bits, or their equivalents, that both the central processing unit floating-point units and the external vector coprocessor provide, and, thus, there is a penalty in accuracy for maximizing floating-point performance in this manner.


[Next] [Up] [Previous] [Next Section]