The idea of replacing an essentially passive array of memory elements by one that is peppered with processors for every so much memory is one that has been proposed many times, but one that has seldom been implemented. It has been considered visionary rather than practical.
On the other hand, banks of arithmetic-logic units set up to perform calculations in parallel have been implented many times, and have been in use for some time.
A limited amount of vector arithmetic capability had been present in some computers using vacuum tubes (the AN/FSQ-7) or transistors (the TX-2 and the AN/FSQ-32), of a type resembling the MMX feature available with Intel microprocessors or the short vector instructions on the architecture described here.
Many of the early machines which provided long vector capabilities did so through a special rapid pipelining mode rather than by having multiple arithmetic-logic units, one for each element of a vector or at least each element of a sizable chunk of a vector.
It was still obvious from the start that a vector unit should ideally connected to memory along a data path having the same width in bits as the combined width of the arithmetic units involved.
In order that the position on this wide data bus of each arithmetic unit be fixed, it would appear preferable to organize an external vector unit of this type along the lines of that performing the short vector instructions of the main processor architecture we have been examining, rather than along the lines of that performing the long vector instructions. However, historically, external vector units have tended to be organized with a fixed number of arithmetic-logic units, so that when doing double-precision arithmetic, half the arithmetic-logic units, even if they are ones only capable of single-precision arithmetic, would not be idle. This can be minimized by making the ALU reconfigurable, so that much of the same logic is used whether one long or two short operands are being worked on, but some circuitry (such as that for normalization) will unavoidably be idle in the case of fewer operands.
In addition to supercomputers with vector capabilities, later vector floating-point units were available as third-party add-on systems for both mainframes and minicomputers from companies such as Floating Point Systems.
More recently, IBM's Enterprise Systems Architecture/390 provided for an optional external vector arithmetic unit the internal width of which would vary from one model to another, but which shared a common interface from the programmer's perspective. This vector facility connected directly to the CPU, and, at least in the case of the version used with ES/9000, operated on data in cache. It had internal vector registers, so it provided a basic architecture similar to that of vector supercomputers like the Cray-1 and its successors. Before that, add-on vector units offered for the System/360 and the original System/370, such as the IBM 2938 Array Processor and the IBM 3838 Array Processor, had their own internal memories, to which data would be transferred before the start of a computation involving vectors.
In the context of the specific architecture being presented here on these pages, it may be noted that the vector mode and the short page mode of operation provide memory-to-memory vector operations only, and the modes which provide access to the internal long vector registers also include memory-to-memory vector operations. Circuitry to detect the presence of an appropriate external coprocessor, and to permit the delegation of these operations to that coprocessor, can be added to the design without in any way altering the instructions themselves. However, if an external vector coprocessor were present, it would be useful to have available vector instructions which had a length field which was 16 or even 32 bits in length, instead of only six bits in length, so that a single instruction could cause a long vector to be processed concurrently with normal machine operations. Of course, the external vector coprocessing units would be pipelined, as this is an inexpensive method of permitting them to perform a larger number of operations in a given time.
Since the external vector coprocessing system would be connected directly to main memory, it always treats memory as divided into 32-bit words of the normal size, ignoring any use of 36-bit, 24-bit, 40-bit or 60-bit words by the central processor, and it also does not handle floating-point numbers of the Medium type, but instead provides support for floating-point numbers of the Small type, just as is done by the Short Vector arithmetic unit.
The following illustration depicts the architecture for external vector coprocessing that is envisaged as being associated with the processor architecture described here, as will be discussed below:

Because the external vector coprocessing system could consist of a multiplicity of functional units, such as, for example, a set of sixteen identical chips, and since it works with operands in memory rather than the cache, it would have a wide path to memory. This suggests the possibility of a design in which the external vector coprocessing chips also perform a memory interface function; thus, while the main microprocessor could have, as given as an example, a 256 bit wide data bus, each of sixteen vector coprocessing units could have a data bus to memory of similar width, as well as being connected to the processor data bus, which could then operate at a higher speed than that of the memory.
This also indicates that the individual chips in that system, having a fixed-width path to memory, would likely work on a principle similar to that of the ALU used for the short vector instructions.
Also, while an elaborate bus structure was shown on this page for allowing, in two steps, vector operands beginning at any arbitrary position to be brought into alignment, for the case of an elaborate implementation of the architecture having a set of 64 ALUs within the chip itself, for the external system, such a capability is unlikely. Instead, if the external system consists of a row of chips, each one coupled to a bank of memory, likely each chip would simply be connected to its two immediately adjacent neighbors. This, of course, would mean that vectors not fully aligned, in terms of the width of the entire assembly of external coprocessor chips, would involve a significant performance penalty, although even that penalty would not outweigh the benefit of avoiding the limited-width data path to the CPU itself.
Note also that the bus between the RAM and each VPU would be able to handle data more quickly than the access time of the RAM, and, thus, the RAM would still be interleaved in this configuration; that would combine with the wider path to memory provided by the multiple vector processor units to further improve memory bandwidth. Essentially, the memory is interleaved because it is slower than the speed of a conventional data bus, and with the vector processing units, multiple conventional data buses are present, each one connecting one of those units to a slice of memory, and a high-speed data bus is provided to the main CPU.
Modern memory modules, used in today's personal computers, include on-chip interleaving of memory banks for higher data transfer speeds. Two-way interleaving appeared first, and more recently, four-way interleaving has made an appearance. Even before this happened, of course, high-performance computers could implement interleaving with external circuitry and by requiring that memory modules be installed in matched sets.
Let us suppose that, in a high-performance implementation, interleaving was taken to what may seem to be an extreme; let the memory be 16-way interleaved, and, in addition, eight of these combination vector coprocessor and memory management chips are used, for a bus to main memory that is 2,048 bits wide. This would mean that the speed at which data is fed to the CPU chip, through its 256-bit external data bus, would be 128 times the basic speed at which data is available from the memory cells used in the main memory.
Using the high-performance chip architecture given here as an example implementation, it takes sixteen fetches of 256 bits of data to fill a single cache line, each cache line being 4,096 bits wide, in order to provide 64 vector arithmetic-logic units each with 64 bits of data in parallel. Thus, even with all this effort being taken to provide data to the processor at a high speed, external data streamed to the chip at the maximum possible rate would only match the rate at which data could be fetched internally from the cache if the memory cells of the cache were but eight times faster than the memory cells used in main memory.
Due to latencies and other issues, the cache would still justify its existence if the disparity in speeds were only eightfold, but it is likely that the disparity will be larger than that.
Available opcode space common to all the 16-bit alignment modes beyond that used for the mode-independent instructions is available, some of which was used for block transfer instructions for the full cache modes. Instructions to allow direct control of the external vector coprocessor are also defined in this space. Therefore, in vector register mode, and symmetric vector register mode, the external vector coprocessor instructions would take this form:

in short page mode, register short page mode, stack mode, register stack mode, short page extended operate mode, register short page extended operate mode, short page short shift mode, and register short page short shift mode, the external vector coprocessor instructions take the form:

and in nearly all the other modes, the external vector coprocessor instructions look like this:

with the sole exception of advanced compound mode, in which the instructions look the same except for the addition of the prefix halfword used for the operate instructions:

The length field must be a multiple of 32 bytes, which means its last five bits must be zero, as it is in units of bytes, and all operands must be aligned on a 32 byte, or 256 bit, boundary.
The available opcodes are:
176004 XVAB External Vector Add Byte 176005 XVSB External Vector Subtract Byte 176013 XVMVSM External Vector Move Small 176016 XVMINB External Vector Minimum Byte 176017 XVMAXB External Vector Maximum Byte 176024 XVAH External Vector Add Halfword 176025 XVSH External Vector Subtract Halfword 176026 XVMH External Vector Multiply Halfword 176027 XVDH External Vector Divide Halfword 176033 XVMVF External Vector Move Floating 176036 XVMINH External Vector Minimum Halfword 176037 XVMAXH External Vector Maximum Halfword 176044 XVA External Vector Add 176045 XVS External Vector Subtract 176046 XVM External Vector Multiply 176047 XVD External Vector Divide 176052 XVMV External Vector Move 176053 XVMVD External Vector Move Double 176054 XVN External Vector AND 176055 XVO External Vector OR 176056 XVMINH External Vector Minimum 176057 XVMAXH External Vector Maximum 176064 XVAL External Vector Add Long 176065 XVSL External Vector Subtract Long 176066 XVML External Vector Multiply Long 176067 XVDL External Vector Divide Long 176073 XVMVQ External Vector Move Quad 176074 XVSW External Vector Swap 176075 XVX External Vector XOR 176076 XVMINH External Vector Minimum Long 176077 XVMAXH External Vector Maximum Long 177004 XVASM External Vector Add Small 177005 XVSSM External Vector Subtract Small 177006 XVMSM External Vector Multiply Small 177007 XVMSM External Vector Divide Small 176012 XVMINSM External Vector Minimum Small 176013 XVMAXSM External Vector Maximum Small 177024 XVAF External Vector Add Floating 177025 XVSF External Vector Subtract Floating 177026 XVMF External Vector Multiply Floating 177027 XVDF External Vector Divide Floating 176032 XVMINF External Vector Minimum Floating 176033 XVMAXF External Vector Maximum Floating 177034 XVAU External Vector Add Unnormalized 177035 XVSU External Vector Subtract Unnormalized 177036 XVMU External Vector Multiply Unnormalized 177037 XVDU External Vector Divide Unnormalized 177044 XVAD External Vector Add Double 177045 XVSD External Vector Subtract Double 177046 XVMD External Vector Multiply Double 177047 XVDD External Vector Divide Double 177052 XVMINF External Vector Minimum Double 177053 XVMAXF External Vector Maximum Double 177054 XVAUD External Vector Add Unnormalized Double 177055 XVSUD External Vector Subtract Unnormalized Double 177056 XVMUD External Vector Multiply Unnormalized Double 177057 XVDUD External Vector Divide Unnormalized Double 177064 XVAQ External Vector Add Quad 177065 XVSQ External Vector Subtract Quad 177066 XVMQ External Vector Multiply Quad 177067 XVDQ External Vector Divide Quad 177072 XVMINQ External Vector Minimum Quad 177073 XVMAXQ External Vector Maximum Quad 177074 XVAUQ External Vector Add Unnormalized Quad 177075 XVSUQ External Vector Subtract Unnormalized Quad 177076 XVMUQ External Vector Multiply Unnormalized Quad 177077 XVDUQ External Vector Divide Unnormalized Quad
and the C bit in the instruction indicates the source operand is a single variable instead of a vector, and the R bit indicates that the destination is to be subtracted from the source, with the result going in the destination, or the source is to be divided by the destination with the result going in the destination. For constant operations, the mnemonics in these cases end in C (Constant) or RC (Reversed Constant) respectively.
The MIN instruction returns the lesser of its two arguments, the MAX instruction returns the greater of its two arguments; just like the ZIN single-operand instruction, to be described below, this helps in performing more sophisticated operations on external vectors without the need for conditional branches on individual elements. This is a technique used in graphics processors; in the main CPU, this problem is dealt with in a different fashion, using mask bits and multi-way vector instructions.
Three-address instructions, in which either the source operand or the operand operand may be constants, but which do not need, and do not have, the option of a reversed direction, are also present.
Note that unnormalized floating-point operations are included; this is important so that it is possible to keep track of significance in what may be an enormous computation.
The single-operand instruction format shows a seven-bit field as available for the opcode. This opcode field is similar to a concatenation of the two-bit type field, and the five-bit opcode field, in a normal floating-point single-operand instruction, but the values of the five-bit opcode field are modified to avoid conflicts between these opcodes and some opcodes that are reserved for special purposes from those used for two-address and three-address instructions; they must also be reserved here, since whether an instruction is a single-address instruction, a two-address instruction, or a three-address instruction, is not indicated until after the first halfword of the instruction.
The opcodes for external vector single-operand instructions are:
176001 XVSINSM 176041 XVSIN 177001 XVSIND 177041 XVSINQ
176002 XVCOSSM 176042 XVCOS 177002 XVCOSD 177042 XVCOSQ
176003 XVTANSM 176043 XVTAN 177003 XVTAND 177043 XVTANQ
176004 XVRSQSM 176044 XVRSQ 177004 XVRSQD 177044 XVRSQQ
176005 XVASNSM 176045 XVASN 177005 XVASND 177045 XVASNQ
176006 XVACSSM 176046 XVACS 177006 XVACSD 177046 XVACSQ
176007 XVATNSM 176047 XVATN 177007 XVATND 177047 XVATNQ
176011 XVSINHSM 176051 XVSINH 177011 XVSINHD 177051 XVSINHQ
176012 XVCOSHSM 176052 XVCOSH 177012 XVCOSHD 177052 XVCOSHQ
176013 XVTANHSM 176053 XVTANH 177013 XVTANHD 177053 XVTANHQ
176014 XVRECSM 176054 XVREC 177014 XVRECD 177054 XVRECQ
176015 XVASNHSM 176055 XVASNH 177015 XVASNHD 177055 XVASNHQ
176016 XVACSHSM 176056 XVACSH 177016 XVACSHD 177056 XVACSHQ
176017 XVATNHSM 176057 XVATNH 177017 XVATNHD 177057 XVATNHQ
177021 XVCLR 177061 XVINV
176022 XVZINB 176062 XVZINH 177022 XVZIN 177062 XVZINL
176023 XVABSB 176063 XVABSH 177023 XVABS 177063 XVABSL
176024 XVSQRSM 176064 XVSQR 177024 XVSQRD 177064 XVSQRQ
176025 XVQBRSM 176065 XVQBR 177025 XVQBRD 177065 XVQBRQ
176026 XVLOGSM 176066 XVLOG 177026 XVLOGD 177066 XVLOGQ
176027 XVEXPSM 176067 XVEXP 177027 XVEXPD 177067 XVEXPQ
176031 XVZINSM 176071 XVZINF 177031 XVZIND 177071 XVZINQ
176032 XVSGNB 176072 XVSGNH 177032 XVSGN 177072 XVSGNL
176033 XVNEGB 176073 XVNEGH 177033 XVNEGD 177073 XVNEGL
176034 XVCLRSM 176074 XVCLRF 177034 XVCLRD 177074 XVCLRQ
176035 XVABSSM 176075 XVABSF 177035 XVABSD 177075 XVABSQ
176036 XVSGNSM 176076 XVSGNF 177036 XVSGND 177076 XVSGNQ
176037 XVNEGSM 176077 XVNEGF 177037 XVNEGD 177077 XVNEGQ
The additional opcodes that may be unfamiliar here are RSQ, which calculates the reciprocal of the square root, ZIN, which replaces its argument by zero if it is negative, and REC, which calculates the reciprocal. Note also the presence of a few fixed-point single operand instructions, fixed-point forms of CLR, ABS, SGN, NEG, and ZIN, as well as INV, which performs a one's complement. Also note that XVINV is distinguished by XVCLR by type bits rather than opcode bits (as was also done for the AND and Swap instructions, and the OR and XOR instructions to make room for the floating-point move instructions), so that opcode space is available for the fixed-point XVZIN instructions.
The diagram below:

illustrates the instruction formats associated with the external vector opcodes which end in three zero bits, which were unused up to this point. As with the other instructions, in the case of advanced compound mode, the 123400 prefix halfword is added. These opcodes are followed by the additional instruction halfwords as shown in the diagrams above, and then the length field (omitted within a stretch, as described below) and the other addressing fields of normal two-address and three-address external vector memory-reference instructions which vary depending on the mode of operation in effect.
Because the positions of later halfwords in the instruction are changed, although the bits in them which indicate whether an instruction is a two-address or a three-address instruction should still be set to their appropriate values, as shown above for the different instruction modes, the first bit of the halfword immediately following the opcode also indicates, if 0, that the instruction is a two-address instruction, and, if 1, that the instruction is a three-address instruction.
These multiple-component instructions treat vectors as composed of repeated groups of two (as illustrated in the first two formats in the diagram above) or four (in the case of the last two formats in the diagram above) numbers.
In a two-address instruction, the values for op1, op2, op3 and op4 have the following meanings:
0010 MOV 0100 ADD 0101 SUB 0110 MUL 0111 DIV 1011 XOR 1100 AND ADDU 1101 OR SUBU 1110 MULU 1111 DIVU
and in a three-address instruction, the values for op1, op2, op3 and op4 have the following meanings:
000 ADD 001 SUB 010 MUL 011 DIV 100 AND ADDU 101 OR SUBU 110 MULU 111 XOR DIVU
the ones in the second column replacing the logical operations for floating-point numbers with unnormalized floating-point arithmetic.
In a two-component instruction, operations are performed on pairs of numbers of the type indicated in the first halfword of the instruction. For each component in the result, an operation is indicated, and a source component is indicated, and, if it is a three-address instruction, an operand component is also indicated. All possible destination components are used; the destination component of the operation indicated by op1 is the first of the two destination components, and the destination component of the operation indicated by op2 is the second of two destination components.
In a three-address instruction, there is also a two-bit op field; this indicates how the result of the operation performed on the source component and the operand component is applied to the destination component. The possible values for this field are:
00 replace 01 zero and subtract 10 add 11 subtract
Thus, it is possible to divide the contents of the operand field by the contents of the source field, and then subtract the result from the contents of the destination field, with the result being placed in the destination field.
With two two-component instructions, it is possible to multiply two vectors of complex numbers.
In a four-component instruction, we again have a series of four instructions which select a source component, and, in a three-address instruction, an operand component, within the corresponding elements of the source vector and the operand vector of the instruction.
This type of instruction is very similar to, but is a superset of, the type of operation commonly found in pixel (or fragment) and vertex shaders in graphics chips. However, tasks such as the rasterization of polygons still require either a conventional software program, or the use of a conventional special-purpose graphics chip.
Also note that if the external vector coprocessors have a 256-bit path to memory, four-component operations on the Quad floating-point type are not possible unless pairs of external vector coprocessors co-operate in performing them. Narrower paths to main memory could impose more severe restrictions.
In many cases, an alternative to using multiple-component instructions would be to use multiple vectors, each vector being of one component. This requires more instructions, but that is not a real cost if the length of a vector handled by an instruction is bounded with sufficient severity that it is shorter than the actual length of the arrays on which operations are being performed. With a 32-bit length for vectors, this is not the case for the external vector coprocessor. The use of long vector instructions with stride is also possible for conversion between the two possible memory organizations.
Because the operands of these instructions have a length indicated by a thirty-two bit length field, it is not practical for the external coprocessor units to posess registers of sufficient size to store a complete operand. However, the ability to use registers, so that not every step in a calculation requires a memory access, is very important. This is handled by providing a special interpretation to the following opcodes for the two-address form of an external vector operation:
176002 XVLB External Vector Load Byte 176003 XVSTB External Vector Store Byte 176022 XVLH External Vector Load Halfword 176023 XVSTH External Vector Store Halfword 176042 XVL External Vector Load 176043 XVST External Vector Store 176062 XVLL External Vector Load Long 176063 XVSTL External Vector Store Long 177002 XVLSM External Vector Load Small 177003 XVSTSM External Vector Store Small 177022 XVLF External Vector Load Floating 177023 XVSTF External Vector Store Floating 177042 XVLD External Vector Load Double 177043 XVSTD External Vector Store Double 177062 XVLQ External Vector Load Quad 177063 XVSTQ External Vector Store Quad
These are memory-to-register instructions. The dX field instead serves as a dR field, and the halfword containing the corresponding address (or, in the case of the short page modes, the indirect bit, the dB field, and the address) is omitted from the instruction, because the destination is a register, and is not in memory; the halfword containing the source address remains present in the instruction. The store instruction also omits the length specification.
A load instruction begins a stretch of code using registers, and a store instruction ends a stretch of code using registers.
The stretch should be treated as if it were a single instruction; no attempt should be made to branch into it, or out of it. For one thing, an attempt to branch into a stretch from instructions not part of a stretch would result in the instructions within it not being interpreted correctly, because the format of instructions is being changed by the omission of the length field. A stretch should be thought of as being similar to a series of instructions within an FLL (Fixed-Length Loop) instruction, since the series is sent once to the external vector coprocessors, but is performed repeatedly by them, as many times as the width of their path to memory lies within the length specification of the initial load instruction.
The length specification in the load instruction applies to all the instructions in the stretch. A stretch consists exclusively of memory-reference instructions and single-operand instructions, both fixed and floating. The multiple-component instructions are allowed within a stretch. All these instructions will be treated as external vector instructions; register references will be to the internal registers of the external vector coprocessor.
These registers will be as wide as the bus connecting each external vector coprocessor to memory; the instructions making up the stretch will be repeated, therefore, with a suitable displacement each time, until the entire length of the vector is processed. Thus, none of the vectors referenced in memory may overlap, or the results will be unpredictable.
Also, within a stretch, the opcode 176021 will be used as an instruction prefix, to indicate that the source operand of an instruction is a scalar, to be used as a constant operand acting on every element of a vector.
In addressing modes that provide three-address memory-to-memory scalar operations, they may be used; otherwise, a three-address external vector instruction with the length field omitted may be present within a stretch as well. Register operands are specified by using a zero base register value in the vector register and symmetric vector register modes, as well as in the conventional modes; in the short page modes, the three bits following the index register field, normally zero, are to contain the bits 111 to indicate a register operand.
Since a load instruction begins a stretch, and a store instruction ends a stretch, transfers within the stretch between registers and memory are handled by the following instructions:
176012 XVMVB External Vector Move Byte 176032 XVMVH External Vector Move Halfword 176052 XVMV External Vector Move 176072 XVMVL External Vector Move Long 176013 XVMVSM External Vector Move Small 176033 XVMVF External Vector Move Floating 176053 XVMVD External Vector Move Double 176073 XVMVQ External Vector Move Quad
Note that, due to a lack of available opcodes, the move instructions for the floating-point types are grouped with fixed-point opcodes.
Only a limited subset of the available floating-point formats supported by the architecture would be available by means of vector coprocessor operations. It is not yet defined how the format to use would be specified, if any choice at all is made available. Presumably, a limited-width field in the Program Status Block would contain the external vector coprocessor floating-point format in current use, which would be signalled to the external vector coprocessor each time an operation is requested.
It is envisaged that in any case, the Standard floating-point format would be supported, since the IEEE 754 floating-point format has become the standard floating-point format supported by virtually all microprocessors. As a second choice, the Compatible or Modified formats, for compatibility with the IBM System/360 series of computers and their successors, suggest themselves. But another very important alternative format would be the Common floating-point format. This is the floating-point format used when it is desired to increase the total floating-point performance of the central processing unit to its maximum value by modifying Simple Floating-Point operation, which is carried out by the fixed-point arithmetic units, by decreasing the length of the exponent field, so that compatibility with the floating-point arithmetic units is achieved. Allowing the external vector coprocessing units to handle this format as well permits all available floating-point processing capacity to use the same floating-point format to maximize floating-point throughput.
Since the external vector coprocessor only works with memory in its native width, the only variation of the Common floating-point format that would be used would be the one where an excess-128 binary exponent occupied the last eight bits of a 32, 64, or 128 bit long floating-point number with a sign-magnitude mantissa. Note also that the integer arithmetic units of the central processing unit do not provide the guard, round, and sticky bits, or their equivalents, that both the central processing unit floating-point units and the external vector coprocessor provide, and, thus, there is a penalty in accuracy for maximizing floating-point performance in this manner.