The decoding of the memory-reference instruction formats in this mode is intricate, and so the instruction formats have been divided up among a number of diagrams.
This first diagram shows shows all the addressing modes, except that the only modes shown where the first two bits of the instruction are 00 are those on the first level of instruction decoding; those on higher levels with those first two bits will be shown in later diagrams:

Note that the opcode is split into two fields. The opcodes are the same seven-bit opcodes as used in other instruction modes, but here the last five bits of the opcode come first. Also, recall that the first two bits of the opcode of a memory-reference instruction can only be either 00, 01, or 10.
Thus, when the bits in the second opcode field of an instruction format are 11, that indicates this addressing mode is not the one used, and, instead, another addressing mode, with the second opcode field located at a more rightward position, is the one used. It is this rightward movement of the second opcode field which places an instruction either on the first level of instruction decoding, if no such movement has taken place, or on a higher level of decoding, if this movement has occurred.
The functions of the addressing modes illustrated in the diagram above are:
The Vector Register to Scratchpad mode, which continues from the Vector Register Constant mode to the second level of decoding, is also shown in the diagram above.
Register to Register: the source and destination operands of this scalar instruction format are general registers.
Vector Register: this vector operation has one of the eight vector registers as its source and destination operands.
Vector Scratchpad to Scratchpad: the source operand is one of the sixty-four vector scratchpad registers, and the destination operand is the supplementary registers.
Vector Register to Scratchpad: the source operand is one of the eight vector registers, and the destination operand is the supplementary registers; this is a vector operation, and a range of the 64 elements in each vector register and the supplementary registers is indicated, together with an optional mask, if the M bit is one, found in the register indicated by the mR field.
Vector: these instructions are three-address memory-to-memory instructions. The source, operand, and destination addresses may individually be selected to be indirect. Since no vector register operands are involved, note the presence of a single length field rather than two six-bit fields giving a range. Also, for that reason, a full 16 bits are used for the length field, so a single instruction can result in repeated full-width vector operations.
Also note that, under appropriate circumstances (i.e., the operand type is not the Medium floating type, the data word length is the normal 32 bits, not 24, 36, or 40 bits, and possibly with the additional restrictions that the operand is floating, and the current floating-point format is one that the relevant external circuitry supports) this instruction may trigger the operation of the external circuitry whose function is outlined in this section rather than making use of the on-chip vector arithmetic-logic unit which makes this mode possible.
The P bit in the instruction, if 0, indicates the instruction functions normally, and completes before the next instruction begins. If it is 1, the next instruction may, particularly in the case where external floating-point units are used, begin while this instruction is still processing. Note that in the case of an external floating-point unit, if this bit is 0, this does not necessarily mean that the processor is tied up waiting for the instruction to complete; instead, the operating system will simply switch to other processes that are waiting to execute until the instruction does complete.
Note that because the external vector coprocessor may be in use by other processes, waiting for the last instruction of this type issued by the current process is done by means of a supervisor call, not a specific instruction dedicated to that purpose.
It is envisaged that a typical full implementation of this architecture may include a 256-bit wide data bus linking the processor to memory, and an internal on-chip data path linking the sixty-four arithmetic-logic units that make up the long vector arithmetic unit to the on-chip cache that is 4,096 bits wide.
Thus, for an external vector processor to be faster than the long vector arithmetic unit in some cases, specifically when the operation is not cacheable, it is sufficient for the external vector processor to be connected to memory by a path that is 512 bits wide or wider, and for it to operate on 512 bits or more at once.
On the other hand, when the specified operation is cacheable, and assuming cache memory is 16 times faster than main memory, so that a reasonable 16-way interleave of main memory makes it possible to fill the cache at full speed, the data path to the external vector processor would have to be 65,536 bits wide to be as fast or faster than the on-chip long vector arithmetic unit. This does not count the penalty for flushing the on-chip cache, in the event the cacheable vector operation deals with operands that are already in the cache.
It is expected that the actual width of an external vector coprocessor will lie somewhere between the two extremes of 512 bits and 65,536 bits. As a result, the C bit is included in the instruction; if this bit is a 1, the instruction will always be executed by the internal long vector arithmetic-logic unit, even if an external vector coprocessor is available.
Note also that another way to allow operations on vectors of more than 64 elements to take place that makes efficient use of the fact that the arithmetic units used for vector operations are themselves pipelined is through the use of the REP and CPS instructions.
Two-Address Vector: This instruction format is similar to the Vector instruction format, but is a two-address instruction.
Scratchpad Memory Reference: in this scalar instruction format, the destination operand is one of the sixty-four supplementary registers, and the source operand is in memory. Indirect addressing is possible in this format.
Vector to Vector Scratchpad: in these instructions, the destination operand is one of the sixty-four vector scratchpad registers, and the source operand is a vector in memory. The I bit indicates that indirect addressing is used for the memory operand; in that case, the effective address of the source operand points to a thirty-two bit absolute address (or a sixty-four bit absolute address if 64-bit addressing is in effect) giving the location of the vector rather than to the vector itself. As in other formats previously described above, the dSl field, if nonzero, shortens the vector in memory, but does not cause elements at its beginning to be skipped. Also, the M bit, if one, causes the mR field to be used to indicate an arithmetic/index register which contains a mask to be used with the instruction.
Vector Register Constant: here, the destination operand is one of the eight vector registers, but the source operand is a scalar found in a general register. This type of instruction can be used, for example, to multiply every element of a vector stored in one of the vector registers by the same value, found in the specified general register.
This diagram shows the instruction modes that continue onwards from the register-to-register format:

The functions of the addressing modes illustrated in this diagram are:
Memory Reference: the source operand of this instruction format is a memory location, the destination a general register.
Memory to Memory: the source and destination operands of this scalar instruction format are memory locations.
Long Scratchpad: in this scalar instruction format, the source is one of the sixty-four supplementary registers, and the destination is a general register. An additional three-bit opcode field allows additional instructions to be used.
Register to Scratchpad: here, a general register is the source operand, and the destination operand is one of the sixty-four supplementary registers.
Scratchpad to Scratchpad: these scalar instructions have their source and destination operands within the set of sixty-four supplementary registers.
Aux Register Memory Reference: these scalar instructions are used for operations involving the thirty-two base registers available in this mode as the destination, with a location in memory as the source operand.
This diagram shows the first half of the instruction modes that continue onwards from the vector register format:

and their descriptions are:
Vector to Vector Register: this vector instruction format has one of the eight vector registers as its destination operand, and a vector located in memory as its source operand. Note that the dSl and dSh fields indicate the range of locations within the vector register used, but they only indicate the length of the vector in memory; the range does not cause locations at the beginning of the memory vector to be skipped; all shortening of the vector in memory through the range is applied to the end. If the mask is used, however, zero mask bits do refer to specific locations within vectors in memory.
Condensed Vector: these instructions are three-address memory-to-memory instructions. The source, operand, and destination addresses may individually be selected to be indirect. Since no vector register operands are involved, note the presence of a single length field rather than two six-bit fields giving a range. In this shorter format for memory-to-memory vector operations, only the arithmetic/index registers, and not the scratchpad registers, are used as indexes.
If the oB field contains all zeroes, this indicates the instruction is a two-address instruction, and the halfword containing the 16-bit operand operand address is omitted from the instruction.
The modes shown in the diagram with a stride represent a partial implementation of another feature found on Cray supercomputers.
The stride is a signed 10-bit field, giving the displacement between successive vector elements in memory. If the stride field contains a 1, the instruction is a conventional vector operation, thus, the displacement is in units of operand size, not bytes; if it is a zero, the memory operand is a scalar value.
The purpose of this is to facilitate matrix multiplications. Since a nonunit stride will lead to extra memory accesses in most implementations, the optimal way in which to perform matrix multiplication will be to load each column of the left matrix into the vector registers or vector scratchpad in turn, requiring an operation with nonunit stride, and then performing successive multiply and accumulate operations involving that operand in the register space with the rows of the right matrix, which is accessed using unit stride.
Thus, stride is only specified for two-address instructions with a register destination.
Vector to Vector Scratchpad with Stride: in these instructions, the destination operand is one of the sixty-four vector scratchpad registers, and the source operand is a vector in memory. The I bit indicates that indirect addressing is used for the memory operand; in that case, the effective address of the source operand points to a thirty-two bit absolute address (or a sixty-four bit absolute address if 64-bit addressing is in effect) giving the location of the vector rather than to the vector itself. As in other formats previously described above, the dSl field, if nonzero, shortens the vector in memory, but does not cause elements at its beginning to be skipped. Also, the M bit, if one, causes the mR field to be used to indicate an arithmetic/index register which contains a mask to be used with the instruction. And this format includes a ten-bit signed stride field, as described above.
Vector to Vector Register with Stride: this vector instruction format has one of the eight vector registers as its destination operand, and a vector located in memory as its source operand. Note that the dSl and dSh fields indicate the range of locations within the vector register used, but they only indicate the length of the vector in memory; the range does not cause locations at the beginning of the memory vector to be skipped; all shortening of the vector in memory through the range is applied to the end. If the mask is used, however, zero mask bits do refer to specific locations within vectors in memory. And this format includes a ten-bit signed stride field, as described above.
Vector to Scratchpad with Stride: in this vector instruction format, a vector in memory is the source operand, and the supplementary registers are the destination operand. And this format includes a ten-bit signed stride field, as described above.
This diagram shows the second half of the instruction modes that continue onwards from the vector register format:

And these addressing modes are described below:
Constant: this memory-to-memory vector instruction format has a source and a destination operand in memory, but the operand operand is in a register: the S bit, if 0, indicates the oR field is used to indicate one of the general registers as that operand; if it is 1, the oS field is used to indicate one of the scratchpad registers as that operand.
Reversed Constant: this memory-to-memory vector instruction format has a destination and an operand operand in memory, but the source operand is in a register: the S bit, if 0, indicates the sR field is used to indicate one of the general registers as that operand; if it is 1, the sS field is used to indicate one of the scratchpad registers as that operand.
Vector to Scratchpad: in this vector instruction format, a vector in memory is the source operand, and the supplementary registers are the destination operand.
Long Vector Register Constant: here, the destination operand is one of the eight vector registers, but the source operand is a scalar found in a general register. This type of instruction can be used, for example, to multiply every element of a vector stored in one of the vector registers by the same value, found in the specified general register. Compared to the Vector Register Constant format, this format includes two additional fields. The first of these is the R bit, which if set, indicates the operation is reversed; in the case of a division operation, that would mean that the destination operand supplies the divisor instead of the source operand; the quotient still is stored in the destination. The second is a three-bit auxilliary opcode field, which, as with other addressing modes, permits additional instructions to be used with the given addressing mode.
Vector Scratchpad Constant: here, the destination operand is one of the sixty-four vector scratchpad registers.
Long Vector Register to Scratchpad: the source operand is one of the eight vector registers, and the destination operand is the supplementary registers; this is a vector operation, and a range of the 64 elements in each vector register and the supplementary registers is indicated, together with an optional mask, if the M bit is one, found in the register indicated by the mR field. Again, this format provides for an extra opcode field.
Long Vector Register: this form of the instruction with a vector register as source and destination includes an extra opcode field to permit additional instructions to be defined.
Long Vector Scratchpad to Scratchpad: these instructions have one of the vector scratchpad registers as the source, and the supplementary registers as their destination.
Vector Scratchpad to Vector Register: here, one of the sixty-four vector scratchpad registers is the source, and one of the eight vector registers is the destination.
Vector Register to Vector Scratchpad: here, one of eight vector registers is the source, and one of the the sixty-four vector scratchpad registers is the destination.
Vector Scratchpad: in this instruction format, the source and the destination are both found among the sixty-four vector scratchpad registers.
For compactness, an extra row is not included for the case when an index register field is zero, indicating that there is no indexing.
Also note that instructions in the scratchpad memory reference format are also scalar operations, although most scalar operations have been made descendants of the register-to-register format.
Because the sixty-four supplementary arithmetic/index registers are 64 bits long, rather than 32 bits long, in order that vector operations on the long type are possible, these additional opcodes:
050xxx I Insert 052xxx UL Unsigned Load
are defined for the vector addressing modes and those scalar addressing modes having a supplementary register as their destination register, controlling sign extension for loading a 32-bit value into a 32-bit register.
Note that the memory reference and scratchpad memory reference instructions provide for two index register fields, one where an accumulator/index register is indicated, and one where a supplementary accumulator/index register is indicated.
The short scratchpad instruction always has register 0 as its destination.
In the instructions with an additional 2-bit opcode field, the contents of that field are to be prepended to the contents of the 5-bit opcode field, producing a 7-bit opcode that is to be interpreted as in normal mode, providing access to all eight basic data types. As that 2-bit opcode field will only contain the values 00, 01, and 10 when serving this purpose, a value of 11, associated with the other instructions, is used to indicate the instruction is to be decoded in a different manner, providing additional addressing modes.
The Aux Register Memory-Reference instructions only have a four-bit opcode field, to select a fixed-point operation. The multiplication and division operations are not included among those available.
The memory to memory vector instructions are three-address instructions, while the vector to scratchpad instructions are two-address instructions.
Also note that, as in other modes, the memory to memory vector instructions, unlike the short vector instructions, are not maskable. The vector to scratchpad instructions, on the other hand, are maskable; the mR field refers to a pair of arithmetic/index registers, and may contain 0, 2, 4 or 6. The mapping of bits in the mask register pair to particular supplementary registers is fixed: the most significant bit in the first word maps to supplementary register 0, the least significant bit in the second word maps to supplementary register 63.
The short vector instructions are designed to be implemented by means of an arithmetic unit with a 256-bit wide register that can be partitioned into multiple areas, and in which all operations are carried out in parallel.
The long vector instructions, on the other hand, follow the principles used in computers where there is only a single conventional ALU, but which can still operate on several operands concurrently because it is pipelined. As a result, the Medium floating-point format is allowed with long vector instructions.
It is envisaged that the short vector arithmetic unit will be pipelined also, and, thus, in order that the long vector instructions, when available, will result in more arithmetic operations in a given time than the short vector instructions, it will be necessary to implement long vector operations with some degree of parallelism.
When parallelism is employed, special interconnections are required to permit the long vector arithmetic units to work not only on groups of consecutive double-precision floating-point numbers or consecutive long integers, which correspond to the most likely width of the data path between the long vector arithmetic units and the cache, but also on the same number of consecutive bytes, the same number of consecutive halfwords, the same number of consecutive integer or floating-point numbers, or, for that matter, the same number of consecutive extended (or quad) precision floating-point numbers.
Eight consecutive bytes, four consecutive 16-bit halfwords, and two consecutive 32-bit integers or floating-point numbers all occupy the same area in memory as one 64-bit double precision floating-point number or long integer, so circuitry is required to send them to consecutive arithmetic-logic units instead of to the same one.
The degree of parallelism provided can vary from one implementation to another. If complete parallelism for a long vector instruction is provided, by means of a bank of sixty-four arithmetic-logic units for long vector operations, some further additional circuitry, to allow MIMD parallel computing as well as SIMD parallel computing, will permit the execution of the instructions defined for this mode described in the section on cache-internal parallel computing.
One possible way to implement the capability of a parallel long vector arithmetic unit to operate on operands of differing size as though it were merely a single, but deeply-pipelined, ALU, while limiting the amount of interconnecting circuitry required is shown below:

A circuit that simply rearranges the 512 bytes contained in sixty-four data words of 64 bits each from
0 1 2 3 4 5 .... 508 509 510 511
to
0 128 1 129 2 130 .... 126 510 127 511
will, when applied once, move the bytes in consecutive 32-bit operands to consecutive 64-bit data words, thus delivering them to the appropriate ALU, and, if applied twice, will move the bytes in consecutive 16-bit operands to consecutive ALUs with a 64-bit path to cache memory, and if applied three times will deliver consecutive bytes to consecutive ALUs.
But the bytes will not be in the right positions; this can be dealt with by using a circuit with a function that is partially the inverse of this function, but which operates locally on the eight bytes dealt with by each single ALU, that is, rearranging the eight bytes in each of the sixty-four groups of eight bytes from:
0 1 2 3 4 5 6 7
to
0 2 4 6 1 3 5 7
Inverse versions of these two circuits will also be required for returning values to memory.
Given a 64-bit path into each ALU, so that doubleword operands proceed directly from cache into the ALU, the operations shown above for different sizes of operands are:
For word operands, one global scatter operation, followed by one local gather operation.
For halfword operands, two consecutive global scatter operations, followed by two consecutive local gather operations.
For byte operands, three consecutive global scatter operations.
Then, a shift step, if necessary, selects which of the two groups of words, or which of the four groups of halfwords, or which of the eight groups of bytes, in a group of doublewords, is operated on in the bank of ALUs, and finally a masking step removes the unused inputs.
Also, it may be noted that it is not necessary for a long vector operand to be aligned on a boundary representing 64 of the data items of which it consists. The circuitry required to deal with this expeditiously is also useful for more general inter-ALU communications, and is therefore described in a subsequent section concerning a MIMD capability obtained by adding a simple control unit to each ALU.
Note that this type of circuitry appears inconsistent with the use of the Medium floating-point type with long vector instructions. However, the circuitry provided for data memory width control can be used to permit efficient operation on this floating-point type for some combinations of floating-point format and memory width, as follows:
If the memory width provides normal 32-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 24-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 40-bit words.
If the memory width provides 24-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 36-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.
If the memory width provides 40-bit words, for formats other than the Standard format the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.
For all other combinations of memory width and floating-point format, the Medium floating-point type may not be used with long vector instructions.
Note also that values in the Medium floating-point type need to be aligned on 16-bit boundaries, not 48-bit boundaries, and thus caching to deal with these values may involve an offset that would not occur if a memory width of 48 bits is applied to all memory beginning with location zero; however, such offsets can also be used when the memory width is changed normally; the actual restriction imposed on the address associated with a cache line is that it must be a multiple of the width of the data bus to main memory.