In addition to long vector instructions, a complete set of addressing modes which allow the use of the 64 supplementary registers for scalar instructions is provided. Some of the possible modes are illustrated below:

In addition to an addressing mode in which the eight scratchpad registers serve as base registers, pointing to areas of memory which may contain up to 33,554,432 bytes, having a 25-bit displacement, a mode is provided where the sixty-four supplementary registers are used both for the base register and the index register, and the displacement is 19 bits long, leading to each of these base registers indicating an area containing up to 524,288 bytes.
One of the uses of this addressing mode is to avoid the need to switch to general register mode when it is desired to translate programs directly from the assembly language of architectures which use general registers which serve as base registers as well as arithmetic accumulators and as index registers.
Another useful feature is that the long memory reference instruction allows unaligned memory-reference instructions, which cannot be expressed in the normal short memory-reference format of this mode.
When the first bit of the second halfword is a one, we proceed to the long vector instructions, which are similar in format.

The intent is that two bits in the first halfword indicate if the source and destination operands are supplementary registers, and, if they are not, the first three bits of the six-bit source or destination field indicates the addressing mode for that operand in an orthogonal manner, so that any of the addressing modes illustrated here in the most common combinations can be used for either the source or destination operands, allowing, for example, memory-to-memory instructions where both the source and destination operands are specified in the long indexed format.
How this works may be made clearer by the diagram below:

The vector instructions with ten-bit opcodes shown above all involve the vector registers, which contain only up to 64 items. Thus, they include a word indicating if a mask register is used, and also having two fields indicating the first and the last of the positions within the 64 elements of a 64 element vector register that are to be used.
When the three-bit opcode field in the second halfword of the instruction is zero, the four-bit opcode field and the three-bit type field in the first halfword of the instruction indicate an operation in these instruction formats in the same way as they do for the register-to-register instructions as described on a previous page.
For other values of the three-bit opcode field in the second halfword of the instruction, the additional opcodes are:
101x1x x1xxxx SFSWL Simple Floating Swap Halfword 103x1x x1xxxx SFCL Simple Floating Compare Halfword 105x1x x1xxxx SFLL Simple Floating Load Halfword 113x1x x1xxxx SFSTL Simple Floating Store Halfword 111x1x x1xxxx SFAL Simple Floating Add Halfword 113x1x x1xxxx SFSL Simple Floating Subtract Halfword 115x1x x1xxxx SFML Simple Floating Multiply Halfword 117x1x x1xxxx SFDL Simple Floating Divide Halfword 121x1x x1xxxx SFMEUH Simple Floating Multiply Extensibly Unnormalized Halfword 123x1x x1xxxx SFDEUH Simple Floating Divide Extensibly Unnormalized Halfword 125x1x x1xxxx SFLUH Simple Floating Load Unnormalized Halfword 127x1x x1xxxx SFSTUH Simple Floating Store Unnormalized Halfword 131x1x x1xxxx SFAUH Simple Floating Add Unnormalized Halfword 133x1x x1xxxx SFSUH Simple Floating Subtract Unnormalized Halfword 135x1x x1xxxx SFMUH Simple Floating Multiply Unnormalized Halfword 137x1x x1xxxx SFDUH Simple Floating Divide Unnormalized Halfword 101x2x x1xxxx SFSW Simple Floating Swap 103x2x x1xxxx SFC Simple Floating Compare 105x2x x1xxxx SFL Simple Floating Load 107x2x x1xxxx SFST Simple Floating Store 111x2x x1xxxx SFA Simple Floating Add 113x2x x1xxxx SFS Simple Floating Subtract 115x2x x1xxxx SFM Simple Floating Multiply 117x2x x1xxxx SFD Simple Floating Divide 121x2x x1xxxx SFMEU Simple Floating Multiply Extensibly Unnormalized 123x2x x1xxxx SFDEU Simple Floating Divide Extensibly Unnormalized 125x2x x1xxxx SFLU Simple Floating Load Unnormalized 127x2x x1xxxx SFSTU Simple Floating Store Unnormalized 131x2x x1xxxx SFAU Simple Floating Add Unnormalized 133x2x x1xxxx SFSU Simple Floating Subtract Unnormalized 135x2x x1xxxx SFMU Simple Floating Multiply Unnormalized 137x2x x1xxxx SFDU Simple Floating Divide Unnormalized 101x3x x1xxxx SFSWL Simple Floating Swap Long 103x3x x1xxxx SFCL Simple Floating Compare Long 105x3x x1xxxx SFLL Simple Floating Load Long 107x3x x1xxxx SFSTL Simple Floating Store Long 111x3x x1xxxx SFAL Simple Floating Add Long 113x3x x1xxxx SFSL Simple Floating Subtract Long 115x3x x1xxxx SFML Simple Floating Multiply Long 117x3x x1xxxx SFDL Simple Floating Divide Long 121x3x x1xxxx SFMEUL Simple Floating Multiply Extensibly Unnormalized Long 123x3x x1xxxx SFDEUL Simple Floating Divide Extensibly Unnormalized Long 125x3x x1xxxx SFLUL Simple Floating Load Unnormalized Long 127x3x x1xxxx SFSTUL Simple Floating Store Unnormalized Long 131x3x x1xxxx SFAUL Simple Floating Add Unnormalized Long 133x3x x1xxxx SFSUL Simple Floating Subtract Unnormalized Long 135x3x x1xxxx SFMUL Simple Floating Multiply Unnormalized Long 137x3x x1xxxx SFDUL Simple Floating Divide Unnormalized Long 103x4x x1xxxx RPC Register Packed Compare 105x4x x1xxxx RPME Register Packed Multiply Extensibly 107x4x x1xxxx RPDE Register Packed Divide Extensibly 111x4x x1xxxx RPA Register Packed Add 113x4x x1xxxx RPS Register Packed Subtract 115x4x x1xxxx RPM Register Packed Multiply 117x4x x1xxxx RPD Register Packed Divide 123x4x x1xxxx RPCL Register Packed Compare Long 125x4x x1xxxx RPMEL Register Packed Multiply Extensibly Long 127x4x x1xxxx RPDEL Register Packed Divide Extensibly Long 131x4x x1xxxx RPAL Register Packed Add Long 133x4x x1xxxx RPSL Register Packed Subtract Long 135x4x x1xxxx RPML Register Packed Multiply Long 137x4x x1xxxx RPDL Register Packed Divide Long 103x5x x1xxxx RCDC Register Compressed Decimal Compare 105x5x x1xxxx RCDME Register Compressed Decimal Multiply Extensibly 107x5x x1xxxx RCDDE Register Compressed Decimal Divide Extensibly 111x5x x1xxxx RCDA Register Compressed Decimal Add 113x5x x1xxxx RCDS Register Compressed Decimal Subtract 115x5x x1xxxx RCDM Register Compressed Decimal Multiply 117x5x x1xxxx RCDD Register Compressed Decimal Divide 123x5x x1xxxx RCDCL Register Compressed Decimal Compare Long 125x5x x1xxxx RCDMEL Register Compressed Decimal Multiply Extensibly Long 127x5x x1xxxx RCDDEL Register Compressed Decimal Divide Extensibly Long 131x5x x1xxxx RCDAL Register Compressed Decimal Add Long 133x5x x1xxxx RCDSL Register Compressed Decimal Subtract Long 135x5x x1xxxx RCDML Register Compressed Decimal Multiply Long 137x5x x1xxxx RCDDL Register Compressed Decimal Divide Long 101x6x x1xxxx SWMDE Swap Medium Decimal Exponent 103x6x x1xxxx CMDE Compare Medium Decimal Exponent 105x6x x1xxxx LMDE Load Medium Decimal Exponent 107x6x x1xxxx STMDE Store Medium Decimal Exponent 111x6x x1xxxx AMDE Add Medium Decimal Exponent 113x6x x1xxxx SMDE Subtract Medium Decimal Exponent 115x6x x1xxxx MMDE Multiply Medium Decimal Exponent 117x6x x1xxxx DMDE Divide Medium Decimal Exponent 125x6x x1xxxx LUMDE Load Unnormalized Medium Decimal Exponent 127x6x x1xxxx STUMDE Store Unnormalized Medium Decimal Exponent 131x6x x1xxxx AUMDE Add Unnormalized Medium Decimal Exponent 133x6x x1xxxx SUMDE Subtract Unnormalized Medium Decimal Exponent 135x6x x1xxxx MUMDE Multiply Unnormalized Medium Decimal Exponent 137x6x x1xxxx DUMDE Divide Unnormalized Medium Decimal Exponent 101x7x x1xxxx SWDDE Swap Double Decimal Exponent 103x7x x1xxxx CDDE Compare Double Decimal Exponent 105x7x x1xxxx LDDE Load Double Decimal Exponent 107x7x x1xxxx STDDE Store Double Decimal Exponent 111x7x x1xxxx ADDE Add Double Decimal Exponent 113x7x x1xxxx SDDE Subtract Double Decimal Exponent 115x7x x1xxxx MDDE Multiply Double Decimal Exponent 117x7x x1xxxx DDDE Divide Double Decimal Exponent 125x7x x1xxxx LUDDE Load Unnormalized Double Decimal Exponent 127x7x x1xxxx STUDDE Store Unnormalized Double Decimal Exponent 131x7x x1xxxx AUDDE Add Unnormalized Double Decimal Exponent 133x7x x1xxxx SUDDE Subtract Unnormalized Double Decimal Exponent 135x7x x1xxxx MUDDE Multiply Unnormalized Double Decimal Exponent 137x7x x1xxxx DUDDE Divide Unnormalized Double Decimal Exponent
An alternative format of vector instruction for memory-to-memory vector operations also exists, having a 13-bit opcode, and a 16-bit length field. Two bits in the instruction indicate if the source or operand (the number divided by in division, the number subtracted in subtraction) operands are scalard instead of vectors (these bits are labelled C for constant) and three bits indicate if stride is present for any of the operands. The 16-bit fields giving the stride are located in destination, operand, and source order after the length field and before the address fields. This format is illustrated below:

As yet, no opcodes are defined which are only available with the longer opcode field of this instruction format, however.
The functions of some of the addressing modes illustrated in the diagrams above are:
Vector Scratchpad: in this instruction format, the source and the destination are both found among the sixty-four vector scratchpad registers.
Scratchpad to Vector Scratchpad: the source operand is the supplementary registers, and the destination operand is one of the sixty-four vector scratchpad registers.
Vector Register to Vector Scratchpad: the source operand is one of the eight vector registers, and the destination operand is one of the sixty-four vector scratchpad registers.
Long Vector Long Memory Reference: the source operand is one of the eight vector registers, and the destination operand is a vector in memory. This is a vector operation, and a range of the 64 elements in the vector scratchpad register used is indicated, together with an optional mask, if the M bit is one, found in the register indicated by the mR field. When a range is used to indicate a vector of less than 64 elements is used, while the starting and ending elements indicate which elements of the vector scratchpad register are used, the operand in memory is simply a vector of less than 64 elements which starts at the effective address. Elements of the vector that are to be ignored due to the use of the mask register, however, are in their assigned positions within the vector, which may begin with an ignored element, whether it is in a vector scratchpad register or in memory.
Long Vector Indexed: this is the indexed form of the Long Vector Long Memory Reference mode described above.
Long Vector Memory Reference: again, the source operand is one of the eight vector registers, and the destination operand is a vector in memory. Here, one of the Address/Base registers is used as the base register, and the displacement is 16 bits in length.
Long Vector Memory Reference with Stride represents a partial implementation of another feature found on Cray supercomputers.
The stride is a signed 16-bit field, giving the displacement between successive vector elements in memory. If the stride field contains a 1, the instruction is a conventional vector operation, thus, the displacement is in units of operand size, not bytes; if it is a zero, the memory operand is a scalar value.
The purpose of this is to facilitate matrix multiplications. Since a nonunit stride will lead to extra memory accesses in most implementations, the optimal way in which to perform matrix multiplication will be to load each column of the left matrix into the vector registers or vector scratchpad in turn, requiring an operation with nonunit stride, and then performing successive multiply and accumulate operations involving that operand in the register space with the rows of the right matrix, which is accessed using unit stride.
Note that the base register field may be zero, indicating Long Vector Long Memory Reference with Stride or Long Vector Indexed with Stride as well.
Note that the different possibilities for the source operand have been illustrated in the modes described above. The destination operand may also be varied between any of the types shown. Two examples to illustrate this are shown.
Vector Scratchpad to Vector Register: here, one of the sixty-four vector scratchpad registers is the source, and one of the eight vector registers is the destination.
Vector Register: this vector operation has one of the eight vector registers as its source and destination operands.
Finally, one mode is shown illustrating how the halfword giving the range of registers used for a vector, and indicating masking, may be omitted if it is not required.
Unmasked Long Vector: in the form of this mode illustrated, the source and destination are each one of the sixty-four long vector scratchpad registers. Again, the source and destination operands may be replaced with operands of any of the types shown above, so the scratchpad registers, or one of the long vector registers, or a vector in memory, with or without stride, by means of memory reference, long memory reference, or indexed access may be used.
Because the sixty-four supplementary arithmetic/index registers are 64 bits long, rather than 32 bits long, in order that vector operations on the long type are possible, these additional opcodes:
xx1000xxxx010xxx x000 I Insert xx1010xxxx010xxx x000 UL Unsigned Load
are defined for the vector addressing modes and those scalar addressing modes having a supplementary register as their destination register, controlling sign extension for loading a 32-bit value into a 32-bit register.
The short vector instructions are designed to be implemented by means of an arithmetic unit with a 256-bit wide register that can be partitioned into multiple areas, and in which all operations are carried out in parallel.
The long vector instructions, on the other hand, follow the principles used in computers where there is only a single conventional ALU, but which can still operate on several operands concurrently because it is pipelined. As a result, the Medium floating-point format is allowed with long vector instructions.
It is envisaged that the short vector arithmetic unit will be pipelined also, and, thus, in order that the long vector instructions, when available, will result in more arithmetic operations in a given time than the short vector instructions, it will be necessary to implement long vector operations with some degree of parallelism.
When parallelism is employed, special interconnections are required to permit the long vector arithmetic units to work not only on groups of consecutive double-precision floating-point numbers or consecutive long integers, which correspond to the most likely width of the data path between the long vector arithmetic units and the cache, but also on the same number of consecutive bytes, the same number of consecutive halfwords, the same number of consecutive integer or floating-point numbers, or, for that matter, the same number of consecutive extended (or quad) precision floating-point numbers.
Eight consecutive bytes, four consecutive 16-bit halfwords, and two consecutive 32-bit integers or floating-point numbers all occupy the same area in memory as one 64-bit double precision floating-point number or long integer, so circuitry is required to send them to consecutive arithmetic-logic units instead of to the same one.
The degree of parallelism provided can vary from one implementation to another. If complete parallelism for a long vector instruction is provided, by means of a bank of sixty-four arithmetic-logic units for long vector operations, some further additional circuitry, to allow MIMD parallel computing as well as SIMD parallel computing, will permit the execution of the instructions defined for this mode described in the section on cache-internal parallel computing.
One possible way to implement the capability of a parallel long vector arithmetic unit to operate on operands of differing size as though it were merely a single, but deeply-pipelined, ALU, while limiting the amount of interconnecting circuitry required is shown below:

A circuit that simply rearranges the 512 bytes contained in sixty-four data words of 64 bits each from
0 1 2 3 4 5 .... 508 509 510 511
to
0 128 1 129 2 130 .... 126 510 127 511
will, when applied once, move the bytes in consecutive 32-bit operands to consecutive 64-bit data words, thus delivering them to the appropriate ALU, and, if applied twice, will move the bytes in consecutive 16-bit operands to consecutive ALUs with a 64-bit path to cache memory, and if applied three times will deliver consecutive bytes to consecutive ALUs.
But the bytes will not be in the right positions; this can be dealt with by using a circuit with a function that is partially the inverse of this function, but which operates locally on the eight bytes dealt with by each single ALU, that is, rearranging the eight bytes in each of the sixty-four groups of eight bytes from:
0 1 2 3 4 5 6 7
to
0 2 4 6 1 3 5 7
Inverse versions of these two circuits will also be required for returning values to memory.
Given a 64-bit path into each ALU, so that doubleword operands proceed directly from cache into the ALU, the operations shown above for different sizes of operands are:
For word operands, one global scatter operation, followed by one local gather operation.
For halfword operands, two consecutive global scatter operations, followed by two consecutive local gather operations.
For byte operands, three consecutive global scatter operations.
Then, a shift step, if necessary, selects which of the two groups of words, or which of the four groups of halfwords, or which of the eight groups of bytes, in a group of doublewords, is operated on in the bank of ALUs, and finally a masking step removes the unused inputs.
Also, it may be noted that it is not necessary for a long vector operand to be aligned on a boundary representing 64 of the data items of which it consists. The circuitry required to deal with this expeditiously is also useful for more general inter-ALU communications, and is therefore described in a subsequent section concerning a MIMD capability obtained by adding a simple control unit to each ALU.
Note that this type of circuitry appears inconsistent with the use of the Medium floating-point type with long vector instructions. However, the circuitry provided for data memory width control can be used to permit efficient operation on this floating-point type for some combinations of floating-point format and memory width, as follows:
If the memory width provides normal 32-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 24-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 40-bit words.
If the memory width provides 24-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 36-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.
If the memory width provides 40-bit words, for formats other than the Standard format the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.
For all other combinations of memory width and floating-point format, the Medium floating-point type may not be used with long vector instructions.
Note also that values in the Medium floating-point type need to be aligned on 16-bit boundaries, not 48-bit boundaries, and thus caching to deal with these values may involve an offset that would not occur if a memory width of 48 bits is applied to all memory beginning with location zero; however, such offsets can also be used when the memory width is changed normally; the actual restriction imposed on the address associated with a cache line is that it must be a multiple of the width of the data bus to main memory.