[Next] [Up] [Previous]

Long Vector Instructions

In addition to long vector instructions, a complete set of addressing modes which allow the use of the 64 supplementary registers for scalar instructions is provided. Some of the possible modes are illustrated below:

In addition to an addressing mode in which the eight scratchpad registers serve as base registers, pointing to areas of memory which may contain up to 33,554,432 bytes, having a 25-bit displacement, a mode is provided where the sixty-four supplementary registers are used both for the base register and the index register, and the displacement is 19 bits long, leading to each of these base registers indicating an area containing up to 524,288 bytes.

One of the uses of this addressing mode is to avoid the need to switch to general register mode when it is desired to translate programs directly from the assembly language of architectures which use general registers which serve as base registers as well as arithmetic accumulators and as index registers.

Another useful feature is that the long memory reference instruction allows unaligned memory-reference instructions, which cannot be expressed in the normal short memory-reference format of this mode.

When the first bit of the second halfword is a one, we proceed to the long vector instructions, which are similar in format.

The intent is that two bits in the first halfword indicate if the source and destination operands are supplementary registers, and, if they are not, the first three bits of the six-bit source or destination field indicates the addressing mode for that operand in an orthogonal manner, so that any of the addressing modes illustrated here in the most common combinations can be used for either the source or destination operands, allowing, for example, memory-to-memory instructions where both the source and destination operands are specified in the long indexed format.

How this works may be made clearer by the diagram below:

The vector instructions with ten-bit opcodes shown above all involve the vector registers, which contain only up to 64 items. Thus, they include a word indicating if a mask register is used, and also having two fields indicating the first and the last of the positions within the 64 elements of a 64 element vector register that are to be used.

When the three-bit opcode field in the second halfword of the instruction is zero, the four-bit opcode field and the three-bit type field in the first halfword of the instruction indicate an operation in these instruction formats in the same way as they do for the register-to-register instructions as described on a previous page.

For other values of the three-bit opcode field in the second halfword of the instruction, the additional opcodes are:

101x1x x1xxxx  SFSWL  Simple Floating Swap Halfword
103x1x x1xxxx  SFCL   Simple Floating Compare Halfword
105x1x x1xxxx  SFLL   Simple Floating Load Halfword
113x1x x1xxxx  SFSTL  Simple Floating Store Halfword
111x1x x1xxxx  SFAL   Simple Floating Add Halfword
113x1x x1xxxx  SFSL   Simple Floating Subtract Halfword
115x1x x1xxxx  SFML   Simple Floating Multiply Halfword
117x1x x1xxxx  SFDL   Simple Floating Divide Halfword

121x1x x1xxxx  SFMEUH Simple Floating Multiply Extensibly Unnormalized Halfword
123x1x x1xxxx  SFDEUH Simple Floating Divide Extensibly Unnormalized Halfword
125x1x x1xxxx  SFLUH  Simple Floating Load Unnormalized Halfword
127x1x x1xxxx  SFSTUH Simple Floating Store Unnormalized Halfword
131x1x x1xxxx  SFAUH  Simple Floating Add Unnormalized Halfword
133x1x x1xxxx  SFSUH  Simple Floating Subtract Unnormalized Halfword
135x1x x1xxxx  SFMUH  Simple Floating Multiply Unnormalized Halfword
137x1x x1xxxx  SFDUH  Simple Floating Divide Unnormalized Halfword

101x2x x1xxxx  SFSW   Simple Floating Swap
103x2x x1xxxx  SFC    Simple Floating Compare
105x2x x1xxxx  SFL    Simple Floating Load
107x2x x1xxxx  SFST   Simple Floating Store
111x2x x1xxxx  SFA    Simple Floating Add
113x2x x1xxxx  SFS    Simple Floating Subtract
115x2x x1xxxx  SFM    Simple Floating Multiply
117x2x x1xxxx  SFD    Simple Floating Divide

121x2x x1xxxx  SFMEU  Simple Floating Multiply Extensibly Unnormalized
123x2x x1xxxx  SFDEU  Simple Floating Divide Extensibly Unnormalized
125x2x x1xxxx  SFLU   Simple Floating Load Unnormalized
127x2x x1xxxx  SFSTU  Simple Floating Store Unnormalized
131x2x x1xxxx  SFAU   Simple Floating Add Unnormalized
133x2x x1xxxx  SFSU   Simple Floating Subtract Unnormalized
135x2x x1xxxx  SFMU   Simple Floating Multiply Unnormalized
137x2x x1xxxx  SFDU   Simple Floating Divide Unnormalized

101x3x x1xxxx  SFSWL  Simple Floating Swap Long
103x3x x1xxxx  SFCL   Simple Floating Compare Long
105x3x x1xxxx  SFLL   Simple Floating Load Long
107x3x x1xxxx  SFSTL  Simple Floating Store Long
111x3x x1xxxx  SFAL   Simple Floating Add Long
113x3x x1xxxx  SFSL   Simple Floating Subtract Long
115x3x x1xxxx  SFML   Simple Floating Multiply Long
117x3x x1xxxx  SFDL   Simple Floating Divide Long

121x3x x1xxxx  SFMEUL Simple Floating Multiply Extensibly Unnormalized Long
123x3x x1xxxx  SFDEUL Simple Floating Divide Extensibly Unnormalized Long
125x3x x1xxxx  SFLUL  Simple Floating Load Unnormalized Long
127x3x x1xxxx  SFSTUL Simple Floating Store Unnormalized Long
131x3x x1xxxx  SFAUL  Simple Floating Add Unnormalized Long
133x3x x1xxxx  SFSUL  Simple Floating Subtract Unnormalized Long
135x3x x1xxxx  SFMUL  Simple Floating Multiply Unnormalized Long
137x3x x1xxxx  SFDUL  Simple Floating Divide Unnormalized Long

103x4x x1xxxx  RPC    Register Packed Compare
105x4x x1xxxx  RPME   Register Packed Multiply Extensibly
107x4x x1xxxx  RPDE   Register Packed Divide Extensibly
111x4x x1xxxx  RPA    Register Packed Add
113x4x x1xxxx  RPS    Register Packed Subtract
115x4x x1xxxx  RPM    Register Packed Multiply
117x4x x1xxxx  RPD    Register Packed Divide

123x4x x1xxxx  RPCL   Register Packed Compare Long
125x4x x1xxxx  RPMEL  Register Packed Multiply Extensibly Long
127x4x x1xxxx  RPDEL  Register Packed Divide Extensibly Long
131x4x x1xxxx  RPAL   Register Packed Add Long
133x4x x1xxxx  RPSL   Register Packed Subtract Long
135x4x x1xxxx  RPML   Register Packed Multiply Long
137x4x x1xxxx  RPDL   Register Packed Divide Long

103x5x x1xxxx  RCDC   Register Compressed Decimal Compare
105x5x x1xxxx  RCDME  Register Compressed Decimal Multiply Extensibly
107x5x x1xxxx  RCDDE  Register Compressed Decimal Divide Extensibly
111x5x x1xxxx  RCDA   Register Compressed Decimal Add
113x5x x1xxxx  RCDS   Register Compressed Decimal Subtract
115x5x x1xxxx  RCDM   Register Compressed Decimal Multiply
117x5x x1xxxx  RCDD   Register Compressed Decimal Divide

123x5x x1xxxx  RCDCL  Register Compressed Decimal Compare Long
125x5x x1xxxx  RCDMEL Register Compressed Decimal Multiply Extensibly Long
127x5x x1xxxx  RCDDEL Register Compressed Decimal Divide Extensibly Long
131x5x x1xxxx  RCDAL  Register Compressed Decimal Add Long
133x5x x1xxxx  RCDSL  Register Compressed Decimal Subtract Long
135x5x x1xxxx  RCDML  Register Compressed Decimal Multiply Long
137x5x x1xxxx  RCDDL  Register Compressed Decimal Divide Long

101x6x x1xxxx  SWMDE  Swap Medium Decimal Exponent
103x6x x1xxxx  CMDE   Compare Medium Decimal Exponent
105x6x x1xxxx  LMDE   Load Medium Decimal Exponent
107x6x x1xxxx  STMDE  Store Medium Decimal Exponent
111x6x x1xxxx  AMDE   Add Medium Decimal Exponent
113x6x x1xxxx  SMDE   Subtract Medium Decimal Exponent
115x6x x1xxxx  MMDE   Multiply Medium Decimal Exponent
117x6x x1xxxx  DMDE   Divide Medium Decimal Exponent

125x6x x1xxxx  LUMDE  Load Unnormalized Medium Decimal Exponent
127x6x x1xxxx  STUMDE Store Unnormalized Medium Decimal Exponent
131x6x x1xxxx  AUMDE  Add Unnormalized Medium Decimal Exponent
133x6x x1xxxx  SUMDE  Subtract Unnormalized Medium Decimal Exponent
135x6x x1xxxx  MUMDE  Multiply Unnormalized Medium Decimal Exponent
137x6x x1xxxx  DUMDE  Divide Unnormalized Medium Decimal Exponent

101x7x x1xxxx  SWDDE Swap Double Decimal Exponent
103x7x x1xxxx  CDDE  Compare Double Decimal Exponent
105x7x x1xxxx  LDDE  Load Double Decimal Exponent
107x7x x1xxxx  STDDE Store Double Decimal Exponent
111x7x x1xxxx  ADDE  Add Double Decimal Exponent
113x7x x1xxxx  SDDE  Subtract Double Decimal Exponent
115x7x x1xxxx  MDDE  Multiply Double Decimal Exponent
117x7x x1xxxx  DDDE  Divide Double Decimal Exponent

125x7x x1xxxx  LUDDE  Load Unnormalized Double Decimal Exponent
127x7x x1xxxx  STUDDE Store Unnormalized Double Decimal Exponent
131x7x x1xxxx  AUDDE  Add Unnormalized Double Decimal Exponent
133x7x x1xxxx  SUDDE  Subtract Unnormalized Double Decimal Exponent
135x7x x1xxxx  MUDDE  Multiply Unnormalized Double Decimal Exponent
137x7x x1xxxx  DUDDE  Divide Unnormalized Double Decimal Exponent

An alternative format of vector instruction for memory-to-memory vector operations also exists, having a 13-bit opcode, and a 16-bit length field. Two bits in the instruction indicate if the source or operand (the number divided by in division, the number subtracted in subtraction) operands are scalard instead of vectors (these bits are labelled C for constant) and three bits indicate if stride is present for any of the operands. The 16-bit fields giving the stride are located in destination, operand, and source order after the length field and before the address fields. This format is illustrated below:

As yet, no opcodes are defined which are only available with the longer opcode field of this instruction format, however.

The functions of some of the addressing modes illustrated in the diagrams above are:

Vector Scratchpad: in this instruction format, the source and the destination are both found among the sixty-four vector scratchpad registers.

Scratchpad to Vector Scratchpad: the source operand is the supplementary registers, and the destination operand is one of the sixty-four vector scratchpad registers.

Vector Register to Vector Scratchpad: the source operand is one of the eight vector registers, and the destination operand is one of the sixty-four vector scratchpad registers.

Long Vector Long Memory Reference: the source operand is one of the eight vector registers, and the destination operand is a vector in memory. This is a vector operation, and a range of the 64 elements in the vector scratchpad register used is indicated, together with an optional mask, if the M bit is one, found in the register indicated by the mR field. When a range is used to indicate a vector of less than 64 elements is used, while the starting and ending elements indicate which elements of the vector scratchpad register are used, the operand in memory is simply a vector of less than 64 elements which starts at the effective address. Elements of the vector that are to be ignored due to the use of the mask register, however, are in their assigned positions within the vector, which may begin with an ignored element, whether it is in a vector scratchpad register or in memory.

Long Vector Indexed: this is the indexed form of the Long Vector Long Memory Reference mode described above.

Long Vector Memory Reference: again, the source operand is one of the eight vector registers, and the destination operand is a vector in memory. Here, one of the Address/Base registers is used as the base register, and the displacement is 16 bits in length.

Long Vector Memory Reference with Stride represents a partial implementation of another feature found on Cray supercomputers.

The stride is a signed 16-bit field, giving the displacement between successive vector elements in memory. If the stride field contains a 1, the instruction is a conventional vector operation, thus, the displacement is in units of operand size, not bytes; if it is a zero, the memory operand is a scalar value.

The purpose of this is to facilitate matrix multiplications. Since a nonunit stride will lead to extra memory accesses in most implementations, the optimal way in which to perform matrix multiplication will be to load each column of the left matrix into the vector registers or vector scratchpad in turn, requiring an operation with nonunit stride, and then performing successive multiply and accumulate operations involving that operand in the register space with the rows of the right matrix, which is accessed using unit stride.

Note that the base register field may be zero, indicating Long Vector Long Memory Reference with Stride or Long Vector Indexed with Stride as well.

Note that the different possibilities for the source operand have been illustrated in the modes described above. The destination operand may also be varied between any of the types shown. Two examples to illustrate this are shown.

Vector Scratchpad to Vector Register: here, one of the sixty-four vector scratchpad registers is the source, and one of the eight vector registers is the destination.

Vector Register: this vector operation has one of the eight vector registers as its source and destination operands.

Finally, one mode is shown illustrating how the halfword giving the range of registers used for a vector, and indicating masking, may be omitted if it is not required.

Unmasked Long Vector: in the form of this mode illustrated, the source and destination are each one of the sixty-four long vector scratchpad registers. Again, the source and destination operands may be replaced with operands of any of the types shown above, so the scratchpad registers, or one of the long vector registers, or a vector in memory, with or without stride, by means of memory reference, long memory reference, or indexed access may be used.

Because the sixty-four supplementary arithmetic/index registers are 64 bits long, rather than 32 bits long, in order that vector operations on the long type are possible, these additional opcodes:

xx1000xxxx010xxx x000   I     Insert
xx1010xxxx010xxx x000   UL    Unsigned Load

are defined for the vector addressing modes and those scalar addressing modes having a supplementary register as their destination register, controlling sign extension for loading a 32-bit value into a 32-bit register.

The short vector instructions are designed to be implemented by means of an arithmetic unit with a 256-bit wide register that can be partitioned into multiple areas, and in which all operations are carried out in parallel.

The long vector instructions, on the other hand, follow the principles used in computers where there is only a single conventional ALU, but which can still operate on several operands concurrently because it is pipelined. As a result, the Medium floating-point format is allowed with long vector instructions.

It is envisaged that the short vector arithmetic unit will be pipelined also, and, thus, in order that the long vector instructions, when available, will result in more arithmetic operations in a given time than the short vector instructions, it will be necessary to implement long vector operations with some degree of parallelism.

When parallelism is employed, special interconnections are required to permit the long vector arithmetic units to work not only on groups of consecutive double-precision floating-point numbers or consecutive long integers, which correspond to the most likely width of the data path between the long vector arithmetic units and the cache, but also on the same number of consecutive bytes, the same number of consecutive halfwords, the same number of consecutive integer or floating-point numbers, or, for that matter, the same number of consecutive extended (or quad) precision floating-point numbers.

Eight consecutive bytes, four consecutive 16-bit halfwords, and two consecutive 32-bit integers or floating-point numbers all occupy the same area in memory as one 64-bit double precision floating-point number or long integer, so circuitry is required to send them to consecutive arithmetic-logic units instead of to the same one.

The degree of parallelism provided can vary from one implementation to another. If complete parallelism for a long vector instruction is provided, by means of a bank of sixty-four arithmetic-logic units for long vector operations, some further additional circuitry, to allow MIMD parallel computing as well as SIMD parallel computing, will permit the execution of the instructions defined for this mode described in the section on cache-internal parallel computing.

One possible way to implement the capability of a parallel long vector arithmetic unit to operate on operands of differing size as though it were merely a single, but deeply-pipelined, ALU, while limiting the amount of interconnecting circuitry required is shown below:

A circuit that simply rearranges the 512 bytes contained in sixty-four data words of 64 bits each from

0 1 2 3 4 5 .... 508 509 510 511

to

0 128 1 129 2 130 .... 126 510 127 511

will, when applied once, move the bytes in consecutive 32-bit operands to consecutive 64-bit data words, thus delivering them to the appropriate ALU, and, if applied twice, will move the bytes in consecutive 16-bit operands to consecutive ALUs with a 64-bit path to cache memory, and if applied three times will deliver consecutive bytes to consecutive ALUs.

But the bytes will not be in the right positions; this can be dealt with by using a circuit with a function that is partially the inverse of this function, but which operates locally on the eight bytes dealt with by each single ALU, that is, rearranging the eight bytes in each of the sixty-four groups of eight bytes from:

0 1 2 3 4 5 6 7

to

0 2 4 6 1 3 5 7

Inverse versions of these two circuits will also be required for returning values to memory.

Given a 64-bit path into each ALU, so that doubleword operands proceed directly from cache into the ALU, the operations shown above for different sizes of operands are:

For word operands, one global scatter operation, followed by one local gather operation.

For halfword operands, two consecutive global scatter operations, followed by two consecutive local gather operations.

For byte operands, three consecutive global scatter operations.

Then, a shift step, if necessary, selects which of the two groups of words, or which of the four groups of halfwords, or which of the eight groups of bytes, in a group of doublewords, is operated on in the bank of ALUs, and finally a masking step removes the unused inputs.

Also, it may be noted that it is not necessary for a long vector operand to be aligned on a boundary representing 64 of the data items of which it consists. The circuitry required to deal with this expeditiously is also useful for more general inter-ALU communications, and is therefore described in a subsequent section concerning a MIMD capability obtained by adding a simple control unit to each ALU.

Note that this type of circuitry appears inconsistent with the use of the Medium floating-point type with long vector instructions. However, the circuitry provided for data memory width control can be used to permit efficient operation on this floating-point type for some combinations of floating-point format and memory width, as follows:

If the memory width provides normal 32-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 24-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 40-bit words.

If the memory width provides 24-bit words, for most formats the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 36-bit words; for the Standard format, the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.

If the memory width provides 40-bit words, for formats other than the Standard format the Medium floating-point type may be handled by caching memory containing variables of that type as if it consisted of 60-bit words.

For all other combinations of memory width and floating-point format, the Medium floating-point type may not be used with long vector instructions.

Note also that values in the Medium floating-point type need to be aligned on 16-bit boundaries, not 48-bit boundaries, and thus caching to deal with these values may involve an offset that would not occur if a memory width of 48 bits is applied to all memory beginning with location zero; however, such offsets can also be used when the memory width is changed normally; the actual restriction imposed on the address associated with a cache line is that it must be a multiple of the width of the data bus to main memory.


[Next] [Up] [Previous]