This is now my seventh attempt to propose a successor to my original Concertina architecture.
I hope that this time I have found a way to achieve the goals I have set for myself while avoiding excessive complexity. The basic instruction formats for this architecture have the form:
All instructions are, or at least start out as, 32 bits in length, and the way in which they are processed is organized to be suitable for an implementation in which instructions are fetched eight at a time, in blocks of 256 bits.
The intent of this design is that any portion of a program which consists only of instructins that are 32 bits long may be executed without any overhead caused by the possibility of instructions of other lengths being present. The principle used to achieve this is this: when, and only when, there is need for instructions of other lengths, one or more instructions in the preceding block indicate how many of the first few 32-bit instruction words in the next block are to be skipped over as containing data additional to normal 32 bit instructions; and, those instructions which are longer appear in the instruction stream as normal 32 bit instructions, but contain a pointer to the additional information required within that skipped-over part of the block.
An obvious source of inspiration for this design is the "heads and tails" scheme proposed by Heidi Pan. However, certain key elements of her idea are left out, making the instruction arrangement more conventional and yielding some advantages, at least as I percieve them.
Because the lengths of instructions and data are in conformance with the width of physical memory, allocating space for a short field giving the number of instructions in each block would be very likely to force a large amount of unused space in each block. Thus, instead, I put that short field in some instructions, with the expectation that almost all of the time, an instruction with such a field will be present in the block preceding any block not completely filled with 32-bit instructions. So I waste extra copies of the 3-bit field most of the time, instead of wasting 29 bits in almost every block; and I also avoid the complexity of avoiding this in the alternate fashion of, say, making the basic length of an instruction, say, 31 bits, so that using three bits for a length field would be less likely to waste a large area of the block (with the data part still containing 8, 16, 32, and 64 bit items).
The sequence of standard-length instrruction portions, the "heads" of the "heads and tails" format, begins here in the middle of the block, at the boundary between heads and tails, and proceeds to the end of the block, rather than beginning at the start of the block, and continuing to the boundary. This has the advantage that when there is a branch to an instruction, there is no need to find the field which specifies how many instructions are in the block. This difference, of course, is necessitated by the previous difference.
Also, in the "heads and tails" scheme, since the "tail" portions of instructions are in sequence from the end of the block, there is no need for instructions to reserve space for pointers to their "tail" portions, if any. Here, there is always an explicit pointer to any additional data forming part of an instruction in addition to its primary 32 bits. This has the advantage of eliminating any need to process the instructions in a block serially; "tail" portions can be fetched and processed for any instruction independently of the others.
The instruction formats shown in the twelfth and thirteenth lines of the diagram are similar to those of many RISC architectures. There are memory-reference instructions, and register-to-register operate instructions that work with a bank of 32 registers.
The bit in the register-to-register instructions marked B, if it is zero, indicates that the instruction is guaranteed not to be dependent on the preceding instruction. This allows more rapid processing of instructions; they are considered to be grouped in blocks, where the first instruction of a block either has the B bit set, or is of a type without a B bit. Every instruction within a block must be one that can be safely executed in parallel.
The register-to-register instructions also have a bit marked C, which must be set to allow the instruction to change the condition codes. The same bit is also provided for the augmented memory-reference instructions, but in a different position.
Unlike most RISC architectures, but like the System/360, memory-reference instructions offer full base-index addressing.
There are 32 integer data registers, each one 64 bits wide, but only 8 address registers, also each 64 bits wide. As well, there are 32 floating-point data registers, each one 128 bits wide.
The base register field refers to one of the eight address registers, except that address register zero is not used as a base register.
The index register field indicates that indexing is not taking place if it contains all zeroes. Otherwise, it indicates the register that is to be used as an index register as follows:
001 Integer data register 1 010 Integer data register 2 011 Integer data register 3 100 Address register 0 101 Address register 1 110 Address register 2 111 Address register 3
Thus, some of the index registers are among the integer data registers, to allow index values to be the result of complex calculations. Since many programs will not require seven different base register values, however, some of the address registers may also be used as index registers; and, specifically, address register 0, which would otherwise not be useful, is allowed to serve as an index register.
The five-bit opcodes of the memory-reference instructions begin either with 0 or with 10, leaving the two bits 11 as the first two bits of instructions of many other types, particularly the various register-to-register instructions.
All the instructions which start with 11 reserve the last three bits of their primary 32-bit portion for a three-bit field. In every 32-bit instruction slot in a 256-bit block, except the last one, the value in this field, from 0 to 7, indicates the number of instruction slots in the following block that do not contain the first 32-bit portion of an instruction, and are thus available for additional data.
In the last 32-bit instruction slot in a block, which must always contain the primary 32 bits of an instruction if the flow of execution passes through that block (areas of memory used for data, of course, need not contain any instructions), if the first two bits of those 32 bits are 11, the last three bits are the number of skipped-over instruction slots in that block itself, rather than the following one.
It should be noted, however, that there is an inherent unavoidable situation where branching to the wrong location is capable of causing problems.
It has been noted above that there may be more than one instruction with a three-bit skip field in a block, and, if so, all the skip fields should contain the same value. It should also be noted that if a block contains one or more instructions with a skip field, it is not necessarily the case that the second-last instruction in a block contains a skip field, and, as noted above, the skip field in the final instruction pertains to the current block rather than the next one.
Thus, if a block contains one or more instructions with a skip field, and these skip fields contain a nonzero value, and a branch is made to that block after the last instruction with a skip field (referring tothe next block) in that block, then the entire following block will be treated as composed of normal 32-bit instructions, with none skipped over (unless, of course, that block ends in an instruction with a skip field). Therefore, this situation must not be allowed to happen to avoid undesired operation.
For a memory-reference instruction, the base register field may not contain all zeroes.
The first through the eleventh lines of the diagram shows how this portion of the opcode space is used instead. The first line shows how it is used to pack two instructions, each 14 bits in length, into a single 32-bit instruction slot.
Note that the first of two 14-bit instructions is split into two pieces, and the last 12 bits of the first small instruction are placed earlier in the compound instruction than the first 2 bits of the small instruction. It is chosen to split the compound instruction on the left, rather than on the right, so that the bits of fields containing register numbers, shift counts, or displacements remain contiguous, and only bits of the opcode are split up.
The formats of the possible 14-bit instructions are shown on the right-hand side of the diagram.
There are register-to-register two-operand instructions which use only the first eight registers in either the integer bank or the floating-point bank, shown in the first line of that section of the diagram. These instructions cannot set the condition codes; setting them must be done by a full-sized instruction.
The second through fifth lines of this portion of the diagram show the shift instructions; prefixing is used on the field giving the length of the shift so as to conserve scarce opcode space here.
The sixth line of this portion of the diagram shows the format of a conditional branch instruction with an eight-bit displacement; this signed displacement can be from -128 to +127. A value of zero refers to the address immediately following the instruction. The displacement is in units of 16 bits, and is strictly according to memory address, so values that point to the second half of a conventional 32-bit instruction, or to the first part of a block which doesn't contain instructions, have to be avoided. Obviously, doing it the other way, and skipping all invalid values, would make branching slow, as well as causing ambiguities.
The eighth line of the main diagram shows how the remaining part of the space available in the case where the base register field of a memory reference instruction is used to allow memory-reference instructions the destination of which is an address register can be specified.
The opcodes for these instructions are:
010 LAR Load Address Register 011 STAR Store Address Register 100 JS Jump to Subroutine 101 LEA Load Effective Address
In the case of the Jump to Subroutine instruction, the destination register field shows where the return address is to be stored, in the address register it indicates.
The Load Effective Address instruction calculates the source address, but does not access memory, instead storing the address value in the destination register.
The ninth line shows an additional group of instructions in this space. This is another form of memory-reference instruction, decided to avoid the need to keep reloading base registers in a program which operates on multiple large arrays.
The last seven bits of the instruction indicate which array is being accessed, and indexing is applied to the pointer to the start of the array, rather than to the array selection field in the instruction itself. Thus, Array Mode is a form of indirect addressing.
If the R bit is a zero, the 128 integer registers in the enlarged register file are used as the source of the array pointers. If that bit is a one, the array pointers are taken from memory, and the region of memory in which they are contained is that to which the contents of address register zero point.
Indirect addressing is generally avoided in modern computer architectures largely due to the complexities that may result in handling the various cases in which one or both of the two memory accesses involved in executing the instruction experience a page fault.
This issue is completely avoided in the case where only register indirect addressing involving the enlarged register file is used. When, instead, the array pointers are kept in memory, since there is only a relatively small number of them, the issue can also be handled through requiring that these arrays of addresses, when used, are always kept in cache.
The fifteenth line shows a modified form of the register-to-register instruction in which only a source and destination register are specified. This allows for a longer opcode field which can specify additional arithmetic operations.
The fourteenth line of the diagram shows an instruction format in which the source operand is an immediate value. But instead of being in a fixed location as part of the instruction, it is referenced by a pointer, like a memory operand instead of an immediate value, which is why this instruction format has been labelled "pseudo-immediate".
The pointer, however, points to one of the 32 bytes of the current 256 bit instruction block, so accessing the item it refers to, from what has already been fetched into the instruction buffer, should be at least as rapid as accessing an operand in a register, and thus in practice there should be no significant difference between operands of this form and conventional immediate operands.
Note that there is no need, as well, with this scheme, to place restrictions on what can be in a block that contains a branch target. Since the pseudo-immediates are pointed to, instead of being in locations deduced from what has happened in previous instructions in the block, as long as one only branches to actual code and not to constant values, things will simply work: the interpretation of instructions after the branch point will not depend on what portion of the block prior to the branch point consists of instructions, and what portion is skipped over to contain immediates or other data.
In various lines of the diagram are additional instruction formats that allow the instruction to be longer than 32 bits. Here, the pointer to additional material is four bits long, so that instructions may be lengthened in steps of 16 bits. How many of those are used depends on the particular instruction.
Note that the 32-bit main body of the instruction is shown on the left of the diagram, and possible formats for the additional bits of the instruction, which actually will precede it in the location the pointer field indicates, are shown on the right of the diagram.
This differs from the case in the fourteenth and sixteenth lines of the diagram; there, the instruction is longer because of the presence of an immediate operand, which can be of any of the available data formats, and thus no attempt is made to illustrate possible immediate operands to the left of the base 32 bits of the instruction.
The sixteenth, eighteenth and twenty-second lines of the diagram show instruction formats similar to those in the fifteenth, seventeenth and twenty-first lines, except that they are modified so that the source operand is immediate. Thus, an immediate operand may be combined with the formats shown in those lines.
Of course, the need for two pointers in the instruction could be avoided, as one pointer could be used to indicate where the additional portion of the instruction proper, and the immediate operand, are both present, one following the other. This arrangement, however, has the advantages of being more orthogonal and of making it easier to properly align immediate values to help speed execution. As well, it is possible for two instructions in a block to have pointers which indicate the same immediate operand, or even the same additional instruction bits, in order to save space.
The nineteenth line of the diagram, along with the twenty-first through twenty-fourth lines, show formats for an alternate set of instructions which use additional banks of registers that have 128 registers in them instead of 32 registers.
The purpose of the enlarged register bank is to allow a single program to use the processor at full speed without need for the technique of out-of-order execution. It is, however, envisaged as difficult to arrange programs to solve real-world problems so as to have lengthy segments which perform multiple independent calculations within the registers, as this would require.
Thus, in the twenty-first line, a register-to-register instruction format is shown with two seven-bit fields to specify the source and destination registers. In the twentieth line, a format is shown with one seven-bit field for the destination register, with an immediate source operand. In the twenty-first line is a form of instruction that links the 128 registers of the enlarged register files with the 32 registers of the standard register files.
In the twenty-fourth line, an instruction format is shown that makes use of additional instruction bits to allow full three-address instructions which can use any of the 128 registers in the enlarged register files in each of the three roles.
In the nineteenth line, an instruction format is shown that allows a destination, operand, and source register to all be specified. Only the destination register field is seven bits long. The operand and source register fields are each four bits long, and thus those registers must belong to the same group of sixteen registers as the destination register.
The opcodes of the memory-reference instructions are:
00000 LB Load Byte 00001 STB Store Byte 00010 ULB Unsigned Load Byte 00011 IB Insert Byte 00100 LH Load Halfword 00101 STH Store Halfword 00110 ULH Unsigned Load Halfword 00111 IH Insert Halfword 01000 L Load 01001 ST Store 01010 UL Unsigned Load 01011 I Insert 01100 LL Load Long 01101 STL Store Long 01110 JC Jump on Condition 01111 JSRD Jump to Subroutine Return in Data 10000 LM Load Medium 10001 STM Store Medium 10010 LF Load Floating 10011 STF Store Floating 10100 LD Load Double 10101 STD Store Double 10110 LQ Load Quad 10111 STQ Store Quad
Byte, Halfword, Word, and Long are integer formats 8, 16, 32, and 64 bits in length respectively; Medium, Floating, Double, and Quad are floating-point formats 48, 32, 64, and 128 bits in length respectively. Due to their odd length, Medium-format floating-point numbers are considered to be aligned when they are aligned to a 16-bit boundary.
In the case of the Jump on Condition instruction, the destination register field is used as part of the opcode, to indicate the condition under which branching takes place. The various forms of the Jump on Condition instruction are:
01110 00000 NOP No-operation 01110 00001 JL Jump if low 01110 00010 JE Jump if equal 01110 00011 JLE Jump if low or equal 01110 00100 JH Jump if high 01110 00101 JNE Jump if not equal 01110 00110 JHE Jump if high or equal 01110 00111 J Jump 01110 01000 JV Jump if overflow
In the case of the Jump to Subroutine Return in Data instruction, the destination register field indicates where the return address is to be placed, in one of the thirty-two normal integer data registers. As placing a return address in an address register may be more useful, the simple Jump to Subroutine instruction is the one in the base register group of instructions.
The basic floating-point formats supported by this architecture are patterned after those of IEEE 754, as used by many other computers, and are shown below:
In addition to the standard 32-bit and 64-bit types specified by IEEE 754, a similar type occupying 48 bits is defined. The size of the exponent field is chosen to be the minimum that allows numbers from 10^-99 to 10^99 to be represented, and with that exponent field, 11 digits of precision are provided. Thus, this format matches the precision provided by many pocket calculators, as well as used in mathematical tables and mechanical calculators; thus, historically, it appears to be a good fit to what many scientific problems require.
The extended-precision format of floating-point number is the one used in the registers for floating-point numbers of all precisions.
Thus, when a register-to-register operation involving a shorter precision is performed, some bits of the register are ignored when operands are taken, but all bits are filled to provide a valid extended-precision number with the proper value when results are returned.
This has the positive consequence that denormals do not require any additional overhead. It also means that single-precision, double-precision, and intermediate-precision numbers, in internal form, have a slightly greater numeric range than they do in external form.
This means that some computations may continue, and produce a correct result, which would otherwise fail if the numbers were kept in external form all the way through. However, this still tends to be viewed as a drawback, as it means that computations are less consistent in their results. Also, this means that instructions to store floating-point numbers other than extended precision floating point numbers may fail with an overflow or underflow error.
A later page will discuss the formats of Decimal Floating-Point numbers. One of those formats involves representing decimal digits using a modified form of Chen-Ho encoding. These numbers will also be converted to an internal form to speed computation; in the internal form, ten-bit fields representing three decimal digits will be converted to normal four-bit BCD digits. This will expand a 128-bit Decimal Floating-Point number to more than 128 bits. As a result, while there will still be 128 Decimal Floating-Point registers, a Decimal Floating-Point register will consist of both the corresponding 128-bit floating-point register and the corresponding 64-bit integer register.
The architecture provides vector instructions similar to those provided by many processors today. A vector as used by a short vector instruction is 256 bits long, and may contain four 64-bit or eight 32-bit floating point numbers, or four 64-bit, eight 32-bit, sixteen 16-bit, or thirty-two 8-bit integers.
There are sixteen short vector registers. Each short vector register occupies two consecutive 128-bit floating-point registers; thus, short vector register 0 is floating-point registers 0 and 1, short vector register 1 is floating-point registers 2 and 3, and so on.
The formats of the short vector instructions are shown in the tenth, twenty-fifth, and twenty-sixth lines of the diagram.
The component elements of a short vector are stored in the short vector registers in the same format as they are stored in memory. This is unlike the case with scalar calculations, where floating-point numbers are converted to an internal form to speed computation. Therefore, a status bit indicating that denormals are to be treated as zeroes is provided in the program status word, but it only affects short vector calculations.
The eleventh and twentieth lines of the diagram show the formats of the long vector instructions.
These use sets of sixty-four vector registers, each of which contains space for sixty-four elements in the vector. Thus, the floating-point long vector registers consist of sixty-four 128-bit elements each, and the integer long vector registers consist of sixty-four 64-bit elements each.
Numbers are converted to internal form for long vector instructions as for scalar instructions and unlike short vector instructions, and so the same considerations with respect to overflow and underflow apply.
Hybrid vector instructions use the floating-point long vector registers only, and also use the same instruction formats as long vector instructions.
These instructions, however, put data in those registers in external format, and pack the data within those registers; thus, a vector register would contain 128 rather than 64 floating-point numbers if they are in 64-bit double precision.
Stride is not supported with hybrid vector instructions.
The length field in the instruction indicates the number of 256-bit short vectors, rather than the number of individual elements, in the vector.
Like short vector instruction, these instructions are affected by the denormals are zero status bit.
The twenty-seventh through thirty-first lines of the diagrams show configurations of bits which, when they appear in the last instruction slot of a block, do not denote an instruction.
The bit configurations in the twenty-seventh and twenty-eighth lines provide bits which indicate predication for the instructions in earlier instruction slots in the block.
Since these are not instructions, the B and C bits are available to specify the format of these bits.
The twenty-seventh line of the diagram shows a flexible format for the instruction slot contents which indicate predication. Here, there may be instructions in all seven other instruction slots in the block, but only three of them can be predicated. Thus, there is also a bit marked SC, and there are three fields with four bits to indicate a flag bit or a condition code, each preceded by an S bit. But in addition, there is a seven-bit field, split up into two bits in the B and C bit positions, and the remaining five following the SC bit, in which up to three 1 bits will indicate which of the seven preceding 32-bit instruction slots of the 256-bit instruction block those fields correspond.
If the SC bit is 0, the four-bit field always indicates a flag bit. If the S bit is zero, the instruction is executed if the flag is set; if the S bit is one, the instruction is executed if the flag is cleared.
If the SC bit is 1, then if the S bit is zero, the four-bit field still indicates a flag bit, and the instruction is executed if the flag is set. If the S bit is one, the four-bit field indicates for what values of the condition codes the instruction is executed, with the same interpretation as used in the conditional jump instructions.
As shown in the twenty-eighth line of the diagram the three bits corresponding to the instruction, if all zero, indicate the instruction is executed unconditionally; if they have a value from 1 to 7, flag bits 1 to 7, from among the sixteen flag bits, numbered 0 to 15, are then used to control whether the instruction is executed.
In the twenty-ninth line of the diagram, a format is shown which instead provides bits used for instruction prefixing. If the bits corresponding to an instruction slot are 000, the instruction in that slot executes normally; for values from 001 to 111, one of seven alternate instruction sets are selected.
In the thirtieth and thirty-first lines of the diagram, a format is shown which allows one alternate instruction set to be selected, and applied to any of the seven preceding instruction slots, and one predicate to be specified, and applied to any of the seven preceding instruction slots.
In the thirtieth line, the predicate refers to a flag bit from 1 to 7, and the S bit, if 0, indicates the flag is to be set for the instruction to execute, if 1, indicates the flag is to be cleared for the instruction to execute. In the thirty-first line, the predicate is instead a selected condition to be indicated by the condition code bits.