This is now my sixth attempt to propose a successor to my original Concertina architecture. This time, the architecture is designed to follow some of the aspects of the RISC style of computer architecture: memory-reference instructions are load and store instructions, and all instructions are 32 bits long.
Many, but not all, instructions include a Break (B) bit. If this bit is set on an instruction that has it, it indicates the start of a new series of instructions that can be initiated simultaneously without checking for dependencies. As resource conflicts are implementation-dependent, the absence of a break bit between instructions will not suppress checking that multiple instructions do not attempt to use the same arithmetic unit.
Instructions the format of which does not include a break bit are treated as if their break bit was set.
Eight instructions, each 32 bits long, are fetched at a time in one fully-aligned 256-bit instruction block. Because all instructions are the same length, they may be decoded simultaneously.
If no breaks occur during an entire instruction block, the first instruction in the following instruction block is treated as though its break bit is set whether or not that is the case. Thus, a maximum of fifteen instructions may have their execution initiated in the same cycle. Additional break bits may be inserted in an implementation-dependent manner to enforce a lower limit on the number of instructions that may be initiated in one cycle.
The break bit by itself does not assert that dependencies exist; they are found, if present, through interlock bits on the various registers, and will result, if necessary, in additional delays of the instruction stream.
The basic complement of registers is as follows:
There are three banks each of both integer registers, each 64 bits in length, and of floating-point registers, each 128 bits in length.
One bank contains eight registers, another contains 32 registers, and the third contains 128 registers, of each type.
Also, an implementation including the long vector feature will consist of one or more banks of four cores that may be linked together.
The complement of registers provided to the programmer is illustrated below:
The diagram also shows how the sixteen short vector registers, each 256 bits long, are formed by using two of the 32 128-bit RISC floating-point registers, and how the eight long vector registers are formed by using sixteen of the 128 registers from four cores, with the elements of the vectors being distributed among the cores.
As well, there is one bank of eight base registers, each 64 bits in length.
Also, there are two registers, each 32 bits long, used for instruction prefixing.
Instructions may be fetched, decoded, and executed in parallel in groups of up to eight instructions, thus allowing the processor to behave like a Digital Signal Processor (DSP) with a Very Long Instruction Word (VLIW) architecure. In order for it to be able to process a large number of instructions at the same time at various pipeline stages, a large complement of registers - 128 integer and 128 floating-point registers - is required.
However, not all programs may be written in such a way as to run efficiently on a VLIW processor designed like a DSP. Therefore, it is expected that implementations of this architecture will have multiple modes of operation that permit them to achieve high levels of utilization of on-chip computational resources with different types of program.
A chip will have as many sets of eight base registers, of two prefix registers, and as many status words and program counters, as it may have threads running simultaneously.
However, a standard implementation may consist of cores containing only one set of banks of 8, 32, and 128 registers for integer and floating-point values.
The intent is that a core may operate in two or three possible modes.
The first mode is with one thread running on the core, having access to banks of 8, 32, and 128 registers, so as to be able to use the banks of 128 registers for VLIW code.
The second mode is with four threads running on the core, with access only to banks of 8 and 32 registers, but not to the banks of 128 registers.
In this second mode, the banks of 32 and 128 registers on the chip will each be divided into four parts, and each program will see one quarter of the bank of 32 registers as its bank of 8 registers, and one quarter of the bank of 128 registers as its bank of 32 registers.
The third mode is with four threads running on the core, with access only to banks of 8 registers.
In this mode, each thread will have a bank of 8 registers taken from the bank of 32 registers, and in addition will have 32 rename registers taken from the bank of 128 registers; the programs will execute out-of-order.
As well, I intend to offer long vector instructions similar to those of the Cray I supercomputer. This will involve having eight vector registers containing vectors of up to 64 elements.
Rather than defining a new bank of registers, however, I envisage this feature only being available in an implementation that has four cores on a single die.
Since stride is only applicable to memory-reference instructions, and register-to-register operate instructions involve corresponding elements of the vector registers involved, I propose to use the banks of 128 DSP registers of four cores to supply the space for the vector registers.
The first sixteen of those 128 DSP registers would act as vector register 0, and successive elements of each vector would rotate between cores.
Although all instructions are 32 bits long to permit multiple instructions to be simultaneously decoded, there are four mechanisms provided to allow additional information to be associated with an instruction so that an instruction can perform operations which require more than 32 bits to be specified, thus allowing the computer to behave like one with a variable word length supporting instruction words longer than 32 bits.
The first mechanism is this: some register-to-register operate instructions allow the register specification to indicate that the source operand is an immediate operand.
In this case, to avoid interference with the parallel decoding of instructions, the immediate operands do not normally follow the instruction body immediately. Instead, any immediate operands called for in the instructions within a given 256-bit block of eight instructions are placed in order, independently of alignment, at the beginning of the following block of eight 32-bit instruction words, thus suppressing the decoding of instructions in the positions used for immediates.
This mechanism is used for immediate values 32, 64, or 128 bits in length. Instruction formats are also provided for operating on 8-bit and 16-bit immediate values within a 32-bit instruction.
The second mechanism involves the two prefix registers.
This method drew inspiration from the Fujitsu SPARC64 X processor, with its SXAR instruction. That instruction put 24 bits of immediate data into the Extended Arithmetic Register (XAR), which would modify the following one or two instructions by associating 12 bits of additional data with them, allowing the instruction set to be extended.
Here, as I was defining a new instruction set, rather than extending an existing one, as Fujitsu did with the SPARC architecture, I included an E bit in the instruction, so that after placing data in the prefix registers, it could be used later on instead of immediately.
One register, the Supplement Prefix Register, has the following format:
1 bit: Single/Double 1 bit: Second of double 15 bits: First double prefix, first half of single prefix 15 bits: Second double prefix, second half of single prefix
The other register, the Augment Prefix Register, has this format:
1 bit: Direct 1 bit: Active 1 bit: unused 2 bits: Counter 1 bit: Single or First Double Augmented 1 bit: Second Double Augmented 1 bit: unused 6 bits: First augment item 6 bits: Second augment item 6 bits: Third augment item 6 bits: Fourth augment item
When an instruction with the Extend (E) bit set is decoded, additional information is associated with the instruction in the following fashion:
If the Direct bit in the Augment Prefix Register is set, then the two counter bits are the number, minus one, of the 6-bit augment item to be applied to the instruction. If the two counter bits are not both one, they are incremented.
If that bit is not set, the information to be associated with the instruction is instead found in the Supplement Prefix Register.
The Active big in the Augment Prefix Register must also be set for an instruction with the Extend bit set to retrieve prefix information from either prefix register, so that if the registers are cleared no prefixing will take place.
If the Single/Double bit in the Supplement Prefix Register is 1, one of the two 15-bit fields is used as the source of that information, as indicated by the Second of Double bit; the first one if it is 0, the second if it is 1. That bit is then set to 1.
The two bits named Single or First Double Augmented and Second Double Augmented in the Augment Prefix Register are, for the corresponding one of those two 15 bit fields, controlling whether, if 0, only the last 12 bits of the field are to be associated with the instruction, or, if 1, that all 15 bits of the field are to be associated with the instruction.
If the Single/Double bit in the Supplement Prefix Register is 0, then the 30-bit field produced by the concatenation of those two 15-bit fields is used as the source of the bits to be associated with the instruction.
The bit named Single or First Double Augmented in the Augment Prefix Register controls whether, if it is 0, the last 24 bits only of that 30-bit field are associated with the instruction, or if it is 1, all 30 bits are so associated.
These two registers are loaded by a group of instructions that include a 24-bit field of immediate data.
One such instruction loads the 24 least significant bits of the Supplement Prefix Register, and clears the Single/Double bit in that register. Another instead loads the 12 least significant bits of each of the two 15-bit fields in the Supplement Prefix Register, and sets the Single/Double bit in that register.
Each of those instructions comes in two forms. One form also clears the bits in the last 30 bits of the Supplement Prefix Register that it does not load, and as well clears the Single or First Double Augmented and, also, if applicable, the Second Double Augmented bits in the Augment Prefix Register. The other form takes a six-bit augment item from the Augment Prefix Register, advancing its counter unless it is already at 11, and uses it to fill in either the most significant 3 bits of each 15-bit instruction prefix or the most significant 6 bits of the 30-bit instruction prefix, being established in the Supplement Prefix Register. This form sets the Single or First Double Augmented, and, also, if applicable, the Second Double Augmented bits in the Augment Prefix Register.
Thus, four instructions can be given a 6-bit prefix, or two instructions a 12-bit prefix, or one instruction a 24-bit prefix, by a single additional 32-bit instruction. In addition, a 32-bit instruction can be used to provide four augment items which can then be used by four subsequent 32-bit instructions so as either to give two instructions a 15-bit prefix instead of a 12-bit one, or one instruction a 30-bit prefix instead of a 24-bit one.
There are also two instructions that load the last 24 bits of the Augment Prefix Register. One sets the Direct bit, indicating that instructions following with the E bit set will take one augment item as their prefix; the other clears that bit, so that the augment items will instead only be taken to lengthen 12 or 24 bit prefixes into 15 or 30 bit prefixes when the Supplement Prefix Register is loaded.
And, therefore, an instruction with the Extend (E) bit set can find an instruction prefix waiting for it that is either 6, 12, 15, 24, or 30 bits in length.
The decoding of the instruction prefix depends on the type of the instruction itself as well as which length of instruction prefix is encountered.
This scheme permits instructions to be lengthened with a high degree of flexibility, and with a limited number of overhead bits in the case where the instructions to be lengthened in a program are grouped in sequences of instructions that all need to be lengthened by the same amount. If instructions with different prefix sizes are mixed, on the other hand, some of the fields in instructions loading the prefix registers will be unused.
Another important consideration to note is that the instructions with the E bit set that use the contents of the prefix registers must be in a block of 256 bits of instructions subsequent to the one in which the instructions loading the prefix registers occurred, so that the prefix register contents are available during instruction decoding.
The third method involves instructions that directly add bits to every instruction in the following block of eight 32-bit instruction words.
One of these instructions belongs to the same group of instructions as the ones that set the prefix registers, but in this case it applies three bits to every instruction word in the next block. The bits are used for instruction predication; if the associated bits are 000, the corresponding instruction is not predicated; otherwise, one of seven flag bits is indicated, and the instruction will only execute if that flag bit is set.
The other instruction in this category only has an eight-bit immediate field, and it associates a single bit with every 32 bits of the following block. However, its purpose is not to make instructions longer, but shorter: if the corresponding bit is a one, the 32-bit instruction word is decoded as two 16-bit register-to-register instructions.
This instruction also contains a single 16-bit instruction in the same format so as to reduce the overhead of indicating 16-bit instructions.
The fourth method addresses the limitations of the second method of lengthening instructions. Because the prefix registers have to be loaded prior to the block of instructions in which their contents are used, a given block of instructions can only have prefixed instructions of the same type in it. The last prefix can be re-used, so it is still possible for all eight instructions in a block to be prefixed with that method, but the variety of prefixed instructions is very limited.
In the fourth method, the final instruction slot of a block contains a special type of instruction called a "Special Block Indicator". If that is present, the following instruction block contains, at its beginning (preceding any immediate values that may also be present in that block) a sequence of prefix codes. These prefix codes are preceded by length indicators; thus, they are serially decoded in the same manner as instructions for computers with variable-length instructions like the IBM System/360.
Thus, while high-performance programs that look like VLIW programs for a DSP, or that look like RISC programs are composed entirely of 32-bit instructions, with the performance advantages that may derive from that, if it is necessary to write a program that uses a lot of longer instructions, such programs can pay the extra decoding costs required without either imposing such costs on other programs, or requiring the computer to switch between different modes of operation for fixed-length and variable-length instructions.
And, given the other three mechanisms of lengthening instructions, even when some extra-long instructions are required, the extra overhead of decoding variable-length instructions is only needed a part of the time.
One consequence of having these four methods of making instructins longer is that if instruction prefixes have been set up as per the second method outlined above, then in the case of an instruction block which begins with immediate values as per the block method, the instructions in that block cannot be branch targets, as this could lead to the immediate values, as they would in the case of a branch into their instruction block still be decoded, even though, preceding the branch target, they would not be executed, grabbing one or more instruction prefixes in error.
The instruction prefix registers are cleared by most instructions which transfer control. They are cleared by all subroutine call instructions, and it should be noted that the prefix registers must be saved and restored by interrupt service routines. The one case where they are not cleared when control is transferred is when that transfer of control is performed by a conditional jump instruction with program-relative addressing. This is useful if the same prefix will be reused many times by instructions in a loop, but it does also mean that branching into code may result in prefixed instructions receiving unexpected prefixes.
Since non-prefixed instructions are sufficient for the normal basic operations of the computer, it should be possible to avoid the use of instructions with the E bit set entirely in security-critical code. As well, where additional security or virtual memory features associate memory with a specific process, retaining the prefix registers on a branch to code belonging to a different process will not be allowed.
Because programs can be interrupted, some implementation details should be noted here.
In the case of the programmer-visible prefix registers, another pair of registers are required that function as shadow registers, recording the contents of the prefix registers at the time control entered the instruction block.
In the case of the third method, although there are no visible registers, two copies of the bits that are set up by the special instructions are required; one with the bits as set up to affect the interpretation of the instructions in the current block, and another with the bits as set up for the next block.
The basic floating-point formats supported by this architecture are patterned after those of IEEE 754, as used by many other computers, and are shown below:
In addition to the standard 32-bit and 64-bit types specified by IEEE 754, a similar type occupying 48 bits is defined. The size of the exponent field is chosen to be the minimum that allows numbers from 10^-99 to 10^99 to be represented, and with that exponent field, 11 digits of precision are provided. Thus, this format matches the precision provided by many pocket calculators, as well as used in mathematical tables and mechanical calculators; thus, historically, it appears to be a good fit to what many scientific problems require.
As well, these formats provide, if one adds the hidden first bit, mantissas of 24, 36, and 53 bits, respectively. Booth Encoding, in its simplest form, roughly divides the number of partial products to be added in order to perform a multiplication by two, although one or two extra products may be needed for overhead. Wallace Tree multipliers of 5, 6, and 7 stages respectively combine 13, 19, and 28 partial products, which is a good fit to these formats.
The Motorola 68000 architecture defined a 96-bit floating-point format which had 16 unused bits, so as to provide the same precision as the 80-bit temporary real format provided on the 8087 coprocessor for the Intel 8086 microprocessor. Thus, there is precedent for defining floating-point formats with unused bits, even in the microprocessor era.
The 128-bit floating-point format defined for this architecture has six unused bits. It resembles the internal formats on other architectures associated with IEEE 754 in not having a hidden first bit, and in having sixteen bits dedicated to the sign and the exponent.
Six unused bits, and sixteen bits used for the sign and exponent, leave 106 bits for the mantissa, which is exactly twice 53.
The IBM 704 computer had hardware floating-point, operating only on 36-bit single-precision numbers. However, its floating-point instructions retained the entire product of two numbers, as it had to be calculated to its last bits for an accurate result, and it also retained additional bits in the case of addition and subtraction, as those bits could simply be copied from the smaller of the two operands without extra calculation.
This meant that software to perform double-precision calculations was relatively simple and fast.
The IBM System/360 Model 85 introduced a 128-bit extended precision floating-point type to the System/360 series. Like double-precision on the IBM 704, these numbers had the format of a pair of floating-point numbers, with the exponent field of the second one either being ignored or containing an offset exponent.
Thus, the intent behind leaving six bits unused in the 128-bit extended precision floating-point format in this architecture is to allow the double-precision floating-point arithmetic unit to be used for extended precision calculations, as such calculations are presumed to be relatively rarely needed, even if it is still useful to provide instructions for them.
The original Concertina architecture began as a short series of pages to illustrate how a computer works through describing the instruction set of an imaginary computer.
Because I designed that instruction set in line with some of my preferences, such as:
I ended up expanding the instruction set in line with some other preferences, so as to cover a wide range of possible uses for a computer, by adding features such as:
Although I did include some discussion of what an implementation might look like, I was not too concerned about the practicalities of implementation.
In this, and some of the previous iterations of the Concertina II design, I have attempted to include the ability to execute programs which resemble those for a Digital Signal Processor. As the purpose of doing this is to increase performance, the question of implementation takes on more importance.
Proceeding fairly naïvely, taking the Itanium and the TMS320C6000 as inspirations, I considered that it might be feasible to fetch and decode eight 32-bit instructions in parallel, in hopes that at least some of the time it would also be possible to execute those eight instructions in parallel.
The Itanium showed that a computer could have a bank of 128 registers.
If I was hoping to have a pipelined architecture, then for a design with a reasonably high clock rate, it might be that the average instruction could take as many as eight clock cycles for its execute portion, I had estimated. While the floating-point multiply will be a frequent instruction, though, most instructions do relatively simple things, and so this may be pessimistic, unless the size and complexity of the architecture inflates the number of cycles needed due to delays in the transport of data.
But based on that estimate, with eight instructions in parallel per cycle, after eight instructions, there would be sixty-four instructions that would have to execute, when running at full speed, before dependent instructions could be brought in.
This is why I decided that the ISA would need to have banks of 128 registers; allowing eight instructions to be fetched in parallel was raising an expectation that eight instructions could be executed in parallel, which ends up by implying the computer needs to be able to handle 64 interleaved sequences of arithmetic operations at once.
With both 128 integer registers and 128 floating-point registers, each of those sequences of operations could be using four registers, which would be a bare minimum.
In discussions with people who know far more than I do about the real-world practicalities of implementing a microprocessor, it has been brought to my attention that there are significant obstacles to achieving the kind of performance of which this design seems to be offering an implied promise.
I had myself been fully aware that very few programs would actually have the necessary amount of instruction-level parallelism (ILP) to support eight instructions per clock (IPC).
One of the first objections brought to my attention is that accessing 128 registers requires enough multiplexing as to impose an additional clock cycle of delay in any register access. I think it may be possible to deal with that through designing the multiplexer so that some of the time delay can be absorbed in the execute portion of the pipeline, where it would not add to instruction latency.
A more serious objection, more recently advanced, is that practical experience has shown that this kind of design would require a register file with more ports than could be made to work.
In the case of read ports, there is a trivial brute-force solution; if fa In the case of write ports, however, there is no such simple solution. In practice, designs with
more than two write ports often work by queuing the additional writes, involving significant
complexity and delay. So with eight instructions being started in every cycle, and those instructions completing after
variable numbers of cycles, there could easily be, in a single cycle, sixteen results that need to
be stored, and quite possibly more than eight of them would have to be stored in one of the two banks
of 128 registers. There are ways to deal with this to some extent. Because storing data in flip-flops is a relatively
simple operation, it might be possible to perform register writes on half cycles. One brute-force solution that seemed attractive to me is to take the banks of 128 registers, and
divide them into, for example, sixteen banks of eight registers, or even thirty-two banks of four
registers. Since instructions take variable amounts of time to complete, even if programs are written so that
each of the calculation sequences uses its own aligned group of registers, instead of being allocated
variable numbers of registers straddling these boundaries, could collisions be avoided? If the banks into which the division is made are so small that one has sixty-four of them overall,
instead of eight of them overall, allocating instructions so that dependent instructions are only
put in the program when the instructions preceding them have had time to complete would
actually also take care of this conflict. However, even if there may be workarounds, I still need to note that of course my competencies are
limited, and this design may call for a number of things beyond the current state of the art. And it may well be that an implementation that only offers up to 4 IPC instead of 8 IPC would
still offer a sufficiently high performance to be quite useful.
In the case of write ports, however, there is no such simple solution. In practice, designs with more than two write ports often work by queuing the additional writes, involving significant complexity and delay.
So with eight instructions being started in every cycle, and those instructions completing after variable numbers of cycles, there could easily be, in a single cycle, sixteen results that need to be stored, and quite possibly more than eight of them would have to be stored in one of the two banks of 128 registers.
There are ways to deal with this to some extent. Because storing data in flip-flops is a relatively simple operation, it might be possible to perform register writes on half cycles.
One brute-force solution that seemed attractive to me is to take the banks of 128 registers, and divide them into, for example, sixteen banks of eight registers, or even thirty-two banks of four registers.
Since instructions take variable amounts of time to complete, even if programs are written so that each of the calculation sequences uses its own aligned group of registers, instead of being allocated variable numbers of registers straddling these boundaries, could collisions be avoided?
If the banks into which the division is made are so small that one has sixty-four of them overall, instead of eight of them overall, allocating instructions so that dependent instructions are only put in the program when the instructions preceding them have had time to complete would actually also take care of this conflict.
However, even if there may be workarounds, I still need to note that of course my competencies are limited, and this design may call for a number of things beyond the current state of the art.
And it may well be that an implementation that only offers up to 4 IPC instead of 8 IPC would still offer a sufficiently high performance to be quite useful.