This is the current draft of an attempt on my part to propose a successor to my original Concertina architecture. Once again, it builds on previous attempts; a major goal is to keep overhead to a minimum, and ensure that program code is compact.
The design attempts to combine many of the benefits of RISC, CISC, and VLIW architectures.
The basic instruction set consists entirely of 32-bit instructions, there are 32 integer general registers and 32 floating-point registers, and those instructions that perform arithmetic or logical operations include a bit for enabling changes to the condition codes as a result of those instructions. These are characteristics found in RISC architectures.
Having register banks of 32 registers allows different calculations to be intertwined in the code, and being able to control if instructions affect the condition codes allows more intervening instructions between an instruction that sets the condition codes and a branch instruction that makes use of those results. Both of these things allowed code to be designed to offer some of the same benefits as are obtained from out-of-order execution, without the hardware overhead. However, at the microprocessor clock rates in use today, these measures normally are not enough to be effective: however, if code written this way is combined with simultaneous multi-threading (SMT), then there is still the potential for competing with out-of-order execution.
Instructions are organized into 256-bit blocks which contain eight 32-bit instruction slots.
The first bit of each instruction is a break bit, indicating that the instruction cannot be executed simultaneously with the instruction that precedes it, if it is set.
This allows rapid superscalar execution of suitable code without the overhead of interlock and out-of-order circuitry, and is a feature of VLIW designs.
The first instruction slot may contain a header which indicates that several instruction slots are not to be decoded. If, instead, the block is composed entirely of instructions, the break bit in the first instruction will be a one (1); given that it takes time to fetch a block of instructions, simultaneous execution cannot cross block boundaries. So if the first bit of a block is instead a zero (0), the block begins with a header.
In this way, all the instructions in a block can be decoded in parallel, and yet immediate values corresponding in length to any of the major data types used in the architecture can be placed in the block without having to allow for instructions having a large number of different possible lengths. This avoids the need for a separate memory access when a program makes use of a constant value.
The first header format also includes space to indicate instruction predication, so that some instructions can be marked for conditional execution without the need for branching.
Programs consist of 256-bit blocks of program code, each of which contains eight 32-bit instruction slots.
Instructions may not cross block boundaries.
The form of a block prefix is illustrated below:
As noted above, a block may also consist entirely of instructions without a prefix.
If there is no header present, or if the header is in the first format, a block may only contain 32-bit instructions.
The second block format allows 16-bit, 32-bit, 48-bit, 64-bit, and 80-bit instructions; it indicates whether each 16-bit instruction slot is the beginning of a 16-bit instruction, a 32-bit instruction, or a longer instruction, so only the instructions longer than 32 bits contain within themselves bits to indicate their length.
Some instructions contain short pointers, which point within the 256-bit instruction block in which the instruction is contained, to constant values used by the instruction, or to an additional portion of the instruction. When an instruction does this, it can only be placed with a block that has a header.
The constant values so indicated are called pseudo-immediates; like ordinary constant values, the instruction contains a pointer to the value, but because the pointer is a short one, with a destination within the same code block as the instruction, normally the data would be fetched along with the instruction instead of requiring an additional memory access, thus providing the same benefit as an immediate mode instruction.
Headers of the first type begin with
This header format only allows 32-bit instructions in the block.
Then, for each of the remaining instruction slots after the header, there are four bits, consisting of an S bit and a three bit flag field.
If the flag field contains a nonzero number, then the instruction in the corresponding slot is predicated based on the flag indicated by that number. If the preceding S bit is zero, the instruction is executed if the flag is set; if the preceding S bit is one, the instruction is executed if the flag is cleared.
0000 indicates an instruction that is not
1000 indicates an instruction slot that
does not contain an instruction.
Headers of the second type begin with
00, followed by two bits
that are not both zero, since the first available 16-bit instruction slot
must be the beginning of an instruction of some length.
This header format allows all available instructions to be used in the block.
Headers of the third type begin with
01, as the first available
16-bit instruction slot must contain an instruction.
This header format allows both 32-bit instructions and 16-bit instructions,
despite the fact that these instructions do not indicate their length within themselves,
and only a single bit is used to indicate the presence of an instruction, not its length.
This is achieved as follows: a
1 bit within the instruction start field
normally indicates the start of a 32-bit instruction, but it will instead indicate
the start of a 16-bit instruction when it logically cannot indicate the start of a 32-bit
instruction because one of the following two conditions is met: the
is followed by another
1 bit, or it is in the final position of the
instruction start field.
Although this does place some limitatiions on the flexibility of the instruction sequence (for example, it may force wasted space if there are no 16-bit immediates used and the last instruction in the block is 16 bits long, as immediate values must be aligned), this block format is still valuable because it offers an opportunity to have only 16 bits of overhead instead of 32 bits of overhead from a header.
These block formats achieve the following goals:
If a program, or a part of a program, consists only of 32-bit instructions, it is not necessary to consume space with headers, so no overhead is imposed in that case.
It is possible, using the third header format, to save space by using 16-bit instructions in a block, and the cost of that is only the length of a 16-bit header, making it easier to achieve a net savings in program length.
The second header format allows the use of instructions longer than 32 bits, avoiding the need to resort to complicated setup preparations to perform operations that can't be completely described in a single 32-bit instruction within the available opcode space.
The first header format allows instructions to be predicated; this method of conditional execution is often faster and more efficient than branching. It only works with 32-bit instructions, but this is an acceptable limitation.
All three header formats allow the indication of areas in the block that do not contain instructions, and are to be skipped over, and so the instructions that make use of pseudo-immediates can be used in all of them. This allows programs to use constant data values that are fetched along with the instruction stream instead of incurring the overhead of a data memory access; this is useful on modern machines, as CPUs are very fast indeed, while DRAM accesses involve a large amount of latency.
The complement of registers included with this architecture is as follows:
There are 32 integer registers, each of which is 64 bits in length, numbered from 0 to 31.
Registers 1 through 7 may be used as index registers.
Registers 25 through 31 may be used as base registers, each of which points to an area of 65,536 bytes in length.
Register 24 serves as a base register which points to an area 32,768 bytes in length.
Registers 9 through 15 may be used as base registers, each of which points to an area of 4,096 bytes in length.
At least part of area of 4,096 bytes in length pointed to by register 8 will normally be used to contain up to 512 pointers, each 64 bits in length, for use in either Array Mode addressing or Address Table addressing.
Registers 17 through 23 may be used as base registers, each of which points to an area of 1,048,576 bytes in length. This addressing format is used for 48-bit extended memory-reference instructions.
Register 16 may be used as a base register which points to an area 512 bytes in length. This is where the operands of the 16-bit memory-reference instructions used in association with blocks having a header in the eleventh header format are found.
There are 32 floating-point registers, each of which is 128 bits in length, numbered from 0 to 31.
Floating point numbers in IEEE 754 format have exponent fields of different length, depending on the size of the number. For faster computation, floating-point numbers are stored in floating-point registers in an internal form which corresponds to the format in which extended precision floating-point numbers are stored in memory: with a 15-bit exponent field, and without a hidden first bit in the significand.
As 128-bit extended floating-point numbers are already in this format in memory, all floating-point numbers will fit in a 128-bit register, although shorter floating-point numbers are expanded.
However, the 32 floating-point registers may also be used for Decimal Floating-Point (DFP) numbers. These numbers will also be expanded into an internal form for faster computation, but that internal form may take more than 128 bits. In order to allow the floating-point registers to behave as if they are 160 bits long, the last four short vector registers are used to provide 32 additional bits to each of the floating-point registrers.
There are 16 short vector registers, each of which is 256 bits in length.
Each of these registers may contain:
As well, they may contain sixteen 16-bit short floating-point numbers in one of two formats.
These numbers all remain in these registers in the same format as that in which they appear in memory.
Two additional groups of registers exist, that should be viewed as optional features of the architecture:
The first consists of sixty-four long vector registers, where each long vector register is composed of sixty-four floating-point registers, each 128 bits in length.
The second consists of two extended register banks of 128 registers; one of 128 64-bit integer registers, and one of 128 128-bit floating-point registers. These are primarily intended to allow code to be written with a higher degree of instruction-level parallelism (ILP) for use in the block formats that offer VLIW features.
In addition to an implementation possibly not offering these as features, possibilities such as the following exist:
An implementation might offer eight-way SMT, but with the following restrictions:
Only two of the eight simultaneous threads at most may be executed out-of-order, with rename registers.
There will be only one set of long vector registers, and so only one thread may use them.
There will be only one set of banks of 128 registers, and so only one thread may use them.
There are no rename registers for the long vector registers or the banks of 128 registers. However, this does not preclude a thread using either or both of these features from being executed out-of-order in respect of the instructions that don't use those features.
Thus, for example, one situation that might arise on a core with this kind of large-scale implementation when fully-loaded might be this:
Five low-priority threads are running in-order;
One higher-priority thread is running out-of-order;
One thread uses the banks of 128 registers; it is running in-order, since it is using VLIW features to run at high speed instead;
One thread uses the long vector registers, and it is also running out-of-order so that scalar calculations within the thread will run as fast as possible.
Therefore, the way to think of these optional register banks is that they are only to be used in programs that are expected to be the only program of their kind running on the computer at a given time. Which permits them to monopolize one or both of these additional resources that allow the program to execute with higher performance.
As for how data values are stored:
Signed integer values are stored in binary two's complement format.
Floating-point numbers are stored in IEEE 754 format.
The architecture is big-endian: the most significant bits of a value are stored in the byte at the lowest numbered address.