Cache-Internal Parallel Computing

or, in full,

Long Vector Arithmetic-Logic Unit Reduced Instruction Set Computing Multiple Instruction Stream/Multiple Data Stream Cache-Internal Parallel Computing

For the vector register modes of operation, which have been described in the preceding portion of this section, to allow the computer to perform a large amount of computation at high speed without pausing to communicate with main memory, two banks of sixty-four arithmetic/logic units, one for floating-point numbers, and one for integers, were envisaged as being appropriate to a full-speed implementation of the architecture, which also involved a large number of additional registers. In this case, the potential exists to provide an additional parallel computing capability for such an implementation, at a relatively modest additional cost in circuitry.

Vector arithmetic is applicable to some types of problem. Other problems, such as sorting large arrays of data, which can benefit from parallel computation require the multiple instruction stream/multiple data stream (MIMD) model of parallel computation to be used. As a control unit that decoded instructions in a simple format would be a fairly limited in size compared to a high-speed floating-point arithmetic unit, it seemed appropriate to attempt to add such a feature to the architecture.

This architecture is, however, enivsaged as a response to current technological trends, which permit larger and more complex microprocessors to be placed on a single chip. A full-featured implementation of the architecture might involve a chip having a 256 bit wide external data bus, combined with a 59-bit address bus and individual byte select lines. With one address bus, sixty-four on-chip RISC processors could not all access memory simultaneously.

In the section on the basic elements of this architecture, it will be noted that a high-powered implementation would have, in addition to 384 Kbytes of on-chip register space, perhaps 8 or 32 megabytes of on-chip cache.

The long vector registers provide each of the 64 ALUs with one accumulator from the supplementary registers, eight registers from the long vector registers, and 64 scratchpad locations from the long vector scratchpad. The integer registers are all 64 bits long, unlike the 32 bit arithmetic/index registers of the main computer.

Although providing a wider and more complex data path to on-chip cache does not require putting more pins on the package in which the chip is enclosed, this would still significantly increase the complexity of both the cache and the register file; for vector operations, only a single wide path between these internal memories and the banks of 64 ALUs is required.

But vector operations, in themselves, are also relatively rarely used. So that these ALUs may contribute more fully to the normal operation of the computer, they are also available for superscalar execution; on every cycle, it is possible not just to initiate an operation in the main ALU, but to initiate an operation in each of these 64 subsidiary ALUs. For this purpose, and given the lack of an independent path from each of these ALUs to cache or the register file, not to mention main memory, each of these 64 ALUs does need a local memory under its own control, which would contain data from the thread of execution the operations required by the instructions of which were directed to that ALU. This is shown in the overview diagram of the architecture as a Level 1 cache. This Level 1 cache could contain 4,096 bytes for each processor; this would correspond to eight lines of the main, or Level 2, cache, each line of which contains sixteen units of 256 bits in width, and thus sixty-four doublewords or 512 bytes.

These 64 processors proceed independently of the main CPU; thus, a process that initiates a parallel computation continues to run after it does so, and the parallel computation continues while the CPU has turned to other threads or switched to other processes. Thus, a process waiting for a result from the parallel processors can relinquish control to the operating system while it is only waiting.