
21
Chapter 1 Architectural Overview
unit added. Other execution units are included to deal with off–chip access via load
and store instructions; and to deal with branch instruction execution. All six function
units, except the integer multiplier, produce their results in a single–cycle.
High speed operation can only be obtained if as many as possible of the function
units can be kept productively busy during the same processor cycles. This will place
a heavy demand on instruction decoding and operand forwarding. Several instruc-
tions will have to be decoded in the same cycle and forwarded to the appropriate
execution unit. The demand for operands for these instructions shall be considerably
higher than must be dealt with by a scalar processor. The following sections describe
some of the difficulties encountered when attempting to execute more than one
instruction per cycle. Architectural techniques which overcome the inherent difficul-
ties are presented.
1.7.1 Instruction Issue and Data Dependency
The term instructions
issue
refers to the passing of an instruction from the pro-
cessor decode stage to an execution unit. With a scalar processor, instructions are is-
sued in–order. By that, I mean, in the order the decoder received the instructions from
cache or off–chip memory. Instructions naturally complete in–order. However with a
RISC processor out–of–order completion is not unusual for certain instructions. Typ-
ically load and store instructions are allowed to execute in parallel with other instruc-
tions. These instructions are issued in–order; they don’t complete immediately but
some time (a few cycles) later. The instructions following loads or stores are issued
and execute in parallel unless there is any data dependencies. Dependencies arise
when, for example, a load instructions is followed by an operation on the loaded data.
A superscalar processor can reduce total execution time for a code sequence if it
allows all instruction types to complete out–of–order. Instruction issue need not stop
after an instruction is issued to a function unit which takes multiple cycles to com-
plete. Consequently, function units with long latency may complete their operation
after a subsequent instruction issued to a low latency function unit. The Am29050
processor allows
long
latency floating–point operations to execute in parallel with
other integer operations. The processor has an additional port on it’s register file for
writing–back the results of floating–point operations. An additional port is required
to avoid the contention which would arise with an integer operation writing back its
result at the same time. Most instructions are issued to an integer unit which, with a
RISC processor, has only one cycle latency. However, there is very likely to be more
than one integer unit, each operating in parallel.