
30
Evaluating and Programming the 29K RISC Family
Figure 1-9 shows the operand busses supplying source operands from the
reorder buffer to the reservation stations. However, in some cases, when an
instruction is decoded and the operand register’s number presented to the reorder
buffer, no entry is found. This indicates there is currently no copy of the required
register. Consequently, the
real
register in the register file must be accessed to obtain
the data. For this reason the register file is provided with read ports (4) which supply
data to the operand bus.
1.7.4 Branch Prediction
Out–of–order instruction issue places a heavy demand on instruction decoding.
If reservation stations are to be kept filled, instruction decode must proceed at a rate
equal to, or greater than, instruction execution. Otherwise, performance will be
limited by the ability to decode instructions. The major obstacle in the way of
achieving efficient decoder operation is branching instructions. Unfortunately,
instruction sequences typically contain only about five or six instructions before a
further branch–type instruction is encountered. Compilers directed to producing
code specifically for superscalar processor execution try to increase this critical
parameter. Additionally, the fact that a target of a branch instruction need not be
aligned on a cache block boundary, can further reduce the efficiency of the decoding
processes.
The decoder fetches instructions and places them into the instruction window
for issue by a function unit. If an average decode rate of more than two instructions
per cycle is to be achieved, it is likely that a four–instruction decoder (or better) will
be required. In fact, AMD’s product overview indicates a four–instruction decoder is
used. To study this further, first examine the code below. The first target sequence
begins at address label L13. The linker need not align the L13 label at a cache block
boundary –– a cache block size of four instructions will be assumed. The same
alignment issue occurs with the second target sequence beginning at label L14. The
decoder is presented with a complete cache block rather than sequential instructions
from within the block. This requires a 128–bit bus between the instruction cache and
the decode unit. However, this is essential if instructions are to be decoded in parallel.
Figure 1-10 shows a possible cache block assignment, assuming the target of the first
instruction sequence begins in the second entry of the cache block. The target of the
second sequence begins in the third instruction of the block.
L13:
;target of a branch
;gr98 = gr98 + 10
add
sll
cpgt
jmpt
add
gr98,gr98,10
gr99,gr99,2
gr97,gr97,gr98
gr97,L14
lr4,lr4,gr99
;conditional branch to L14
;branch delay slot, see section 1.13
L15:
load
store
0,0,gr97,lr4
0,0,gr97,gr96