18
Evaluating and Programming the 29K RISC Family
the M bit is not set when a cache block is reallocated, the
out–going
block is not co-
pied back.
When data cache is added to a processor, there can be difficulties dealing with
data consistency. Problems arise when there is more than one processor or data con-
troller (such as a DMA controller) accessing the same memory region. The Am29040
processor uses bus snooping to solve this problem. The method relies on the proces-
sor
monitoring
all accesses performed on the memory system. The processor inter-
venes or updates its cache when an access is attempted on a currently cached data
value. Cache consistency is dealt with in detail in section 5.14.4.
Via the MMU, each memory page can be separately marked as “non cached”,
“copy–back”, or “write–through”. A two word write–through buffer is used to assist
with writes to memory. It enables multiple store instructions to be in–execution with-
out the processor pipeline stalling. Data accesses which hit in the cache require
2–cycle access times. Two cycles, rather than one, are required due to the potentially
high internal clock speed. The data cache operation is explained in detail in section
5.14.2. However, load instructions do not cause pipeline stalling if the instruction im-
mediately following the load does not require the data being accessed.
Scalable bus clocking is supported; enabling the processor to run at twice the
speed of the off–chip memory system. Scalable Clocking
was first introduced with
the Am29030 processors, and is described in the previous section describing the
Am29030. If cache hit rates are sufficiently high, Scalable Clocking enables high
performance systems to be built around relatively slow memory systems. It also of-
fers an excellent upgrade path when additional performance is required in the future.
The maximum on–chip clock speed is 50 MHz.
The Am29040 processor supports integer multiply directly. A latency of two
cycles applies to integer multiply instructions (most 29K instructions require only
one cycle). Again, this is a result of the potentially high internal clocking speeds of
the processor. Most 29K processors take a trap when an integer multiply is attempted.
It is left to trapware to emulate the
missing
instruction. The ability to perform high
speed multiply makes the processor a better choice for calculation intensive applica-
tions such as digital signal processing. Note, floating–point performance should also
improve with the Am29040 as floating–point emulation routines can make use of the
integer multiply instruction.
The Am29040 has two Translation Look–Aside Buffers (TLBs). Having two
TLBs enables a larger number of virtual to physical address translations to be cached
(held in a TLB register) at any time. This reduces the TLB reload overhead. The TLB
format is similar to the arrangement used with the Am29243 microcontroller. Each
TLB has 16 entries (8 sets, two entries per set). The page size used by each TLB can
be the same or different. If the TLB page sizes are the same, a four–way set associa-
tive MMU can be constructed with supporting software. Alternatively one TLB can
be used for code and the second, with a larger page size, for data buffers or shared