9
Highly Integrated 64-Bit RISC; L2-Cache, DRAM, PCI Interfaces
UltraSPARC-IIi CPU
SME1430LGA-360
SME1430LGA-440
SME1430LGA-480
May 1999
Sun Microsystems, Inc
In addition, the ECU supports DMA accesses which hit in the external cache and maintains data coherency
between the external cache and the main memory. The size of the external cache can be
256 k-byte, 512 k-byte, 1 M-byte, or 2 M-byte (where the line size is always 64 bytes). Cache lines have only 3
states: modied, exclusive or invalid.
The combination of the load buffer and the ECU is fully pipelined. For programs with large data sets, instruc-
tions are scheduled with load latencies based on the L2-cache latency, so the L2-cache acts as a large primary
cache. Floating-point applications use this feature to effectively “hide” D-cache misses. Coherency is main-
tained between all caches and external PCI DMA references.
The ECU overlaps processing during load and store misses. Stores that hit the L2-cache can proceed while a
load miss is being processed. The ECU is also capable of processing reads and writes without a costly turn-
around penalty on the bidirectional L2-cache data bus.
Block loads and block stores (these load or store a 64-byte line of data from memory or L2-cache to the oat-
ing-point register le) provide high transfer bandwidth. By not installing into the L2-cache on miss, they
avoid polluting the cache with data that is only touched once.
The ECU also provides support for multiple outstanding data transfer requests to the MCU and PBM.
Memory Controller Unit (MCU)
All transactions to the DRAM and UPA64S subsystems are handled by the MCU. The external pins controlled
by the MCU operate at divisions of the processor clock:
UPA64S runs at 1/4 the processor clock rate
data transfers to the DRAM transceivers are programmable to occur typically at 1/4, 1/5, 1/6 of the
processor clock rate.
External data transceivers allow the DRAM data to be twice as wide as data from the processor’s MEMDATA
pins, so the EDO CAS cycle is only 26.5 ns at 440 MHz. The MCU supports 50 or 60 ns EDO DRAMs from
many major vendors.
Use of faster DRAMs allow higher-than-quoted performance, because the various components of memory
delay are programmable.
Instruction Cache (I-Cache)
The I-cache is a 16-kilobyte two-way, set-associative cache with 32-byte blocks. The cache is physically
indexed and physically tagged. The set is predicted as part of the “next eld” so that only the index bits of an
address are necessary to address the cache. (This means only 13 bits, which matches the minimum page size.)
The instruction cache returns up to four instructions from a line that is eight instructions wide.
Data Cache (D-Cache)
The data cache is a write-through non-allocating 16 Kilobyte direct-mapped cache with two 16-byte subblocks
per line. It is virtually indexed and physically tagged. The tag array is dual-ported so that tag updates due to
line lls do not collide with tag reads for incoming loads. Snoops to the D-cache use the second tag port so
that an incoming load can proceed without being held up by a snoop.