
TM1300 Data Book
Philips Semiconductors
4-10
PRODUCT SPECIFICATION
overhead of the inner loop has been eliminated, further
increasing the performance advantage.
4.4.2
More Unrolling
The code transformations of the previous section
achieved impressive performance improvements, but
given the VLIW nature of the TM1300 CPU, more can be
done to exploit TM1300’s parallelism.
The code in Figure 4-12 has a loop containing only 4 op-
erations (excluding loop overhead). Since TM1300’s
branches have a 3-instruction delay and each instruction
can contain up to 5 operations, a fully utilized minimum-
sized loop can contain 16 operations (20 minus loop
overhead).
The TM1300 compilation system performs a wide variety
of powerful code transformation and scheduling optimi-
zations to ensure that the VLIW capabilities of the CPU
are exploited. It is still wise, however, to make program
parallelism explicit in source code when possible. Explicit
parallelism can only help the compiler produce a fast run-
ning program.
To this end, we can unroll the loop of Figure 4-12 some
number of times to create explicit parallelism and help
the compiler create a fast running loop. In this case,
where the number of iterations is a power-of-two, it
makes sense to unroll by a factor that is a power-of-two
to create clean code.
Figure 4-15 shows the loop unrolled by a factor of eight.
The compiler can apply common sub-expression elimi-
nation and other optimizations to eliminate extraneous
operations in the array indexing, but, again, improve-
ments in the source code can only help the compiler pro-
duce the best possible code and fastest-running pro-
gram.
Figure 4-16 shows one way to modify the code for sim-
pler array indexing.
Figure 4-14. The loop of Figure 4-13 recoded with 32-bit array accesses and the ume8uu custom operation.
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (row = 0; row < 16; row += 1)
{
int rowoffset = row * 4;
for (col4 = 0; col4 < 4; col4 += 1)
cost += UME8UU(IA[rowoffset + col4], IB[rowoffset + col4]);
}
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (i = 0; i < 64; i += 8)
{
cost0 = UME8UU(IA[i+0], IB[i+0]);
cost1 = UME8UU(IA[i+1], IB[i+1]);
cost2 = UME8UU(IA[i+2], IB[i+2]);
cost3 = UME8UU(IA[i+3], IB[i+3]);
cost4 = UME8UU(IA[i+4], IB[i+4]);
cost5 = UME8UU(IA[i+5], IB[i+5]);
cost6 = UME8UU(IA[i+6], IB[i+6]);
cost7 = UME8UU(IA[i+7], IB[i+7]);
cost += cost0 + cost1 + cost2 +
cost3 + cost4 + cost5 +
cost6 + cost7;
}
Figure 4-15. Unrolled version of Figure 4-12. This
code makes good use of TM1300’s VLIW capabili-
ties.
unsigned char A[16][16];
unsigned char B[16][16];
.
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (i =0;i<64; i+=8,IA+=8,IB+=
8)
{
cost0 = UME8UU(IA[0], IB[0]);
cost1 = UME8UU(IA[1], IB[1]);
cost2 = UME8UU(IA[2], IB[2]);
cost3 = UME8UU(IA[3], IB[3]);
cost4 = UME8UU(IA[4], IB[4]);
cost5 = UME8UU(IA[5], IB[5]);
cost6 = UME8UU(IA[6], IB[6]);
cost7 = UME8UU(IA[7], IB[7]);
cost += cost0 + cost1 + cost2 +
cost3 + cost4 + cost5 +
cost6 + cost7;
}
Figure 4-16. Code from Figure 4-15 with simplified
array index calculations.