Philips Semiconductors
Custom Operations for Multimedia
PRELIMINARY SPECIFICATION
4-9
Excluding the array accesses, the loop body in
Figure 4-11
is now recognizable as the function per-
formed by the ume8uu custom operation: the sum of 4
absolute values of 4 differences. To use the ume8uu op-
eration, however, the code must access the arrays with
32-bit word pointers instead of with 8-bit byte pointers.
Figure 4-13
shows the loop recoded to access A[][] and
B[][] as one-dimensional instead of two-dimensional ar-
rays. We take advantage of our knowledge of C-lan-
guage array storage conventions to perform this code
transformation. Recoding to use one-dimensional arrays
prepares the code for transformation to 32-bit array ac-
cesses.
(From here on, until the final code is shown, the declara-
tions of the A and B arrays will be omitted from the code
fragments for the sake of brevity.)
Figure 4-14
shows the loop of
Figure 4-13
recoded to
use ume8uu. Once again taking advantage of our knowl-
edge of the C-language array storage conventions, the
one-dimensional byte array is now accessed as a one-di-
mensional 32-bit-word array. The declarations of the
pointers IA and IB as pointers to integers is the key, but
also notice that the multiplier in the expression for row
offset has been scaled from 16 to 4 to account for the fact
that there are 4 bytes in a 32-bit word.
Of course, since we are now using one-dimensional ar-
rays to access the pixel data, it is natural to use a single
for loop instead of two.
Figure 4-12
shows this stream-
lined version of the code without the inner loop. Since C-
language arrays are stored as a linear vector of values,
we can simply increase the number of iterations of the
outer loop from 16 to 64 to traverse the entire array.
The recoding and use of the ume8uu operation has re-
sulted in a substantial improvement in the performance
of the match-cost loop. In the original version, the code
executed 1280 operations (including loads, adds, sub-
tracts, and absolute values); in the restructured version,
there are only 256 operations
—
128 loads, 64 ume8uu
operations, and 64 additions. This is a factor of five re-
duction in the number of operations executed. Also, the
unsigned char A[16][16];
unsigned char B[16][16];
.
.
.
for (row = 0; row < 16; row += 1)
{
for (col = 0; col < 16; col += 4)
{
cost0 = abs(A[row][col+0] – B[row][col+0]);
cost1 = abs(A[row][col+1] – B[row][col+1]);
cost2 = abs(A[row][col+2] – B[row][col+2]);
cost3 = abs(A[row][col+3] – B[row][col+3]);
cost += cost0 + cost1 + cost2 + cost3;
Figure 4-11. Parallel version of
Figure 4-10
.
Figure 4-12. The loop of
Figure 4-14
with the inner
loop eliminated.
unsigned int *IA = (unsigned int *) A;
unsigned int *IB = (unsigned int *) B;
for (i = 0; i < 64; i += 1)
cost += UME8UU(IA[i], IB[i]);
Figure 4-13. The loop of
Figure 4-11
recoded with one-dimensional array accesses.
unsigned char A[16][16];
unsigned char B[16][16];
.
.
.
unsigned char *CA = A;
unsigned char *CB = B;
for (row = 0; row < 16; row += 1)
{
int rowoffset = row * 16;
for (col = 0; col < 16; col += 4)
{
cost0 = abs(CA[rowoffset + col+0] – CB[rowoffset + col+0]);
cost1 = abs(CA[rowoffset + col+1] – CB[rowoffset + col+1]);
cost2 = abs(CA[rowoffset + col+2] – CB[rowoffset + col+2]);
cost3 = abs(CA[rowoffset + col+3] – CB[rowoffset + col+3]);
cost += cost0 + cost1 + cost2 + cost3;