Philips Semiconductors
Custom Operations for Multimedia
PRELIMINARY SPECIFICATION
4-5
A straightforward coding of the reconstruction algorithm
might look as shown in
Figure 4-4
. This implementation
shares many of the undesirable properties of the first ex-
ample of byte-matrix transposition. The code accesses
memory a byte at a time instead of a word at a time,
which wastes 75% of the available bandwidth. Also, in
light of the many quad-byte-parallel operations intro-
duced in
Section 4.1.2,
“
Introduction to Custom Opera-
tions,
”
it seems inefficient to spend three separate addi-
tions and one shift to process a single eight-bit pixel.
Perhaps even more unfortunate for a VLIW processor
like PNX1300 is the branch-intensive code that performs
the saturation testing; eliminating these branches could
reap a significant performance gain.
Since MPEG decoding is the kind of task for which
PNX1300 was created, there are two custom opera-
tions
—
quadavg and dspuquadaddui
—
that exactly fit this
important MPEG kernel (and other kernels). These cus-
tom operations process four pairs of 8-bit pixel values in
parallel. In addition, dspuquadaddui performs saturation
tests in hardware, which eliminates any need to execute
explicit tests and branches.
For readers familiar with the details of MPEG algorithms,
the use of eight-bit IDCT values later in this example may
be confusing. The standard MPEG implementation calls
for nine-bit IDCT values, but extensive analysis has
shown that values outside the range [
–
128..127] occur
so rarely that they can be considered unimportant. Pur-
suant to this observation, the IDCT values are clipped
into the eight-bit range [
–
128..127] with saturating arith-
metic before the frame reconstruction code runs. The as-
sumption that this saturation occurs permits some of
PNX1300
’
s custom operations to have clean, simple def-
initions.
The first step in seeing how custom operations can be of
value in this case, is to unroll the loop by a factor of four.
The unrolled code is shown in
Figure 4-5
. This creates
code that is parallel with respect to the four pixel compu-
tations. As it is easily seen in the code, the four groups of
computations (one group per pixel) do not depend on
each other.
After some experience is gained with custom operations,
it is not necessary to unroll loops to discover situations
where custom operations are useful. Often, a good pro-
grammer with knowledge of the function of the custom
operations can see by simple inspection opportunities to
exploit custom operations.
To understand how quadavg and dspuquadaddui can be
used in this code, we examine the function of these cus-
tom operations.
The quadavg custom operation performs pixel averaging
on four pairs of pixels in parallel. Formally, the operation
of quadavg is as follows:
quadavg rscr1 rsrc2 -> rdest
takes arguments in registers rsrc1 and rsrc2, and it com-
putes a result into register rdest. rsrc1 = [abcd], rsrc2 =
[wxyz], and rdest = [pqrs] where a, b, c, d, w, x, y, z, p, q,
r, and s are all unsigned eight-bit values. Then, quadavg
computes the output vector [pqrs] as follows:
p = (a + w + 1) >> 1
q = (b + x + 1) >> 1
r = (c + y + 1) >> 1
s = (d + z + 1) >> 1
The pixel averaging in
Figure 4-5
is evident in the first
statement of each of the four groups of statements. The
rest of the code
—
adding idct[i] value and performing the
saturation test
—
can be performed by the dspuquadad-
dui operation. Formally, its function is as follows:
dspuquadaddui rsrc1 rsrc2 -> rdest
takes arguments in registers rsrc1 and rsrc2, and it com-
putes a result into register rdest. rsrc1 = [efgh], rsrc2 =
[stuv], and rdest = [ijkl] where e, f, g, h, i, j, k, and l are
unsigned 8-bit values; s, t, u, and v are signed 8-bit val-
ues. Then, dspuquadaddui computes the output vector
[ijkl] as follows:
i = uclipi(e + s, 255)
j = uclipi(f + t, 255)
k = uclipi(g + u, 255)
l = uclipi(h + v, 255)
The uclipi operation is defined in this case as it is for the
separate PNX1300 operation of the same name de-
scribed in
Appendix A,
“
PNX1300/01/02/11 DSPCPU
Operations,
”
. Its definition is as follows:
void reconstruct (unsigned char *back,
unsigned char *forward,
char *idct,
unsigned char *destination)
{
int i, temp;
for (i = 0; i < 64; i += 1)
{
temp = ((back[i] + forward[i] + 1) >> 1) + idct[i];
if (temp > 255)
temp = 255;
else if (temp < 0)
temp = 0;
destination[i] = temp;
}
}
Figure 4-4. Straightforward code for MPEG frame reconstruction.