Philips Semiconductors
Custom Operations for Multimedia
PRELIMINARY SPECIFICATION
4-3
4.1.3
Example Uses of Custom Ops
The next three sections illustrate the advantages of using
custom operations. Also, the more complex examples il-
lustrate how custom operations can be integrated into
application code by providing listings of C-language pro-
gram fragments. The examples progress in complexity
from simple to intricate; the most interesting examples
are taken from actual multimedia codes, such as MPEG
decompression.
4.2
EXAMPLE 1: BYTE-MATRIX
TRANSPOSITION
The goal of this example is to provide a simple, introduc-
tory illustration of how custom operations can significant-
ly increase processing speed in small kernels of applica-
tions. As in most uses of custom operations, the power
of custom operations in this case comes from their ability
to operate on multiple data items in parallel.
Imagine that our task is to transpose a packed, 4-by-4
matrix of bytes in memory; the matrix might, for example,
contain 8-bit pixel values.
Figure 4-1
illustrates both the
organization of the matrix in memory and the task to be
performed in standard mathematical notation.
Performing this operation with traditional microprocessor
instructions is straight forward but time consuming. One
way to perform the manipulation is to perform 12 load-
byte instructions (since only 12 of the 16 bytes need to
be repositioned) and 12 store-byte instructions that place
the bytes back in memory in their new positions. Another
way would be to perform four load-word instructions, re-
position the bytes in registers, and then perform four
store-word instructions. Unfortunately, repositioning the
bytes in registers would require a large number of in-
structions to properly shift and mask the bytes. Perform-
ing the 24 loads and stores makes implicit use of the
shifting and masking hardware in the load/store units and
thus yields a shorter instruction sequence.
The problem with performing 24 loads and stores is that
loads and stores are inherently slow operations because
they must access at least the cache and possibly slower
layers in the memory hierarchy. Further, performing byte
loads and stores when 32-bit word-wide accesses run
just as fast wastes the power of the cache/memory inter-
face. We would prefer a fast algorithm that takes full ad-
vantage of cache/memory bandwidth while not requiring
an inordinate number of byte-manipulation instructions.
PNX1300 has instructions that merge and pack bytes
and 16-bit halfwords directly and in parallel. Four of
these instructions can be applied in this case to speed up
the manipulation of bytes that are packed into words.
Figure 4-2
shows the application of these instructions to
the byte-matrix transposition problem, and the left side of
Figure 4-3
shows a list of the operations needed to im-
plement the matrix transpose. When assembled into ac-
tual PNX1300 instructions, these custom operations
would be packed as tightly as dependencies allow, up to
five operations per instruction.
Note that a programmer would not need to program at
this level (PNX1300 assembler). The matrix transpose
would be expressed just as efficiently in C-language
source code, as shown on the right side of
Figure 4-3
.
The low-level code is shown here for illustration purpos-
es only.
The first sequence of four load-word operations in
Figure 4-3
brings the packed words of the input matrix
into registers R10, R11, R12, and R13. The next se-
quence of four merge operations produces intermediate
results into registers R14, R15, R16, and R17. The next
sequence of four pack operations could then replace the
original operands or place the transposed matrix in sep-
arate registers if the original matrix operands were need-
8-bit
quadumax
quadumin
dspuquadaddui
Unsigned bytewise quad max
Unsigned bytewise quad min
Quad clipped add of unsigned/
signed bytes
Signed sum of products of
signed bytes
Signed sum of products of
signed/unsigned bytes
Unsigned sum of products of
unsigned bytes
Merge least-significant bytes
Merge most-significant bytes
Pack least-significant bytes
Unsigned byte-wise quad aver-
age
Unsigned quad 8-bit multiply
most significant
Unsigned sum of absolute val-
ues of signed 8-bit differences
Unsigned sum of absolute val-
ues of unsigned 8-bit differ-
ences
ifir8ii
ifir8iu
ufir8uu
mergelsb
mergemsb
packbytes
quadavg
quadumulmsb
ume8ii
ume8uu
Table 4-2. Key Multimedia Custom Operations Listed
by Operand Size
Op. Size
Custom Op
Description
31
0
a
e
i
m
b
f
j
n
c
g
k
o
d
h
l
p
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
Row Major
Column Major
Transpose
a
e
i
m n
b
f
j
c
g
k
o
d
h
l
p
31
0
a
b
c
d
e
f
g
h
i
j
m
n
o
p
k
l
Transpose
n+0:
n+4:
n+8:
n+12:
Memory
Figure 4-1. Byte-matrix transposition. Top shows
byte matrices packed into memory words; bottom
shows mathematical matrix representation.