米6体育平台手机版

SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1

1.5 SIMD Width

VCOP has 8 lanes of 40 bits each. When mapped to 32-bit lanes on C7x, there are 16 lanes available, potentially doubling the throughput of a kernel.

Many kernels are written to be independent of the SIMD width, using the macro VCOP_SIMD_WIDTH to abstract the number of lanes. Some of these kernels can be successfully built in host emulation mode for wider (or narrower) machines simply by changing the value of the macro. In host emulation mode, VCOP_SIMD_WIDTH must now be defined on the command line or before inclusion of vcop_host_emulation.h.

The SIMD width used by VCC is controlled by the --vcop_simd option. (Kernels that qualify for SIMD 16 are NOT automatically detected or transformed.) For a SIMD width of 16, --vcop_simd=16 should be used. This option controls the translation sequence calls to the VM. As an additional change to allow this option, the generated C source file will also define VCOP_SIMD_WIDTH.

Some kernels depend on a specific SIMD width and will not work correctly if extended to 16-way SIMD. Furthermore, increasing the SIMD factor may depend on certain properties of the data layout in memory. For example, image widths may be required to be multiples of 16 instead of 8. It is not possible for the migration tool to automatically detect these cases.

The following are examples of VCOP operations that cannot be trivially extended to 16-way SIMD.

VBITPK – if kernel assumes results are 8-bit values
VBITTR – assumes 8x8 transpose
VBITUNPK – if bit mask ( src1[0]) is assumed to be 8 bits
Interleave/De-interleave, including de-interleaving loads, interleaving stores, and vector operations – if kernel assumes interleaving on 8-lane boundaries. (If kernel avoids making assumptions about vector sizes or layouts, for example simply using de-interleave-on-read and interleave-on-write to improve throughput without regard to layout, then interleaving can be extended to wider SIMD widths.)
Load with custom distribution – kernel C source format is tied to 8 lanes
Load with expand – if kernel assumes 8-bit predicate
Load with nbits – if kernel assumes 8-bit type for packed bit vector in memory
Lookup table and histogram – the table layout in memory is tied to VCOP’s 8-bank memory architecture, but certain cases of 16-way lookup and histogram are supported. See Section 5.5.8.