SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1
VCOP’s fundamental load operation fetches a vector of 8 consecutive elements from memory into the 8 lanes of a vector register. In memory the elements are 8, 16, or 32 bit signed or unsigned values; they are sign or zero extended into the 40-bit lanes of the register. On C7x the migration tool models VCOP’s 40 bit lanes as 32 bit lanes; a vector register contains 16 of these lanes. (In 8-way SIMD mode only 8 of these 16 lanes are used; in 16-way SIMD mode all 16 are used.)
VCOP has various additional “distribution modes” that provide for alternate data layouts in memory: for example, as the data is read in it may be oversampled (read each element multiple times), undersampled (skip elements), deinterleaved, and so on. Descriptions of the distribution modes can be found in the VCOP CPU manual or the programmer’s guide.
Thus, load operations are characterized primarily
by the element type in memory and distribution mode. The virtual machine has a
template class called vcop_load
that implements the various
combinations using C7x operations. The template parameters of the class specify the
type and distribution mode, along with a specification of what kind of low-level
addressing to use (SE, SA, or indirect) and the number of SIMD lanes to emulate (8
or 16). The class has two methods, load()
and
SE_load()
, which implement the load operation as specified by
the template parameters. The load()
method implements non-SE-based
loads, and the SE_load()
method implements SE-based loads.
For example, here is the translation of a VCOP load instruction using the CIRC2 distribution mode and SA-based addressing.
Kernel-C:
__vptr_uint16 in;
Vreg = in[Agen0].circ2();
translates to:
vcop_load<ushort, circ2, sa1adv>::load(Vreg, (uchar*)(tvals->p1));
The template parameters ushort
,
and circ2
specify the data type and distribution mode. The template
parameter sa1adv
tells the template to use SA-based addressing,
with SA1 and advancing enabled. The SIMD factor defaults to 8. The runtime argument
Vreg
is the destination vector register (passed by reference,
since the load writes into it). The tvals->p1
expression is the
base address. The load()
method, when load template is specialized
for ushort
, circ2
, and sa1adv
,
results in generation of the specific C7x sequence to implement that
combination.
The specific C7x instruction sequences generated
by specialized load()
methods that result from all combinations of
type and distribution mode can be determined by examining the template
specializations in the header files.
The load()
methods need to handle
loading the vector elements according to the distribution mode, and sign- or
zero-extension. Most combinations of type and distribution mode are covered by a
single C7x instruction. A few need additional instructions to exactly mimic VCOP’s
specific modes.
In general SE-based loads, invoked via the
SE_load()
method, simply rely on the SE configuration as setup
in the init()
function (see Section 1.5), and translate to a simple access of the corresponding SE source register.
The basic setup for the SE is based on the data
type. For example, signed 16-bit data uses element type __SE_ELETYPE_16BIT and
__SE_PROMOTE_2X_SIGNEXT for sign-extension to 32 bits. The default vector length for
8-way SIMD is __SE_VECLEN_32BYTE, for 8 lanes of 32-bit (4-byte) data. Additional SE
features used to implement the specific distribution modes are configured by the
various specializations of the vcop_load
template.
C7x can speculatively load without risk of faulting. Therefore most load sequences simply load a full C7x vector’s worth of data—that is, 16 lanes of 32-bit values. In 8-way SIMD mode, the extra values are simply unused.