SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1
Similarly to loads, VCOP’s store operations are characterized by data type and distribution mode. In addition to distribution mode, there are two primary considerations for translation of VCOP stores for C7x: packing, and lane masking.
Packing is the opposite of sign extension. The source data in C7x registers is 32 bits wide. When storing to 16- or 8-bit element types, the elements must be truncated to that size. The C7x has direct instruction support for such packing stores. Packing depends on size but not signedness. Signed and unsigned types generally use the exact same sequence, so there are three fundamental translation sequences for each mode, corresponding to 8-bit, 16-bit, or 32-bit data.
Signedness does come into play for rounding or saturation; these are covered in Section 5.3.3 and Section 5.3.4.
Unlike loads, translation of stores is sensitive to the number of lanes. While excess lanes can be safely loaded and ignored, stores must take care to only store the number of lanes being modeled. That means in the default 8-way SIMD mode, stores are limited to storing only 8 lanes, even though the C7x vectors contain 16 lanes of (32-bit) data. There is instructional support for such partial-vector stores for some cases, but not all. In particular there are no partial-vector packing stores. Thus packing stores in 8-way SIMD mode that use regular indirect addressing require the use of an explicit predicate to mask the store of unused lanes. This lane-masking predicate is constant and can be computed outside the loop.
When using SA-based addressing, the SA automatically provides lane masking based on the VECLEN flag in the SA setup vector. In this case the explicit lane masks are not needed.
The virtual machine has a template class called
vcop_store
that implements the various combinations in terms of
C7x operations. The template parameters of the class specify the type and
distribution mode, along with a specification of what kind of low-level addressing
to use (SA or indirect), and the number of SIMD lanes to emulate (8 or 16). The
class has two methods: store()
for regular stores and
store_pred()
for predicated stores. For example, here is the
translation of a store instruction using the DS2 distribution mode and SA-based
addressing.
Kernel-C source:
__vptr_s8 out;
out[Agen1].ds2() = Vreg;
translates to:
vcop_store<char, ds2, sa0adv>::store(Vreg, (uchar *)(tvals->p2));
The template parameters char
and
ds2
specify the data type and distribution mode. The template
parameter sa0adv
tells the template to use SA-based addressing,
with SA0 and advancing enabled. The SIMD factor defaults to 8. The runtime argument
Vreg
is the source vector register. The
tvals->p2
expression is the base address.
Most cases of unconditional stores using the basic store distributions modes translate to a one- or two-instruction sequence. Collating stores store a data-dependent number of elements in packed fashion and are therefore more involved.
Efficiency Warning: Collating Stores |
---|
Collating stores require a long translation sequence and may perform poorly. |