UDMAP: UDMA transfers with ICNTs and/or src/dst addr NOT aligned to 64B fail when used in "event trigger" mode
Details:
Note: The following description uses an example a C7x DSP core, but it applies to any
other processing cores which can program the UDMA.
For DSP algorithm processing on
C6x/C7x, the software often uses UDMA in NavSS or DRU in MSMC. In many cases, UDMA
is used instead of DRU, because DRU channels are reserved in many use-cases for
C7x/MMA deep learning operations. In a typical DSP algorithm processing, data is
DMA'ed block by block to L2 memory for DSP, and DSP operates on the data in L2
memory instead of operating from DDR (through the cache). The typical DMA setup and
event trigger for this operation is as below; this is referred to as "2D trigger and
wait" in the following example.
For each "frame":
- Setup a TR typically 3 or 4
dimension TR.
- Set TYPE =
4D_BLOCK_MOVE_REPACKING_INDIRECTION
- Set EVENT_SIZE =
ICNT2_DEC
- Set TRIGGER0 =
GLOBAL0
- Set TRIGGER0_TYPE =
ICNT2_DEC
- Set TRIGGER1 =
NONE
- ICNT0 x ICNT1 is
block width x block height
- ICNT2 = number of
blocks
- ICNT3 = 1
- src addr = DDR
- dst addr = C6x L2
memory
- Submit this TR
- This TR starts a
transfer on GLOBAL TRIGGER0 and transfers ICNT0xICNT1 bytes, then
raises an event
- For each block do the
following:
- Trigger DMA by
setting GLOBAL TRIGGER0
- Wait for the event
that indicates that the block is transferred
- Do DSP
processing
This sequence is a simplified sequence; in the actual algorithm, there can be
multiple channels doing DDR to L2 or L2 DDR transfer in a "ping-pong" manner, such
that DSP processing and DMA runs in parallel. The event itself is programmed
appropriately at the channel OES registers, and the event status check is done using
a free bit in IA for UDMA.
When the following conditions occur,
the event in step 3.2 is not received for the first trigger:
- Condition 1: ICNT0xICT1 is
NOT a multiple of 64.
- Condition 2: src or dst is
NOT a multiple of 64.
- Condition 3: ICNT0xICT1 is
NOT a multiple of 64 and src/dst address not a multiple of 64
Multiple of 16B or 32B for ICNT0xICNT1 and src/dst addr also has the same
issue, where the event is not received. Only alignment of 64B makes it work.
Conditions in which it works:
- If ICNT0xICNT1 is made a
multiple of 64 and src/dst address a multiple of 64, the test case
passes.
- If DRU is used instead of
UDMA, then the test passes. You must submit the TR to DRU through the UDMA
DRU external channel. With DRU and with ICNTs and src/dst addr unaligned,
the user can trigger and get events as expected when TR is programmed such
that the number of events and number of triggers in a frame is 1, i.e ICNT2
= 1 in above case or EVENT_SIZE = COMPLETION and trigger is NONE. Then the
completion event occurs as expected. This is not feasible to be used by the
use-cases in question.
Above is a example for "2D trigger and wait", the same constraint applies for
"1D trigger and wait" and "3D trigger and wait":
- For "1D trigger and wait",
ICNT0 MUST be multiple of 64
- For "3D trigger and wait",
ICNT0xICNT1xICNT2 MUST be multiple of 64
Workaround(s):
Set the EOL flag in TR for UDMAP as
shown in following example:
- 1D trigger and wait
- TR.FLAGS |= CSL_FMK(UDMAP_TR_FLAGS_EOL,
CSL_UDMAP_TR_FLAGS_EOL_ICNT0);
- 2D trigger and wait
- TR.FLAGS |= CSL_FMK(UDMAP_TR_FLAGS_EOL,
CSL_UDMAP_TR_FLAGS_EOL_ICNT0_ICNT1);
- 3D trigger and wait
- TR.FLAGS |=
CSL_FMK(UDMAP_TR_FLAGS_EOL,CSL_UDMAP_TR_FLAGS_EOL_ICNT0_ICNT1_ICNT2);
There is no performance impact due to this workaround.