SPRADD8 November 2024 F29H850TU , F29H859TU-Q1
Traditionally, Branch, Call, and Return operations incur overhead because of the instruction pipeline. The CPU fetches, decodes, and determines that a branch, call, or return operation needs to occur in the Decode-2 phase of the pipeline. By this time, the pipeline is filled with next instructions, which need to be flushed before the instruction at the discontinuity destination is fetched. Flushing of instructions results in overhead.
The C29 CPU has a 9-stage pipeline, with discontinuity decision occurring in the Decode-2 (D2) phase of the pipeline. Therefore, three instructions following a discontinuity instruction are already in the pipeline (the Fetch-1, Fetch-2, and Decode-1 phases of pipeline). In addition to regular branch, call, or return instructions, the C29 ISA supports delayed branch, call, or return instructions (the corresponding instruction has a trailing D, for example CALLD, RETD). When these delayed discontinuity instructions are used, three instructions immediately following them are always executed, regardless of whether the discontinuity occurs or not (in the case of a conditional branch). The three instructions following a delayed discontinuity instruction are referred to as delay slots. The C29 Compiler, when using the delay slot version of these instructions, inserts appropriate instructions into delay slots, thus reducing the discontinuity overhead from three cycles to effectively zero cycles.
Two examples illustrating the use of this by a compiler are shown below.
@CALLD funcA ; Call funcA
||LD.32 A4,@pointer1 ; Load A4 with pointer1 value from memory
LD.32 A5,@pointer2 ; Load A5 with pointer2 value from memory
||SUB.U16 A6,SP,#34 ; A6 points to value on stack offset -34
MV A7,#ArrayB ; Load A7 with address of ArrayB
||LD.32 D0,@variable1 ; Load D0 with Variable1 from memory
LD.32 D1,@variable2 ; Load D1 with Variable2 from memory
; Total Cycles = 4
funcA: ADD.U16 SP,SP,#24 ; Allocate local stack space
ST.64 *(SP-#24),XM2 ; Save XM2, XM4, XM6 registers on stack
ST.64 *(SP-#16),XM4
ST.64 *(SP-#8),XM6
... user code...
RETD *(SP-#32) ; packet 1:Return and restore RPC from stack
||MV M0,M3 ; Place return value in register M0
LD.64 XM6,*(SP-#8) ; packet 2:Restore XM6 from stack
LD.64 XM4,*(SP-#16) ; packet 3:Restore XM4 from stack
LD.64 XM2,*(SP-#24) ; packet 4:Restore XM2 from stack
||SUB.U16 SP,SP,#32 ; Deallocate local + return stack space
; Total Cycles = 4
The above examples are models of how the C29 compiler uses delay slots. In practice, delay slots are used for more than just function argument passing and register restoration and stack deallocation. Delay slots often contain instructions for implementing the actual functionality of user code.