6.4 DSP Subsystem
The device includes two identical instances (DSP1 and DSP2) of a digital signal processor (DSP) subsystem, based on the TI's standard TMS320C66x™ DSP CorePac core.
The TMS320C66x DSP core enhances the TMS320C674x™ core, which merges the C674x™ floating point and the C64x+™ fixed-point instruction set architectures. The C66x DSP is object-code compatible with the C64x+/C674x DSPs.
For more information on the TMS320C66x core CPU, see the TMS320C66x DSP CPU and Instruction Set Reference Guide, (SPRUGH7).
The DSP subsystem integrated in the device includes the following components:
- A TMS320C66x™ CorePac DSP core that encompasses:
- L1 program-dedicated (L1P) cacheable memory
- L1 data-dedicated (L1D) cacheable memory
- L2 (program and data) cacheable memory
- Extended Memory Controller (XMC)
- External Memory Controller (EMC)
- DSP CorePac located interrupt controller (INTC)
- DSP CorePac located power-down controller (PDC)
- Dedicated enhanced data memory access engine - EDMA, to transfer data from/to memories and peripherals external to the DSP subsystems and to local DSP memory (most commonly L2 SRAM). The external DMA requests are passed through DSP system level (SYS) wakeup logic, and collected from the DSP1 / DSP2 dedicated outputs of the device DMA Events Crossbar for each of the two subsystems.
- A level 2 (L2) interconnect network (DSP NoC) to allow connectivity between different modules of the subsystem or the remainder of the device via the device L3_MAIN interconnect.
- Two memory management units (on EDMA L2 interconnect and DSP MDMA paths) for accessing the device L3_MAIN interconnect address space
- Dedicated system control logic (DSP_SYSTEM) responsible for power management, clock generation, and connection to the device power, reset, and clock management (PRCM) module
The TMS320C66x Instruction Set Architecture (ISA) is the latest for the C6000 family. As with its predecessors (C64x, C64x+ and C674x), the C66x is an advanced VLIW architecture with 8 functional units (two multiplier units and six arithmetic logic units) that operate in parallel. The C66x CPU has a total of 64 general-purpose 32-bit registers.
Some features of the DSP C6000 family devices are:
- Advanced VLIW CPU with eight functional units (two multipliers and six ALUs) which:
- Executes up to eight instructions per cycle for up to ten times the performance of typical DSPs
- Allows designers to develop highly effective RISC-like code for fast development time
- Instruction packing
- Gives code size equivalence for eight instructions executed serially or in parallel
- Reduces code size, program fetches, and power consumption
- Conditional execution of most instructions
- Reduces costly branching
- Increases parallelism for higher sustained performance
- Efficient code execution on independent functional units
- Industry's most efficient C compiler on DSP benchmark suite
- Industry's first assembly optimizer for fast development and improved parallelization
- 8-/16-/32-/64-bit data support, providing efficient memory support for a variety of applications
- 40-bit arithmetic options which add extra precision for vocoders and other computationally intensive applications
- Saturation and normalization to provide support for key arithmetic operations
- Field manipulation and instruction extract, set, clear, and bit counting support common operation found in control and data manipulation applications.
The C66x CPU has the following additional features:
- Each multiplier can perform two 16 × 16-bit or four 8 × 8 bit multiplies every clock cycle.
- Quad 8-bit and dual 16-bit instruction set extensions with data flow support
- Support for non-aligned 32-bit (word) and 64-bit (double word) memory accesses
- Special communication-specific instructions have been added to address common operations in error-correcting codes.
- Bit count and rotate hardware extends support for bit-level algorithms.
- Compact instructions: Common instructions (AND, ADD, LD, MPY) have 16-bit versions to reduce code size.
- Protected mode operation: A two-level system of privileged program execution to support higher-capability operating systems and system features such as memory protection.
- Exceptions support for error detection and program redirection to provide robust code execution
- Hardware support for modulo loop operation to reduce code size and allow interrupts during fully-pipelined code
- Each multiplier can perform 32 × 32 bit multiplies
- Additional instructions to support complex multiplies allowing up to eight 16-bit multiply/add/subtracts per clock cycle
The TMS320C66x has the following key improvements to the ISA:
- 4x Multiply Accumulate improvement for both fixed and floating point
- Improvement of the floating point arithmetic
- Enhancement of the vector processing capability for fixed and floating point
- Addition of domain-specific instructions for complex arithmetic and matrix operations
On the C66x ISA, the vector processing capability is improved by extending the width of the SIMD instructions. The C674x DSP supports 2-way SIMD operations for 16-bit data and 4-way SIMD operations for 8-bit data. C66x enhances this capabilities with the addition of SIMD instructions for 32-bit data allowing operation on 128-bit vectors. For example the QMPY32 instruction is able to perform the element to element multiplication between two vectors of four 32-bit data each.
C66x ISA includes a set of specific instructions to handle complex arithmetic and matrix operations.
- TMS320C66x DSP CorePac memory components:
- A 32-KiB L1 program memory (L1P) configurable as cache and/or SRAM:
- When configured as a cache, the L1P is a 1-way set-associative cache with a 32-byte cache line
- The DSP CorePac L1P memory controller provides bandwidth management, memory protection, and power-down functions
- The L1P is capable of cache block and global coherence operations
- The L1P controller has an Error Detection (ED) mechanism, including necessary SRAM
- The L1P memory can be fully configured as a cache or SRAM
- Page size for L1P memory is 2KB
- A 32-KiB L1 data memory (L1D) with ECC, configurable as cache and / or SRAM:
- When configured as a cache, the L1D is a 2-way set-associative cache with a 64-byte cache line
- The DSP CorePac L1D memory controller provides bandwidth management, memory protection, and power-down functions
- The L1D memory can be fully configured as a cache or SRAM
- No support for error correction or detection
- Page size for L1D memory is 2KB
- A 288-KiB (program and data) L2 memory, only part of which is cacheable:
- When configured as a cache, the L2 memory is a 4-way set associative cache with a 128-byte cache line
- Only 256 KiB of L2 memory can be configured as cache or SRAM
- 32 KiB of the L2 memory is always mapped as SRAM
- The L2 memory controller has an Error Correction Code (ECC) and ED mechanism, including necessary SRAM
- The L2 memory controller supports hardware prefetching and also provides bandwidth management, memory protection, and power-down functions.
- Page size for L2 memory is 16KB
- The External Memory Controller (EMC) is a bridge from the C66x CorePac to the rest of the DSP subsystem and device. It has:
- a 32-bit configuration port (CFG) providing access to local subsystem resources (like DSP_EDMA, DSP_SYSTEM, and so forth) or to L3_MAIN resources accessible via the CFG address range.
- a 128-bit slave-DMA port (SDMA) which provides accesses of system masters outside the DSP subsystem to resources inside the DSP subsystem or C66x DSP CorePac memories, i.e. when the DSP subsystem is the slave in a transaction.
- The Extended Memory Controller (XMC) processes requests from the L2 Cache Controller (which are a result of CPU instruction fetches, load/store commands, cache operations) to device resources via the C66x DSP CorePac 128-bit master DMA (MDMA) port:
- Memory protection for addresses outside C66x DSP CorePac generated over device L3_MAIN on the MDMA port
- Prefetch, multi-in-flight requests
- A DSP local Interrupt Controller (INTC) in the DSP C66x CorePac, interfaces the system events to the DSP C66x core CPU interrupt and exceptions inputs. Each DSP subsystem C66x CorePac interrupt controller supports up to 128 system events of which 64 interrupts are external to DSP subsystems, collected from the DSP1 /DSP2 dedicated outputs of the device Interrupt Crossbar.
- Local Enhanced Direct Memory Access (EDMA) controller features:
- Channel controller (CC): 64-channel, 128 PaRAM, 2 Queues
- 2 × Third-party Transfer Controllers (TPTC0 and TPTC1):
- Each TC has a 128-bit read port and a 128-bit write port
- 2KiB FIFOs on each TPTC
- 1-dimensional/2-dimensional (1D/2D) addressing
- Chaining capability
- DSP subsystem integrated MMUs:
- Two MMUs are integrated:
- The MMU0 is located between DSP MDMA master port and the device L3_MAIN interconnect and can be optionally bypassed
- The MMU1 is located between the EDMA master port and the device L3_MAIN interconnect
- A DSP local Power-Down Controller (PDC) is responsible to power-down various parts of the DSP C66x CorePac, or the entire DSP C66x CorePac.
- The DSP subsystem System Control logic provides:
- Slave idle and master standby protocols with device PRCM for powerdown
- OCP Disconnect handshake for init and target busses
- Asynchronous reset
- Power-down modes:
- "Clockstop" mode featuring wake-up on interrupt event. The DMA event wake-up is managed in software.
- The device DSP subsystems are supplied by a PRCM DPLL, but each DSP1/2 has integrated its own PLL module outside the C66x CorePac for clock gating and division.
- The device DSP subsystem has following port instances to connect to remaining part of the device. See also:
- A 128-bit initiator (DSP MDMA master) port for MDMA/Cache requests
- A 128-bit initiator (DSP EDMA master) port for EDMA requests
- A 32-bit initiator (DSP CFG master) port for configuration requests
- A 128-bit target (DSP slave) port for requests to DSP memories and various peripherals
- C66x DSP subsystem (DSPSS) safety aspects:
- Above mentioned memory ECC/ED mechanisms
- MMUs enable mapping of only the necessary application space to the processor
- Memory Protection Units internal to the DSPSS (in L1P, L1D and L2 memory controllers) and external to DSPSS (firewalls) to help define legal accesses and raise exceptions on illegal accesses
- Exceptions: Memory errors, various DSP errors, MMU errors and some system errors are detected and cause exceptions. The exceptions could be handled by the DSP or by a designated safety processor at the chip level. Note that it may not be possible for the safety processor to completely handle some exceptions
Unsupported features on the C66x DSP core for the device are:
- The Extended Memory Controller MPAX (memory protection and address extension) 36-bit addressing is NOT supported
Known DSP subsystem powermode restrictions for the device are:
- "Full logic / RAM retention" mode featuring wake-up on both interrupt or DMA event (logic in “always on” domain). Only OFF mode is supported by DSP subsystem, requiring full boot.
For more information about C66x debug/trace support, see chapter On-Chip Debug Support of the Device TRM.