|
|
TMS320C8x Features
As the flagship in TI's industry-leading TMS320 DSP
family, the TMS320C8x generation is a true breakthrough in digital signal
processing that offers to change the way we process information. The 'C80,
the first member of the 'C8x generation, is the highest performance and
most highly integrated DSP ever produced by Texas Instruments. Four advanced
DSPs and a RISC master processor are integrated on a single chip to deliver
over two billion RISC-like operations per second (BOPS).
The latest addition to the 'C8x generation is the 'C82, a scaled down
version of the 'C80. The 'C82 provides two advanced DSPs coupled with a
RISC master processor for high-performance, cost-sensitive applications.
'C8x Key specifications ('C82 specifications shown in parentheses):
- 32-bit RISC master processor with IEEE-754 floating-point hardware
- Four 32-bit parallel, advanced DSPs (Two 32-bit parallel, advanced
DSPs)
- 50 Kbytes of on-chip SRAM (44 KBytes of on-chip SRAM)
- Video controller with dual frame timers (No video controller)
- Built-in internal emulation and boundary scan paths through an IEEE
1149.1 test access port
- Transfer controller for cache servicing and transferring data between
external memory and internal SRAM
- Each DSP uses a 32-bit local port and global access to on-chip SRAM
data
- Instruction ports consist of a 64-bit port for each DSP and a 32-bit
port for the master processor
- On-chip cache or data RAM is accessed via a 64-bit port
- On-chip processors use crossbar switching to access on-chip RAM
- Direct interface to DRAM, SDRAM, SRAM and VRAM
'C8x Key applications:
- Large-scale video conferencing ('C80)
- Desktop video conferencing ('C82)
- Video phones ('C82)
- Digital switching for cellular base stations ('C82)
- Image processing
- Video processing
- Multimedia workstations
- 2-D and 3-D graphics accelerators
- Virtual reality
- Real-time compression systems
- Security
- Radar/sonar systems
- Cable TV video compression
- Document imaging
Features By Device
TMS320C8x Master Processor (MP)
The master processor (MP) is a 32-bit RISC processor with an integral
IEEE-754 floating-point unit. As with other RISC processors, all
accesses to memory are performed with load and store instructions,
and most integer and logical operations are performed on registers
in a single cycle. The floating-point instructions are pipelined;
therefore, you can start a single-precision multiply or any floating-point
add instruction on each clock cycle. Moreover, the floating-point
unit approaches 100 MFLOPS in performance at 50-MHz internal clock
rate.
Floating-point operations use the same register file as the integer
and logic unit. A register scoreboard ensures that correct register-access
sequences are maintained.
The MP is structured for efficient execution of C code. For example,
the MP contains an R0 register, often called a zeroing register,
used by C. Also, the MP instruction set is tailored to contain
many of the C executables found in compiler technology.
Features of the master processor include:
- 32-bit RISC CPU delivering 50 MIPS @ 50 MHz
- Targeted for high-level languages
- IEEE-754 100-MFLOP floating-point unit
- Parallel multiply, add, and load/store
- 31 32-bit registers
- Single file for integer and floating point
- Loads and FPU results are scoreboarded
- Instruction and data cache control
- 4K-byte instruction cache
- 4K-byte data cache
- 2K-byte parameter RAM ('C80), 4K-byte parameter RAM ('C82)
TMS320C8x MP Floating-Point Unit
The MP's floating-point unit is capable of performing IEEE-754
floating-point operations in 32-bit single-precision and 64-bit
double-precision floating point. Conversion between different
formats is also supported. In addition, the floating-point unit
provides vector floating-point operations with the option of performing
a parallel load or store to improve program efficiency.
Hardware support for the floating-point unit consists of a full
double-precision floating-point add unit and a 32-bit single-precision
floating-point multiply unit:
- IEEE-754 floating point
- Hardware exception handling
- FP add unit with double-precision ALU
- 1-cycle adds/subs/compares (single and double) and conversions
- 6-cycle single- and 20-cycle double-precision divide
- 9-cycle single- and 26-cycle double-precision square root
- The floating-point multiply unit performs all multiplies (integer
and floating-point), divides, and square roots.
- 1-cycle single-precision multiply
- 4-cycle double-precision multiply
- Pipelined-Can start a new instruction every cycle
- 3-stage pipeline
- Register file scoreboard prevents "races"
- Vector FP for 100-MFLOP operation
- Parallel multiply, add, and 64-bit load (p++) in one cycle
- 4 double-precision accumulator registers support pipelining
- Supports matrix multiplies, DCTs, and FFTs
- FP status and interrupt-enable registers
- MP's test-and-branch instructions access FP status
TMS320C8x Parallel Processing Advanced
Digital Signal Processors (PP)
The parallel processing advanced digital signal processors (PPs)
provide much of the 'C8x's performance. The PPs are designed to
perform digital signal processing along with bit-field and multiple-pixel
manipulation. These processors have advanced features that are
not found in any other DSP or general-purpose processor and can
perform in excess of ten RISC-like operations in each cycle.
In order to specify the multiple parallel operations that the
PPs can perform, a wide instruction word of 64 bits is used. The
instruction has fields that independently control the data unit
and the two address units. All instructions execute in a minimum
of a single cycle.
Each PP has a register file of 44 user-visible registers. All
registers can be the source or destination of ALU or memory operations.
The register set is divided into files according to each register's
function. The PP features:
Additional features include:
- Two address units
- Up to two memory operations/cycle
- Single-cycle multiplier
- One 16-bit or two 8-bit results/cycle
- Splittable 3-input ALU
- Multiple operations in each pass
- Up to four 8-bit results/cycle
- Pixel and bit field hardware
- 3-input ALU with mixed arithmetic and Boolean operations
- Can perform masking at the same time as an add or subtract
- Flexible data path feeding 3-input ALU
- Fast bit and file processing
- Address data paths can be used for general-purpose arithmetic
- Byte/halfword multiple arithmetic
- Single instruction stream, multiple data stream (SIMD) processing
within each processor
- Better handling of pixels and Z-buffers than in other DSPs
or general-
purpose processors
- Eight primary data registers, d0 to d7 (D registers), that
can perform
up to seven reads and four writes
- Two multiplier sources, three ALU sources, one multiplier
result,
one ALU result, and three LD/ST/MOVE
- Splittable multiplier for fast pixel math
- Any D register can be used on a multiply-with-parallel-add
- Three levels of zero-overhead loops
- Conditional operations (for ALU, load/store, and/or register
source)
TMS320C8x PP Data Unit
The parallel-processing advanced DSP (PP) data unit has two data
paths; each data path has its own set of hardware that functions
independently of the other data path.
The ALU data path includes a barrel rotator, mask generator, 1-bit
to n-bit expander, and a 3-input ALU that can combine the mask
or expander output with register data to create over 2,000 different
processing options. The 3-input ALU can perform 512 logical and/or
mixed logical and arithmetic operations that support masking or
merging and addition/subtraction in a single pass. The ALU can
also be split to perform multiple 8-bit or 16-bit operations in
parallel.
The PP data unit features are:
- 3-input ALU (512 operations)
- Mixed arithmetic and Boolean in one cycle (mask and add/sub
in one pass)
- Mask/merge and field processing
- Splittable for multibyte operations
- 16-bit ´ 16-bit multiplier
(32-bit results)
- Rounding for DCT accuracy
- Splittable into two 8-bit ´
8-bit multipliers (16-bit results)
- Flexible data path
- Barrel rotator
- Mask generator
- N-to-1 and 1-to-N translations via mf register
- Left/rightmost one and bit-change
- 44 user-visible registers
- Any register can be operand of ALU
- Eight D registers
- Conditional operations
- Conditional choice of register pair source
- Conditional save of result
TMS320C8x Transfer Controller (TC)
The transfer controller (TC) is a combined DMA machine and memory
interface that intelligently queues, prioritizes, and services
the data requests and cache misses of the MP and the PPs. The
transfer controller interfaces directly with the on-chip SRAMs.
Through the TC, all of the processors can access the system external
to the chip. In addition, data-cache or instruction-cache misses
are automatically handled by the TC.
Data transfers are specifically requested by the PPs or the MP
in the form of linked-list packet transfers, which are handled
by the TC. These requests allow multidimensional blocks of information
to be transferred between a source and destination, either of
which can be on-chip or off-chip. Packet-oriented data transfers
offer compatibility with several local area network standards,
such as ATM.
The TC performs:
- Cache fills and writes
- Direct loads and stores from/to off-chip memory via DEA request
- Block movement of data via packet transfers
- Refresh and SRT (shift register transfer) cycles needed to
maintain DRAMs and VRAM capture/display buffer respectively
Features of the TC include:
- 400 Mbytes/s external bandwidth
- Direct DRAM, VRAM, SRAM, and SDRAM control
- Dynamic bus sizing (64, 32, 16, or 8 bits)
- Packet transfers controlled autonomously by transfer controller
- Linear x/y addressing
- Independent source and destination
- Automatic byte alignment
- Intelligent request
- Queuing and prioritization
The 'C82 TC includes a memory configuration cache that consists
of six 32-bit words that describe the properties of the six most
recently-used banks of memory. The cache automatically loads configuration
words each time an access to a new bank is made and it can be
locked into a set high or low priority. The configuration cache
reduces the number of pins necessary in the 'C82 and in support
chips.
|
|