Slide 1: Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating Digital Signal Processing
Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari.
Department of Electronics, Computer Science and Systems (DEIS) University of Calabria, Rende (CS)
Slide 2: Outline
Motivation The proposed Coarse Grain Reconfigurable Array (CGRA) Architectural overview Computational model Post Layout Results Comparison Conclusion
Slide 3: The Challenge
Nowadays, Digital Signal Processing (DSP) is extensively used for several applications
Multimedia Image analysis and processing Speech processing Wireless communication
These applications impose strict hardware requirements
High performance
Real-time operations High computational load
Intensive arithmetic operations
(add, sub, shift, mult, mult-acc)
Energy-efficiency
Portable devices
Flexibility
Support multiple applications Match the rapid evolving of the algorithms
Slide 4: Executing DSP on various architectures
Full Custom Solutions Reconfigurable Computing General Purpose Processors & Programmable Digital Signal Processors
CGRA FPGA
Increasing Flexibility Increasing Performances
Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances
Slide 5: Reconfigurable Computing
FPGAs are very flexible, …
Gate-level functions General routing
… ,but the flexibility is very expensive
FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations. Long reconfiguration time
CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures
Compromise programmability and fixed functionality Flexible and efficient within an application domain
Slide 6: Architectural Overview
Config. & Elab. Data Host Interface Data Addr.
RAM PE
External Memory Interface
Reconfigurable Cell Lached Programmable Switches
I/O DATA & CONFIGURATION CENTRAL CONTROLLER Elab. Data Config. Data
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
RAM PE
Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance Run-time reconfigurable cells to achieve a high flexibility within the target application domain Distributed control logic to reduce control complexity and enhancing data parallelism
Slide 7: The Reconfigurable Cell
I/O interface similar to a conventional RAM
2 input/output data ports 2 input address ports 1 output address port I/O control signals
AddrA/B_ext Data_InA/B_ext
Input Stage Ram Interface control signals Config. Data
Dual Port SRAM (256*8-bits) data memory Reconfigurable 8-bit PE Internal Control Unit
Two operative states
Loading Executing
Dual Port SRAM (256*8-bit)
Control Unit Config. Mem
Controls Signals
PE (8-bit) Output Stage
Addr_Out_ext Data_OutA/B_ext
Slide 8: Functionality of the RC in the executing state
RAM RAM RAM RAM
PE
PE
PE
PE
(a)
a) b) c) d)
(b)
(c)
(d)
feed-forward mode; feed-back mode; route-through mode; route-through mode (double throughput)
Slide 9: The Processing Element
B-Register (8-bit) A-Register (8-bit)
Single clock cycle operations
ADD, SUB,ACC, INC, DEC, MUL, MUL-ACC, SHIFT
00000001 S0 S1
00000001
0001
S2 00000000
S3
0000
0000
S6
MULT2 (8X4-bit)
S6
MULT1 (8X4-bit)
S4 S5
HA-based Compressor (4-bit)
3:2 (FA-based) Compressor (8-bit)
Fast and low-cost
Adder3 (4-bit) Register (4-bit)
O[15:12]
co2
Adder2 (8-bit) Register (8-bit)
O[11:4]
co1
Adder1 (4-bit) Register (4-bit)
O[3:0]
S7=cin
Slide 10: The Control Unit
Instructions define the execution of vector/block operations on a large data stream Each instruction consist of several fields op_code specifies the
operation code;
Configuaration Data Config. Memory Instr. Counter
op_code #ops Instruction Decoder
Address Descriptors Addresses Generator AddrA_int AddrB_int Addr_ext Hanshake & Elab. Control Handshake Signals
#ops specifies the
number of the operations to be performed in the current instruction;
address descriptors
specify the data organization in the memory.
PE & I/O control signals
Slide 11: The Address Generator
base_address step step_register control_signal end_subset addr_register
Continuous vector forward scan Continuous vector (column mode) (Step=1, Subset=8, Skip=0) forward/reverse scan (Step=n/-n, Subset=8, Skip=0) Continuous vector reverse scan (Step=-1, Subset=8, Skip=0) Block scan (forward/reverse mode) (Step=1/-1, Subset=3, Skip=n-3/-n+3)
skip skip_register
subset down counter
=0
address_calculation _adder current_address
Sparse vector forward scan (Step=2, Subset=4, Skip=0) Sparse vector reverse scan (Step=-2, Subset=4, Skip=0) Rotating vector forward scan (Step=1, Subset=8, Skip=-7)
Sparse vector (column mode) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0)
Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)
Slide 12: The Interconnection Topology
N-bit NW N NE
W
E
SW
S
SE
neighbor interconnections interleaved interconnections
2N-bit
Programmable Latched Switches
Slide 13: Applications Mapping: Block-level pipelining
RAM(i-1)
RC(i-1)
Load Execute Load Execute Load Execute Load
PE(i-1)
RC(i)
RAM(i)
Load Execute Load Execute Load Execute
PE(i)
RC(i+1)
Load Execute Load Execute Load
The computation is organized in concurrently executing kernels
RAM(i+1)
Each kernel is implemented by a RC
PE(i+1)
A kernel consumes a set of input data, performs one or more computations, and produces a set of output data
RCs communicate by sending addressed packets of data. Memory data loading of each cell is overlapped with data producing of previous cell
An execution is performed as soon as all necessary data input are available
Data syncronization mechanism is realized by handshake signals No explicit temporal scheduling of execution is required
Slide 14: Applications Mapping: Flexible computational
load balancing
Data parallel
Function parallel
Parallelism in both vertical/temporal and horizontal/spatial directions
Horizontal comp. load balancing achieved via data parallelism Vertical comp. load balancing achieved by increasing the number of pipeline stages
RAM(1)
RAM(1)
PE(1)
PE(1)
RAM(2)
RAM(2)
RAM(3)
PE(2)
PE(2)
PE(3)
RAM(3) RAM(4) PE(3) PE(4)
RAM(4)
PE(4)
Slide 15: Architecture evaluation
Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device
The implemented system includes 64 RCs organized in 4x4 quadrants The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr RGB, 2dDCT, 2d-FIR) .
Physical Evaluation for the ST 90nm CMOS technology
Reconfigurable Cell
Synthesis done with Synopsys Design Compiler Physical Design done with Cadence SoC Encounter, also considering manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues.
Interconnections
Preliminary electrical simulations were performed
Obtained results were compared to 90nm CMOS Virtex-4 FPGA
Slide 16: RC Layout
Input Stage Dual Port SRAM (256*8-bit)
Technology
CMOS 90nm
Suppy voltage
1.0 V
RAM Interface
Frequency
1 GHz
Configuration Memory PE
Core Area
79.52 um2
Avg. Dyn. Power @1 GHz
Control Unit Output Stage
20 mW
Leakage Power
627.6 uW
Slide 17: Resources usage/energy/performance tradeoff comparisons: New to Xilinx Virtex-4
Algorithm Proposed Reconfigurable Array Virtex-4 FPGA (CORE Generator)
Resources/ Area [mm2] Throughput [MOPS] (8*8-image block) 13.3 10.5 Energy Efficiency [MOPS/W] (8*8-image block) 45.9 23.9 Resources / Area [mm2] Throughput [MOPS] (8*8-image block) 1.7 1.3 Energy Efficiency [MOPS/W] (8*8-image block) 29.1 18.4
Color Space Conversion 2D separable 4x4 FIR 2D-DCT (8x8)
13 RCs / 1.034 20 RCs / 1.590 22 RCs / 1.749
436 Slices + 2 Bram / 1.572 440 Slices + 2 Bram/ 1.657 786 Slices + 3 Bram / 2.919
10.2
20.8
2.1
14.2
•Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.
Slide 18: Conclusion
Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications
Performance improvement at a low cost
Exploit spatial and temporal parallelism High arithmetic processing capability high bandwidth and low latency memory access
Performance/energy/area evaluations for representative tasks belonging to the target application domain Obtained results demonstrate significative advantages with respect to conventional FPGA
Speedups ranging from 4.8X to 8X Energy efficiency improvement ranging from 24% to 58% Area saving up to 40%