glennka's picture
From glennka rss RSS  subscribe Subscribe

RCIM 2008 - - UniCal 

RCIM 2008 - - UniCal

 

 
 
Tags:  domain  register 
Views:  589
Published:  December 20, 2009
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
No related plicks found
 
More from this user
Jaar Dominic   Platt William   Power Point   Final

Jaar Dominic Platt William Power Point Final

From: glennka
Views: 501
Comments: 0

Should ETL Become Obsolete

Should ETL Become Obsolete

From: glennka
Views: 479
Comments: 0

Social Network Advertising - Neogen.ro

Social Network Advertising - Neogen.ro

From: glennka
Views: 60
Comments: 0

Ivf Babys

Ivf Babys

From: glennka
Views: 413
Comments: 0

2010 G Sedan West Long Branch

2010 G Sedan West Long Branch

From: glennka
Views: 303
Comments: 0

 
See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Energy Efficient Coarse-Grain Reconfigurable Array for Accelerating Digital Signal Processing Pasquale Corsonello, Fabio Frustaci, Marco Lanuzza, Stefania Perri, Paolo Zicari. Department of Electronics, Computer Science and Systems (DEIS) University of Calabria, Rende (CS)
Slide 2: Outline Motivation The proposed Coarse Grain Reconfigurable Array (CGRA) Architectural overview Computational model Post Layout Results Comparison Conclusion
Slide 3: The Challenge Nowadays, Digital Signal Processing (DSP) is extensively used for several applications Multimedia Image analysis and processing Speech processing Wireless communication These applications impose strict hardware requirements High performance Real-time operations High computational load Intensive arithmetic operations (add, sub, shift, mult, mult-acc) Energy-efficiency Portable devices Flexibility Support multiple applications Match the rapid evolving of the algorithms
Slide 4: Executing DSP on various architectures Full Custom Solutions Reconfigurable Computing General Purpose Processors & Programmable Digital Signal Processors CGRA FPGA Increasing Flexibility Increasing Performances Reconfigurable computing architectures provide an intermediate tradeoff between flexibility and performances
Slide 5: Reconfigurable Computing FPGAs are very flexible, … Gate-level functions General routing … ,but the flexibility is very expensive FPGAs are slower than ASICs, have lower logic density and are inefficient for word operations. Long reconfiguration time CGRAs use multiple-bits wide PEs and more speed-, area- and power-efficient routing structures Compromise programmability and fixed functionality Flexible and efficient within an application domain
Slide 6: Architectural Overview Config. & Elab. Data Host Interface Data Addr. RAM PE External Memory Interface Reconfigurable Cell Lached Programmable Switches I/O DATA & CONFIGURATION CENTRAL CONTROLLER Elab. Data Config. Data RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE RAM PE Distributed small RAMs and on purpose designed interconnection scheme to achieve high performance Run-time reconfigurable cells to achieve a high flexibility within the target application domain Distributed control logic to reduce control complexity and enhancing data parallelism
Slide 7: The Reconfigurable Cell I/O interface similar to a conventional RAM 2 input/output data ports 2 input address ports 1 output address port I/O control signals AddrA/B_ext Data_InA/B_ext Input Stage Ram Interface control signals Config. Data Dual Port SRAM (256*8-bits) data memory Reconfigurable 8-bit PE Internal Control Unit Two operative states Loading Executing Dual Port SRAM (256*8-bit) Control Unit Config. Mem Controls Signals PE (8-bit) Output Stage Addr_Out_ext Data_OutA/B_ext
Slide 8: Functionality of the RC in the executing state RAM RAM RAM RAM PE PE PE PE (a) a) b) c) d) (b) (c) (d) feed-forward mode; feed-back mode; route-through mode; route-through mode (double throughput)
Slide 9: The Processing Element B-Register (8-bit) A-Register (8-bit) Single clock cycle operations ADD, SUB,ACC, INC, DEC, MUL, MUL-ACC, SHIFT 00000001 S0 S1 00000001 0001 S2 00000000 S3 0000 0000 S6 MULT2 (8X4-bit) S6 MULT1 (8X4-bit) S4 S5 HA-based Compressor (4-bit) 3:2 (FA-based) Compressor (8-bit) Fast and low-cost Adder3 (4-bit) Register (4-bit) O[15:12] co2 Adder2 (8-bit) Register (8-bit) O[11:4] co1 Adder1 (4-bit) Register (4-bit) O[3:0] S7=cin
Slide 10: The Control Unit Instructions define the execution of vector/block operations on a large data stream Each instruction consist of several fields op_code specifies the operation code; Configuaration Data Config. Memory Instr. Counter op_code #ops Instruction Decoder Address Descriptors Addresses Generator AddrA_int AddrB_int Addr_ext Hanshake & Elab. Control Handshake Signals #ops specifies the number of the operations to be performed in the current instruction; address descriptors specify the data organization in the memory. PE & I/O control signals
Slide 11: The Address Generator base_address step step_register control_signal end_subset addr_register Continuous vector forward scan Continuous vector (column mode) (Step=1, Subset=8, Skip=0) forward/reverse scan (Step=n/-n, Subset=8, Skip=0) Continuous vector reverse scan (Step=-1, Subset=8, Skip=0) Block scan (forward/reverse mode) (Step=1/-1, Subset=3, Skip=n-3/-n+3) skip skip_register subset down counter =0 address_calculation _adder current_address Sparse vector forward scan (Step=2, Subset=4, Skip=0) Sparse vector reverse scan (Step=-2, Subset=4, Skip=0) Rotating vector forward scan (Step=1, Subset=8, Skip=-7) Sparse vector (column mode) forward/reverse scan (Step=2n/-2n , Subset=4, Skip=0) Rotating vector reverse scan (Step=-1, Subset=8, Skip=+7)
Slide 12: The Interconnection Topology N-bit NW N NE W E SW S SE neighbor interconnections interleaved interconnections 2N-bit Programmable Latched Switches
Slide 13: Applications Mapping: Block-level pipelining RAM(i-1) RC(i-1) Load Execute Load Execute Load Execute Load PE(i-1) RC(i) RAM(i) Load Execute Load Execute Load Execute PE(i) RC(i+1) Load Execute Load Execute Load The computation is organized in concurrently executing kernels RAM(i+1) Each kernel is implemented by a RC PE(i+1) A kernel consumes a set of input data, performs one or more computations, and produces a set of output data RCs communicate by sending addressed packets of data. Memory data loading of each cell is overlapped with data producing of previous cell An execution is performed as soon as all necessary data input are available Data syncronization mechanism is realized by handshake signals No explicit temporal scheduling of execution is required
Slide 14: Applications Mapping: Flexible computational load balancing Data parallel Function parallel Parallelism in both vertical/temporal and horizontal/spatial directions Horizontal comp. load balancing achieved via data parallelism Vertical comp. load balancing achieved by increasing the number of pipeline stages RAM(1) RAM(1) PE(1) PE(1) RAM(2) RAM(2) RAM(3) PE(2) PE(2) PE(3) RAM(3) RAM(4) PE(3) PE(4) RAM(4) PE(4)
Slide 15: Architecture evaluation Hardware-assisted simulation environment developed using a XILINX XC4VLX200 device The implemented system includes 64 RCs organized in 4x4 quadrants The number of the required clock cycles were precisely evaluated for different DSP benchmarks (YCbCr RGB, 2dDCT, 2d-FIR) . Physical Evaluation for the ST 90nm CMOS technology Reconfigurable Cell Synthesis done with Synopsys Design Compiler Physical Design done with Cadence SoC Encounter, also considering manufacturing (such as DRCs and antennas) and Signal Integrity (SI) issues. Interconnections Preliminary electrical simulations were performed Obtained results were compared to 90nm CMOS Virtex-4 FPGA
Slide 16: RC Layout Input Stage Dual Port SRAM (256*8-bit) Technology CMOS 90nm Suppy voltage 1.0 V RAM Interface Frequency 1 GHz Configuration Memory PE Core Area 79.52 um2 Avg. Dyn. Power @1 GHz Control Unit Output Stage 20 mW Leakage Power 627.6 uW
Slide 17: Resources usage/energy/performance tradeoff comparisons: New to Xilinx Virtex-4 Algorithm Proposed Reconfigurable Array Virtex-4 FPGA (CORE Generator) Resources/ Area [mm2] Throughput [MOPS] (8*8-image block) 13.3 10.5 Energy Efficiency [MOPS/W] (8*8-image block) 45.9 23.9 Resources / Area [mm2] Throughput [MOPS] (8*8-image block) 1.7 1.3 Energy Efficiency [MOPS/W] (8*8-image block) 29.1 18.4 Color Space Conversion 2D separable 4x4 FIR 2D-DCT (8x8) 13 RCs / 1.034 20 RCs / 1.590 22 RCs / 1.749 436 Slices + 2 Bram / 1.572 440 Slices + 2 Bram/ 1.657 786 Slices + 3 Bram / 2.919 10.2 20.8 2.1 14.2 •Speedups ranging from 4.8X to 8X •Energy efficiency improvement ranging from 24% to 58% •Area saving up to 40%.
Slide 18: Conclusion Presented VLSI implementation of a new coarse-grain reconfigurable architecture optimized for high throughput DSP applications Performance improvement at a low cost Exploit spatial and temporal parallelism High arithmetic processing capability high bandwidth and low latency memory access Performance/energy/area evaluations for representative tasks belonging to the target application domain Obtained results demonstrate significative advantages with respect to conventional FPGA Speedups ranging from 4.8X to 8X Energy efficiency improvement ranging from 24% to 58% Area saving up to 40%

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location