1

# Accelerator-Rich Architectures — From Single-chip to Datacenters

#### **Jason Cong**

Chancellor's Professor, UCLA Director, Center for Domain-Specific Computing

> cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong



| ES 128bit key<br>28bit data | Throughput     | Power  | Figure of Merit<br>(Gb/s/W) |
|-----------------------------|----------------|--------|-----------------------------|
| ).18mm CMOS                 | 3.84 Gbits/sec | 350 mW | 11 (1/1)                    |
| FPGA [1]                    | 1.32 Gbit/sec  | 490 mW | 2.7 (1/4)                   |
| ASM StrongARM [2]           | 31 Mbit/sec    | 240 mW | 0.13 (1/85)                 |
| ASM Pentium III [3]         | 648 Mbits/sec  | 41.4 W | 0.015 (1/800)               |
| C Emb. Sparc [4]            | 133 Kbits/sec  | 120 mW | 0.0011 (1/10,000)           |
| Java [5] Emb. Sparc         | 450 bits/sec   | 120 mW | 0.0000037 (1/3,000,000)     |

[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
 [3] Helger Lipmaa Pill assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
 [4] gcc, 1 mWMHz @ 120 Mhz Sparc – assumes 0.25 u CMOS
 [5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

Source: P Schaumont and I Verbauwhede, "Domain specific codesign for embedded security," IEEE Computer 36(4), 2003









| Operation           | Processor<br>ALU   | 45 nm TSMC<br>library | Why are processor units                                                        |
|---------------------|--------------------|-----------------------|--------------------------------------------------------------------------------|
| 32-bit add          | 0.122 nJ@2         | 0.002 nJ @ 1          | so expensive?                                                                  |
| 32-bit multiply     | 0.120 nJ@2<br>GHz  | 0.007 nJ @ 1<br>GHz   | <ul> <li>ALU can perform multiple<br/>operations</li> </ul>                    |
| Single<br>precision | 0.150 nJ @<br>2GHz | 0.008 nJ @<br>500 MHz | <ul> <li>Add/sub/bitwise XOR/OR<br/>AND</li> </ul>                             |
| FP operation        |                    |                       | 64-bit ALU                                                                     |
|                     |                    |                       | <ul> <li>Dynamic/domino logic<br/>used to run at high<br/>frequency</li> </ul> |
|                     |                    |                       | <ul> <li>Higher power dissipation</li> </ul>                                   |





### So, What Shall We Do with Processors? Our Proposal – Accelerator-Rich Architectures

- A customizable heterogeneous platform (CHP)
  - With a sea of dedicated and composable accelerators
  - Most computations are carried on accelerators not on processors!
- A fundamental departure from von Neumann architecture
- Why now?
  - Previous architectures are device/transistor limited
  - Von Neumann architecture allows maximum device reuse
    - One pipeline serves all functions, fully utilized
- Future architectures
  - Plenty of transistors, but power/energy limited (dark silicon)
  - Customization and specialization for maximum energy efficiency
- A story of specialization

- Different region responsible for different functions
- Remarkable advancement of civilization also from specialization
  - More advanced societies have higher degree of specialization











| N              | ledical Image Processi               | ng Pipeline                                                                                                                                                                                                                                                                                                             |                                |
|----------------|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
| reconstruction |                                      | $\begin{split} \text{Medical images exhibit sparsity, and can be sampled at} \\ \text{a rate } << \text{classical Shannon - Nyquist theory :} \\ \min_{u} \sum_{\text{sampled points}} & \left\  ARu \cdot S \right\ ^2 + \lambda \sum_{\text{Vvicxels}} & \left\  grad(u) \right\  \end{split}$                        | compressive<br>sensing         |
| denoising      | $\forall voxel: u(i) = \sqrt{1 + 1}$ | $\left(\sum_{\text{vore} j \in \text{solume}} \mathbf{w}_{i,j} f(j)^2 \right) - 2\sigma^2, \mathbf{w}_{i,j} = \frac{1}{Z(i)} e^{-\frac{\sqrt{\frac{1}{5}}\sum_{j=1}^{5} \left \mathbf{y}_i - \mathbf{z}_i\right ^2}{\hbar}}$                                                                                            | total variational<br>algorithm |
| registration   |                                      | $\begin{split} v &= \frac{\partial u}{\partial t} + v \cdot \nabla u \\ \mu \Delta v + (\mu + \eta) \nabla (\nabla \cdot v) &= - \left[ T(x - u) - R(x) \right] \nabla T(x - u) \end{split}$                                                                                                                            | fluid<br>registration          |
| segmentation   |                                      | $\begin{split} & \frac{\partial \varphi}{\partial t} =  \nabla \varphi  \bigg[ F(data, \varphi) + \lambda div\bigg( \frac{\nabla \varphi}{ \nabla \varphi } \bigg) \bigg] \\ & surface(t) = \big\{ voxels \ x : \varphi(x, t) = 0 \big\} \end{split}$                                                                   | level set<br>methods           |
| analysis       |                                      | $ \begin{split} & \frac{\partial v}{\partial t} + (v \cdot \nabla) v = -\nabla p + v \Delta v + f(x,t) \\ & \frac{\partial v_i}{\partial t} + \sum_{j=1}^3 v_j \frac{\partial v_i}{\partial x_j} = -\frac{\partial p}{\partial x_i} + v \sum_{j=1}^3 v_j \frac{\partial^2 v_i}{\partial x_j^2} + f_i(x,t) \end{split} $ | Navier-Stokes<br>equations     |

|                                                |                                                             | GPU *<br>(NVIDIA Tesla M2075)                                    | FPGA<br>(Xilinx V6) | Monolithic<br>Accelerators                    |
|------------------------------------------------|-------------------------------------------------------------|------------------------------------------------------------------|---------------------|-----------------------------------------------|
| Deblur                                         | Performance                                                 | 2.4X                                                             | 3.4X                | 7.8X                                          |
|                                                | Energy                                                      | 0.3X                                                             | 3.2X                | 32X                                           |
| Denoise                                        | Performance                                                 | 16.6X                                                            | 1.6X                | 3.5X                                          |
|                                                | Energy                                                      | 1.4X                                                             | 1.2X                | 13X                                           |
| Segmentation                                   | Performance                                                 | 73X                                                              | 16X                 | 16X                                           |
|                                                | Energy                                                      | 6.1X                                                             | 3.6X                | 53X                                           |
| Registration                                   | Performance                                                 | 3.9X                                                             | 6.7X                | 15X                                           |
|                                                | Energy                                                      | 0.4X                                                             | 3.2X                | 60X                                           |
| Average                                        | Performance                                                 | 24X                                                              | 6.9X                | 10X                                           |
|                                                | Energy                                                      | 2X                                                               | 2.8X                | 39.8X                                         |
| OTE: GPU power value<br>It device, making them | es were full-system measure<br>relatively inflated compared | ements obtained using the Kill-<br>I to other McPAT-generated va | A- Resu<br>lues     | Its relative to Quad Co<br>Accelerators are s |



# Possibility of Accelerator Composition – Use of Accelerator Building Blocks (ABBs)

|                              | Denoise      | Deblur       | Registration                                         | Segmentation         |
|------------------------------|--------------|--------------|------------------------------------------------------|----------------------|
| ABBs                         |              |              | _                                                    | -                    |
| Float Reciprocal (FInv)      | $\checkmark$ | $\checkmark$ |                                                      | $\checkmark$         |
| Float Square-Root (FSqrt)    | $\checkmark$ | $\checkmark$ | $\checkmark$                                         | $\checkmark$         |
| Float Polynomial-16 (Poly16) | $\checkmark$ | $\checkmark$ | $\checkmark$                                         | $\checkmark$         |
| Float Divide (FDiv)          | $\checkmark$ | $\checkmark$ | $\checkmark$                                         | $\checkmark$         |
|                              |              |              | sm<br>ctri1<br>counter<br>2to4<br>Decoder<br>Id[3:0] | ADD/SUB/MUL<br>(ASM) |
|                              |              |              |                                                      |                      |







|              |             | GPU *<br>(NVIDIA Tesla M2075) | FPGA<br>(Xilinx V6) | Monolithic<br>Accelerators | Composable<br>Accelerators |
|--------------|-------------|-------------------------------|---------------------|----------------------------|----------------------------|
| Deblur       | Performance | 2.4X                          | 3.4X                | 7.8X                       | 21X                        |
|              | Energy      | 0.3X                          | 3.2X                | 32X                        | 55X                        |
| Denoise      | Performance | 16.6X                         | 1.6X                | 3.5X                       | 11X                        |
|              | Energy      | 1.4X                          | 1.2X                | 13X                        | 29X                        |
| Segmentation | Performance | 73X                           | 16X                 | 16X                        | 77X                        |
| 0            | Energy      | 6.1X                          | 3.6X                | 53X                        | 186X                       |
| Registration | Performance | 3.9X                          | 6.7X                | 15X                        | 58X                        |
| <b>v</b>     | Energy      | 0.4X                          | 3.2X                | 60X                        | 144X                       |
| Average      | Performance | 24X                           | 6.9X                | 10X                        | 42X                        |
|              | Energy      | 2X                            | 2.8X                | 39.8X                      | 103X                       |



New Research Opportunities for Architecture-Rich Architecture

- Memory support
- Communication support
- Prototyping and validation
- Software support









| Use Xilinx AXI4 bus<br>Memory sharing an<br>Our customized cro<br>performance impro                                                                                                    | i IP<br>nong accelerator<br>ossbar vs conve<br>vement | rs vs private mem<br>ntional bus → bot                | ories → huge are<br>h area savings a              | ea savings<br>nd |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|---------------------------------------------------|------------------|--|--|
| <ul> <li>Conventional bus performs arbitration at every memory access, and optimized for general-<br/>purpose access patterns → extra logics and delay spent on arbitrators</li> </ul> |                                                       |                                                       |                                                   |                  |  |  |
|                                                                                                                                                                                        |                                                       |                                                       |                                                   |                  |  |  |
|                                                                                                                                                                                        | Memory usage                                          | Interconnect<br>cost in # of LUTs                     | Accelerator<br>subtask runtime                    |                  |  |  |
| Private memories                                                                                                                                                                       | Memory usage<br>3328KB (177%)                         | Interconnect<br>cost in # of LUTs<br>0                | Accelerator<br>subtask runtime<br>10.3us          |                  |  |  |
| Private memories<br>Shared memories via<br>AXI buses                                                                                                                                   | Memory usage<br>3328KB (177%)<br>768KB (41%)          | Interconnect<br>cost in # of LUTs<br>0<br>50043 (33%) | Accelerator<br>subtask runtime<br>10.3us<br>117us |                  |  |  |





# Qualcomm Neural Processing Units (NPUs)

A new class of processors mimicking human perception and cognition (Oct. 2013)









| <ul> <li>Press releases</li> </ul>                                                                                                                                                                                                                                                                                                                                                              | Xilinx Demonstrates Industry's First QPI 1.1 Interface with<br>FPGAs at Intel Developer Forum                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| April 23, 2014<br>What <u>Power8</u> and OpenPOWER Might Mean for HP<br>Timothy Prickett Morgan                                                                                                                                                                                                                                                                                                 | <ul> <li><sup>7</sup>ath Interconnect enables 7 series All Programmable FPGAs; extends</li> <li><sup>6</sup> lifes of Intel processor-based systems<br/>wwire</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| IBM is making a big play in hybrid com<br>seeking to marry its POWER8 process<br>seeking to marry its POWER8 process<br>speed networking and opening up its<br>and system software through the<br>Downer Power Strate Strate Strate Strate<br>Server Capabilities<br>Demo Features Stratic V FPGA Configured<br>Sandy Bridge KOM processors, the dison<br>sond bridge KOM processors, the dison | puting,       , 1012         iors       , 2012         iors       , 2012 <td< td=""></td<> |
| IBM is working with FPGA makers Xilin<br>running over the CAPI interface, so this<br>1.1 intellectual property (IP) solution to sup<br>the Impact2014 event, IBM and Xilinx, w Platform at the Intel Developers Forum (ID)                                                                                                                                                                      | ssing and embedded applications, such as high-frequency trading and big data that<br>watt than traditional CPU configurations can deliver. Altera is demonstrating its <mark>CPI</mark><br>port both the Caching Agent and Home Agent in a Pactron Vigor Development<br>F) Beijing, April 10-11, in Altera's booth #E120.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| being accelerated by FPGAs and show or is the only way to coherently connect to<br>order of magnitude lower latency. A Mid to support the Intel OP lectrical specificat<br>machines accelerated by Altera FPGAs the facible shared memory model that Inte<br>adapter and switch maker Mellanox Te channels connecting to four 8 GR RDIMMs.                                                      | o an Intel server processor. The Altera StratixV FPGA transceiver has been qualified<br>on at 8 Gbps. Developers of low-latency, high-bandwidth systems looking to extend<br>I uses for x86 programming can now efficiently integrate a Stratix V FPGA into their<br>32 GB of memory on the motherboard connected to the socket with support for two                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| using Remote Direct Memory Access (<br>"Our OPI 1.1 solution provides developers of<br>boosted throughput and cut latencies I<br>significantly increase their compute perform<br>compute and storage product line at Altera.                                                                                                                                                                    | of data centers and high-performance computing applications a platform to<br>nance while reducing system cost and power," said David Gamba, director of the<br>"FPGAs deliver a highly effective, efficient way to speed the processing of large data<br>rated data transfers."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |



























































# Initial Experimental Result

- Cluster setting
  - 4 CPU nodes (Xeon), each connected to one FPGA node (ML605)
- User application
  - Application I: Logistic Regression (LR) 2x FPGA speedup
  - Application II: Neural Network (NN) Training 9x FPGA speedup

|               | LR first | NN first | LR, NN simul. |
|---------------|----------|----------|---------------|
| Local AM      | 6.14s    | 0.62s    | 1.23s         |
| Global AM     | 0.85s    | 0.62s    | 0.62s         |
| Speedup       | 7.22x    |          | 2x            |
| Energy saving | 10.2x    |          | 1.45x         |

- With local AM, the first application will occupy all the accelerator resources
- With global AM (resource revision), more acc/FPGA resources will be allocated to applications with higher acceleration potential (NN)







|            | -                                                                                                     |    |
|------------|-------------------------------------------------------------------------------------------------------|----|
| • N        | ew era of computing                                                                                   |    |
|            | Future computing platforms will have a sea-of-accelerators                                            |    |
|            | With efficient support for customization and specialization                                           |    |
| • A        | ccelerators at all levels                                                                             |    |
|            | Chip-level                                                                                            |    |
| •          | Server node level                                                                                     |    |
|            | Data center level                                                                                     |    |
| • Ci<br>be | ustomizable and composable accelerators offer the right trade-or<br>etween flexibility and efficiency | ff |
| • S        | oftware is the key                                                                                    |    |
|            | Programming models                                                                                    |    |
|            | OpenMP 4.0, OpenCL, Hadoop/MapReduce + C/C++,                                                         |    |
|            | Compilation support                                                                                   |    |
|            | Runtime management                                                                                    |    |
|            |                                                                                                       | 7  |





