# QCD TO THE EXASCALE

#### Kate Clark, August 2nd 2016





# CONTENTS

# Charting our way to the Exascale GPUs Block solvers - Locality Multigrid - Parallelism To the Exascale and Beyond





### CHARTING OUR WAY TO THE EXASCALE

# HPC's Biggest Challenge: Power



#### Power of 300 Petaflop CPU-only Supercomputer





Power for the city of San Francisco









## THE RISE OF LEAKAGE





### LEAKAGE IS KILLING VOLTAGE SCALING

#### The Good Old Days

Leakage not important, voltage scaled with feature size

> L' = L/2V' = V/2E' =  $CV^2$  = E/8f' = 2fD' =  $1/L^2$  = 4DP' = P

L/2 => 4x transistors => 8x capability => same power

#### The New Reality

Leakage limiting threshold voltage, voltage scaling dying

L' = 
$$L/2$$
  
V' =  $-V$   
E' =  $CV^2$  =  $E/2$   
f' =  $2f$   
D' =  $1/L2$  =  $4D$   
P' =  $4P$ 

L/2 => 4x transistors => 8x capability => 4x power, or...

2x capability, same power, ¼ area









### 64-bit DP 20 pJ







#### 64-bit DP 20 pJ

50 pJ

256-bit access 8 kB SRAM









256-bit access 8 kB SRAM

#### 20mm 28nm IC





### 28nm IC







- Need to expose massive concurrence
  - Exaflop at O(GHz) clocks  $\Rightarrow$  O(billion-way) parallelism!
- Need to expose and exploit locality
  - Data *motion* more expensive than computation  $\bullet$
  - > 100:1 global v. local energy  $\bullet$

### SOFTWARE IMPLICATIONS









# THE RISE OF THE GPU

#### • Pascal P100 (2016)

- 3584 FP32 processing elements
- 10.6 TFLOPS FP32 / 5.3 TFLOPS FP64 peak
- Deep memory hierarchy
- Programmed using a thread model
  - Architecture abstraction is CUDA
  - Fine-grained parallelism is required
- Diversity of programming languages
  - CUDA C / C++ / Fortran
  - C / C++ / Fortran + OpenACC / OpenMP
  - Python, Java, etc.

# WHAT IS A GPU?



### HETEROGENOUS NODE PHILOSOPHY **Optimize for Efficiency**





### **TESLA PASCAL P100**

#### Tesla P100 for NVLink-enabled Servers



#### Tesla P100 for PCIe-Based Servers



5.3 TF DP · 10.6 TF SP · 21 TF HP 720 GB/sec Memory Bandwidth 16 GB HBM2

4.7 TF DP · 9.3 TF SP · 18.7 TF HP Config 1: 16 GB (HBM2), 720 GB/sec Config 2: 12 GB (HBM2), 540 GB/sec



- Pascal includes NVLink interconnect technology
  - 4x NVLink connections per GPU
  - 160 GB/s bi-directional bandwidth per GPU
- NVIDIA DGX-1
  - 8x P100 GPUs
  - 2<sup>3</sup> NVLink topology
  - 4x EDR IB (via PCIe switches)
  - 2x Intel Xeon hosts
- 1 DGX-1 equivalent inter-GPU bandwidth to 64 nodes of Titan

### DGX-1



↔ NVLink ↔ PCle



### US to Build Two Flagship Supercomputers Powered by the Tesla Platform











### Major Step Forward on the Path to Exascale

100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017







Just 4 nodes in Summit would make the Top500 list of supercomputers today





#### Similar Power as Titan 5-10x Faster 1/5th the Size



#### 150 PF = 3M Laptops One laptop for Every Resident in State of Mississippi



# QUDA

- "QCD on CUDA" http://lattice.github.com/quda (open source, BSD license)
- Chroma, CPS, MILC, TIFR, tmLQCD, etc.
  - Latest release 0.8.0 (8th February 2016)
- Provides:
  - Various solvers for all major fermionic discretizations, with multi-GPU support Additional performance-critical routines needed for gauge-field generation

#### • Maximize performance

- Exploit physical symmetries to minimize memory traffic
- Mixed-precision methods
- Autotuning for high performance on all CUDA-capable architectures
- Domain-decomposed (Schwarz) preconditioners for strong scaling \_\_\_\_\_
- Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR)
- Multi-source solvers
- Multigrid solvers for optimal convergence
- A research tool for how to reach the exascale







# • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD,





### **QUDA CONTRIBUTORS** (multigrid collaborators in green)

Ron Babich (NVIDIA) Michael Baldhauf (Regensburg) Kip Barros (LANL) Rich Brower (Boston University) Nuno Cardoso (NCSA) Kate Clark (NVIDIA) Michael Cheng (Boston University) Carleton DeTar (Utah University) Justin Foley (Utah -> NIH) Joel Giedt (Rensselaer Polytechnic Institute) Arjun Gambhir (William and Mary)

Steve Gottlieb (Indiana University) Dean Howarth (Rensselaer Polytechnic Institute) Bálint Joó (Jlab) Hyung-Jin Kim (BNL -> Samsung) Claudio Rebbi (Boston University) Guochun Shi (NCSA -> Google) Mario Schröck (INFN) Alexei Strelchenko (FNAL) Alejandro Vaquero (INFN) Mathias Wagner (NVIDIA) Frank Winter (Jlab)





|                                                                 | Q                                                          | UDA - I                                                                                                                 |                                                                                                              |  |  |  |
|-----------------------------------------------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--|--|--|
|                                                                 |                                                            | <u>nttp:/</u>                                                                                                           | Alattice                                                                                                     |  |  |  |
| Lattice /                                                       | quda                                                       |                                                                                                                         |                                                                                                              |  |  |  |
| <> Code                                                         | Issues 107                                                 | 1) Pull requests 2                                                                                                      | 🗏 Wiki 🔸 Pi                                                                                                  |  |  |  |
| QUDA is a l                                                     | ibrary for performin                                       | ng calculations in latti                                                                                                | ce QCD on GPU                                                                                                |  |  |  |
| <b>7 4,621</b> commits                                          |                                                            |                                                                                                                         | 49 branches                                                                                                  |  |  |  |
| Ū                                                               | <b>4,621</b> commits                                       | <b>9 49</b> bra                                                                                                         | anches                                                                                                       |  |  |  |
| Branch: dev                                                     | 4,621 commits                                              | uest                                                                                                                    | anches                                                                                                       |  |  |  |
| Branch: deve                                                    | 4,621 commits elop  New pull requests swagner committed on | uest<br>GitHub Merge pull reque                                                                                         | est #487 from lattice                                                                                        |  |  |  |
| Branch: development                                             | 4,621 commits elop  New pull requests swagner committed on | uest<br>GitHub Merge pull reque                                                                                         | enches<br>est #487 from lattice<br>m, if staggered fer                                                       |  |  |  |
| Branch: devel<br>mathias                                        | 4,621 commits elop  New pull requ swagner committed on     | uest<br>GitHub Merge pull reque<br>In ColorSpinorPara<br>Correctly set volum                                            | enches<br>est #487 from lattice<br>m, if staggered fer<br>neCB for parity sub                                |  |  |  |
| Branch: deve<br>mathias                                         | 4,621 commits elop  New pull requ swagner committed on     | uest<br>GitHub Merge pull reque<br>In ColorSpinorPara<br>Correctly set volum<br>Requestingtest 1                        | est #487 from lattice<br>m, if staggered fer<br>neCB for parity sub<br>with staggered_d                      |  |  |  |
| Branch: deve<br>mathias<br>include<br>ib<br>itests<br>juitignor | 4,621 commits elop  New pull requ swagner committed on     | uest<br>GitHub Merge pull reque<br>In ColorSpinorPara<br>Correctly set volum<br>Requestingtest 1<br>Updates to .gitigno | est #487 from lattice<br>m, if staggered fer<br>heCB for parity sub<br>with staggered_d<br>ore and renamed m |  |  |  |

| CE QCD ON GPUS<br>e.github.com/quda +                                               |                                                                                              |                                                                                       |       |                          |                   |                                                                                |                               |  |
|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|-------|--------------------------|-------------------|--------------------------------------------------------------------------------|-------------------------------|--|
|                                                                                     |                                                                                              | ● Watch ▼                                                                             | 36    | ★ Unstar                 | 45                | <b>%</b> Fork                                                                  | 29                            |  |
| Pulse                                                                               | Graphs                                                                                       | Settings                                                                              |       |                          |                   |                                                                                |                               |  |
| Us. http://lattice.github.com/quda — Edit                                           |                                                                                              |                                                                                       |       |                          |                   |                                                                                |                               |  |
| 19 releases                                                                         |                                                                                              | <b>L</b> 16 contributors                                                              |       |                          |                   |                                                                                |                               |  |
|                                                                                     |                                                                                              |                                                                                       |       |                          |                   |                                                                                | _                             |  |
|                                                                                     | Create new                                                                                   | / file Upload                                                                         | files | Find file                | Clone             | or download                                                                    | d 🔻                           |  |
| ice/hotfix/che                                                                      | Create new                                                                                   | file Upload                                                                           | files | Find file<br>Latest comr | Clone<br>nit f3e2 | or download<br>2aa7 a day a                                                    | d <b>-</b>                    |  |
| ice/hotfix/cheo                                                                     | Create new<br>ckerboard-re                                                                   | file Upload<br>ference                                                                | files | Find file<br>Latest comm | Clone<br>nit f3e2 | or download<br>2aa7 a day a<br>11 days a                                       | d -<br>go                     |  |
| ice/hotfix/cheo<br>ermions then                                                     | Create new<br>ckerboard-re<br>set field din                                                  | file Upload<br>ference<br>nension t                                                   | files | Find file<br>Latest com  | Clone<br>nit f3e2 | or download<br>2aa7 a day a<br>11 days a<br>a day a                            | d -<br>go<br>go               |  |
| ice/hotfix/cheo<br>ermions then<br>ubset referer                                    | Create new<br>ckerboard-re<br>set field din<br>nces – need t                                 | file Upload<br>ference<br>nension t<br>to check p<br>dagM operato                     | r     | Find file                | Clone             | or download<br>2aa7 a day a<br>11 days a<br>a day a<br>11 days a               | d ✓<br>go<br>go               |  |
| ice/hotfix/cheo<br>ermions then<br>ubset referer<br>_dslash_test r<br>multigrid_bei | Create new<br>ckerboard-re<br>set field din<br>nces – need t<br>now tests Mo<br>nchmark to r | file Upload<br>ference •••<br>nension t<br>to check p<br>dagM operato<br>multigrid_be | r     | Find file                | Clone<br>nit f3e2 | or download<br>2aa7 a day a<br>11 days a<br>a day a<br>11 days a<br>3 months a | d <b>↓</b><br>900<br>90<br>90 |  |



# MAPPING THE DIRAC OPERATOR TO CUDA

Assign a single space-time point to each thread V = XYZT threads, e.g., V =  $24^4$  =>  $3.3 \times 10^6$  threads Looping over direction each thread must Load the neighboring spinor (24 numbers x8) Load the color matrix connecting the sites (18 numbers x8) Do the computation Save the result (24 numbers) Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver





### WILSON-DSLASH PERFORMANCE K20X (2012), ECC on, V = 24<sup>3</sup>xT

GPU TECHNOLOGY CONFERENCE







### PASCAL

- Newly launched Pascal P100 GPU provides significant performance uplift through stacked HBM memory
- Wilson-clover dslash 3x speedup vs K40
  - double ~500 GFLOPS
  - single ~1000 GFLOPS
  - half ~2000 GFLOPS







#### P100





### LQCD WITH GPU GENERATION Single precision Wilson-dslash performance





### SOLVERS FOR MULTIPLE RIGHT HAND SIDES

### **CONJUGATE GRADIENT** just as a reminder

### procedure CG $r^{(0)} = b - Ar^{(0)}$ $p^{(0)} = r^{(0)}$ for $k = 1, 2, \ldots$ until converged **do** $z^{(k-1)} = Ap^{(k-1)}$ $\alpha^{(k-1)} = \frac{|r^{(k-1)}|^2}{\langle (p^{(k-1)}), z^{(k-1)} \rangle}$ $x^{(k)} = x(k-1) + \alpha^{(k-1)}p^{(k-1)}$ $r^{(k)} = r^{(k-1)} - \alpha^{(k-1)} z^{(k-1)}$ $\beta^{(k-1)} = \frac{|r^{(k-1)}|^2}{|r^{(k)}|^2}$ $p^{(k)} = r^{(k)} + \beta^{(k-1)} p^{(k-1)}$ end for end procedure



### QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7



### QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7

exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats



### QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7

exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

### WILSON CLOVER DSLASH

Volume =  $32^4$ 





QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7

exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

Smearing kills symmetry: stuck with 18 floats



QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7

exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

Smearing kills symmetry: stuck with 18 floats

Reuse gauge field for multiple rhs


# QCD PERFORMANCE LIMITERS

QCD is **memory bandwidth bound** Dslash arithmetic intensity for HISQ ~ 0.7

exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats

Smearing kills symmetry: stuck with 18 floats

Reuse gauge field for multiple rhs





| 6     | 8 | 10 | 12 | 14 | 16 |
|-------|---|----|----|----|----|
| # rhs |   |    |    |    |    |















## **CONJUGATE GRADIENT** using multi-src Dslash

procedure CG with multi-src Dslash  $r_i = b_i - A x_i^{(0)}$  $p_i^{(0)} = r_i^{(0)}$ for  $k = 1, 2, \ldots$  until converged **do**  $\left\{z_i^{(k-1)}\right\} = A\left\{p^{(k-1)}\right\}$  $\alpha_i^{(k-1)} = |r_i^{(k-1)}|^2 / \langle (p_i^{(k-1)}), z_i^{(k-1)} \rangle$  $x_i^{(k)} = x_i^{(k-1)} + \alpha_i^{(k-1)} p_i^{(k-1)}$  $r_i^{(k)} = r_i^{(k-1)} - \alpha_i^{(k-1)} z_i^{(k-1)}$  $\beta_i^{(k-1)} = |r_i^{(k-1)}|^2 / |r_i^{(k)}|^2$  $p_i^{(k)} = r_i^{(k)} + \beta_i^{(k-1)} p_i^{(k-1)}$ end for

end procedure

### exploit multi-src Dslash performance

### do all the linear algebra for each rhs

### same iteration count as CG



## **MULTI-SRC CG ON PASCAL** Volume 24<sup>4</sup>, HISQ



### ➡ P100 single



## **BLOCK KRYLOV SOLVERS** Share the Krylov space

BlockCG solver suggested by O'Leary in early 80's retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations



## **BLOCK KRYLOV SOLVERS** Share the Krylov space

BlockCG solver suggested by O'Leary in early 80's retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations

### **Application in QCD:**

Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift)





## procedure BLOCKCG $R^{(0)} = B - AX^{(0)}$ $P^{(0)} = R^{(0)}$ for $k = 1, 2, \ldots$ until converged do $Z^{(k-1)} = AP^{(k-1)}$ $\alpha^{(k-1)} = \left[ (P^{(k-1)})^H Z^{(k-1)} \right]^{-1} (R^{(k-1)})^H R^{(k-1)}$ $X^{(k)} = X^{(k-1)} + P^{(k-1)}\alpha^{(k-1)}$ $R^{(k)} = R^{(k-1)} - Z^{(k-1)} \alpha^{(k-1)}$ $\beta^{(k-1)} = \left[ (R^{(k-1)})^H R^{(k-1)} \right]^{-1} (R^{(k)})^H R^{(k)}$ $P^{(k)} = R^{(k)} - P^{(k-1)}\beta^{(k-1)}$ end for

end procedure



## procedure BLOCKCG $R^{(0)} = B - AX^{(0)}$ $P^{(0)} = R^{(0)}$ for $k = 1, 2, \ldots$ until converged do $Z^{(k-1)} = AP^{(k-1)}$ $\alpha^{(k-1)} = \left[ (P^{(k-1)})^H Z^{(k-1)} \right]^{-1} (R^{(k-1)})^H R^{(k-1)}$ $X^{(k)} = X^{(k-1)} + P^{(k-1)}\alpha^{(k-1)}$ $R^{(k)} = R^{(k-1)} - Z^{(k-1)} \alpha^{(k-1)}$ $\beta^{(k-1)} = \left[ (R^{(k-1)})^H R^{(k-1)} \right]^{-1} (R^{(k)})^H R^{(k)}$ $P^{(k)} = R^{(k)} - P^{(k-1)}\beta^{(k-1)}$ end for



end procedure





end for

## procedure BLOCKCG $R^{(0)} = B - AX^{(0)}$ $P^{(0)} = R^{(0)}$ for $k = 1, 2, \ldots$ until converged do $Z^{(k-1)} = AP^{(k-1)}$ $\alpha^{(k-1)} = \left[ (P^{(k-1)})^H Z^{(k-1)} \right]^{-1} (R^{(k-1)})^H R^{(k-1)}$ $X^{(k)} = X^{(k-1)} + P^{(k-1)}\alpha^{(k-1)}$ $R^{(k)} = R^{(k-1)} - Z^{(k-1)} \alpha^{(k-1)}$ $\beta^{(k-1)} = \left[ (R^{(k-1)})^H R^{(k-1)} \right]^{-1} (R^{(k)})^H R^{(k)}$ $P^{(k)} = R^{(k)} - P^{(k-1)}\beta^{(k-1)}$

end procedure





## procedure BLOCKCG $R^{(0)} = B - AX^{(0)}$ $P^{(0)} = R^{(0)}$ for $k = 1, 2, \ldots$ until converged do $Z^{(k-1)} = AP^{(k-1)}$ $\alpha^{(k-1)} = \left[ (P^{(k-1)})^H Z^{(k-1)} \right]^{-1} (R^{(k-1)})^H R^{(k-1)}$ $X^{(k)} = X^{(k-1)} + P^{(k-1)}\alpha^{(k-1)}$ $R^{(k)} = R^{(k-1)} - Z^{(k-1)} \alpha^{(k-1)}$ $\beta^{(k-1)} = \left[ (R^{(k-1)})^H R^{(k-1)} \right]^{-1} (R^{(k)})^H R^{(k)}$ $P^{(k)} = R^{(k)} - P^{(k-1)}\beta^{(k-1)}$ end for

end procedure





| 4                  | — 8  | — 12 |      | 16   |      |
|--------------------|------|------|------|------|------|
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
|                    |      |      |      |      |      |
| 2000<br>iterations | 2500 | 3000 | 3500 | 4000 | 4500 |



























## WHY DOESN'T EVERYONE USE IT? BlockCG is not always numerically stable

- Remedy with orthogonalization: Gram-Schmidt or modified Gram-Schmidt
  - Quadratic scaling with # rhs
  - Becomes prohibitive





## WHY DOESN'T EVERYONE USE IT? BlockCG is not always numerically stable

- Remedy with orthogonalization: Gram-Schmidt or modified Gram-Schmidt
  - Quadratic scaling with # rhs
  - Becomes prohibitive





## WHY DOESN'T EVERYONE USE IT? BlockCG is not always numerically stable

- Remedy with orthogonalization: Gram-Schmidt or modified Gram-Schmidt
  - Quadratic scaling with # rhs
  - Becomes prohibitive







$$y_i = \sum a_{ij} x_j + y_i$$

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

| SM   |             |            |            |          |            |         |            |          |         |            |            |          |            |
|------|-------------|------------|------------|----------|------------|---------|------------|----------|---------|------------|------------|----------|------------|
|      |             |            |            |          |            |         | Instructio | on Cache |         |            |            |          |            |
|      |             | 1          | nstructio  | on Buffe | r.         |         |            |          |         | Ì          | nstructio  | on Buffe | r          |
|      |             |            | Warp So    | cheduler |            |         |            |          |         |            | Warp So    | heduler  |            |
|      | Dispate     | ch Unit    |            |          | Dispat     | ch Unit |            |          | Dispate | ch Unit    |            | -        | Dispa      |
|      |             | Regist     | er File (3 | 32,768 x | 32-bit)    |         |            |          |         | Regist     | er File (3 | 32,768 x | 32-bit)    |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core        | DP<br>Unit | Core       | Core     | DP<br>Unit | LD/ST   | SFU        | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
|      |             |            |            |          |            |         | Texture /  | L1 Cache |         |            |            |          |            |
|      | Tex Tex Tex |            |            |          |            |         |            |          |         |            |            |          |            |
|      |             |            |            |          |            | 6       | 4KB Shar   | ed Memo  | ry      |            |            |          |            |

**Note:** Kepler GK110 had a 3:1 ratio of SP units to DP units.

| atch Unit |     |
|-----------|-----|
| LD/ST     | SFU |
|           |     |
| Tex       |     |
|           |     |



$$y_i = \sum a_{ij} x_j + y_i$$

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

| SM   |                    |              |            |          |            |           |           |          |            |            |            |            |            |  |
|------|--------------------|--------------|------------|----------|------------|-----------|-----------|----------|------------|------------|------------|------------|------------|--|
|      |                    |              |            |          |            |           | Instructi | on Cache |            |            |            |            |            |  |
|      |                    | 1            | nstructio  | on Buffe | IF:        |           |           |          |            | 1          | nstructio  | on Buffe   | Г          |  |
| Ļ    |                    |              | Warp Sc    | heduler  |            |           |           |          |            |            | Warp So    | cheduler   | duler      |  |
|      | Dispate            | :h Unit<br>L |            |          | Dispat     | ch Unit   |           |          | Dispate    | ch Unit    |            |            | Disp       |  |
|      |                    | Regist       | er File (3 | 32,768 x | 32-bit)    |           |           |          |            | Regist     | er File (3 | 32,768 x   | 32-bit     |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | SFU       | Core      | Core     | DP<br>Unit | Core       | Core       | DP<br>Unit |            |  |
|      |                    |              |            |          |            | Texture / | L1 Cache  |          |            |            |            |            |            |  |
|      | Te                 | x            |            |          |            | Tex       |           |          |            |            |            |            |            |  |
|      | 64KB Shared Memory |              |            |          |            |           |           |          |            |            |            |            |            |  |



| atch Unit |     |
|-----------|-----|
| LD/ST     | SFU |
|           |     |
| Tex       |     |
|           |     |



$$y_i = \sum a_{ij} x_j + y_i$$

## CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

$$y_0(0) = a_{00}x_0(0) + a_{01}x_1(0) + \dots$$
  

$$y_1(0) = a_{10}x_0(0) + a_{11}x_1(0) + \dots$$
  

$$y_2(0) = a_{20}x_0(0) + a_{21}x_1(0) + \dots$$
  

$$y_3(0) = a_{30}x_0(0) + a_{31}x_1(0) + \dots$$

| SM   |                    |              |            |          |            |           |           |          |            |            |            |            |            |  |
|------|--------------------|--------------|------------|----------|------------|-----------|-----------|----------|------------|------------|------------|------------|------------|--|
|      |                    |              |            |          |            |           | Instructi | on Cache |            |            |            |            |            |  |
|      |                    | 1            | nstructio  | on Buffe | IF:        |           |           |          |            | 1          | nstructio  | on Buffe   | Г          |  |
| Ļ    |                    |              | Warp Sc    | heduler  |            |           |           |          |            |            | Warp So    | cheduler   | duler      |  |
|      | Dispate            | :h Unit<br>L |            |          | Dispat     | ch Unit   |           |          | Dispate    | ch Unit    |            |            | Disp       |  |
|      |                    | Regist       | er File (3 | 32,768 x | 32-bit)    |           |           |          |            | Regist     | er File (3 | 32,768 x   | 32-bit     |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST     | SFU       | Core     | Core       | DP<br>Unit | Core       | Core       | DP<br>Unit |  |
| Core | Core               | DP<br>Unit   | Core       | Core     | DP<br>Unit | SFU       | Core      | Core     | DP<br>Unit | Core       | Core       | DP<br>Unit |            |  |
|      |                    |              |            |          |            | Texture / | L1 Cache  |          |            |            |            |            |            |  |
|      | Te                 | x            |            |          |            | Tex       |           |          |            |            |            |            |            |  |
|      | 64KB Shared Memory |              |            |          |            |           |           |          |            |            |            |            |            |  |



| atch Unit |     |
|-----------|-----|
| LD/ST     | SFU |
|           |     |
| Tex       |     |
|           |     |



$$y_i = \sum a_{ij} x_j + y_i$$

## CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

$$y_0(0) = a_{00}x_0(0) + a_{01}x_1(0) + \dots$$
  

$$y_1(0) = a_{10}x_0(0) + a_{11}x_1(0) + \dots$$
  

$$y_2(0) = a_{20}x_0(0) + a_{21}x_1(0) + \dots$$
  

$$y_3(0) = a_{30}x_0(0) + a_{31}x_1(0) + \dots$$

| SM   |         |              |            |          |            |       |           |          |         |            |            |          |            |
|------|---------|--------------|------------|----------|------------|-------|-----------|----------|---------|------------|------------|----------|------------|
|      |         |              |            | D //     |            |       | Instructi | on Cache |         |            |            | D //     |            |
|      | _       |              | Warn Sc    | on Buffe | IF.        | _     | _         |          | _       |            | Warp Sc    | on Buffe | r.         |
|      | Dispate | :h Unit<br>L | waip ac    |          |            |       | Dispate   | ch Unit  | Waip St |            | Dispa      |          |            |
|      |         | Regist       | er File (3 | 32,768 x | 32-bit)    |       |           |          |         | Regist     | er File (3 | 32,768 x | 32-bit)    |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
|      |         |              |            |          |            |       | Texture / | L1 Cache |         |            |            |          |            |
|      | Те      | x            |            |          | т          | ex    |           |          | Т       | ex         |            |          |            |
|      |         |              |            |          |            | 6     | 4KB Shai  | red Memo | ry      |            |            |          |            |



| atch Unit |     |
|-----------|-----|
| LD/ST     | SFU |
|           |     |
| Tex       |     |
|           |     |



$$y_i = \sum a_{ij} x_j + y_i$$

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

$$y_{0}(0) = a_{00} x_{0}(0) + a_{01} x_{1}(0) + \dots$$
  

$$y_{1}(0) = a_{10} x_{0}(0) + a_{11} x_{1}(0) + \dots$$
  

$$y_{2}(0) = a_{20} x_{0}(0) + a_{21} x_{1}(0) + \dots$$
  

$$y_{3}(0) = a_{30} x_{0}(0) + a_{31} x_{1}(0) + \dots$$

### cache reuse

| SM   |         |              |            |          |            |       |           |          |         |            |            |          |            |
|------|---------|--------------|------------|----------|------------|-------|-----------|----------|---------|------------|------------|----------|------------|
|      |         |              |            | D //     |            |       | Instructi | on Cache |         |            |            | D //     |            |
|      | _       |              | Warn Sc    | on Buffe | IF.        | _     | _         |          | _       |            | Warp Sc    | on Buffe | r.         |
|      | Dispate | :h Unit<br>L | waip ac    |          |            |       | Dispate   | ch Unit  | Waip St |            | Dispa      |          |            |
|      |         | Regist       | er File (3 | 32,768 x | 32-bit)    |       |           |          |         | Regist     | er File (3 | 32,768 x | 32-bit)    |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
| Core | Core    | DP<br>Unit   | Core       | Core     | DP<br>Unit | LD/ST | SFU       | Core     | Core    | DP<br>Unit | Core       | Core     | DP<br>Unit |
|      |         |              |            |          |            |       | Texture / | L1 Cache |         |            |            |          |            |
|      | Те      | x            |            |          | т          | ex    |           |          | Т       | ex         |            |          |            |
|      |         |              |            |          |            | 6     | 4KB Shai  | red Memo | ry      |            |            |          |            |



| atch Unit |     |
|-----------|-----|
| LD/ST     | SFU |
|           |     |
| Tex       |     |
|           |     |

## **EXPLOIT GPU ARCHITECTURE** to overcome quadratic scaling



$$y_i = \sum a_{ij} x_j + y_i$$

CUDA supports two dimensional grid blocks: easy to exploit locality for texture cache / shared memory

$$y_{0}(0) = a_{00} x_{0}(0) + a_{01} x_{1}(0) + \dots$$
  

$$y_{1}(0) = a_{10} x_{0}(0) + a_{11} x_{1}(0) + \dots$$
  

$$y_{2}(0) = a_{20} x_{0}(0) + a_{21} x_{1}(0) + \dots$$
  

$$y_{3}(0) = a_{30} x_{0}(0) + a_{31} x_{1}(0) + \dots$$

### cache reuse

|      | Instruction Buffer<br>Warp Scheduler |            |            |          |               |       |           |          | Instruction Buffer<br>Warp Scheduler |            |            |          |            |  |
|------|--------------------------------------|------------|------------|----------|---------------|-------|-----------|----------|--------------------------------------|------------|------------|----------|------------|--|
|      |                                      |            |            |          |               |       |           |          |                                      |            |            |          |            |  |
|      | Dispatch Unit                        |            |            |          | Dispatch Unit |       |           |          | Dispatch Unit                        |            |            | Dispa    |            |  |
|      |                                      | Regist     | er File (: | 32,768 x | 32-bit)       |       |           |          |                                      | Regist     | er File (: | 32,768 x | 32-bit)    |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
| Core | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit    | LD/ST | SFU       | Core     | Core                                 | DP<br>Unit | Core       | Core     | DP<br>Unit |  |
|      |                                      |            |            |          |               |       | Texture / | L1 Cache | 1                                    |            |            |          |            |  |
|      | Tex                                  |            |            |          | Tex           |       |           |          | Tex                                  |            |            |          |            |  |
|      | 64KB Shar                            |            |            |          |               |       |           |          | red Memory                           |            |            |          |            |  |



| atch Unit |       |     |  |  |  |  |
|-----------|-------|-----|--|--|--|--|
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
|           | LD/ST | SFU |  |  |  |  |
| Tex       |       |     |  |  |  |  |
|           |       |     |  |  |  |  |





### # rhs

## CholQR

 $B = R^H R$ Gram-Matrix: Cholesky Decomposition  $S^H S = B$  of  $m \times m$  matrix  $Q = RS^{-1}$ apply to vectors axpy  $m \times m$  (output, input)

- Linear algebra and orthogonalization stay constant
- Large benefits from multi-src Dslash but relative importance of Dslash reduces

 $m \times m$  dot products of length n

Cost [A.U.]







## CholQR

 $B = R^H R$ Gram-Matrix: Cholesky Decomposition  $S^H S = B$  of  $m \times m$  matrix  $Q = RS^{-1}$ apply to vectors axpy  $m \times m$  (output, input)

- Linear algebra and orthogonalization stay constant
- Large benefits from multi-src Dslash but relative importance of Dslash reduces

 $m \times m$  dot products of length n

Cost [A.U.]





## CholQR

 $B = R^H R$ Gram-Matrix: Cholesky Decomposition  $S^H S = B$  of  $m \times m$  matrix  $Q = RS^{-1}$ apply to vectors axpy  $m \times m$  (output, input)

- Linear algebra and orthogonalization stay constant
- Large benefits from multi-src Dslash but relative importance of Dslash reduces

 $m \times m$  dot products of length n

Cost [A.U.]



Vector operation



## CholQR

 $B = R^H R$ Gram-Matrix: Cholesky Decomposition  $S^H S = B$  of  $m \times m$  matrix  $Q = RS^{-1}$ apply to vectors axpy  $m \times m$  (output, input)

- Linear algebra and orthogonalization stay constant
- Large benefits from multi-src Dslash but relative importance of Dslash reduces

 $m \times m$  dot products of length n

Cost [A.U.]



Vector operation



# WORK TO BE DONE



stability needs real world testing orthogonalization might be necessary

iteration count improvement may depend on gauge field and sources

need to finish up implementation

add mixed precision



# **SPEEDUP OVER CG**



<sup>#</sup> rhs




Reuse gauge field for Dslash





## Reuse gauge field for Dslash

## Reduced iteration count





## Reuse gauge field for Dslash

## Reduced iteration count





Reuse gauge field for Dslash

Reduced iteration count

Avoid quadratic scaling in LA

Speedups up to 10x

Scalable algorithm for future architectures





## WHY MULTIGRID?



Babich et al 2010



|                                  | - |
|----------------------------------|---|
|                                  |   |
|                                  | _ |
|                                  |   |
|                                  |   |
|                                  | - |
|                                  |   |
|                                  | _ |
|                                  |   |
|                                  |   |
|                                  | - |
|                                  |   |
|                                  | _ |
|                                  |   |
|                                  |   |
|                                  | - |
|                                  |   |
|                                  | _ |
|                                  | 1 |
| odes)                            |   |
| ark                              |   |
| ark<br><b>le</b><br>Cohen<br>orn |   |
| ark<br><b>le</b><br>Cohen<br>orn |   |
| ark<br><b>le</b><br>Cohen<br>orn |   |
| ark<br>le<br>Cohen<br>orn        |   |

## ADAPTIVE GEOMETRIC MULTIGRID

Adaptively find candidate null-space vectors

Dynamically learn the null space and use this to define the prolongator

Algorithm is self learning

Setup

- 1. Set solver to be simple smoother
- 2. Apply current solver to random vector  $v_i = P(D) \eta_i$
- 3. If convergence good enough, solver setup complete
- 4. Construct prolongator using fixed coarsening  $(1 P R) v_k = 0$  $\rightarrow$  Typically use 4<sup>4</sup> geometric blocks

 $\blacksquare$  Preserve chirality when coarsening R =  $\gamma_5 P^{\dagger} \gamma_5 = P^{\dagger}$ 

- 5. Construct coarse operator ( $D_c = R D P$ )
- 6. Recurse on coarse problem
- 7. Set solver to be augmented V-cycle, goto 2

Babich *et al* 2010



Falgout

see also Inexact Deflation (Lüscher, 2007) \_ocal coherence = weak approximation theory

43





# THE CHALLENGE OF MULTIGRID ON GPU



GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU Fine grids run very efficiently High parallel throughput problem Coarse grids are worst possible scenario More cores than degrees of freedom Increasingly serial and latency bound Little's law (bytes = bandwidth \* latency) Amdahl's law limiter

Multigrid exposes many of the problems expected at the Exascale





## INGREDIENTS FOR PARALLEL ADAPTIVE MULTIGRID

- Multigrid setup
  - Block orthogonalization of null space vectors \_\_\_\_
  - Batched QR decomposition
- Smoothing (relaxation on a given grid)
  - Repurpose existing solvers —
- Prolongation
  - interpolation from coarse grid to fine grid
  - one-to-many mapping
- Restriction
  - restriction from fine grid to coarse grid
  - many-to-one mapping
- Coarse Operator construction (setup)
  - Evaluate *R A P* locally
  - Batched (small) dense matrix multiplication —
- Coarse grid solver
  - Need optimal coarse-grid operator



 $x - \hat{v}$ 









## **COARSE GRID OPERATOR**

 Coarse operator looks like a Dirac operator (many more colors) - Link matrices have dimension 2N, x 2N, (e.g., 48 x 48)

$$\hat{D}_{\mathbf{i}\hat{s}\hat{c},\mathbf{j}\hat{s}'\hat{c}'} = -\sum_{\mu} \left[ Y_{\mathbf{i}\hat{s}\hat{c},\mathbf{j}\hat{s}'\hat{c}'}^{-\mu} \delta_{\mathbf{i}+\mu,\mathbf{j}} + \right]$$

- Fine vs. Coarse grid parallelization
  - Fine grid operator has plenty of grid-level parallelism
    - E.g., 16x16x16x16 = 65536 lattice sites
  - Coarse grid operator has diminishing grid-level parallelism
    - first coarse grid 4x4x4x4= 256 lattice sites
    - second coarse grid 2x2x2x2 = 16 lattice sites

## Current GPUs have up to 3840 processing elements

- Need to consider finer-grained parallelization - Increase parallelism to use all GPU resources
  - Load balancing



 $Y_{\mathbf{i}\hat{s}\hat{c},\mathbf{j}\hat{s}'\hat{c}'}^{+\mu\dagger}\delta_{\mathbf{i}-\mu,\mathbf{j}}\Big| + (M - X_{\mathbf{i}\hat{s}\hat{c},\mathbf{j}\hat{s}'\hat{c}'})\,\delta_{\mathbf{i}\hat{s}\hat{c},\mathbf{j}\hat{s}'\hat{c}'}.$ 







### 1. Grid parallelism Volume of threads



$$\begin{pmatrix} a_{00} & a_{01} & a_{02} & a_{03} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ b_3 \end{pmatrix} \Rightarrow \begin{pmatrix} a_{00} & a_{01} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}$$

## SOURCE OF PARALLELISM







4. Dot-product partitioning 4-way thread parallelism + ILP

## **COARSE GRID OPERATOR PERFORMANCE** Tesla K20X (Titan), FP32, N<sub>vec</sub> = 24



### 24,576-way parallel

## 16-way parallel





## **COARSE GRID OPERATOR PERFORMANCE** 8-core Haswell 2.4 GHz (solid line) vs M6000 (dashed lined), FP32

100





- Larger grids favor less fine grained
- Coarse grids favor most fine grained
- GPU is nearly always faster than CPU
- Expect in future that coarse grids will favor CPUs
- For now, use GPU exclusively









### **COARSE GRID OPERATOR PERFORMANCE** ◆ Kepler (FP32) ◆ Maxwell (FP32) ◆ Pascal (FP32) 600





# IMPROVING STRONG SCALING



| s  |           | 16497 | .913 μs | 5    | 1630.3 ms | 1630.35 ms                                                                                                                                              | 1630.4 ms                | 1630.45 ms                                                         | 1630.5 m |
|----|-----------|-------|---------|------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------------------------------------------|----------|
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          | cudaMemcpy                                                         |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      | void quda | ::coarseDslashKernel <float< th=""><th>, quda::colorspinor::Fie</th><th>eldOrderCB<float, i<="" int="2," th=""><th>nt=24, v</th></float,></th></float<> | , quda::colorspinor::Fie | eldOrderCB <float, i<="" int="2," th=""><th>nt=24, v</th></float,> | nt=24, v |
|    |           |       |         |      | void quda | a::coarseDslashKernel <float< th=""><th>, quda::colorspinor::Fie</th><th>eldOrderCB<float, i<="" int="2," th=""><th>nt=24,</th></float,></th></float<>  | , quda::colorspinor::Fie | eldOrderCB <float, i<="" int="2," th=""><th>nt=24,</th></float,>   | nt=24,   |
|    | 0         | m     | or      | ncr  | we to     |                                                                                                                                                         |                          |                                                                    |          |
| u  |           |       | CI      | iich | ys u      |                                                                                                                                                         |                          |                                                                    | M        |
| r  |           |       | 0       | lato | nov       |                                                                                                                                                         |                          |                                                                    |          |
| 10 | <b>-U</b> | uc    | C       | ialt | TICY      |                                                                                                                                                         |                          |                                                                    |          |
|    |           |       |         |      |           |                                                                                                                                                         |                          |                                                                    |          |

|         | 1603 <mark>321.896 µs</mark> | 1603.3 ms                                                                                                           | 1603.35 ms               | 1603.4 ms            | 1603.4 |
|---------|------------------------------|---------------------------------------------------------------------------------------------------------------------|--------------------------|----------------------|--------|
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     | cudaMemcpy               |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
| void qu | uda::coarseDslashKer         | rnel <float, quda::co<="" td=""><td>lorspinor::FieldOrderCB&lt;</td><td>float, int=2, int=24</td><td></td></float,> | lorspinor::FieldOrderCB< | float, int=2, int=24 |        |
| void qı | uda::coarseDslashKer         | rnel <float, quda::co<="" td=""><td>lorspinor::FieldOrderCB&lt;</td><td>float, int=2, int=24</td><td></td></float,> | lorspinor::FieldOrderCB< | float, int=2, int=24 |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |
|         |                              |                                                                                                                     |                          |                      |        |







- Compare MG against the state-of-the-art traditional Krylov solver on GPU
  - BiCGstab in double/half precision with reliable updates
  - 12/8 reconstruct
  - Symmetric red-black preconditioning
- Adaptive Multigrid algorithm
  - GCR outer solver wraps 3-level MG preconditioner
  - GCR restarts done in double, everything else in single
  - 24 null-space vectors on fine grid
  - Minimum Residual smoother
  - Symmetric red-black preconditioning on each level
  - GCR coarse-grid solver

$$\mathbf{g} \ \hat{M}_{ee} = 1_{ee} - \kappa^2 A_{ee}^{-1} D_{eo} A_{oo}^{-1} D_{oe}$$





12



## MULTIGRID VERSUS BICGSTAB Wilson, $V = 24^3x64$ , single workstation (3x M6000)

### 

-0.410 -0.405 -0.40 Mass parameter













| _      | ltera    | tions  | GFLOPs   |        |  |
|--------|----------|--------|----------|--------|--|
| mass   | BiCGstab | GCR-MG | BiCGstab | GCR-MG |  |
| -0.400 | 251      | 15     | 980      | 376    |  |
| -0.405 | 372      | 16     | 980      | 372    |  |
| -0.410 | 510      | 17     | 980      | 353    |  |
| -0.415 | 866      | 18     | 980      | 314    |  |
| -0.420 | 3103     | 19     | 980      | 293    |  |







Number of Nodes

## MULTIGRID VERSUS BICGSTAB Wilson-clover, Strong scaling on Titan (K20X)









57

# MULTIGRID TIMING BREAKDOWN



Number of Nodes

## ERROR REDUCTION AND VARIANCE $V = 40^3 x 256$ , $m_{\pi} = 230 \text{ MeV}$







- Absolute Performance tuning, e.g., half precision on coarse grids
- Strong scaling improvements:
  - Combine with Schwarz preconditioner
  - Accelerate coarse grid solver: CA-GMRES instead of GCR, deflation
  - More flexible coarse grid distribution, e.g., redundant nodes
- Investigate off load of coarse grids to the CPU
  - Use CPU and GPU simultaneously using additive MG
- Full off load of setup phase to GPU required for HMC

## MULTIGRID FUTURE WORK





## HIERARCHICAL ALGORITHMS ON HETEROGENEOUS ARCHITECTURES





## **MULTI-SRC SOLVERS**

- Coarse Dslash / Prolongator / Restrictor
- Multi-src solvers increase locality through link-field reuse • Multi-grid operators even more so since link matrices are 48x48
- Coarsest grids also latency limited
  - Kernel level latency
  - Network latency
- Multi-src solvers are a solution
  - More parallelism
  - Bigger messages



TO THE EXASCALE AND BEYOND

## SOFTWARE AND ALGORITHMS

- Algorithms continue to innovate rapidly inside and outside of LQCD
  - E.g., Krylov solvers
    - Communication avoiding solvers (Demmel et al)
    - Cooperative Krylov methods (Bhaya et al)
    - Enlarged Krylov space methods (Grigori et al)
- Software can be the problem
  - Hierarchical / overlapping grids break most LQCD frameworks
  - Used to calling solvers in serial fashion
  - Precision often baked in



## FINE-GRAINED PARALLELISM AND THE **IMPLICATIONS FOR DLLS**

• Traditional DSL approach is to abstract the grid parallelism

```
Matrix u;
Vector x, y;
y = u * x;
```

- CUDA / C++ meta template magic, etc.
- This abstraction breaks with multigrid
  - Not enough grid parallelism
- Platform and algorithmic independent conjecture

# a requirement at the Exascale (and beyond)"

• Compiler / front end will then transform this expression into a data parallel operation using OpenMP /

"Fine-grained parallelization will becoming increasingly





# PRECISION

- How much precision is really required?
  - 16-bit solvers with high precision reliable updates (arXiv:0911.3191)
  - Truncate trailing mantissa bits in halo exchange (P. Boyle, arXiv:1402.2585)
  - Use fp16 for coarse grid solve (Heybrock *et al*, arXiv:1512.04506)
- Large HMC runs with tolerance ~10<sup>-13</sup>
  - What happens when the volumes get bigger?
- Precision optimization an important part of the puzzle





## COMMUNICATION

### Chroma 48<sup>3</sup>x512 lattice Relative Scaling (Application Time)



"XK7" node = XK7 (1x K20X + 1x Interlagos) "XE6" node = XE6 (2x Interlagos)



## TO THE EXASCALE AND BEYOND

- (At least) four challenges to overcome
  - Parallelism
  - Locality
  - Communication
  - Latency
- •What's the answer to the above?

## Algorithms

\*and the ability to express those algorithms





## HMC AND STREAM PARALLELISM

- Network bandwidth increasing becoming a limiting factor
- Starting to see hierarchy of network bandwidths
  - CORAL: fat nodes with thin network
- HMC requires strong scaling
  - HMC has a flat communications profile
  - Limited by slowest connection
- Split determinants and impose task parallelism
  - Each fat node computes one determinant contribution
  - Eliminates slow connection from fermion solver

$$\int dU e^{-S_g(U)} \det(\mathcal{M}) = \int dU e^{-S_g(U)} \prod_{i=1}^n \det(\mathcal{M}^{1/n})$$



