GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and hermitian eigenproblems

doi:10.1016/j.cpc.2020.107808

Computer Physics Communications

Volume 262, May 2021, 107808

https://doi.org/10.1016/j.cpc.2020.107808 Get rights and content

Abstract

The solution of eigenproblems is often a key computational bottleneck that limits the tractable system size of numerical algorithms, among them electronic structure theory in chemistry and in condensed matter physics. Large eigenproblems can easily exceed the capacity of a single compute node, thus must be solved on distributed-memory parallel computers. We here present GPU-oriented optimizations of the ELPA two-stage tridiagonalization eigensolver (ELPA2). On top of cuBLAS-based GPU offloading, we add a CUDA kernel to speed up the back-transformation of eigenvectors, which can be the computationally most expensive part of the two-stage tridiagonalization algorithm. We benchmark the performance of this GPU-accelerated eigensolver on two hybrid CPU–GPU architectures, namely a compute cluster based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Consistent with previous benchmarks on CPU-only architectures, the GPU-accelerated two-stage solver exhibits a parallel performance superior to the one-stage counterpart. Finally, we demonstrate the performance of the GPU-accelerated eigensolver developed in this work for routine semi-local KS-DFT calculations comprising thousands of atoms.

Introduction

Finding the eigenvalues and eigenvectors of large dense matrices is a frequent problem in computational science and engineering. For example, in Kohn–Sham density-functional theory (KS-DFT) [1], [2], the many-electron problem for the Born–Oppenheimer electronic ground state is reduced to a system of single particle equations that can be discretized into a generalized eigenproblem in the following matrix form $H C = S C Σ .$ Here the Hamiltonian matrix $H$ and the overlap matrix $S$ are real symmetric or complex Hermitian. $S$ is positive definite. The matrix $C$ and the diagonal matrix $Σ$ are the eigenvectors and eigenvalues, respectively, of this eigensystem. In the framework of KS-DFT, $C$ and $Σ$ (or the information they carry, at least) are needed for the construction of $H$ . Therefore, Eq. (1) is a non-linear problem and must be solved self-consistently. It is possible, and has already been implemented in various codes [3], [4], [5], [6], to restrict the computational cost of the construction of $H$ to scale linearly with respect to the system size $N$ for any semi-local and hybrid exchange–correlation functional. In contrast, the solution of a dense eigenproblem (“diagonalization”) scales as $O (N^{3})$ , quickly growing to become prohibitive as $N$ increases to large values.

Today, simulations in materials science and computational chemistry make up a large fraction of supercomputer time used in production (see, e.g., the workload analysis by the National Energy Research Scientific Computing Center [7], [8]). Particularly, DFT simulations running on high-performance computers are facilitating scientific discoveries across a broad range of disciplines. The progress of this community relies inherently on the availability of scalable solvers on new hardware. In the past, various developments have originated in this community to tackle Eq. (1) [9], [10].

•
The algorithm typically employed in a conventional dense eigensolver [11], [12] is tridiagonalization, which brings the original matrix to a tridiagonal form by a series of Householder transformations. This algorithm suffers from the inefficiency of BLAS level-2 matrix–vector operations. New algorithms such as pentadiagonalization [13] and two-stage tridiagonalization [14], [15], [16], [17] have been developed, leading to enhanced performance over the conventional one-stage tridiagonalization approach.
•
Iterative eigensolvers [18], [19], [20], [21] are commonly employed by DFT codes, particularly those based on plane-wave (PW) basis functions and pseudopotentials. In that case, because of the large number of basis functions, i.e. the large dimension of the matrices in Eq. (1), needed in an accurate calculation, a direct solution of Eq. (1) is rather infeasible. Iterative eigensolvers are well suited to find a small fraction of low-lying eigenstates, commensurate with the needs of a PW-based code. When using spatially localized basis functions, such as linear combination of atomic orbitals (LCAO), the fraction of needed eigenpairs out of the full matrix dimension can be fairly large. In this scenario, iterative solvers no longer have an advantage over direct eigensolvers.
•
With localized basis functions, locality in the physical system can be translated to sparsity in the $H$ and $S$ matrices. Methods exploiting this sparsity can be formulated as $O (N) \sim O (N^{2})$ [22], [23], [24], [25], [26], [27] by circumventing the explicit solution of Eq. (1). In particular, linear scaling algorithms in a density matrix formalism have been successfully applied to simulations of one million atoms [28], [29]. Despite the success in extreme-scale simulations, reduced scaling methods come with a computational prefactor that is much larger than that of the $O (N^{3})$ diagonalization method. Moreover, the applicability and optimal performance of reduced scaling methods are often limited to some certain problem types.

As of today, dense eigensolvers, with their small computational prefactor and general applicability, remain the default method in most LCAO codes. Even in PW codes, the performance of a dense eigensolver is still crucial, because at some stage of an iterative solver there will typically be a reduced-size dense eigenproblem, the size of which scales with the number of valence electrons in the system being simulated. Therefore, any improvement made to a dense eigensolver would benefit the entire electronic structure community, and the broader field of computational physics in general.

The ubiquitous adoption of graphics processing units (GPUs) in high-performance computing opens up new opportunities to accelerate the solution of dense eigenproblems. A GPU device consists of hundreds to thousands of parallel cores operating at a relatively low frequency. These cores are naturally suited for parallel computational tasks, such as vector and matrix operations found in dense linear algebra. On top of that, GPUs typically have a power efficiency superior to traditional central processing units (CPUs), and therefore play an important role in supercomputing towards the exascale. According to the November 2020 release of the TOP500 list [30], six of the top ten machines have GPU accelerators, including Summit, the world’s second fastest computer (was the fastest from June 2018 to June 2020) based on IBM POWER9 CPUs and NVIDIA Tesla Volta V100 GPUs.

GPU-accelerated eigensolvers have long been available in GPU-oriented linear algebra packages such as cuSOLVER (single GPU) [31], cuSOLVER-MG (multiple GPUs) [31], and MAGMA (CPUs and multiple GPUs) [32], [33]. These packages are designed and optimized for shared-memory host architectures. They can be very fast, but the problem size they can tackle is limited by the memory capacity of a single compute node. Fully exploiting the power of GPU-accelerated supercomputers would require a distributed-memory implementation.

The MPI-parallel, distributed-memory ELPA library [34] implements the conventional one-stage diagonalization method and the two-stage diagonalization proposed in Refs. [14], [15], known as the “ELPA1” and “ELPA2” solvers, respectively. The ELPA library is being used in a large number of quantum chemistry and solid state physics software packages (mentioned either in the publications cited here or in the documentation of the package), including ABINIT [35], BerkeleyGW [36], CP2K [37], CPMD [38], DFTB＋[39], FHI-aims [4], GPAW [40], NWChem [41], Octopus [42], OpenMX [43], QuantumATK [44], Quantum ESPRESSO [45], SIESTA [6], VASP [46], and WIEN2k [47]. Both the ELPA1 and ELPA2 solvers have been ported to GPUs by substituting BLAS calls with the corresponding cuBLAS functions [48], making ELPA the first publicly available, distributed-memory, GPU-accelerated eigensolver to our knowledge. We note the ongoing effort of the SLATE project [49], a more general distributed-memory dense linear algebra framework being developed to replace ScaLAPACK on modern high-performance computing (HPC) machines with large numbers of CPU cores and accelerators. The eigensolver implementation in SLATE adopts the same two-stage diagonalization method as implemented in ELPA2, but does not return eigenvectors at the time of writing (September 2020).

For brevity, the CPU-only version of ELPA1 and ELPA2 and the GPU-accelerated version of ELPA1 and ELPA2 are hereafter referred to as CPU-ELPA1, CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2, respectively. This paper presents the first complete overview of GPU-ELPA2, including in-depth benchmarks. The only previously published benchmark data of GPU-ELPA2 were restricted to single-node tests of a preliminary version of GPU-ELPA2, shown as a single line of data in Fig. 5 of Ref. [50]. When using two IBM POWER8 CPUs (24 cores in total) and four NVIDIA Pascal P100 GPUs, GPU-ELPA1 delivers up to 11.9x performance boost compared to CPU-ELPA1 using 24 CPU cores. In this regime, the performance of GPU-ELPA1 was still better than that of GPU-ELPA2. The extensive set of benchmarks in the present paper demonstrate the regimes in which the present version of GPU-ELPA2 provides a significant advantage. Historically, the GPU port of ELPA2 has its roots in 2013, when Peter Messmer of NVIDIA programmed the first proof-of-concept version of GPU-ELPA2. Then the code was refactored and merged into the mainline version of ELPA, and has been available in released versions of the ELPA eigensolver library since 2016. In this paper, we report the current complete version of GPU-ELPA2, including our latest optimizations and developments that enable a performance improvement on distributed-memory, GPU-accelerated architectures. Specifically, several synchronizations and memory transfers between CPUs and GPUs have been optimized. Additionally, some kernels in one of the major computational steps, the tridiagonal-to-banded back-transformation of eigenvectors (Eq. (6d)), have been rewritten.

The rest of this paper is organized as follows. First, we briefly review the two-stage diagonalization algorithm, in particular the tridiagonal-to-banded back-transformation of eigenvectors and its CPU implementation in the ELPA library. Next, we outline the GPU acceleration strategies employed in GPU-ELPA2, and elaborate on our CUDA implementation of the tridiagonal-to-banded back-transformation of eigenvectors, which is essentially a GPU extension of the algorithm in Refs. [16], [51]. We then benchmark the performance and scalability of the GPU-ELPA2 solver on two GPU-accelerated computers, namely the Talos cluster at Max Planck Computing and Data Facility in Garching, Germany, based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer at Oak Ridge National Laboratory in Tennessee, USA, based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Finally, we demonstrate the performance of the GPU-ELPA2 solver for practical computational physics problems on the current top supercomputer, Summit, for routine semi-local KS-DFT calculations including thousands of atoms without sacrificing any accuracy.

Section snippets

Overview of the two-stage tridiagonalization

The textbook procedure [11], [12] to solve a dense generalized eigenproblem, like the one in Eq. (1), first computes the Cholesky factorization of $S$ $S = L L^{*},$ then uses $L$ to transform Eq. (1) to a standard eigenproblem $\tilde{H} \tilde{C} = \tilde{C} Σ .$ $\tilde{H}$ is $\tilde{H} = L^{- 1} H {(L^{*})}^{- 1},$ and the eigenvectors $\tilde{C}$ must be back-transformed in order to retrieve the eigenvectors of Eq. (1), i.e. $C = {(L^{*})}^{- 1} \tilde{C} .$

The direct solution of Eq. (3) is based on tridiagonalization, that is, the full matrix $\tilde{H}$ is transformed to a tridiagonal matrix $T$ . This

GPU acceleration of ELPA2

The two-stage tridiagonalization is implemented as five separate subroutines in ELPA2, corresponding to the five steps in Eq. (6). The input, output, and internal working matrices of ELPA2 are distributed across CPU cores. ELPA2 relies on its own explicit MPI calls to handle distributed linear algebra operations. The efficient MPI communication pattern in the CPU version of ELPA2 is not altered in the GPU version, where GPU acceleration is mainly realized by substituting local BLAS calls with

Performance and scalability

The performance of the GPU-accelerated ELPA2 solver is benchmarked on the Talos cluster at Max Planck Computing and Data Facility and the Summit supercomputer at Oak Ridge National Laboratory. Each node of Talos has two Intel Xeon Gold 6148 CPUs (40 cores in total) and two NVIDIA Tesla Volta V100 GPUs (each has 32 GB high-bandwidth memory, double precision peak 7.0 TFLOP/s, PCIe 32 GB/s interconnect). Benchmarks presented in Fig. 5 are performed on Talos. The ELPA code is compiled with the

Application in materials calculations

Finally, we demonstrate the efficiency of GPU-ELPA2 in actual materials simulations. Two atomic structure models are selected as test systems, namely Cu $_{2}$ BaSnS $_{4}$ and graphene on SiC as shown in Fig. 10(a) and (b), respectively. Cu $_{2}$ BaSnS $_{4}$ is a semiconductor that was designed as a potential photovoltaic absorber material [65], [66]. In this class of materials, large supercells can be of importance to understand, for example, the impact of dynamical properties, of defects, or of deliberately

Conclusions

In this paper, we report our GPU-oriented optimizations of the two-stage tridiagonalization eigensolver ELPA2 in the ELPA library. The local BLAS operations in ELPA2 are offloaded to GPUs via the cuBLAS library. The tridiagonal-to-banded back-transformation of eigenvectors, which cannot be easily written as BLAS operations, is GPU-accelerated by a CUDA implementation. The overall performance of the GPU-accelerated ELPA2 solver is promising. It delivers a significant performance boost over the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the National Science Foundation (NSF), USA under Award No. 1450280, and was partly supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division. Yu was supported by a fellowship from the Molecular Sciences Software Institute under NSF, USA Award No. 1547580. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, USA which is

References (70)

BlumV. et al.
Comput. Phys. Comm.
(2009)
YuV.W.-z. et al.
Comput. Phys. Comm.
(2018)
YuV.W.-z. et al.
Comput. Phys. Comm.
(2020)
AuckenthalerT. et al.
Parallel Comput.
(2011)
DavidsonE.R.
J. Comput. Phys.
(1975)
DawsonW. et al.
Comput. Phys. Comm.
(2018)
TomovS. et al.
Parallel Comput.
(2010)
GonzeX. et al.
Comput. Phys. Comm.
(2020)
DeslippeJ. et al.
Comput. Phys. Comm.
(2012)
KlöffelT. et al.
Comput. Phys. Comm.
(2021)

KůsP. et al.

Parallel Comput.

(2019)

HuhnW.P. et al.

Comput. Phys. Comm.

(2020)

HohenbergP. et al.

Phys. Rev.

(1964)

KohnW. et al.

Phys. Rev.

(1965)

NakataA. et al.

J. Chem. Phys.

(2020)

PrenticeJ.C.A. et al.

J. Chem. Phys.

(2020)

GarcíaA. et al.

J. Chem. Phys.

(2020)

2014 NERSC workload analysis

(2014)

Nersc-10 workload analysis (data from 2018)

(2018)

PressW.H. et al.

Numerical Recipes 3rd Edition: The Art of Scientific Computing

(2007)

GolubG. et al.

ImamuraT. et al.

Prog. Nucl. Sci. Technol.

(2011)

LangB.

SIAM J. Sci. Comput.

(1993)

C. Bischof, X. Sun, B. Lang, Proceedings of IEEE Scalable High Performance Computing Conference, 1994, pp....

MarekA. et al.

J. Phys.: Condens. Matter

(2014)

SleijpenG.L.G. et al.

SIAM J. Matrix Anal. Appl.

(1996)

PayneM.C. et al.

Rev. Modern Phys.

(1992)

KresseG. et al.

Phys. Rev. B

(1996)

GoedeckerS.

Rev. Modern Phys.

(1999)

BowlerD.R. et al.

Rep. Progr. Phys.

(2012)

MoussaJ. et al.

Electron. Struct.

(2019)

PolizziE.

Phys. Rev. B

(2009)

LinL. et al.

J. Phys.: Condens. Matter

(2013)

BowlerD. et al.

J. Phys.: Condens. Matter

(2010)

VandeVondeleJ. et al.

J. Chem. Theory Comput.

(2012)

Cited by (35)

DFT-FE 1.0: A massively parallel hybrid CPU-GPU density functional theory code using finite-element discretization
2022, Computer Physics Communications
We present DFT-FE 1.0, building on DFT-FE 0.6 Motamarri et al. (2020) [28], to conduct fast and accurate large-scale density functional theory (DFT) calculations (reaching $\sim 100, 000$ electrons) on both many-core CPU and hybrid CPU-GPU computing architectures. This work involves improvements in the real-space formulation—via an improved treatment of the electrostatic interactions that substantially enhances the computational efficiency—as well high-performance computing aspects, including the GPU acceleration of all the key compute kernels in DFT-FE. We demonstrate the accuracy by comparing the ground-state energies, ionic forces and cell stresses on a wide-range of benchmark systems against those obtained from widely used DFT codes. Further, we demonstrate the numerical efficiency of our GPU acceleration, which yields ∼20× speed-up on hybrid CPU-GPU nodes of the Summit supercomputer. Notably, owing to the parallel-scaling of the GPU implementation, we obtain wall-times of $80 - 140$ seconds for full ground-state calculations, with stringent accuracy, on benchmark systems containing $\sim 6, 000 - 15, 000$ electrons using $64 - 224$ nodes of the Summit supercomputer.
Program Title: DFT-FE
CPC Library link to program files: https://doi.org/10.17632/c5ghfc6ctn.1
Developer's repository link: https://github.com/dftfeDevelopers/dftfe
Licensing provisions: LGPL v3
Programming language: C/C++
External routines/libraries: p4est (http://www.p4est.org/), deal.II (https://www.dealii.org/), BLAS (http://www.netlib.org/blas/), LAPACK (http://www.netlib.org/lapack/), ELPA (https://elpa.mpcdf.mpg.de/), ScaLAPACK (http://www.netlib.org/scalapack/), Spglib (https://atztogo.github.io/spglib/), ALGLIB (http://www.alglib.net/), LIBXC (http://www.tddft.org/programs/libxc/), PETSc (https://www.mcs.anl.gov/petsc), SLEPc (http://slepc.upv.es), NCCL (optional-https://github.com/NVIDIA/nccl).
Nature of problem: Density functional theory calculations.
Solution method: We employ a local real-space variational formulation of Kohn-Sham density functional theory that is applicable for both pseudopotential and all-electron calculations on periodic, semi-periodic and non-periodic geometries. Higher-order adaptive spectral finite-element basis is used to discretize the Kohn-Sham equations. Chebyshev polynomial filtered subspace iteration procedure (ChFSI) is employed to solve the nonlinear Kohn-Sham eigenvalue problem self-consistently. ChFSI in DFT-FE employs Cholesky factorization based orthonormalization, and spectrum splitting based Rayleigh-Ritz procedure in conjunction with mixed precision arithmetic. Configurational force approach is used to compute ionic forces and periodic cell stresses for geometry optimization.
Additional comments including restrictions and unusual features: Exchange correlation functionals are restricted to Local Density Approximation (LDA) and Generalized Gradient Approximation (GGA), with and without spin. The pseudopotentials available are optimized norm conserving Vanderbilt (ONCV) pseudopotentials and Troullier–Martins (TM) pseudopotentials. Calculations are non-relativistic.
DFT-FE handles all-electron and pseudopotential calculations in the same framework, while accommodating periodic, non-periodic and semi-periodic boundary conditions.
Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels
2024, Journal of Chemical Physics
Time-dependent nuclear energy-density functional theory toolkit for neutron star crust: dynamics of a nucleus in a neutron superfluid
2024, arXiv
Integrated workflows and interfaces for data-driven semi-empirical electronic structure calculations
2024, arXiv
Efficient All-electron Hybrid Density Functionals for Atomistic Simulations Beyond 10,000 Atoms
2024, arXiv
Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels
2024, arXiv

View all citing articles on Scopus

^☆: The review of this paper was arranged by Prof. Stephan Fritzsche.

View full text

GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and hermitian eigenproblems☆

Abstract

Introduction

Section snippets

Overview of the two-stage tridiagonalization

GPU acceleration of ELPA2

Performance and scalability

Application in materials calculations

Conclusions

Declaration of Competing Interest

Acknowledgments

Comput. Phys. Comm.

Comput. Phys. Comm.

Comput. Phys. Comm.

Parallel Comput.

J. Comput. Phys.

Comput. Phys. Comm.

Parallel Comput.

Comput. Phys. Comm.

Comput. Phys. Comm.

Comput. Phys. Comm.

Parallel Comput.

Comput. Phys. Comm.

Phys. Rev.

Phys. Rev.

J. Chem. Phys.

J. Chem. Phys.

J. Chem. Phys.

2014 NERSC workload analysis

Nersc-10 workload analysis (data from 2018)

Numerical Recipes 3rd Edition: The Art of Scientific Computing

Prog. Nucl. Sci. Technol.

SIAM J. Sci. Comput.

J. Phys.: Condens. Matter

SIAM J. Matrix Anal. Appl.

Rev. Modern Phys.

Phys. Rev. B

Rev. Modern Phys.

Rep. Progr. Phys.

Electron. Struct.

Phys. Rev. B

J. Phys.: Condens. Matter

J. Phys.: Condens. Matter

J. Chem. Theory Comput.