GPU-acceleration of the ELPA2 distributed eigensolver for dense symmetric and hermitian eigenproblems☆
Introduction
Finding the eigenvalues and eigenvectors of large dense matrices is a frequent problem in computational science and engineering. For example, in Kohn–Sham density-functional theory (KS-DFT) [1], [2], the many-electron problem for the Born–Oppenheimer electronic ground state is reduced to a system of single particle equations that can be discretized into a generalized eigenproblem in the following matrix form Here the Hamiltonian matrix and the overlap matrix are real symmetric or complex Hermitian. is positive definite. The matrix and the diagonal matrix are the eigenvectors and eigenvalues, respectively, of this eigensystem. In the framework of KS-DFT, and (or the information they carry, at least) are needed for the construction of . Therefore, Eq. (1) is a non-linear problem and must be solved self-consistently. It is possible, and has already been implemented in various codes [3], [4], [5], [6], to restrict the computational cost of the construction of to scale linearly with respect to the system size for any semi-local and hybrid exchange–correlation functional. In contrast, the solution of a dense eigenproblem (“diagonalization”) scales as , quickly growing to become prohibitive as increases to large values.
Today, simulations in materials science and computational chemistry make up a large fraction of supercomputer time used in production (see, e.g., the workload analysis by the National Energy Research Scientific Computing Center [7], [8]). Particularly, DFT simulations running on high-performance computers are facilitating scientific discoveries across a broad range of disciplines. The progress of this community relies inherently on the availability of scalable solvers on new hardware. In the past, various developments have originated in this community to tackle Eq. (1) [9], [10].
- •
The algorithm typically employed in a conventional dense eigensolver [11], [12] is tridiagonalization, which brings the original matrix to a tridiagonal form by a series of Householder transformations. This algorithm suffers from the inefficiency of BLAS level-2 matrix–vector operations. New algorithms such as pentadiagonalization [13] and two-stage tridiagonalization [14], [15], [16], [17] have been developed, leading to enhanced performance over the conventional one-stage tridiagonalization approach.
- •
Iterative eigensolvers [18], [19], [20], [21] are commonly employed by DFT codes, particularly those based on plane-wave (PW) basis functions and pseudopotentials. In that case, because of the large number of basis functions, i.e. the large dimension of the matrices in Eq. (1), needed in an accurate calculation, a direct solution of Eq. (1) is rather infeasible. Iterative eigensolvers are well suited to find a small fraction of low-lying eigenstates, commensurate with the needs of a PW-based code. When using spatially localized basis functions, such as linear combination of atomic orbitals (LCAO), the fraction of needed eigenpairs out of the full matrix dimension can be fairly large. In this scenario, iterative solvers no longer have an advantage over direct eigensolvers.
- •
With localized basis functions, locality in the physical system can be translated to sparsity in the and matrices. Methods exploiting this sparsity can be formulated as [22], [23], [24], [25], [26], [27] by circumventing the explicit solution of Eq. (1). In particular, linear scaling algorithms in a density matrix formalism have been successfully applied to simulations of one million atoms [28], [29]. Despite the success in extreme-scale simulations, reduced scaling methods come with a computational prefactor that is much larger than that of the diagonalization method. Moreover, the applicability and optimal performance of reduced scaling methods are often limited to some certain problem types.
As of today, dense eigensolvers, with their small computational prefactor and general applicability, remain the default method in most LCAO codes. Even in PW codes, the performance of a dense eigensolver is still crucial, because at some stage of an iterative solver there will typically be a reduced-size dense eigenproblem, the size of which scales with the number of valence electrons in the system being simulated. Therefore, any improvement made to a dense eigensolver would benefit the entire electronic structure community, and the broader field of computational physics in general.
The ubiquitous adoption of graphics processing units (GPUs) in high-performance computing opens up new opportunities to accelerate the solution of dense eigenproblems. A GPU device consists of hundreds to thousands of parallel cores operating at a relatively low frequency. These cores are naturally suited for parallel computational tasks, such as vector and matrix operations found in dense linear algebra. On top of that, GPUs typically have a power efficiency superior to traditional central processing units (CPUs), and therefore play an important role in supercomputing towards the exascale. According to the November 2020 release of the TOP500 list [30], six of the top ten machines have GPU accelerators, including Summit, the world’s second fastest computer (was the fastest from June 2018 to June 2020) based on IBM POWER9 CPUs and NVIDIA Tesla Volta V100 GPUs.
GPU-accelerated eigensolvers have long been available in GPU-oriented linear algebra packages such as cuSOLVER (single GPU) [31], cuSOLVER-MG (multiple GPUs) [31], and MAGMA (CPUs and multiple GPUs) [32], [33]. These packages are designed and optimized for shared-memory host architectures. They can be very fast, but the problem size they can tackle is limited by the memory capacity of a single compute node. Fully exploiting the power of GPU-accelerated supercomputers would require a distributed-memory implementation.
The MPI-parallel, distributed-memory ELPA library [34] implements the conventional one-stage diagonalization method and the two-stage diagonalization proposed in Refs. [14], [15], known as the “ELPA1” and “ELPA2” solvers, respectively. The ELPA library is being used in a large number of quantum chemistry and solid state physics software packages (mentioned either in the publications cited here or in the documentation of the package), including ABINIT [35], BerkeleyGW [36], CP2K [37], CPMD [38], DFTB+[39], FHI-aims [4], GPAW [40], NWChem [41], Octopus [42], OpenMX [43], QuantumATK [44], Quantum ESPRESSO [45], SIESTA [6], VASP [46], and WIEN2k [47]. Both the ELPA1 and ELPA2 solvers have been ported to GPUs by substituting BLAS calls with the corresponding cuBLAS functions [48], making ELPA the first publicly available, distributed-memory, GPU-accelerated eigensolver to our knowledge. We note the ongoing effort of the SLATE project [49], a more general distributed-memory dense linear algebra framework being developed to replace ScaLAPACK on modern high-performance computing (HPC) machines with large numbers of CPU cores and accelerators. The eigensolver implementation in SLATE adopts the same two-stage diagonalization method as implemented in ELPA2, but does not return eigenvectors at the time of writing (September 2020).
For brevity, the CPU-only version of ELPA1 and ELPA2 and the GPU-accelerated version of ELPA1 and ELPA2 are hereafter referred to as CPU-ELPA1, CPU-ELPA2, GPU-ELPA1, and GPU-ELPA2, respectively. This paper presents the first complete overview of GPU-ELPA2, including in-depth benchmarks. The only previously published benchmark data of GPU-ELPA2 were restricted to single-node tests of a preliminary version of GPU-ELPA2, shown as a single line of data in Fig. 5 of Ref. [50]. When using two IBM POWER8 CPUs (24 cores in total) and four NVIDIA Pascal P100 GPUs, GPU-ELPA1 delivers up to 11.9x performance boost compared to CPU-ELPA1 using 24 CPU cores. In this regime, the performance of GPU-ELPA1 was still better than that of GPU-ELPA2. The extensive set of benchmarks in the present paper demonstrate the regimes in which the present version of GPU-ELPA2 provides a significant advantage. Historically, the GPU port of ELPA2 has its roots in 2013, when Peter Messmer of NVIDIA programmed the first proof-of-concept version of GPU-ELPA2. Then the code was refactored and merged into the mainline version of ELPA, and has been available in released versions of the ELPA eigensolver library since 2016. In this paper, we report the current complete version of GPU-ELPA2, including our latest optimizations and developments that enable a performance improvement on distributed-memory, GPU-accelerated architectures. Specifically, several synchronizations and memory transfers between CPUs and GPUs have been optimized. Additionally, some kernels in one of the major computational steps, the tridiagonal-to-banded back-transformation of eigenvectors (Eq. (6d)), have been rewritten.
The rest of this paper is organized as follows. First, we briefly review the two-stage diagonalization algorithm, in particular the tridiagonal-to-banded back-transformation of eigenvectors and its CPU implementation in the ELPA library. Next, we outline the GPU acceleration strategies employed in GPU-ELPA2, and elaborate on our CUDA implementation of the tridiagonal-to-banded back-transformation of eigenvectors, which is essentially a GPU extension of the algorithm in Refs. [16], [51]. We then benchmark the performance and scalability of the GPU-ELPA2 solver on two GPU-accelerated computers, namely the Talos cluster at Max Planck Computing and Data Facility in Garching, Germany, based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer at Oak Ridge National Laboratory in Tennessee, USA, based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Finally, we demonstrate the performance of the GPU-ELPA2 solver for practical computational physics problems on the current top supercomputer, Summit, for routine semi-local KS-DFT calculations including thousands of atoms without sacrificing any accuracy.
Section snippets
Overview of the two-stage tridiagonalization
The textbook procedure [11], [12] to solve a dense generalized eigenproblem, like the one in Eq. (1), first computes the Cholesky factorization of then uses to transform Eq. (1) to a standard eigenproblem is and the eigenvectors must be back-transformed in order to retrieve the eigenvectors of Eq. (1), i.e.
The direct solution of Eq. (3) is based on tridiagonalization, that is, the full matrix is transformed to a tridiagonal matrix . This
GPU acceleration of ELPA2
The two-stage tridiagonalization is implemented as five separate subroutines in ELPA2, corresponding to the five steps in Eq. (6). The input, output, and internal working matrices of ELPA2 are distributed across CPU cores. ELPA2 relies on its own explicit MPI calls to handle distributed linear algebra operations. The efficient MPI communication pattern in the CPU version of ELPA2 is not altered in the GPU version, where GPU acceleration is mainly realized by substituting local BLAS calls with
Performance and scalability
The performance of the GPU-accelerated ELPA2 solver is benchmarked on the Talos cluster at Max Planck Computing and Data Facility and the Summit supercomputer at Oak Ridge National Laboratory. Each node of Talos has two Intel Xeon Gold 6148 CPUs (40 cores in total) and two NVIDIA Tesla Volta V100 GPUs (each has 32 GB high-bandwidth memory, double precision peak 7.0 TFLOP/s, PCIe 32 GB/s interconnect). Benchmarks presented in Fig. 5 are performed on Talos. The ELPA code is compiled with the
Application in materials calculations
Finally, we demonstrate the efficiency of GPU-ELPA2 in actual materials simulations. Two atomic structure models are selected as test systems, namely CuBaSnS and graphene on SiC as shown in Fig. 10(a) and (b), respectively. CuBaSnS is a semiconductor that was designed as a potential photovoltaic absorber material [65], [66]. In this class of materials, large supercells can be of importance to understand, for example, the impact of dynamical properties, of defects, or of deliberately
Conclusions
In this paper, we report our GPU-oriented optimizations of the two-stage tridiagonalization eigensolver ELPA2 in the ELPA library. The local BLAS operations in ELPA2 are offloaded to GPUs via the cuBLAS library. The tridiagonal-to-banded back-transformation of eigenvectors, which cannot be easily written as BLAS operations, is GPU-accelerated by a CUDA implementation. The overall performance of the GPU-accelerated ELPA2 solver is promising. It delivers a significant performance boost over the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was supported by the National Science Foundation (NSF), USA under Award No. 1450280, and was partly supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division. Yu was supported by a fellowship from the Molecular Sciences Software Institute under NSF, USA Award No. 1547580. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, USA which is
References (70)
- et al.
Comput. Phys. Comm.
(2009) - et al.
Comput. Phys. Comm.
(2018) - et al.
Comput. Phys. Comm.
(2020) - et al.
Parallel Comput.
(2011) J. Comput. Phys.
(1975)- et al.
Comput. Phys. Comm.
(2018) - et al.
Parallel Comput.
(2010) - et al.
Comput. Phys. Comm.
(2020) - et al.
Comput. Phys. Comm.
(2012) - et al.
Comput. Phys. Comm.
(2021)
Parallel Comput.
Comput. Phys. Comm.
Phys. Rev.
Phys. Rev.
J. Chem. Phys.
J. Chem. Phys.
J. Chem. Phys.
2014 NERSC workload analysis
Nersc-10 workload analysis (data from 2018)
Numerical Recipes 3rd Edition: The Art of Scientific Computing
Prog. Nucl. Sci. Technol.
SIAM J. Sci. Comput.
J. Phys.: Condens. Matter
SIAM J. Matrix Anal. Appl.
Rev. Modern Phys.
Phys. Rev. B
Rev. Modern Phys.
Rep. Progr. Phys.
Electron. Struct.
Phys. Rev. B
J. Phys.: Condens. Matter
J. Phys.: Condens. Matter
J. Chem. Theory Comput.
Cited by (35)
DFT-FE 1.0: A massively parallel hybrid CPU-GPU density functional theory code using finite-element discretization
2022, Computer Physics CommunicationsHybrid programming-model strategies for GPU offloading of electronic structure calculation kernels
2024, Journal of Chemical Physics
- ☆
The review of this paper was arranged by Prof. Stephan Fritzsche.