Skip to content

FHI-aims and the ELPA2 Kernel

The "ELPA kernels" are very small subroutines that speed up the execution of a particular set of transformations in the ELPA eigenvalue solver drastically if one knows the particular CPU and architecture. If the ELPA2_KERNEL variable is not set, a generic ELPA kernel is used. While this is not inherently bad, it cannot take advantage of the often substantial speedups enabled by using CPU-specific instruction sets.

Choosing the correct ELPA2 Kernel

ELPA generally uses BLAS level-3 routines for compute-intensive work so that the performance of ELPA mainly depends on the quality of the BLAS implementation. One exception is the back-transformation of the eigenvectors in the two-stage solver (ELPA2). In this case, BLAS level-3 routines cannot be used effectively due to the specifics of the algorithm.

The back-transformation of eigenvectors in ELPA2 has been put into a routine of its own so that this can be replaced by hand-tailored, optimized kernels for specific platforms.

Available Kernels

The redistributed ELPA library in ELSI offers the following kernels:

  • Generic: The generic Fortran version of the ELPA2 kernels should work on every platform. It contains some hand optimizations (loop unrolling). This is the default kernel.
  • AVX: Optimized kernels using Intel AVX instruction sets.
  • AVX2: Optimized kernels using Intel AVX2 instruction sets.
  • AVX512: Optimized kernels using Intel AVX512 instruction sets.

These kernels may be chosen by setting the ELPA2_KERNEL CMake keyword during configuration. When the USE_GPU_CUDA option is enabled, the back-transformation of eigenvectors in ELPA2 can be offloaded to GPUs.

Additional kernels are available in the upstream version of ELPA.

Choose Kernels based on available Instruction Sets

To figure out which instruction sets are supported on a computer/node, you need to what model of CPU the system is using. A good summary of Intel CPU processors and their supported vector instructions is given here.

If you are still unsure about your system, you can run the following commands depending on your platform:

  • Linux: cat /proc/cpuinfo | grep flags
Linux example: Computing node with 2 Intel Xeon Gold 6148 (Skylake) processor, MPCDF HPC system Cobra

Running the command cat /proc/cpuinfo | grep flags shows:

✗ cat /proc/cpuinfo | grep flags
    flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
    rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
    tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch 
    cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep 
    bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
    xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities
We can see that this CPU supports AVX, AVX2, and AVX512 instructions. We would thus set the variable ELPA2_KERNEL to AVX512 for the newest CPU instruction set.

  • MacOS: sysctl -a | grep machdep.cpu.features
MacOS example: Macbook Pro, M1 Max, 2021

Running the command sysctl -a | grep machdep.cpu shows:

✗ sysctl -a | grep machdep.cpu
    machdep.cpu.cores_per_package: 10
    machdep.cpu.core_count: 10
    machdep.cpu.logical_per_package: 10
    machdep.cpu.thread_count: 10
    machdep.cpu.brand_string: Apple M1 Max 
The M1 Max chip (ARM-based) from Apple doesn't support AVX instructions, meaning the ELPA2_KERNEL should be set to its default Generic.

Notes
  • In the real generic kernel, a complex variable is used to enforce better compiler optimizations. This produces correct code, however, some compilers may report a warning.
  • In Fortran, an assumed-size array has an asterisk as the upper bound of its last dimension. An assumed-size array must be a dummy argument. Although this feature is no longer recommended by the Fortran standard, we have found that using assumed-size arrays in the ELPA2 kernels is critical for performance, as it keeps the compiler from creating temporary arrays at runtime.

Examples

An example initial_cache.cmake file using the ELPA2 kernel for a processor that support the AVX512 instruction set can be found below.

initial_cache.cmake file for Intel compilers, libraries, and AVX512:

set(CMAKE_Fortran_COMPILER "mpiifort" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS "-fc=ifort -O3 -ip -xmic-avx512 -fp-model precise" CACHE STRING "" FORCE)
set(Fortran_MIN_FLAGS "-fc=ifort -O0 -fp-model precise" CACHE STRING "" FORCE)
set(CMAKE_C_COMPILER "icc" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS "-O3 -ip -xmic-avx512 -fp-model precise -std=gnu99" CACHE STRING "" FORCE)
set(CMAKE_CXX_COMPILER "icpc" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS "-O3 -ip -xmic-avx512 -fp-model precise" CACHE STRING "" FORCE)
set(LIB_PATHS "/usr/local/apps/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64" CACHE STRING "" FORCE)
set(LIBS "mkl_intel_lp64 mkl_sequential mkl_core mkl_blacs_intelmpi_lp64 mkl_scalapack_lp64" CACHE STRING "" FORCE)
set(USE_MPI ON CACHE BOOL "" FORCE)
set(USE_SCALAPACK ON CACHE BOOL "" FORCE)
set(ELPA2_KERNEL "AVX512" CACHE STRING "")

One important note on -x... options

With Intel's compilers, DO NOT use the architecture-specific -x... options unless you know exactly what CPU you will be running on. While the -x... options can make an important difference if used correctly, these options hardwire specific chip instructions into the binary. While they will appear to work on other chips 99% of the time, individual results can suddenly be numerically completely wrong when a binary compiler on one chip (say, the head node of a cluster) is run on another chip (say, a compute node of the same cluster). This issue has bitten some of our users - please pay attention to this issue. In short, you need to know the exact CPU installed on the node and this can be found out by reading /proc/cpuinfo on that node.