Part 2: Small benchmark: Ac-Ala19-LysH+

Benchmarks

The benchmarks subdirectory of the FHI-aims repository contains two benchmark cases with reference results that can be used to (roughly) assess the parallel performance of FHI-aims on HPC platforms.

These cases demonstrate both the scaling with number of CPU cores as well as the expected approximate runtimes for FHI-aims on current (2021) Intel Skylake hardware with a fast interconnect between nodes.

Reference output files are provided for 40 and 240 CPU cores for the small case, and for 240, 960, and 1920 CPU cores for the large case, generated on the Max Planck Society's Cobra supercomputer. The CPUs used in these runs are Intel Xeon Skylake-SP processors (each node has 20 cores @ 2.4 GHz each), dating from 2018. The interconnect is a fast OmniPath network. The ELPA kernel was set to "generic" (using optimized AVX kernels for the ELPA library would likely generate faster code), and Intel Fortran and MKL versions from 2020 were employed.

Example submit.sh scripts are specific to the SLURM queueing system and to the installation at the "Cobra" Supercomputer.

Two typical log-log scaling plots are also provided with the benchmarks.

About the small benchmark

In the following, we will compute the small benchmark twice: Once with the settings you can find in the current benchmark folder of the FHI-aims distribution (FHIaims/benchmarks/Ac-Lys-Ala19-H/) and once with FHI-aims default settings with only use_local_index in addition. The origin of this benchmark dates back to 2016. However, the choice of the current FHI-aims default settings have improved since then and there is actually no need (except for use_local_index) to set them explicitly. The only reason to still use the "old" control.in is to keep a consistent benchmark over time. For a production calculation of this system, setting only use_local_index would be sufficient.

"Small" benchmark

system

The benchmark timings shown for this benchmark include 10 MD steps (i.e. ten converged SCF cycles). Since this is a "small" test case (220 atoms, 2072 basis functions), scalability on a current (2016) HPC system is expected to extend up to several hundred CPU cores. This smaller benchmark can be used to test if the FHI-aims build (compiler version, compiler flags, libraries etc.) works correctly on the cluster - just compare your results with the benchmark reference.

Optimizing performance in an HPC environment

There are several keywords which can help improving performance in large scale simulations. The general part (everything except the species definitions) of a control.in could look something like this:

#  VB, KML, JF 2021
#
#  Input file control.in : All computational details for structure geometry.in
#
#  * First, general computational parameters:
#
#  Physics choices (model):
#
  xc               pbe 
    vdw_correction_hirshfeld
  spin             none 
  relativistic     none
  charge           1.0
#
#  SCF cycle:
#
  mixer            pulay 
    n_max_pulay       10 
    charge_mix_param  0.4 
#
#  accuracy / efficiency for normal scf
#
  density_update_method density_matrix
  empty_states 3
#
#
  use_local_index .true.
  load_balancing .true.

# For MD:
  MD_run   0.01   NVT_nose-hoover 300  1700
  MD_thermostat_units cm^-1
  MD_time_step  0.001
  output_level MD_light

mixer pulay sets the density-mixing algorithm for the SCF cycle to the Pulay mixer. The Pulay mixer usually leads to convergence within few SCF cycles for many systems, and is therefore also the default. n_max_pulay defines the maximum number of previous SCF steps which are mixed into the output density of a given step. The density change between steps is additionally multiplied by charge_mix_param. Smaller values of the latter thus lead to a more conservative, more stable but slower convergence, whereas larger values allow more drastic changes in density between SCF steps. Usually, FHI-aims is capable to figure out the correct mixing parameters. Only test and set these parameters explicitly if you know what you are doing - typically, the defaults set by the code should work with acceptable performance as well.
density_update_method defines whether the electron density is updated using the Kohn-Sham orbitals or the density matrix. This keyword is only relevant for mid-sized non-periodic systems. The orbital-based density update is cheaper for small systems but scales as \(O(N^2)\) with system size N. The density-matrix based update scales \(O(N)\) and is therefore the method of choice for large systems. In an intermediate size range, the code makes a decision based on our own past benchmarking, but an explicit benchmark for a given case (e.g., prior to a long molecular dynamics run) can offer some additional speedup. For periodic systems, only the density matrix based update is available.
empty_states specifies the number of unoccupied Kohn-Sham states computed by the eigensolver. Reducing this number decrease the time for the eigensolver. However, test this reduction before using it. Be certain that your system is non-metallic and no post-processing will be performed which needs unoccupied states explicitly, e.g. spin-orbit coupling, or correlated methods such as GW. For such a case, increasing empty_states might be desirable to get converged results.
When use_local_index is set to .true., each process will store only those parts of the Hamiltonian and overlap matrices which are influenced by any grid point that is assigned to that process. If .false., each process will store a full copy of these matrices, greatly increasing memory requirements. Setting use_local_index .true is always recommended for system sizes greater than (say) 100 atoms, as long as it is supported internally by the code for the requested operations.
load_balancing: While use_local_index resolves memory bottlenecks, it is not optimized for computational time. Computational workload may be unevenly distributed between processes. load_balancing redistributes grid points between processes to eliminate this negative impact on performance. We recommend to try this keyword in addition to use_local_index to improve performance.

We run this small benchmark on 40 and 240 cores. A summary of timings and memory used in a calculation is printed at the end of the main output:

------------------------------------------------------------
          Leaving FHI-aims.
          Date     :  20211014, Time     :  191131.295

          Computational steps:
          | Number of self-consistency cycles          :          110
          | Number of SCF (re)initializations          :           10
          | Number of molecular dynamics steps         :           10
          | Number of force evaluations                :           10

          Detailed time accounting                     :  max(cpu_time)    wall_clock(cpu1)
          | Total time                                  :      653.596 s         658.322 s
          | Preparation time                            :        0.218 s           1.478 s
          | Boundary condition initalization            :        0.059 s           0.009 s
          | Grid partitioning                           :        9.478 s           9.496 s
          | Preloading free-atom quantities on grid     :       15.578 s          15.670 s
          | Free-atom superposition energy              :        1.796 s           1.807 s
          | Total time for integrations                 :       90.555 s          91.271 s
          | Total time for solution of K.-S. equations  :       24.952 s          25.248 s
          | Total time for EV reorthonormalization      :        0.485 s           0.487 s
          | Total time for density & force components   :      200.626 s         200.666 s
          | Total time for mixing                       :        1.301 s           1.215 s
          | Total time for Hartree multipole update     :        1.351 s           1.351 s
          | Total time for Hartree multipole sum        :      232.417 s         232.440 s
          | Total time for total energy evaluation      :        0.016 s           0.082 s
          | Total time NSC force correction             :       65.551 s          65.668 s
          | Total time for vdW correction               :        7.628 s           7.631 s

          Partial memory accounting:
          | Residual value for overall tracked memory usage across tasks:     0.000000 MB (should be 0.000000 MB)
          | Peak values for overall tracked memory usage:
          |   Minimum:       19.816 MB (on task  3 after allocating d_wave)
          |   Maximum:       21.986 MB (on task 30 after allocating d_wave)
          |   Average:       21.171 MB
          | Largest tracked array allocation:
          |   Minimum:        4.367 MB (hessian_basis_wave on task  0)
          |   Maximum:        4.367 MB (hessian_basis_wave on task  0)
          |   Average:        4.367 MB
          Note:  These values currently only include a subset of arrays which are explicitly tracked.
          The "true" memory usage will be greater.

          Have a nice day.
------------------------------------------------------------

cpu_time vs. wall_clock

In the above snippet, you find two timings: cpu_time and wall_clock. It is important to note that the wall_clock time is the timing one should look for. It is the actual time measured by the system.

Additionally, more detailed timing information is printed throughout the run, depending on the type of calculation. In our MD run, for example, the time spent on each SCF step, for each geometry, is summarized like this:

Convergence:    q app. |  density  | eigen (eV) | Etot (eV) | forces (eV/A) |       CPU time |     Clock time
  SCF    1 :  0.10E+01 |  0.23E+01 |   0.14E+04 |  0.93E+02 |             . |        4.561 s |        4.562 s
  SCF    2 :  0.10E+01 |  0.11E+01 |   0.78E+03 |  0.91E+01 |             . |        3.726 s |        3.727 s
  SCF    3 :  0.10E+01 |  0.55E+00 |   0.39E+03 |  0.21E+01 |             . |        3.950 s |        3.949 s
  SCF    4 :  0.10E+01 |  0.26E+00 |   0.48E+02 |  0.53E+00 |             . |        4.135 s |        4.135 s
  SCF    5 :  0.10E+01 |  0.54E-01 |  -0.14E+02 |  0.39E-01 |             . |        3.566 s |        3.566 s
  SCF    6 :  0.10E+01 |  0.20E-01 |   0.37E+01 |  0.12E-01 |             . |        4.105 s |        4.105 s
  SCF    7 :  0.10E+01 |  0.80E-02 |  -0.31E+01 |  0.37E-02 |             . |        4.097 s |        4.097 s
  SCF    8 :  0.10E+01 |  0.26E-02 |  -0.25E+01 |  0.15E-02 |             . |        4.110 s |        4.110 s
  SCF    9 :  0.10E+01 |  0.88E-03 |  -0.16E+00 |  0.20E-03 |             . |        4.111 s |        4.110 s
  SCF   10 :  0.10E+01 |  0.28E-03 |   0.63E-01 | -0.14E-05 |             . |        4.110 s |        4.110 s
  SCF   11 :  0.10E+01 |  0.12E-03 |   0.94E-02 |  0.75E-05 |             . |        4.087 s |        4.087 s
  SCF   12 :  0.10E+01 |  0.41E-04 |  -0.14E-01 |  0.57E-05 |             . |        4.009 s |        4.009 s
  SCF   13 :  0.10E+01 |  0.16E-04 |  -0.24E-03 |  0.18E-06 |      0.20E-01 |       25.396 s |       25.400 s

Here are the most important time components visualized, for the two different numbers of cores:

specific settings

As you can see, most parts of the calculation scale fairly well with the number of cores. What sticks out is the eigenvalue solver which takes approximately the same time on 40 and on 240 cores. The reason is that the actual computational time is already nearly negligible (keep in mind that the eigensolver is called in every single SCF step; with \(<30\,{\rm s}\) cumulative time over \(>100\) SCF steps, each call to the eigensolver takes \(<0.3\,{\rm s}\)). At this point, the total time is dominated by the communication between the individual cores, which is not reduced when using more cores.

The memory requirements, on the other hand, do not decrease significantly when distributing over more cores in this case. The highest tracked value of used memory on any process decreases only slightly, from 21.986 MB on 40 cores to 18.962 MB on 240 cores. The reason is that some information is required on all processes and cannot really be distributed. Instead, copies of identical information are stored on each process. In principle, this residual memory usage could be reduced further but on any modern computer system, there is no need to do this.

It should be noted here that FHI-aims tracks only the memory usage of large arrays. The printed value is thus only a lower bound. The real memory usage, including small arrays and scalars, will be larger.

Comparison: All default settings

To assess the impact of the performance keywords used in the above example, we rerun the same calculation without these keywords, effectively setting them to the respective defaults, and compare the results. Only the use_local_index keyword should always be set, to improve memory efficiency.

#  VB, December 2016
#
#  Input file control.in : All computational details for structure geometry.in
#
#  * First, general computational parameters:
#
#  Physics choices (model):
#
  xc               pbe 
    vdw_correction_hirshfeld
  spin             none 
  relativistic     none
  charge           1.0

 # keyword that is always useful for large-scale calculations when allowed (memory efficiency) 
  use_local_index .true.

# For MD:
  MD_run   0.01   NVT_nose-hoover 300  1700
  MD_thermostat_units cm^-1
  MD_time_step  0.001
  output_level MD_light

Here are the results using the default settings:

default settings