Part 2: Small benchmark: Ac-Ala19-LysH+
Benchmarks
The benchmarks
subdirectory of the FHI-aims repository contains two benchmark cases with reference results that can be used to (roughly) assess the parallel performance of FHI-aims on HPC platforms.
These cases demonstrate both the scaling with number of CPU cores as well as the expected approximate runtimes for FHI-aims on current (2021) Intel Skylake hardware with a fast interconnect between nodes.
Reference output files are provided for 40 and 240 CPU cores for the small case, and for 240, 960, and 1920 CPU cores for the large case, generated on the Max Planck Society's Cobra supercomputer. The CPUs used in these runs are Intel Xeon Skylake-SP processors (each node has 20 cores @ 2.4 GHz each), dating from 2018. The interconnect is a fast OmniPath network. The ELPA kernel was set to "generic" (using optimized AVX kernels for the ELPA library would likely generate faster code), and Intel Fortran and MKL versions from 2020 were employed.
Example submit.sh
scripts are specific to the SLURM queueing system and to the installation at the "Cobra" Supercomputer.
Two typical log-log scaling plots are also provided with the benchmarks.
About the small benchmark
In the following, we will compute the small benchmark twice: Once with the settings you can find in the current benchmark folder of the FHI-aims distribution (FHIaims/benchmarks/Ac-Lys-Ala19-H/
) and once with FHI-aims default settings with only use_local_index
in addition. The origin of this benchmark dates back to 2016. However, the choice of the current FHI-aims default settings have improved since then and there is actually no need (except for use_local_index
) to set them explicitly. The only reason to still use the "old" control.in
is to keep a consistent benchmark over time. For a production calculation of this system, setting only use_local_index
would be sufficient.
"Small" benchmark
The benchmark timings shown for this benchmark include 10 MD steps (i.e. ten converged SCF cycles). Since this is a "small" test case (220 atoms, 2072 basis functions), scalability on a current (2016) HPC system is expected to extend up to several hundred CPU cores. This smaller benchmark can be used to test if the FHI-aims build (compiler version, compiler flags, libraries etc.) works correctly on the cluster - just compare your results with the benchmark reference.
Optimizing performance in an HPC environment
There are several keywords which can help improving performance in large scale simulations. The general part (everything except the species definitions) of a control.in
could look something like this:
# VB, KML, JF 2021
#
# Input file control.in : All computational details for structure geometry.in
#
# * First, general computational parameters:
#
# Physics choices (model):
#
xc pbe
vdw_correction_hirshfeld
spin none
relativistic none
charge 1.0
#
# SCF cycle:
#
mixer pulay
n_max_pulay 10
charge_mix_param 0.4
#
# accuracy / efficiency for normal scf
#
density_update_method density_matrix
empty_states 3
#
#
use_local_index .true.
load_balancing .true.
# For MD:
MD_run 0.01 NVT_nose-hoover 300 1700
MD_thermostat_units cm^-1
MD_time_step 0.001
output_level MD_light
-
mixer pulay
sets the density-mixing algorithm for the SCF cycle to the Pulay mixer. The Pulay mixer usually leads to convergence within few SCF cycles for many systems, and is therefore also the default.n_max_pulay
defines the maximum number of previous SCF steps which are mixed into the output density of a given step. The density change between steps is additionally multiplied bycharge_mix_param
. Smaller values of the latter thus lead to a more conservative, more stable but slower convergence, whereas larger values allow more drastic changes in density between SCF steps. Usually, FHI-aims is capable to figure out the correct mixing parameters. Only test and set these parameters explicitly if you know what you are doing - typically, the defaults set by the code should work with acceptable performance as well. -
density_update_method
defines whether the electron density is updated using the Kohn-Sham orbitals or the density matrix. This keyword is only relevant for mid-sized non-periodic systems. The orbital-based density update is cheaper for small systems but scales as \(O(N^2)\) with system size N. The density-matrix based update scales \(O(N)\) and is therefore the method of choice for large systems. In an intermediate size range, the code makes a decision based on our own past benchmarking, but an explicit benchmark for a given case (e.g., prior to a long molecular dynamics run) can offer some additional speedup. For periodic systems, only the density matrix based update is available. -
empty_states
specifies the number of unoccupied Kohn-Sham states computed by the eigensolver. Reducing this number decrease the time for the eigensolver. However, test this reduction before using it. Be certain that your system is non-metallic and no post-processing will be performed which needs unoccupied states explicitly, e.g. spin-orbit coupling, or correlated methods such as GW. For such a case, increasingempty_states
might be desirable to get converged results. -
When
use_local_index
is set to.true.
, each process will store only those parts of the Hamiltonian and overlap matrices which are influenced by any grid point that is assigned to that process. If.false.
, each process will store a full copy of these matrices, greatly increasing memory requirements. Settinguse_local_index .true
is always recommended for system sizes greater than (say) 100 atoms, as long as it is supported internally by the code for the requested operations. -
load_balancing
: Whileuse_local_index
resolves memory bottlenecks, it is not optimized for computational time. Computational workload may be unevenly distributed between processes.load_balancing
redistributes grid points between processes to eliminate this negative impact on performance. We recommend to try this keyword in addition touse_local_index
to improve performance.
We run this small benchmark on 40 and 240 cores. A summary of timings and memory used in a calculation is printed at the end of the main output:
------------------------------------------------------------
Leaving FHI-aims.
Date : 20211014, Time : 191131.295
Computational steps:
| Number of self-consistency cycles : 110
| Number of SCF (re)initializations : 10
| Number of molecular dynamics steps : 10
| Number of force evaluations : 10
Detailed time accounting : max(cpu_time) wall_clock(cpu1)
| Total time : 653.596 s 658.322 s
| Preparation time : 0.218 s 1.478 s
| Boundary condition initalization : 0.059 s 0.009 s
| Grid partitioning : 9.478 s 9.496 s
| Preloading free-atom quantities on grid : 15.578 s 15.670 s
| Free-atom superposition energy : 1.796 s 1.807 s
| Total time for integrations : 90.555 s 91.271 s
| Total time for solution of K.-S. equations : 24.952 s 25.248 s
| Total time for EV reorthonormalization : 0.485 s 0.487 s
| Total time for density & force components : 200.626 s 200.666 s
| Total time for mixing : 1.301 s 1.215 s
| Total time for Hartree multipole update : 1.351 s 1.351 s
| Total time for Hartree multipole sum : 232.417 s 232.440 s
| Total time for total energy evaluation : 0.016 s 0.082 s
| Total time NSC force correction : 65.551 s 65.668 s
| Total time for vdW correction : 7.628 s 7.631 s
Partial memory accounting:
| Residual value for overall tracked memory usage across tasks: 0.000000 MB (should be 0.000000 MB)
| Peak values for overall tracked memory usage:
| Minimum: 19.816 MB (on task 3 after allocating d_wave)
| Maximum: 21.986 MB (on task 30 after allocating d_wave)
| Average: 21.171 MB
| Largest tracked array allocation:
| Minimum: 4.367 MB (hessian_basis_wave on task 0)
| Maximum: 4.367 MB (hessian_basis_wave on task 0)
| Average: 4.367 MB
Note: These values currently only include a subset of arrays which are explicitly tracked.
The "true" memory usage will be greater.
Have a nice day.
------------------------------------------------------------
cpu_time
vs. wall_clock
In the above snippet, you find two timings: cpu_time
and wall_clock
. It is important to note that the wall_clock
time is the timing one should look for. It is the actual time measured by the system.
Additionally, more detailed timing information is printed throughout the run, depending on the type of calculation. In our MD run, for example, the time spent on each SCF step, for each geometry, is summarized like this:
Convergence: q app. | density | eigen (eV) | Etot (eV) | forces (eV/A) | CPU time | Clock time
SCF 1 : 0.10E+01 | 0.23E+01 | 0.14E+04 | 0.93E+02 | . | 4.561 s | 4.562 s
SCF 2 : 0.10E+01 | 0.11E+01 | 0.78E+03 | 0.91E+01 | . | 3.726 s | 3.727 s
SCF 3 : 0.10E+01 | 0.55E+00 | 0.39E+03 | 0.21E+01 | . | 3.950 s | 3.949 s
SCF 4 : 0.10E+01 | 0.26E+00 | 0.48E+02 | 0.53E+00 | . | 4.135 s | 4.135 s
SCF 5 : 0.10E+01 | 0.54E-01 | -0.14E+02 | 0.39E-01 | . | 3.566 s | 3.566 s
SCF 6 : 0.10E+01 | 0.20E-01 | 0.37E+01 | 0.12E-01 | . | 4.105 s | 4.105 s
SCF 7 : 0.10E+01 | 0.80E-02 | -0.31E+01 | 0.37E-02 | . | 4.097 s | 4.097 s
SCF 8 : 0.10E+01 | 0.26E-02 | -0.25E+01 | 0.15E-02 | . | 4.110 s | 4.110 s
SCF 9 : 0.10E+01 | 0.88E-03 | -0.16E+00 | 0.20E-03 | . | 4.111 s | 4.110 s
SCF 10 : 0.10E+01 | 0.28E-03 | 0.63E-01 | -0.14E-05 | . | 4.110 s | 4.110 s
SCF 11 : 0.10E+01 | 0.12E-03 | 0.94E-02 | 0.75E-05 | . | 4.087 s | 4.087 s
SCF 12 : 0.10E+01 | 0.41E-04 | -0.14E-01 | 0.57E-05 | . | 4.009 s | 4.009 s
SCF 13 : 0.10E+01 | 0.16E-04 | -0.24E-03 | 0.18E-06 | 0.20E-01 | 25.396 s | 25.400 s
Here are the most important time components visualized, for the two different numbers of cores:
As you can see, most parts of the calculation scale fairly well with the number of cores. What sticks out is the eigenvalue solver which takes approximately the same time on 40 and on 240 cores. The reason is that the actual computational time is already nearly negligible (keep in mind that the eigensolver is called in every single SCF step; with \(<30\,{\rm s}\) cumulative time over \(>100\) SCF steps, each call to the eigensolver takes \(<0.3\,{\rm s}\)). At this point, the total time is dominated by the communication between the individual cores, which is not reduced when using more cores.
The memory requirements, on the other hand, do not decrease significantly when distributing over more cores in this case. The highest tracked value of used memory on any process decreases only slightly, from 21.986 MB on 40 cores to 18.962 MB on 240 cores. The reason is that some information is required on all processes and cannot really be distributed. Instead, copies of identical information are stored on each process. In principle, this residual memory usage could be reduced further but on any modern computer system, there is no need to do this.
It should be noted here that FHI-aims tracks only the memory usage of large arrays. The printed value is thus only a lower bound. The real memory usage, including small arrays and scalars, will be larger.
Comparison: All default settings
To assess the impact of the performance keywords used in the above example, we rerun the same calculation without these keywords, effectively setting them to the respective defaults, and compare the results. Only the use_local_index
keyword should always be set, to improve memory efficiency.
# VB, December 2016
#
# Input file control.in : All computational details for structure geometry.in
#
# * First, general computational parameters:
#
# Physics choices (model):
#
xc pbe
vdw_correction_hirshfeld
spin none
relativistic none
charge 1.0
# keyword that is always useful for large-scale calculations when allowed (memory efficiency)
use_local_index .true.
# For MD:
MD_run 0.01 NVT_nose-hoover 300 1700
MD_thermostat_units cm^-1
MD_time_step 0.001
output_level MD_light
Here are the results using the default settings: