Part 3: Large benchmark: Graphene on SiC(0001)

In contrast to the smaller benchmark discussed in the previous part, the next test case illustrates the general philosophy of a larger benchmark, aimed primarily at testing computational speed. In a real 'production' calculation, FHI-aims will perform equivalent tasks repeatedly. This is the case especially for the steps in the SCF procedure, and to a lesser degree also for the single point calculations in a molecular dynamics or geometry optimization run. Either way, the run time of a typical calculation will be heavily dominated by the SCF procedure (except when computationally demanding post-processing is performed).

For each geometry, FHI-aims scf typically runs one initialization, then N=10-40 scf iterations, each the same, then perhaps one force / stress iteration and post-processing that could also be relevant. The number of SCF steps until convergence is largely (in an ideal world: completely) independent of the computing environment, and the computational time of one single SCF step is more or less constant between steps, as the same tasks are performed in each step, just with different numbers. To compare computational speed between different setups, it is therefore often sufficient to compare the time necessary for one single SCF step. Running additional steps will only consume computing time without providing additional information.

For a benchmark specific to a particular task you intend to plan/perform, we might also want to benchmark:

any other type of operation / iteration that is relevant for the planned simulation (e.g., forces/stresses, post-processing, etc.)
plot memory estimates

We here focus on the simpler case of the SCF procedure being the bottleneck. We benchmark a run that has

One initialization iteration (different from the other scf)
Two scf iterations without forces only

The SiC system used in this test case (see https://doi.org/10.1103/PhysRevLett.111.065502 for the physics behind this test case) is larger than the previous one and is expected to show excellent scaling up to several thousand CPU cores.

system

Note: This is a large system - 1648 atoms in the unit cell and tight settings. The scaling test that we propose here runs up to 2,000 CPU cores (see below). As you will see, this number is chosen to be appropriate for this case. However, do not simply assume that other cases - importantly, cases that are much smaller in size and/or that use light settings - will run efficiently with the same large core counts as well. Use the strategy laid out in this tutorial to test what resources are actually approprate for your own intended production workload.

Here is the general part of the control.in file:

#########################################################################################
#
#   input file control.in tight for scaling studies
#
#########################################################################################
#
#  Physical model
#
  xc                 pbe
  vdw_correction_hirshfeld
  spin               none
  relativistic       atomic_zora scalar
  charge             0.
  k_grid 1 1 1
#
#  SCF convergence
# 
  sc_iter_limit      2
#
  use_local_index .true.

Results

First step: Check the numbers (energies) that come out of the benchmark. Do they look correct?

Next, let us look at the timing results:

total time

init time

The total time of this run is interesting, but it is even more interesting to plot the individual subtimings. In particular, the SCF iterations would heavily dominate the overall run time in a real run since it will be executed far more often than the other types. In a 'single point' (i.e. single geometry, no relaxation or MD) calculation like this, FHI-aims will print detailed timing information after each SCF step like this:

  End self-consistency iteration #     1       :  max(cpu_time)    wall_clock(cpu1)
  | Time for this iteration                     :      106.736 s         106.804 s
  | Charge density update                       :       22.398 s          22.401 s
  | Density mixing & preconditioning            :       10.253 s          10.251 s
  | Hartree multipole update                    :        0.245 s           0.244 s
  | Hartree multipole summation                 :        6.516 s           6.529 s
  | Integration                                 :       12.979 s          12.982 s
  | Solution of K.-S. eqns.                     :       54.295 s          54.320 s
  | Total energy evaluation                     :        0.051 s           0.001 s

Similar information is given about the initialization.

Let us now see how the time required for one SCF step, and its most important components, scale with the number of cores:

step time

In contrast to the smaller testcase discussed in the previous part, we can now clearly see how not only workload, but also information is distributed between processes. From the final part of the main output file we can extract the memory estimation by FHI-aims (here for 240 cores):

    Partial memory accounting:
    | Residual value for overall tracked memory usage across tasks:     0.000000 MB (should be 0.000000 MB)
    | Peak values for overall tracked memory usage:
    |   Minimum:      474.466 MB (on task   3 after allocating gradient_basis_wave)
    |   Maximum:      713.025 MB (on task  60 after allocating gradient_basis_wave)
    |   Average:      608.440 MB
    | Largest tracked array allocation:
    |   Minimum:      120.469 MB (ovlp on task 229)
    |   Maximum:      189.818 MB (ham_ovlp_work on task  60)
    |   Average:      138.929 MB
    Note:  These values currently only include a subset of arrays which are explicitly tracked.
    The "true" memory usage will be greater.

The highest tracked memory usage on any process is 713.025 MB on 240 cores, and reduces to 294.877 MB on 960 and 217.522 MB on 1920 cores.

Recommendations

If the parallelization of the code were ideal and if there were no communication overhead, the execution time should scale as 1/N, where N is the number of CPU cores used for practical execution. This regime is indicated by the slope of the thin brown lines signifying "linear scaling" in all plots above.

However, in practice, every code will have some per-task overhead that is not parallelizable. In addition, the larger N, the more communication will be needed to synchronize the execution of different computational pieces between different codes.

In short, as we add more CPU cores and MPI tasks, the computational scaling will deviate away from 1/N and will eventually flatten out or even increase, a behavior anticipated by "Amdahl's law".

From a resource usage point of view, it is frequently desirable to optimize the number of "node-hours" needed for a given task - i.e., number of nodes used multiplied by the number of wall-clock hours spent on a given task.

As the number of CPU cores increases, the number of node hours used will also increase slowly unless the code were to show ideal 1/N scaling.

In the example of SiC above, qualitatively reasonable scaling is certainly observed up to 1,000 CPU cores. For 2,000 CPU cores, the resource usage increases somewhat, although the overall wall clock continues to decrease.

The eigenvalue solver is the scaling bottleneck for semilocal DFT calculations. FHI-aims, in fact, makes use of the rather efficient ELPA eigenvalue solver, which keeps simulations of large systems manageable within our approach.

Improvement compared to older results

Let us compare the timing for 1 SCF step to the results from an older (2016) benchmark:

step time 2016

The times for 1 SCF step have improved somewhat since 2016, from approximately 4000 to 2300 s (240 cores). The main reason for the speed up is the improvement of the hardware, but some code improvements internal to FHI-aims also make a difference.