# Part 1: Preparing the Computer Environment

As noted in the introduction, a key part of successfully executing a large-scale FHI-aims run is to understand the computer on which the run should be executed. While this isn't all that is needed, especially initial problems can often be avoided by taking the time to understand what infrastructure we are running on and what is the software environment in which FHI-aims is embedded.

This part is not yet going to do much with FHI-aims. However, it is critical to read this and assemble the information mentioned below. The parts that are needed to run FHI-aims successfully are not so many, but it can become complicated to manage them all if one is not careful.

The tutorial below lists a number of key question that one should understand (or find out about) in order to successfully and efficiently execute FHI-aims on a given supercomputer.

The answers to the questions listed below are usually included in the documentation that comes with the supercomputer you are using. However, regardless whether or not it is easy to find that documentation: Use the below questions as a checklist. You must understand the basic layout of the computer you are operating on or it may not be possible for you to understand in which manner FHI-aims can be run successfully on the computer.

Task: Collect the information below for a (super)computer of your choice. An example is given below.

## What computer am I running on?

Typical supercomputers will allow you to run on more than one physical computer at a time. The individual computers that make up a full supercomputer are called compute nodes. Each node will have a number of central processing units, called processor cores. Each node may also include other components that compute, called accelerators - most frequently, graphical processing units (GPUs) as accelerators for particular numerical operations.

Between one another, the nodes are usually connected with a fast communication network, so that data can be transferred back and forth rapidly between individual compute nodes. Finally, the nodes will all be connected to a file system, within which they can read input files and write output files as needed.

Before starting, the key question to answer are:

• What types of compute nodes are available on the supercomputer in question?

Often, a single supercomputer can include more than a single type of node. Understanding which type of node you are running on is essential. Do not attempt to mix nodes of different kinds in the same FHI-aims run.

• What type of processor (manufacturer, model number) is used in each compute node?

This sounds tedious and yes, it can be. However, it cannot be overstated how critical it is to obtain at least a basic understanding of what the processors you are using are.

The reason is that you will be using binary code (i.e., the aims.<version>.x file, the FHI-aims executable), typically compiled from the source code for one particular architecture but not for another.

• How many physical (i.e., actual) processor cores does each compute node in my supercomputer contain?

It is important to know this number since FHI-aims will parallelize its operations over these physical processor cores. FHI-aims uses a parallelization standard called the "Message Passing Interface", or MPI. Essentially, the MPI infrastructure on a given computer is a set of libraries that allows the code to pass data between individual processes (called MPI tasks) that execute a single FHI-aims calculation in parallel.

In FHI-aims, each MPI task should always be executed on exactly one physical processor core - no two MPI tasks on a single core. Unless memory needs constrain you, you should also attempt to use all processor cores that each compute node offers. All this means that you will need to know how many cores are available in each compute node.

• Are all the processor cores that I am using in my planned run the same make and model?

If they are, that is as it should be. If they are not - do not run the code using different processor cores in the same parallel run. Unless you know exactly what you are doing, different processor types can give individually different numbers and mixing different processor cores in the same run can and will lead to wrong numbers. Usually, supercomputers are set up such that parallel runs are executed on the same type of processor core throughout, i.e., it is not often that one encounters a truly heterogeneous architecture. But if you do, make sure that your run will only run on a single type of processor cores, not on a mix of different processor cores.

• Which types of optimization and instruction sets are supported by the processor cores available?

The actual mathematical instructions that you will be using are hardwired into the chip. However, different processor core makes and models will support different instruction sets. For fast execution, it is important to use the instruction set available on your architecture. Do not use any instruction set that isn't supported by the processor cores to be used.

The actual instruction set that is used is determined by what is compiled into your aims.<version>.x binary file. In other words, the compiler options used when building the FHI-aims code determine what you will be using. Compiling FHI-aims is not part of this tutorial. However, detailed explanations, including known examples for specific supercomputers, are available in the Wiki at https://aims-git.rz-berlin.mpg.de/, to which every user of FHI-aims should have access.

In summary, you need to ensure that the binary code you are about to run uses instructions that the computer actually supports. Mostly, the compiler takes care of this for you.

However, be aware that there is also a selection of code libraries that you will be selecting at runtime, i.e., when queueing an actual job for execution. As laid out below, those libraries include mathematical libraries, MPI libraries and perhaps others, determined by your queueing system. You must ensure that the libraries loaded at runtime use instructions that are supported by the processor core architecture you are using or a given run may fail or give wrong numbers.

• How much memory does each compute node include, and how much memory is available per core?

This part is easier. Each of your compute nodes will only offer a finite amount of memory during runtime and you need to know how much is available. Since (as mentioned above) you will likely be using one MPI task per available processor core, it is additionally helpful to know how much memory per core is available. Just divide the total number of memory available in a node by the number of processor cores available in a node.

Usually, 2 GB (Gigabytes) per processor core are comfortable for FHI-aims. However, how much memory you will need exactly will depend on the type of computation requested.

Running out of memory is a frequent mode of failure for any high-performance code and typically, the failure will not come with an understandable error message. So pay attention to how much memory you have and develop a sense for how much memory you will need for your intended run. This can be done by careful testing and planning ahead of time (also covered in this tutorial).

• How many compute nodes of a given type are available on my supercomputer?

After all the previous details, this is a truly simple question. How big is the supercomputer overall, and how many of those nodes are theoretically available to you, for single runs?

In production calculations, you will usually request more than just a single compute node. You will rarely want to request the whole supercomputer at once - instead, you should choose carefully to use a number of nodes that is sufficiently large for your calculation (memory-wise and time-wise) but not too large, in order to guarantee an efficient use of the available resources. In any case, it will be helpful to know the resources that are theoretically available, in order to guarantee feasibility of a given run.

While the wall clock time for computational operations generally decreases as a function of the number of compute nodes used, the wall clock time used for communicating data between different nodes will increase if more nodes are involved. The computational efficiency of a given run, measured in node-hours required (number of compute nodes used times number of wall-clock hours used for the computation) will therefore usually get worse as the overall number of nodes increases. Determining what is the right amount of nodes for a given type of run on a given computer requires testing and this tutorial will provide some examples for how such a test is done.

• What is the file system that my supercomputer is using and do I have enough disk space available?

Disk space is usually not a problem for standard FHI-aims runs, but using restart files or producing large output such as cube files or molecular dynamics trajectories can lead to enormous disk space needs when large systems are being simulated.

Computer centers frequently limit the disk space available on users' home directories quite severely and, instead, require users to use work directories on a different file system that is not backed up.

You need to ensure that you are starting your actual FHI-aims run in a directory that is located on a file system that is large enough for your output files. Additionally, this exact file system must be visible on all compute nodes that are included in your run. If you are using only one compute node, no problem. However, if you are using more than one node, then each node must see it. Otherwise, FHI-aims will not be able to run successfully.

• Do my compute nodes contain accelerators (typically, GPUs)?

In addition to regular CPU cores, some supercomputer nodes will include accelerators for particularly compute heavy tasks. The most common accelerator type available today are graphical processing units, or GPUs. If they are available, you want to use them if you can, since raw computing power that they provide can far outweigh the computing power available through the CPUs in the same nodes.

FHI-aims support GPUs for some operations (semilocal DFT and the eigenvalue solver), but not for others. This tutorial, unfortunately, does not cover GPUs. However, some information is available in the manual and at the https://aims-git.rz-berlin.mpg.de/ (see Wiki). The GPU capabilities of FHI-aims are described in at least two papers:

• William P. Huhn, Bjoern Lange, Victor W.-z. Yu, Mina Yoon and Volker Blum, “GPU-Accelerated Large-Scale Electronic Structure Theory with a First-Principles All-Electron Code,” Computer Physics Communications, Vol. 254, issue September 2020, 107314 (2020). https://dx.doi.org/10.1016/j.cpc.2020.107314.

• Victor Wen-zhe Yu, Jonathan Moussa, Pavel Kus, Andreas Marek, Peter Messmer, Mina Yoon, Hermann Lederer and Volker Blum, “GPU-Acceleration of the ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems,” Computer Physics Communications, Vol. 262, issue: May 2021, 107808 (2021). https://dx.doi.org/10.1016/j.cpc.2020.107808

If you are interested in GPU acceleration, those papers are worth looking into.

## Which compilers and libraries will be used to support my computation?

FHI-aims comes as a large source code, but it depends on several external software pieces that are not part of FHI-aims and that are usually present on, and specific to, a given supercomputer. These pieces are:

• Fortran compiler,
• C compiler,
• Mathematical libraries (serial and parallel linear algebra),
• Message-Passing Interface (MPI) library for parallel code execution.

These four pieces must be present and specified correctly when compiling the code.

Critically, the same four pieces must also be set again at runtime, i.e., when executing an actual FHI-aims run through a queueing system.

Finding out which version(s) of this software are available and which ones work correctly with FHI-aims is, unfortunately, very specific to different supercomputers and therefore not always straightforward. Nevertheless, this is a key step. When first building the FHI-aims code, review the information in the Wiki at https://aims-git.rz-berlin.mpg.de/. Test carefully.

Often, the modules present on a supercomputer allow you to select which particular compilers and libraries are used. However, the modules in question must also be requested in the job submission script used to execute the code through a queueing system.

In cases where no module environment is present on a computer, the libraries to be used must still be present in environment variables. The LD_LIBRARY_PATH environment variable is typically the one that lists which directories will be searched for libraries by default when starting to execute a given binary file.

One final and equally critical piece to consider is

• the queueing system used to submit computational tasks to the computer

Job submission usually happens by submitting a shell script with some initial instructions for the queueing system, using a command such as:

sbatch submit.sh


or similar, where sbatch is a job submission command (in this example, associated with the slurm queueing system) and submit.sh is a script with instructions, provided by you.

The names of these commands can vary. They will be provided by the supercomputing center in question and they are, in practice, typically different from one supercomputer to the next.

However, what is essential is that the queueing system can think for itself and can set and unset libraries, environment variables and other critical choices that are important for your run, including severe performance degradation.

Be sure to understand the actions of the queueing system and be sure to set as many necessary environment options explicitly before executing FHI-aims itself. Many different combinations of compilers and libraries as possible, but they need to be suitable for the computer and hardware available for your run. When in doubt, test.

## An example

A simple way to ensure that the FHI-aims code functions in a basic form (not sufficient to guarantee full correctness of the FHI-aims installation) is to set up a job submission script that executes one of the test cases that is provided with FHI-aims.

The test case we will use is one that takes a few seconds at most: The "H$$_2$$O relaxation" testcase provided in testcases/H2O-relaxation along with any distributed version of FHI-aims.

We here demonstrate this for the Stampede2 supercomputer at Texas Advanced Computing Center (TACC).

Following the questions above, we find:

• Compute nodes: It turns out there are two types of nodes: SKX, hosting Intel Skylake processors, and KNL, hosting Intel Knight's Landing processors. We choose SKX nodes in the following.
• Processor type in each SKX node: Intel Xeon Platinum 8160 ("Skylake").
• Number of processor cores per SKX compute node: 48
• Are all processors we are attempting to use the same? We must make sure that only SKX nodes, with the same processor type, are used in our run. It turns out that on Stampede2, this choice is made when choosing the correct job queue to submit to, in the job submission script (see below).
• What types of optimizations and instruction sets are available for this processor type? It turns out that one can find further specifications of this processor type online. In particular, this processor supports the Intel AVX512 instruction set, for which special compiler optimizations are available, covered at the Wiki entry at https://aims-git.rz-berlin.mpg.de/.
• How much memory is available in each compute node? 192 GB per node, i.e., 4 GB per processor core. This is a comfortable amount.
• How many SKX compute nodes are available on Stampede2? 1,736 nodes, at the time of writing. This is far more than we will need.
• File system: It turns out that Stampede2's user guide states that file systems called $WORK or$SCRATCH have enough quota for some runs, whereas $HOME does not. This is an important consideration. In addition, files on the$SCRATCH file system are deleted 10 days (!) after they were last accessed, so anything important must be saved elsewhere immediately. None of this is specific to FHI-aims, but for successful execution of simulations with FHI-aims on Stampede2, one must know this information.
• Accelerators: No accelerators (no GPUs) on the SKX nodes of Stampede2.

Furthermore, a detailed inspection of the available software is necessary. We use, in this case, a command module avail as a starting point to learn which software exists in a prearranged form on Stampede2. Specifically, we choose:

• Intel Fortran compiler as provided in a module called intel/19.1.1
• Intel C compiler as provided in a module called intel/19.1.1
• Intel mathematical libraries ("math kernel library", mkl) as provided in the module intel/19.1.1
• Intel MPI library as provided in a module called impi/19.0.9

It is important to note that there are several other options for compilers and libraries available on Stampede2. Some of them may work, while other combinations might not work with one another or, in the worst case, could introduce numerical errors into our computations. How do we know to use the above choices? We tested. Additionally, a summary of recommended compiler and library choices to build and use the FHI-aims code on Stampede2 is provided in the Wiki at https://aims-git.rz-berlin.mpg.de/. As a user of FHI-aims, you should have or request access to this Wiki.

Finally, the queueing system on Stampede2 is a version of slurm, for which we use the following job submission script (after some more inspection of user documentation provided at Stampede2):

#!/bin/bash
#----------------------------------------------------
# Sample Slurm job script
#   for TACC Stampede2 SKX nodes
#   modified from a script supplied by TACC
#
#   *** MPI Job on SKX Normal Queue ***
#
# Notes:
#
#   -- Launch this script by executing
#      "sbatch submit.sh" on Stampede2 login node.
#
#   -- Use ibrun to launch MPI codes on TACC systems.
#      Do not use mpirun or mpiexec.
#
#   -- Max recommended MPI ranks per SKX node: 48
#
#   -- If you're running out of memory, try running
#
#   This example is for 1 node, using 48 cores per node, on
#   Stampede2's "SKX" nodes (Intel Skylake processors).
#
#   As you know, an equivalent script must be customized for other supercomputers,
#   following the documentation of that particular supercomputer.
#
#   Even on Stampede2 itself, this script must be customized according to your
#   own account and intended run. As an absolute minimum, the words
#
#     - YOUR_ALLOCATION_CODE
#     - PATH_TO_YOUR_FHIAIMS_BINARY
#     - VERSION
#
#   must be replaced by information specific to your own account, directory structure
#   and FHI-aims binary file. This script will not function if you do not make those
#   replacements.
#
#----------------------------------------------------
#SBATCH -J aims           # Job name
#SBATCH -o aims.o%j       # Name of stdout output file
#SBATCH -e aims.e%j       # Name of stderr error file
#SBATCH -p skx-normal     # Queue (partition) name skx-large for > 128 nodes
#SBATCH -N 1            # Total # of nodes
#SBATCH -t 08:00:00        # Run time (hh:mm:ss)
#SBATCH -A YOUR_ALLOCATION_CODE       # Allocation name (req'd if you have more than 1)
#SBATCH --mail-type=all    # Send email at begin and end of job
# Other commands must follow all #SBATCH directives...
module list
pwd
date
export MKL_DYNAMIC=FALSE
ulimit -s unlimited
# Launch MPI code...
AIMS=PATH_TO_YOUR_FHI_AIMS_BINARY/aims.VERSION.scalapack.mpi.x
ibrun \$AIMS > aims.out         # Use ibrun instead of mpirun or mpiexec
# ---------------------------------------------------


This job submission script example can also be inspected in the solutions folder, as can the results of the calculation

The relaxation of the H$$_2$$O molecule worked as expected. Inspect the total energy and relaxation trajectory (e.g., using GIMS) and compare with the test case output supplied with FHI-aims to convince yourself of this.

One interesting tidbit can be found at the very end of the calculation. Specifically, the timing output looks like this:

------------------------------------------------------------
Leaving FHI-aims.
Date     :  20211010, Time     :  202237.144

Computational steps:
| Number of self-consistency cycles          :           47
| Number of SCF (re)initializations          :            5
| Number of relaxation steps                 :            4

Detailed time accounting                     :  max(cpu_time)    wall_clock(cpu1)
| Total time                                  :        1.491 s           5.408 s
| Preparation time                            :        0.126 s           0.336 s
| Boundary condition initalization            :        0.002 s           0.036 s
| Grid partitioning                           :        0.037 s           0.128 s
| Free-atom superposition energy              :        0.003 s           0.003 s
| Total time for integrations                 :        0.078 s           0.223 s
| Total time for solution of K.-S. equations  :        0.150 s           0.297 s
| Total time for EV reorthonormalization      :        0.005 s           0.054 s
| Total time for density & force components   :        0.059 s           0.157 s
| Total time for mixing                       :        1.452 s           0.227 s
| Total time for Hartree multipole update     :        0.053 s           0.058 s
| Total time for Hartree multipole sum        :        0.093 s           0.155 s
| Total time for total energy evaluation      :        0.015 s           0.078 s
| Total time NSC force correction             :        0.011 s           0.012 s

Partial memory accounting:
| Residual value for overall tracked memory usage across tasks:     0.000000 MB (should be 0.000000 MB)
| Peak values for overall tracked memory usage:
|   Minimum:        0.216 MB (on task 24 after allocating hessian_basis_wave)
|   Maximum:        0.217 MB (on task  0 after allocating hessian_basis_wave)
|   Average:        0.217 MB
| Largest tracked array allocation:
|   Minimum:        0.114 MB (hessian_basis_wave on task  0)
|   Maximum:        0.114 MB (hessian_basis_wave on task  0)
|   Average:        0.114 MB
Note:  These values currently only include a subset of arrays which are explicitly tracked.
The "true" memory usage will be greater.

Have a nice day.
------------------------------------------------------------


Interestingly, the wall clock time (the actually elapsed time for this run) is listed as 5.408 s. The wall clock time as measured here is real and this is the measure that you want to use to assess the resources used in any computation.

In contrast, the elapsed "CPU time" (measured per MPI task) is much lower, 1.491 s. Why is this the case?

The "CPU time" is an internal measure of time used that can be particular to a compiler or operating system and is not always indicative of the actual time used. Ideally, this measure is just a cross-check. If all is well, the "CPU time" should roughly agree with the "wall clock time" on a given, CPU-only computer.

Here, the two time measures do not agree because, compared to the size of the problem (really small), a relatively enormous amount of resources (48 cores of a modern CPU) were thrown at the problem. Those 48 CPUs were no longer used efficiently, although the resulting overall time is, of course, completely negligible.

Nevertheless, the testcase output supplied in the actual FHI-aims code distribution at the time of writing only took 1.618 s wall clock time, using four CPU cores. Clearly, this test case is too small to be useful as a real test of a supercomputer. We will come to meaningful test cases later in this tutorial.

## Regression tests: Ensuring that the FHI-aims code works correctly on a given computer

After assembling the information related to a given supercomputer and creating a binary executable version of FHI-aims that works in a given software environment, we could in principle start to run tests of larger systems directly.

If your FHI-aims binary on that computer was already compiled and thoroughly tested by someone else on that computer, moving on to tests of larger systems would be ok.

If you, however, just compiled FHI-aims for the first time on a computer, using a combination of compilers and libraries that was not yet tested, you are not yet done.

A single working testcase of FHI-aims, i.e., the H$$_2$$O testcase described above, is not at all sufficient to ensure that all compilers and libraries you are using will provide correct output for larger and/or different problem settings. Before going further, it is critical to test a newly compiled binary file using many more tests, i.e., the so-called "regression tests" supplied with FHI-aims.

This set of tests is supplied with the FHI-aims distribution, including a python script that allows for convenient execution and analysis of all regression tests to ensure that they pass.

Unfortunately, the process to execute them can vary heavily from one computer to another, depending on factors such as whether python scripts are allowed within a job submission scripts, whether multiple instances of the mpirun command can be executed in quick succession on a compute node, etc. Executing the regression tests can be a longer task and takes some case. It is therefore not directly part of this tutorial.

However, information about compiling the code and executing the regression tests can be found at the Wiki of the FHI-aims gitlab server. A short tutorial about running the regression tests can be found in the wiki: https://aims-git.rz-berlin.mpg.de/aims/FHIaims/-/wikis/How-to-run-the-regression-tests

We highly recommend to familiarize yourself with the regression tests and ensure that they execute correctly, at least for a small number of CPU cores (some tests are too small for execution on a large number of cores). Then, proceed to use the benchmarks supplied with FHI-aims to make sure that the computer you are using performs correctly and with the expected speed (see the next parts of this tutorial).