Tutorial. Working with a resource manager on an HPC infrastructure | LATIN AMERICA HIGH PERFORMANCE COMPUTER CONFERENCIA : CARLA 2021

LATIN AMERICA HIGH PERFORMANCE COMPUTING CONFERENCE CARLA 2021

Registration for this event is closed.

Synopsis:

Instructors: Eugenio Guerra - Esteban Osorio

Language: Spanish

National Laboratory for High Performance Computing (NLHPC)

Attendance: 80 people

Requirements:
To be able to follow the course and the commands that the teacher will execute, it is recommended to have knowledge of Linux.

In this workshop we will show how to use the Slurm resource management system present in the vast majority of top500 supercomputers. The Leftraru-Guacolda cluster of the National High Performance Computing Laboratory (NLHPC) will be used.

The tutorial will be carried out in 2 sessions of 4 hours each.

The contents of session number 1 are as follows:

Module I

NLHPC infrastructure
Presentation of NLHPC infrastructure
Accessing the cluster and submitting tasks
NLHPC Login Nodes
Basic use of Slurm
Using the srun command and its parameters
Using the sbatch command
Basic script
Queue tasks
Monitoring tasks
Cancel tasks
Resource underutilisation
Other basic tasks
Available software
Listing available software
Using available software
Computational efficiency
Others

Module II

Parallel programming (basic notions)
Shared Memory Model (OpenMP)
Message passing model (MPI)
Running simulations
Sequential jobs
OpenMP jobs
MPI jobs
Multiple sequential jobs (job array).
Jobs that use GPUs
Job dependencies
Task scheduling using crontab
Checkpoint/Restart
Simulation monitoring
Monitoring simulations using http
Monitoring simulations using Ganglia
Utilization graphs in notification mail
Installing and compiling applications
Compilers and flags used
Compiling programs from source code
Installing modules in Python
Installing modules in R
Frequent problems
Cancellation due to excess memory
Cancellation due to CPU underutilization
Cancellation due to underutilization of Memory
Resource overuse