ROOM 2. Scaling Data Analytics with Dask | LATIN AMERICA HIGH PERFORMANCE COMPUTER CONFERENCIA : CARLA 2023

LATIN AMERICA HIGH PERFORMANCE COMPUTING CONFERENCE CARLA 2021

ROOM 3

Monday September 18, 2023

14:00 to 17:30 hours

Instructors:

Aurelio Antonio Vivas Meza (Colombia), Universidad de los Andes, Argonne National Laboratory
Diego Fernando Roa (Colombia), University of Delaware, Argonne National Laboratory
Natalia Clementi (Argentina), Coiled
John Alexander Sanabria Ordoñez (Colombia), Universidad del Valle

Program:

Module 1. Fundamentals
- Parallel programming fundamentals in Dask
- Lab 1: Parallel programming
- Out of core computations
- Lab 2: Out of core computations
- Local and HPC Clusters
- Lab 3: Local and HPC Clusters
- Dask Dashboard and Performance Metrics
- Lab 4: Dashboard and Performance Metrics
Module 2. Dask Data Parallel Collections
- Lab 5: Dask Array
- Lab 6: Dask DataFrame
- Lab 7: Dask Bags

Chair(s):

Information

Nowadays most of the scientific research supported by High Performance Computing (HPC) Systems begin with large simulations and scientific instruments data collection campaigns followed by large data analytics workflows. This has motivated the convergence between HPC and Big Data Analytics from the hardware and the software point of view. On one hand, the design of supercomputers was pushed towards the development of architectures that meet the needs of both numerical computations and Big Data analysis. On the other hand, many data analytics tools, such as Dask, in the Big Data ecosystem have been adapted to HPC systems.

Dask is an open-source library for parallel/distributed computing in Python. Dask extends scientific data collections such as Numpy, Xarray, Pandas DataFrames, Scikit-Learn, among others so that they can achieve parallel/distributed processing from local machines to distributed systems such as the cloud and high-performance computing systems

Student's prerequisites

Desirable but not required experience in Python, installing libraries via pip or anaconda.

Access. Students must use their personal equipment

It is expected for the user to have administrator privileges in her machine.
It is recommended to use a Linux-based OS.
The user will be required to install the dask[complete] Python library and nodejs via APT only for Lab 2.

References:

Dask Fundamentals Tutorial for High Performance Computing https://github.com/DonAurelio/dask-tutorial-2023

Instructor(s):
Aurelio Vivas