ROOM 2. Scaling Data Analytics with Dask

Scaling Data Analytics with Dask

ROOM 3

Monday September 18, 2023

14:00 to 17:30 hours

Instructors:

  • Aurelio Antonio Vivas Meza (Colombia), Universidad de los Andes, Argonne National Laboratory 
  • Diego Fernando Roa (Colombia), University of Delaware, Argonne National Laboratory
  • Natalia Clementi (Argentina), Coiled
  • John Alexander Sanabria Ordoñez (Colombia), Universidad del Valle

Program:

  • Module 1. Fundamentals
    • Parallel programming fundamentals in Dask
    • Lab 1: Parallel programming
    • Out of core computations
    • Lab 2: Out of core computations
    • Local and HPC Clusters
    • Lab 3: Local and HPC Clusters
    • Dask Dashboard and Performance Metrics
    • Lab 4: Dashboard and Performance Metrics
  • Module 2. Dask Data Parallel Collections
    • Lab 5: Dask Array
    • Lab 6: Dask DataFrame
    • Lab 7: Dask Bags
Chair(s):
Information

Nowadays most of the scientific research supported by High Performance Computing (HPC) Systems begin with large simulations and scientific instruments data collection campaigns followed by large data analytics workflows. This has motivated the convergence between HPC and Big Data Analytics from the hardware and the software point of view. On one hand, the design of supercomputers was pushed towards the development of architectures that meet the needs of both numerical computations and Big Data analysis. On the other hand, many data analytics tools, such as Dask, in the Big Data ecosystem have been adapted to HPC systems.

Dask is an open-source library for parallel/distributed computing in Python. Dask extends scientific data collections such as Numpy, Xarray, Pandas DataFrames, Scikit-Learn, among others so that they can achieve parallel/distributed processing from local machines to distributed systems such as the cloud and high-performance computing systems

Student's prerequisites

  • Desirable but not required experience in Python, installing libraries via pip or anaconda.

Access. Students must use their personal equipment

  • It is expected for the user to have administrator privileges in her machine.
  • It is recommended to use a Linux-based OS.
  • The user will be required to install the dask[complete] Python library and nodejs via APT only for Lab 2.


References: 

 

Instructor(s):
Aurelio Vivas