Agenda - HPC & Data Science

Agenda is subject to change. Times listed below are in Pacific.

Lesson Materials: https://github.com/sdsc/sdsc-summer-institute-2024

Tuesday, July 30
Preparation day (virtual)

Pacific time

Session

9:00 AM – 11:00 AM

1.0 Preparation Day - Welcome & Orientation
Robert Sinkovits, Director of Education and Training

Accounts, Login, Environment, Running Jobs and Logging into Expanse User Portal

Q&A wrap up

Monday, August 5

Pacific time	Main Room Session
8:00 AM – 8:30 AM	Check-in & Registration
8:30 AM - 9:30 AM	Welcome
9:30 AM - 10:15 AM	2.1 Parallel Computing Concepts Robert Sinkovits, Director of Education and Training Advanced cyberinfrastructure users, whether they develop their own software or run 3rd party applications, should understand fundamental parallel computing concepts. Here we cover supercomputer architectures, the differences between threads and processes, implementations of parallelism (e.g., OpenMP and MPI), strong and weak scaling, limitations on scalability (Amdahl’s and Gustafson’s Laws) and benchmarking. We also discuss how to choose the appropriate number of cores, nodes or GPUs when running your applications and, when appropriate, the best balance between threads and processes. This session does not assume any programming experience.
10:15 AM – 11:00 AM	2.2 Hardware Overview Andreas Goetz, Research Scientist & Principal Investigator All users of advanced CI can benefit from a basic understanding of hardware, to determine which factors affect application performance. Here we give an overview starting from CPUs (processors, cores, hyperthreading, instruction sets), the anatomy of a compute node (sockets, memory, attached devices, accelerators), to an overview of cluster architecture (login and compute nodes, interconnects). We also cover how to obtain hardware information using Linux tools, pseudo-filesystems and commonly used hardware utilization monitoring tools.
11:00 AM - 11:15 AM	Break
11:15 AM – 12:15PM	2.3 Intermediate Linux Andreas Goetz, Research Scientist & Principal Investigator Effective use of Linux based compute resources via the command line interface (CLI) can significantly increase researcher productivity. Assuming basic familiarity with the Linux CLI we cover some more advanced concepts with focus on the Bash shell. Among others this includes the filesystem hierarchy, file permissions, symbolic and hard links, wildcards and file globbing, finding commands and files, environment variables and modules, configuration files, aliases, history and tips for effective Bash shell scripting.
12:15 PM - 1:45 PM	Lunch
1:45 PM – 2:45 PM	2.4 Batch Computing Marty Kandes, Computational and Data Science Research Specialist As computational and data requirements grow, researchers may find that they need to make the transition from dedicated resources (e.g., laptops, desktops) to campus clusters or nationally allocated systems. Jobs on these shared resources are typically executed under the control of a batch submission system such as Slurm, PBS, LSF or SGE. This requires a different mindset since the job needs to be configured so that the application(s) can be run non-interactively and at a time determined by the scheduler. The user also needs to specify the job duration, account information, hardware requirements and partition or queue. The goals of this session are to introduce participants to the fundamentals of batch computing before diving into the details of any particular workload manager to help them become more proficient, help ease porting of applications to different resources, and to allow CI Users to understand concepts such as fair share scheduling and backfilling.
2:45 PM – 3:00 PM	Break
3:00 PM – 4:00 PM	2.5 Interactive Computing Mary Thomas, Computational Data Scientist Interactive computing refers to working with software that accepts input from the user as it runs. This applies not only to business and office applications, such as word processing and spreadsheet software, but HPC use cases involving code development, real-time data exploration and advanced visualizations run across one or more compute nodes. Interactive computing is often used when applications require large memory, have large data sets that are not that practical to download to local devices, need access to higher core counts or rely on software that is difficult to install. User inputs are entered via a command line interface (CLI) or application GUI (e.g., Jupyter Notebooks, Matlab, RStudio). Actions are initiated on remote compute nodes as a result of user inputs. This session will introduce participants to advanced CI concepts and what’s going on "under the hood" when they are using interactive tools. Topics covered will include mechanisms for accessing interactive resources; commonalities and differences between batch and interactive computing; understanding the differences between web-based services and X11/GUI applications; monitoring jobs running on interactive nodes; overview of Open OnDemand portals.
4:00 PM - 4:30 PM	Q&A + Wrap-up
4:45 PM - 7:00 PM	Evening Reception (Off-Campus) Transportation will be provided.

Tuesday, August 6

Pacific time	Main Room Session
8:00 AM – 8:30 AM	Check-in & Light Breakfast
8:30 AM – 9:00 AM	3.1 Getting Help Nicole Wolter, Computational Data Scientist Reducing the time and effort needed to address problems related to application performance, batch job submission or data management can minimize frustration and enable the users to become more productive. In this section we will cover common problems and best practices on resolving issues.
9:00 AM – 10:00 AM	3.2 Data Management Marty Kandes, Computational and Data Science Research Specialist Proper data management is essential for the effective use of advanced CI. This session will cover an overview of file systems, data compression, archives (tar files), checksums and MD5 digests, downloading data using wget and curl, data transfer and long-term storage solutions.
10:00 AM – 10:15 AM	Break
10:15 AM – 11:00 AM	3.3 Security Scott Sakai, Senior Security Analyst, SDSC Maintaining a secure CI environment is essential for ensuring the integrity of resources, data and research. In this session we will discuss the importance of, and best practices for maintaining a secure environment.
11:00 AM – 12:00 PM	3.4 Code Migration Mahidhar Tatineni, Director of User Services Introduction to porting codes and workflows to SDSC resources. We will cover typical approaches to moving your computations to our resources – use of applications/software packages already available on the system; compiling code from source with information on compilers, libraries, and optimization flags to use; setting up python & R environments on our systems; use of conda based environments; managing workflows; and use of containerized solutions via Singularity.
12:00 PM - 1:30 PM	Lunch
1:30 PM – 2:45 PM	3.5 High Throughput Computing Marty Kandes, Computational and Data Science Research Specialist High-throughput computing (HTC) workloads are characterized by large numbers of small jobs. These frequently involve parameter sweeps where the same type of calculation is done repeatedly with different input values or data processing pipelines where an identical set of operations is applied to many files. This session covers the characteristics and potential pitfalls of HTC, job bundling, the Open Science Grid and the resources available through the Partnership to Advance Throughput Computing (PATh).
2:45 PM - 3:00 PM	Break
3:00 PM - 4:30 PM	3.6 Linux Tools for File Processing Bob Sinkovits, Director of Education and Training Many computational and data processing workloads require pre-processing of input files to get the data into a format that is compatible with the user’s application and/or post-processing of output files to extract key results. While these tasks could be done by hand, the process can be time-consuming, tedious and, worst of all, error prone. In this session we cover the Linux tools awk, sed, grep, sort, head, tail, cut, paste, cat and split, which will help users to easily implement automation.
4:30 PM - 4:45 PM	Q&A + Wrap-up

Wednesday, August 7

Pacific time	Main Room Session	Breakout Room Session
8:00 AM – 8:30 AM	Check-in & Light Breakfast
8:30 AM – 10:00 AM	4.1a Intro to Git & GitHub Elham Khoda, Computational and Data Science Research Specialist This session will provide an overview of GitHub and an introduce version control with Git/GitHub for beginners. Participants will learn to create a repository on Github and manage files, use pull requests, merge changes, rebase branches, etc.	4.1b Advanced Git & GitHub Fernando Garzon, Computational and Data Science Research Specialist In today's fast-paced software development world, mastering GitHub and Git is a game-changer. This session will enhance your understanding beyond the basics, introducing advanced techniques to streamline workflows, manage complex projects, and automate tasks. You'll discover how to maximize productivity without compromising quality. By the end of this talk, you'll have a deeper grasp of GitHub's potential and a curiosity to explore tools like GitHub Actions, documentation, and automation features. Join us to elevate your development experience and skills.
10:00 AM – 10:15 AM	Break
10:15 AM – 12:30 PM	4.2a Python for HPC Andrea Zonca, Senior Computational Scientist In this session we will introduce four key technologies in the Python ecosystem that provide significant benefits for scientific applications run in supercomputing environments. Previous Python experience is recommended but not required. (1) First, we will learn how to speed up Python code compiling it on-the-fly with numba (2) Then we will introduce the threads, processes and the Global Interpreter lock and we will leverage first numba then dask to use all available cores on a machine (3) Finally we will distribute computations across multiple nodes launching dask workers on a separate Expanse job.	4.2b Information Visualization Concepts Isaac Nealey, Data Visualization Research Specialist This session will provide an introductory understanding of information visualization concepts and how they can be leveraged to select and use effective visual idioms for different data types such as tabular, geospatial, and graph data. We will discuss and critique some example visualization designs, and cover practical rules of thumbs for canonical information visualization as well as address the challenges facing practitioners of scientific data visualization.
12:30 PM - 2:00 PM	Group Photo Lunch
2:00 PM – 4:30 PM	4.3a Conducting Scientific Visualization with VTK and Unreal Engine 5 Isaac Nealey, Data Visualization Research Specialist This tutorial will provide a high-level overview of scientific visualization techniques with two open-source software suites. We will primarily cover structured mesh-based data, but we will address other commonly encountered datasets such as point clouds and geographic data as well. Attendees will employ a state-of-the-art game engine to render 3D scenes containing scientific data. This is a hands-on workshop that will require an adequate computer and the installation of large software packages.	4.3b Scalable Machine Learning Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Research Analyst Machine learning is an integral part of knowledge discovery in a wide variety of applications. From scientific domains to social media analytics, the data that needs to be analyzed has become massive and complex. This session introduces approaches that can be used to perform machine learning at scale. Tools and procedures for executing machine learning techniques on HPC will be presented. Spark will also be covered for scalable data analytics and machine learning. Please note: Knowledge of fundamental machine learning algorithms and techniques is required.
4:30 PM - 4:45 PM	Q&A + Wrap-up

Thursday, August 8

Pacific time	Main Room Session	Breakout Room Session
8:00 AM – 8:30 AM	Check-in & Light Breakfast
8:30 AM - 9:30 AM	5.1 Scaling up Interactive Data Analysis in Jupyter Lab: From Laptop to HPC Peter Rose, Director of Structural Bioinformatics Laboratory In this session we will demonstrate scaling up data analysis to larger than memory (out-of-core) datasets and processing them in parallel on CPU and GPU nodes. In the hands-on exercise we will compare Pandas, Dask, Spark, cuDF, and Dask-cuDF dataframe libraries for handling large datasets. We also cover setting up reproducible and transferable software environments for data analysis.
9:30 AM - 9:45 AM	Break
9:45 AM – 12:15 PM	5.2a Performance Tuning Bob Sinkovits, Director of Education and Training This session is targeted at attendees who both do their own code development and need their calculations to finish as quickly as possible. We will cover the effective use of cache, loop-level optimizations, force reductions, optimizing compilers and their limitations, short-circuiting, time-space tradeoffs and more. Exercises will be done mostly in C, but emphasis will be on general techniques that can be applied in any language.	5.2b Deep Learning - Part 1 Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Computational Data Scientist Deep learning, a subfield of machine learning, has seen tremendous growth and success in the past few years. Deep learning approaches have achieved state-of-the-art performance across many domains, including image classification, speech recognition, and biomedical applications. This session provides an introduction to neural networks and deep learning concepts and approaches. Examples utilizing deep learning will be presented, and hands-on exercises will be covered using Keras. Please note: Knowledge of fundamental machine learning concepts and techniques is required.
12:15 PM - 1:45 PM	Lunch
1:45 PM – 4:30 PM	5.3a GPU Computing and Programming Andreas Goetz, Research Scientist and Principal Investigator This session introduces massively parallel computing with graphics processing units (GPUs). The use of GPUs is popular across all scientific domains since GPUs can significantly accelerate time to solution for many computational tasks. Participants will be introduced to essential background of the GPU chip architecture and will learn how to program GPUs via the use of libraries, OpenACC compiler directives, and CUDA programming. The session will incorporate hands-on exercises for participants to acquire the basic skills to use and develop GPU aware applications.	5.3b Deep Learning – Part 2 Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Computational Data Scientist This session continues and extends Deep Learning - Part 1 by going into more advanced examples. Concepts regarding architecture, layers, and applications will be presented. Additionally, more advanced tutorials and hands-on exercises with larger deep convolutional networks and transfer learning will be executed on GPUs. There will also be a chance to learn Keras more in depth and become familiar with building more flexible models.
4:30 PM - 4:45 PM	Q&A + Wrap-up
5:00 PM - 7:00 PM	Dinner at the 15th Floor

Friday, August 9

Pacific time	Main Room Session
8:00 AM – 8:30 AM	Check-in & Light Breakfast
8:30 AM – 11:30 AM	6.1a Parallel Computing using MPI & Open MP Mahidhar Tatineni, Director of User Services This session is targeted at attendees who are looking for a hands-on introduction to parallel computing using MPI and Open MP programming. The session will start with an introduction and basic information for getting started with MPI. An overview of the common MPI routines that are useful for beginner MPI programmers, including MPI environment set up, point-to-point communications, and collective communications routines will be provided. Simple examples illustrating distributed memory computing, with the use of common MPI routines, will be covered. The OpenMP section will provide an overview of constructs and directives for specifying parallel regions, work sharing, synchronization and data scope. Simple examples will be used to illustrate the use of OpenMP shared-memory programming model, and important run time environment variables Hands on exercises for both MPI and OpenMP will be done in C and FORTRAN.	6.1b Knowledge Management and Knowledge Graph Subhasis Dasgupta, Computational and Data Researcher Jon Stephen, Computational and Data Researcher In this session, we have three connected sections. The first section will help participants understand knowledge management and how to implement it, specifically within the scientific community. It will also highlight the fundamental shift in the machine learning paradigm and how to incorporate knowledge management into daily processes. This section will cover the basic concepts of knowledge management, from ontology development to document management. In the next part, we will cover two fundamental knowledge management techniques that can help users design their knowledge pipelines and improve their daily processes. The first technique focuses on how to use large language models (LLMs) beyond traditional engineering methods and other related technologies. Finally, we are going to delve into the fascinating topic of knowledge graph technology. In this section, we'll explore how to effectively build and use knowledge graphs to manage knowledge pipelines. We'll also discuss how to use LLMs to construct and navigate knowledge management systems.
11:30 AM – 11:45 AM	6.2 Overview of Voyager Amit Majumdar, Division Director of Data-Enabled Scientific Computing Voyager provides an innovative system architecture uniquely optimized for deep learning operations using well-established frameworks such as PyTorch and TensorFlow. Voyager comprises 42 training nodes of Supermicro X12 Habana Gaudi Training Servers; each training node contains 8 GAUDI HL-205 training processor cards which have 100 GbE non-blocking, all-to-all connections among the 8 cards within a node; the 42 Training nodes are connected via a high-performance, low latency 400 GbE switch interconnect. Voyager’s architecture has already shown highly scalable AI application performance in various areas such as LLMs (with billions of parameters such as for GPT2-XL and GPT3-XL), convolutional neural network-based image processing, and graph neural network based high-energy particle physics.
11:45 AM - 12:00 PM	6.3 Overview of COSMOS Mahidhar Tatineni, Director of User Services
12:15 PM – 12:30 PM	Closing Remarks Robert Sinkovits, Director of Education and Training Lunch boxes will be provided*

Get Connected