Agenda

Preparation Day: Wednesday, July 27, 2022
Institute: Monday, August 1 – August 5, 2022
This event will be held virtually. All times listed are Pacific Time.

All program content will be found on the GitHub Repository
https://github.com/sdsc/sdsc-summer-institute-2022

Wednesday, July 27

Pacific time

Session

9:00 AM – 11:00 AM

1.0 Preparation Day - Welcome & Orientation
Robert Sinkovits, Director of Education and Training

Accounts, Login, Environment, Running Jobs and Logging into Expanse User Portal

Q&A wrap up

Monday, August 1

Pacific time	Main Room Session
8:00 AM – 8:15 AM	Welcome
8:15 AM – 9:15 AM	2.1 Parallel Computing Concepts Robert Sinkovits, Director of Education and Training Advanced cyberinfrastructure users, whether they develop their own software or run 3rd party applications, should understand fundamental parallel computing concepts. Here we cover supercomputer architectures, the differences between threads and processes, implementations of parallelism (e.g., OpenMP and MPI), strong and weak scaling, limitations on scalability (Amdahl’s and Gustafson’s Laws) and benchmarking. We also discuss how to choose the appropriate number of cores, nodes or GPUs when running your applications and, when appropriate, the best balance between threads and processes. This session does not assume any programming experience.
9:15 AM – 10:00 AM	2.2 Hardware Overview Andreas Goetz, Research Scientist & Principal Investigator All users of advanced CI can benefit from a basic understanding of hardware, to determine which factors affect application performance. Here we give an overview starting from CPUs (processors, cores, hyperthreading, instruction sets), the anatomy of a compute node (sockets, memory, attached devices, accelerators), to an overview of cluster architecture (login and compute nodes, interconnects). We also cover how to obtain hardware information using Linux tools, pseudo-filesystems and commonly used hardware utilization monitoring tools.
10:00 AM – 10:15 AM	Break
10:15 AM – 11:30 AM	2.3 Intermediate Linux Andreas Goetz, Research Scientist & Principal Investigator Effective use of Linux based compute resources via the command line interface (CLI) can significantly increase researcher productivity. Assuming basic familiarity with the Linux CLI we cover some more advanced concepts with focus on the Bash shell. Among others this includes the filesystem hierarchy, file permissions, symbolic and hard links, wildcards and file globbing, finding commands and files, environment variables and modules, configuration files, aliases, history and tips for effective Bash shell scripting.
11:30 AM – 12:30 PM	2.4 Batch Computing Mary Thomas, Computational Data Scientist As computational and data requirements grow, researchers may find that they need to make the transition from dedicated resources (e.g., laptops, desktops) to campus clusters or nationally allocated systems. Jobs on these shared resources are typically executed under the control of a batch submission system such as Slurm, PBS, LSF or SGE. This requires a different mindset since the job needs to be configured so that the application(s) can be run non-interactively and at a time determined by the scheduler. The user also needs to specify the job duration, account information, hardware requirements and partition or queue. The goals of this session are to introduce participants to the fundamentals of batch computing before diving into the details of any particular workload manager to help them become more proficient, help ease porting of applications to different resources, and to allow CI Users to understand concepts such as fair share scheduling and backfilling.
12:30 PM – 12:45 PM	Break
12:45 PM – 2:15 PM	2.5 Data Management Marty Kandes, Computational and Data Science Research Specialist Proper data management is essential for the effective use of advanced CI. This session will cover an overview of file systems, data compression, archives (tar files), checksums and MD5 digests, downloading data using wget and curl, data transfer and long-term storage solutions.

Tuesday, August 2

Pacific time	Main Room Session
8:00 AM – 8:30 AM	3.1 Security Nicole Wolter, Computational Data Scientist Maintaining a secure CI environment is essential for ensuring the integrity of resources, data and research. In this session we will discuss the importance of, and best practices for maintaining a secure environment.
8:30 AM – 9:30 AM	3.2 Interactive Computing Mary Thomas, Computational Data Scientist Interactive computing refers to working with software that accepts input from the user as it runs. This applies not only to business and office applications, such as word processing and spreadsheet software, but HPC use cases involving code development, real-time data exploration and advanced visualizations run across one or more compute nodes. Interactive computing is often used when applications require large memory, have large data sets that are not that practical to download to local devices, need access to higher core counts or rely on software that is difficult to install. User inputs are entered via a command line interface (CLI) or application GUI (e.g., Jupyter Notebooks, Matlab, RStudio). Actions are initiated on remote compute nodes as a result of user inputs. This session will introduce participants to advanced CI concepts and what’s going on "under the hood" when they are using interactive tools. Topics covered will include mechanisms for accessing interactive resources; commonalities and differences between batch and interactive computing; understanding the differences between web-based services and X11/GUI applications; monitoring jobs running on interactive nodes; overview of Open OnDemand portals.
9:30 AM – 9:45 AM	Break
9:45 AM – 10:30 AM	3.3 Getting Help Nicole Wolter, Computational Data Scientist Reducing the time and effort needed to address problems related to application performance, batch job submission or data management can minimize frustration and enable the users to become more productive. In this section we will cover common problems and best practices on resolving issues.
10:30 AM – 11:30 AM	3.4 Code Migration Mahidhar Tatineni, Director of User Services Introduction to porting codes and workflows to SDSC resources. We will cover typical approaches to moving your computations to our resources – use of applications/software packages already available on the system; compiling code from source with information on compilers, libraries, and optimization flags to use; setting up python & R environments on our systems; use of conda based environments; managing workflows; and use of containerized solutions via Singularity.
11:30 AM – 11:45 AM	Break
11:45 AM – 12:45 PM	3.5 High Throughput Computing Marty Kandes, Computational and Data Science Research Specialist High-throughput computing (HTC) workloads are characterized by large numbers of small jobs. These frequently involve parameter sweeps where the same type of calculation is done repeatedly with different input values or data processing pipelines where an identical set of operations is applied to many files. This session covers the characteristics and potential pitfalls of HTC, job bundling, the Open Science Grid and the resources available through the Partnership to Advance Throughput Computing (PATh).
12:45 PM – 1:45 PM	3.6 Linux Tools for File Processing Robert Sinkovits, Director of Education and Training Many computational and data processing workloads require pre-processing of input files to get the data into a format that is compatible with the user’s application and/or post-processing of output files to extract key results. While these tasks could be done by hand, the process can be time-consuming, tedious and, worst of all, error prone. In this session we cover the Linux tools awk, sed, grep, sort, head, tail, cut, paste, cat and split, which will help users to easily implement automation.

Wednesday, August 3

Pacific time	Main Room Session	Breakout Room Session
8:00 AM – 9:30 AM	4.1a Intro to Git & GitHub Mary Thomas, Computational Data Scientist This session will provide an overview of GitHub and an introduce version control with Git/GitHub for beginners. Participants will learn to create a repository on Github and manage files, use pull requests, merge changes, rebase branches, etc.	4.1b Advanced Git & GitHub Marty Kandes, Computational and Data Science Research Specialist You should be already familiar with creating Pull Requests, merging, and rebasing branches
9:30 AM – 9:45 AM	Break
9:45 AM – 12:00 PM	4.2a Python for HPC Mahidhar Tatineni, Director of User Services In this session we will introduce four key technologies in the Python ecosystem that provide significant benefits for scientific applications run in supercomputing environments. Previous Python experience is recommended but not required. (1) First, we will learn how to speed up Python code compiling it on-the-fly with numba (2) Then we will introduce the threads, processes and the Global Interpreter lock and we will leverage first numba then dask to use all available cores on a machine (3) Finally we will distribute computations across multiple nodes launching dask workers on a separate Expanse job.	4.2b A Short Introduction to Data Science and its Applications Ilkay Altintas, Chief Data Science Officer Subhasis Dasgupta, Computational and Data Researcher Shweta Purawat, Computational and Data Researcher The new era of data science is here. Our lives as well as any field of science, engineering, business, and society are continuously transformed by our ability to collect meaningful data in a systematic fashion and turn that into value. These needs not only push for new and innovative capabilities in composable data management and analytical methods that can scale in an anytime anywhere fashion, but also require methods to bridge the gap between applications and compose such capabilities within solution architectures. In this short overview, we will show you a plethora of applications that are enabled by data science techniques and describe the process and cyberinfrastructure used within these projects to solve questions.
12:00 PM – 2:30 PM	4.3a Performance Tuning Bob Sinkovits, Director for Scientific Computing Applications This session is targeted at attendees who both do their own code development and need their calculations to finish as quickly as possible. We will cover the effective use of cache, loop-level optimizations, force reductions, optimizing compilers and their limitations, short-circuiting, time-space tradeoffs and more. Exercises will be done mostly in C, but emphasis will be on general techniques that can be applied in any language.	4.3b Scalable Machine Learning Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Research Analyst Machine learning is an integral part of knowledge discovery in a wide variety of applications. From scientific domains to social media analytics, the data that needs to be analyzed has become massive and complex. This session introduces approaches that can be used to perform machine learning at scale. Tools and procedures for executing machine learning techniques on HPC will be presented. Spark will also be covered for scalable data analytics and machine learning. Please note: Knowledge of fundamental machine learning algorithms and techniques is required.

Thursday, August 4

Pacific time	Main Room Session	Breakout Room Session
8:00 AM – 10:30 AM	5.1a Scientific Visualization for mesh based data with Visit Amit Chourasia, Senior Visualization Scientist This tutorial will provide a high-level overview of scientific visualization techniques and their applicability for structured mesh-based data (such as rectilinear grids). Attendees will follow along exercises in a hands-on manner to employ different types of techniques using VisIt software and also perform remote visualization on Expanse cluster.	5.1b Deep Learning - Part 1 Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Computational Data Scientist Deep learning, a subfield of machine learning, has seen tremendous growth and success in the past few years. Deep learning approaches have achieved state-of-the-art performance across many domains, including image classification, speech recognition, and biomedical applications. This session provides an introduction to neural networks and deep learning concepts and approaches. Examples utilizing deep learning will be presented, and hands-on exercises will be covered using Keras. Please note: Knowledge of fundamental machine learning concepts and techniques is required.
10:30 AM – 10:45 AM	Break
10:45 AM – 1:30 PM	5.2a GPU Computing and Programming Andreas Goetz, Research Scientist and Principal Investigator This session introduces massively parallel computing with graphics processing units (GPUs). The use of GPUs is popular across all scientific domains since GPUs can significantly accelerate time to solution for many computational tasks. Participants will be introduced to essential background of the GPU chip architecture and will learn how to program GPUs via the use of libraries, OpenACC compiler directives, and CUDA programming. The session will incorporate hands-on exercises for participants to acquire the basic skills to use and develop GPU aware applications.	5.2b Deep Learning – Part 2 Mai Nguyen, Lead for Data Analytics Paul Rodriguez, Computational Data Scientist This session continues and extends Deep Learning - Part 1 by going into more advanced examples. Concepts regarding architecture, layers, and applications will be presented. Additionally, more advanced tutorials and hands-on exercises with larger deep convolutional networks and transfer learning will be executed on GPUs. There will also be a chance to learn Keras more in depth and become familiar with building more flexible models.
1:30 PM – 2:00 PM	5.3 An Introduction to Singularity: Containers for Scientific and High-Performance Computing Martin Kandes, Computational & Data Science Research Specialist

Friday, August 5

Pacific time	Main Room Session
8:00 AM – 11:00 AM	6.1a Parallel Computing using MPI & Open MP Mahidhar Tatineni, Director of User Services This session is targeted at attendees who are looking for a hands-on introduction to parallel computing using MPI and Open MP programming. The session will start with an introduction and basic information for getting started with MPI. An overview of the common MPI routines that are useful for beginner MPI programmers, including MPI environment set up, point-to-point communications, and collective communications routines will be provided. Simple examples illustrating distributed memory computing, with the use of common MPI routines, will be covered. The OpenMP section will provide an overview of constructs and directives for specifying parallel regions, work sharing, synchronization and data scope. Simple examples will be used to illustrate the use of OpenMP shared-memory programming model, and important run time environment variables Hands on exercises for both MPI and OpenMP will be done in C and FORTRAN.	6.1b Information Visualization Concepts Amit Chourasia, Senior Visualization Scientist This tutorial will provide a ground up understanding of information visualization concepts and how they can be leveraged to select and use effective visual idioms for different data types such spreadsheet data, geospatial, graph, etc.). Example visualization designs and fixing problems with existing visualizations will be discussed. Practical rules of thumbs for visualization will be discussed as well.
11:00 AM – 11:15 AM	Break
11:15 AM – 12:00 PM	6.2 Scaling up Interactive Data Analysis in Jupyter Lab: From Laptop to HPC Peter Rose, Director of Structural Bioinformatics Laboratory In this session we will demonstrate scaling up data analysis to larger than memory (out-of-core) datasets and processing them in parallel on CPU and GPU nodes. In the hands-on exercise we will compare Pandas, Dask, Spark, cuDF, and Dask-cuDF dataframe libraries for handling large datasets. We also cover setting up reproducible and transferable software environments for data analysis.
12:00 PM – 12:15 PM	Closing Remarks Robert Sinkovits, Director of Education and Training

Get Connected