Scalable and Reproducible Structural Bioinformatics Workshop & Hackathon 2018

Application of Big Data Technology

San Diego Supercomputer Center/University of California, San Diego
May 7 - 9, 2018

This 3-day hands-on workshop introduces participants to the development of fast and scalable structural bioinformatics methods using state-of-the-art Big Data technologies, data mining, machine learning, 3D visualization, and deployment in Jupyter Notebooks. The first two days of the workshop combine lectures, hands-on applications, and programming sessions. On the third day participants apply the new technologies to their own projects.

This workshop is held at the University of California, San Diego and hosted by the Structural Bioinformatics Laboratory at SDSC.


This workshop is sponsored by the NIH Big Data to Knowledge (BD2K) initiative. Air travel and 4-day lodging can be provided for non-commercial participants, including a limited number of international participants. Apply now to secure your place in the workshop. Participants will be selected based on the best fit to the program.

Target Audience

The workshop is aimed at graduate students, postdocs, staff, faculty, industrial researchers, and scientific software developers who develop software for Structural Bioinformatics applications. Intermediate to advanced programming skills in high-level languages (Python, Java, C++) and basic knowledge of SQL are required. The tutorials will be in Python.


In this workshop you will learn how to use the following technologies and apply them to your own projects.

MTF-PySpark and MMTF-Spark (, are open source projects that provide APIs and sample applications for the scalable mining of 3D structures, including structures from the Protein Data Bank, homology models from SWISS-MODEL, and de novo models (e.g., Rosetta). These projects use Big Data technologies to enable high-performance parallel processing of macromolecular structures.

MMTF (Macromolecular Transmission Format) ( is a compact data format for high-performance processing of 3D structures.

Apache Spark ( is the most popular Big Data framework for distributed parallel computing.

Apache Spark SQL and Apache Spark ML (Machine Learning) are scalable data analytics frameworks

Jupyter Notebook and integration with machine learning and 3D visualization (NGLView, py3Dmol) tools in Python

Workshop Outcomes

Learn how to apply Big Data technologies to problems in Structural Bioinformatics

Implement and deploy scalable analysis, visualization, and machine learning in Jupyter Notebooks

Foster collaborations among SDSC scientists and workshop participants.

Become a contributor to open source projects.