Book of Workshops
Preface
Yale’s Public Health Data Science and Data Equity (DSDE) team hosts workshops, tutorials, and information sessions known as Coffee, Cookie and Coding \left(C^3\right) Workshops. These sessions are designed to help Public Health and Biostatistics masters-level students at Yale effectively leverage computational tools and analytical methods in their educational and professional endeavors. While primarily intended for the Yale community, all are welcome and encouraged to attend and benefit from our offerings.
You can find out more about past and upcoming work on our website. Additional tutorials will soon be available on our YouTube channel. If you are affiliated with Yale, you can set up an office hour appointment with one of our data scientists on our Bookings Page.
Each module adapted for asynchronous learning is presented as its own chapter. To help build your technical vocabulary, new key terms will be highlighted throughout. These terms will appear as new terms and new code commands, and hovering over them will display their definitions. All terms are summarized in a glossary at the end of each section where they are first introduced.
List of Offerings
Workshops Adapted for Asynchronous Learning
Getting Started with Git and GitHub
>
Getting Started with Git and GitHub
>Find the session materials and other related content in the Book of Workshops chapter. All code can be found on its GitHub page.
Part 1 Instructor:
Shelby Golden, M.S., Data Scientist I
Part 1 Learning Goals:
- Understand the purpose and value of Git and GitHub in managing coding projects.
- Learn how Git manages files for version control locally and distributes them through GitHub.
- Set up and configure your local Git and GitHub accounts using either HTTPS or SSH Keys.
Part 2 Instructors:
Shelby Golden, M.S., Data Scientist I
Howard Baik, M.S., Former Data Scientist I at Yale School of Public Health
Part 2 Learning Goals:
- Understand the purpose and value of Git and GitHub in managing coding projects.
- Get hands-on experience using Git and GitHub for solo projects through a worked through example showing common workflows.
- Learn how to use GitHub to support collaboration and teamwork on group projects.
Data Visualization with ggplot2
>
Data Visualization with ggplot2
>
Find the session materials and other related content in the Book of Workshops chapter. All code can be found on its GitHub page.
Instructor:
Shelby Golden, M.S., Data Scientist I
Learning Goals:
- Classify the Grammar of Graphics layers used in
ggplot2syntax. - Applications of different geometries, effective use of layering, and polishing the result.
- Interactive plots, map projections, and leveraging AI-assisted coding.
Live Workshop Content
Choosing Good Subsamples for Measuring New Variables
>
Choosing Good Subsamples for Measuring New Variables
>Date Held: Monday, February 3rd, 2025
Find the session materials and other related content on the GitHub page.
Instructor:
Professor Thomas Lumley, Chair in Biostatistics Statistics at the University of Auckland
Learning Goals:
Researchers often want to add more measurements to an existing database. These might be new assays on stored specimens, or coding of free-text responses, or validation of EHR data against clinical notes. It is expensive to measure the new variables on everyone, so subsampling is attractive. It is possible to do much better than simple random sampling when measuring additional data: any information you already have can be used to identify the most informative records to measure. It's also possible to recover a lot of information from the records that are not chosen in the subsample. Software already exists in R to support most analyses you would want to do of the subsampled data.
Analyzing Larger Data in R
>
Analyzing Larger Data in R
>Date Held: Tuesday, February 4th, 2025
Find the session materials and other related content on the GitHub page.
Instructor:
Professor Thomas Lumley, Chair in Biostatistics Statistics at the University of Auckland
Learning Goals:
Even with growing computer power, researchers sometimes want to work with datasets that are much bigger than computer memory. The interfaces to allow selection, summarization, and aggregation of very large datasets from R are increasingly transparent and easy to set up. I will demonstrate simple analysis of large datasets in R. I will also show how some more complicated analyses can be partitioned between R and a database to exploit the advantages of both systems. I will primarily use duckdb, but I will refer to other large-data interfaces.
Getting Started with R Shiny: Build and Deploy Interactive Web Apps
>
Getting Started with R Shiny: Build and Deploy Interactive Web Apps
>Date Held: Monday, February 17th, 2025
Find the session materials and other related content on the GitHub page.
Instructor:
Howard Baik, M.S., Former Data Scientist I at Yale School of Public Health
Learning Goals:
- Understand the structure of an R Shiny app, including UI (User Interface) and server components.
- Use reactive elements to make your app dynamic and interactive.
- Build a simple web application using a data example.
- Deploy Shiny apps using a self-service platform like shinyapps.io.
Tutorials and Information Sessions
Information Session on HPC and AI/Clarity
>
Information Session on HPC and AI/Clarity
>Date Held: Monday, January 27th, 2025
Find the session materials and other related content on the GitHub page.
The High-Performance Computing (HPC) Session Instructor:
Aya Nawano, Ph.D., Computational Research Support Analyst
The Yale’s AI Clarity Platform Session Instructors:
Weis Rafi, PhD., Associate Chief Information Officer of the Health Sciences Division
Hadar Call, Associate Chief Information Officer, Enterprise Systems and Platforms
Learning Goals:
- Understand the services provided by the Yale Center for Research Computing (YCRC), including high-performance computing clusters (HPCs), and how they benefit researchers. Learn how to request access to YCRC’s HPC resources for your research needs.
- Gain insights into Yale’s Clarity platform, its secure AI tools, and enhanced data protection features. Explore future capabilities of the Clarity platform, including creating custom chatbots and integrating AI models into research and teaching applications.
Meet The Creators
Shelby Golden, M.S.
Data Scientist I
Shelby Golden is a data scientist with a background in computational mathematics, molecular biology, and biochemistry. She holds a Master of Science in Applied Computational Mathematics from Johns Hopkins University and dual Bachelor of Science degrees in Molecular, Cellular, Developmental Biology and Biochemistry with a minor in Engineering in Applied Mathematics.
Howard Baik, M.S.
Former Data Scientist I at Yale School of Public Health
Howard Baik is a Data Scientist with a background in biostatistics and computer science. He holds a Master of Science in Biostatistics, and a Bachelor of Science in Statistics with a minor in Mathematics, both from the University of Washington. He is working towards a Post-Baccalaureate Degree in Computer Science from Oregon State University.
Meet The Guest Speakers
Wies Rafi, Ph.D.
Associate Chief Information Officer of the Health Sciences Division
As Associate Chief Information Officer of the Health Sciences Division, Wies reports directly to the IT CIO, John Barden. In this role, he provides strategic IT guidance and support across multiple Yale schools and health entities, enhancing coordination, efficiencies, and communication in research computing, academic technologies, client services, administrative systems, and business analytics.
He was a lecturer for the Yale's AI Clarity Platform portion of the "Information Session on HPC and AI/Clarity" on Monday, January 27th, 2025.
Hadar Call
Associate CIO, Enterprise Systems and Platforms
Hadar was appointed Associate CIO of Enterprise Systems and Platforms in November 2023. In this role, she leads teams that manage systems for Yale Advancement, Cultural Heritage, and Facilities, and oversees the IT Learning and Development team to enhance the IT workforce through training and development initiatives.
She was a lecturer for the Yale's AI Clarity Platform portion of the "Information Session on HPC and AI/Clarity" on Monday, January 27th, 2025.
Aya Nawano, Ph.D.
Computational Research Support Analyst
Aya Nawano is a Computational Research Support Analyst at Yale Center for Research Computing, which supports the advanced computing needs of Yale’s research community. She primarily works with researchers in the Faculty of Arts and Sciences who use the Bouchet, Grace, and Milgram HPC clusters. Her expertise includes molecular dynamics, MATLAB, and tightly coupled parallel workflows (MPI).
She was the lecturer for the HPC portion of the "Information Session on HPC and AI/Clarity" on Monday, January 27th, 2025.
Professor Thomas Lumley
Chair in Biostatistics Statistics at the University of Auckland
Thomas Lumley is a Professor in the Department of Statistics at the University of Auckland, specializing in statistical methodology and data analysis, particularly in epidemiology, bioinformatics, and complex surveys. He is known for his significant contributions to statistical software development and bridging the gap between theoretical and practical applications in the analysis of complex datasets.
He was the lecturer for the "Choosing Good Subsamples for Measuring New Variables" on Monday, February 3rd and "Analyzing Larger Data in R" on Tuesday, February 4th, 2025.