Data Visualization with ggplot2

This material was authored by Shelby Golden, M.S. You can learn more about the instructors here. Special thanks to Professor Daniel Weinberger for allowing us to use his plot code in this workshop.

Introduction

“the grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).” — Professor Leland Wilkinson, Yale alumnus, in The Grammar of Graphics

Throughout the data processing pipeline—whether immediately after data import for exploratory data analysis or in preparation for communicating results—researchers frequently visualize their data. Before Professor Hadley Wickham and his colleagues developed the ggplot2 package as part of the tidyverse, creating complex or customized graphics beyond typical chart types was a challenging task. It often required tedious, verbose, and less maintainable code.

In this workshop, we will explore the evolution of graphical representations of data, which led to the development of the Grammar of Graphics used by ggplot2. This approach transforms visualization principles from rigid chart types to flexible, customizable rules for generating graphs. In this session, you will come to realize that different chart types, such as pie charts and bar charts, often represent the same data visualization with only minor adjustments.

With a better understanding of the domain language used by ggplot2, we will examine each grammatical layer that constitutes a graphical object. We will discuss how these layers are procedurally processed relative to one another, how data and aesthetic mappings are inherited in subsequent layers, and how to correctly utilize layers by their design.

While the core focus of this workshop is on introducing students to the paradigm-shifting approach of visualizing data using graphical grammar instead of traditional charts, we will also discuss some of the advanced features that ggplot2 offers. For example, we will briefly touch on map projections and interactive plots, although these topics will not be covered in-depth.

Workshop Learning Goals

  • Classify the Grammar of Graphics layers used in ggplot2 syntax.
  • Applications of different geometries, effective use of layering, and polishing the result.
  • Interactive plots, map projections, and leveraging AI-assisted coding.

We have prepared real-world examples, including challenge questions and answers, for the hands-on portion of this workshop. To fully engage with these materials, please download the provided codebase. Detailed instructions for this process can be found below under Accessing the Codespaces.

Slides and Handouts

You can download the complete slide deck with annotations and the in-person workshop handout. Comments are saved in the bottom left corner of each slide, and references for this webpage are located in the Appendix. As you go through the workshop materials, you might find it helpful to refer to the ggplot2 package documentation and the tidyverse cheat sheets, especially the function references page 13.

Notice that layer-specific code is summarized at the top of each slide, with unique aspects highlighted in purple. These summaries are not comprehensive but are intended to illustrate the components used in each layer. Students are encouraged to explore the ggplot2 documentation for complete details.

The code used to generate the example graphs demonstrates how each layer contributes to the final figure. To view the entire code, please follow the instructions on the Worked-Through Example page.

Accessing the Codespaces

In this workshop, you’ll need to access the R code prepared for the workshop discussions and challenge questions. If you haven’t already, please download the latest version of R to your device. We also recommend using the latest version of RStudio as your Integrated Development Environment (IDE)  4.

Attribution and Ownership

Please note that all materials provided in this workshop, including any code added to your personal repository, belongs to DSDE. When using or referencing this material, please ensure to cite it correctly to give proper credit to the original authors.

This workshop was created using R (v 4.5.1) in the RStudio IDE (v 2025.09.2+418). The renv() package (v 1.1.5) is included to reproduce the same coding environment, ensuring all relevant packages and package versions are stored. If you encounter issues running the scripts, please check that the environment is initialized and that you are using the same versions of R and RStudio.

Initializing the Environment

  1. Download the prepared codespace, which contains a comprehensive code environment configured with the RStudio IDE for in-class discussions and challenge questions with solutions.
  1. Unzipped the downloaded directory and move it to the file location you wish to house the project.

    Command-Line Application
    cd "file_path/Downloads/"                         # Open the directory the file was downloaded into.
    unzip Data-Visualization-with-ggplot2.zip         # Unzip the file.
    mv Data-Visualization-with-ggplot2 /new_path/     # Move the unzipped directory to the new location.
  2. Launch the project by opening the *.Rproj file in RStudio.

    Command-Line Application
    cd "/new_path/Data-Visualization-with-ggplot2"    # Navigate to the new location directory.
    
    # Open the *.Rproj file using RStudio
    open Data-Visualization-with-ggplot2.Rproj        # For macOS
    start "" "Data-Visualization-with-ggplot2.Rproj"  # For Windows
  3. Open Discussion and Challenge Questions.R. To ensure this script opens in the correct project RStudio window, you can use the “Files” pane inside RStudio to launch the scripts.

  4. In the R console, activate the environment by running the following lines of code. Please note that some of the packages used in this workshop, such as arrow, may take a while to install.

    RStudio Console
    renv::init()          # Initialize the project.
    renv::restore()       # Download packages and their version saved in the lockfile.
Note

If you are asked to update packages, say no. The renv() is intended to recreate the same environment under which the project was created, making it reproducible. You are ready to proceed when running renv::restore() gives the output:

RStudio Output
- The library is already synchronized with the lockfile.

If you experience any trouble with this step, you might want to confirm that you are using R (v 4.5.1) in the RStudio IDE (v 2025.09.2+418). You can also read more about renv() in their vignette 5.

Note

A .gitignore file has been included in the directory, specifying the files and file types that do not need to be version controlled or shared with GitHub. Please note that on some devices, this file might be hidden depending on your settings.

Deploying to GitHub

With a Git-initialized project, you can add it to a remote repository on GitHub. Below are instructions to help you establish a remote copy of this codebase on GitHub. If you are new to using Git and GitHub for code-based projects, please check out our workshop on the topic: Getting Started with Git and GitHub.

  1. Log in to your personal GitHub account.

  2. In the top-right of the page navigation bar, select the dropdown menu and click New repository 6.

  3. Fill out the following sections:

    1. Adjust the GitHub account owner as needed and create the name for the new repository.

    2. It is good practice to initially set the repository to “Private”.

    3. Do NOT use a template or include a description, README.md, .gitignore, or license.

  4. Open the command-line application (i.e. Terminal for Macs and Git Bash for Windows) and navigate to the file location you want to temporarily store the repository copy.

    Command-Line Application
    cd "/file_path/Data-Visualization-with-ggplot2"
  5. Confirm that Git has been initialized by checking the project status. If the project is initialized, you should see that Git is present and tracking files.

    Command-Line Application
    git status

    If instead, you see the following message, you will need to initialize Git as shown above in the section Initializing the Environment.

    Command-Line Output
    fatal: not a git repository (or any of the parent directories): .git
  6. If git status shows untracked files or untracked changes to tracked files, add and commit these to the project’s .git directory. If this is your first time committing the directory contents, you will see all files listed as untracked.

    Command-Line Application
    git add .                       # Stage all changed and new files for version tracking.
    git commit -m "first commit"`   # Commit the changes with a message.
  7. Create a branch named main.

    Command-Line Application
    git branch -a                   # Check that the branch 'main' does not exist.
    git branch -M main              # If the main branch doesn’t exist, create it.
  8. In the newly created GitHub repository, under “Quick setup,” you will find the repository’s SSH or HTTPS URL. Copy one of these URLs to define the remote location and transfer protocol you want to use between your local device and this GitHub repository.

    For example, if the repository name is “NEW-REPOSITORY,” the URLs will look like this:

    # SSH
    git@github.com:EXAMPLE-USER/NEW-REPOSITORY.git
    
    # HTTPS
    https://github.com/EXAMPLE-USER/NEW-REPOSITORY.git
  9. Set the GitHub repository location and save it as origin. Replace the placeholder URL with the repository URL you copied from GitHub in the previous step.

    Command-Line Application
    # using SSH
    git remote add origin git@github.com:EXAMPLE-USER/NEW-REPOSITORY.git
    
    # or using HTTPS
    git remote add origin https://github.com/EXAMPLE-USER/NEW-REPOSITORY.git
  10. Finally, push the directory to your empty GitHub repository. Since you are establishing a new repository, use -u origin main to create and set the upstream branch in the GitHub repository.

    Command-Line Application
    git push -u origin main         # Push to remote and set upstream.
  11. Refresh the page to confirm the transfer and linking were successful.

About the Data

Respiratory syncytial virus (RSV) is a common virus that infects the nose, throat, and lungs, spreading mainly in fall and winter, with peaks in December and January. While it often causes mild illness in healthy individuals, it can lead to hospitalization in infants under six months and older adults. RSV can cause severe illnesses like bronchiolitis and pneumonia, especially in children under one year, making it a leading cause of these conditions. For more information, visit the CDC’s About RSV page 7.

National trends of RSV infections are monitored through the Centers for Disease Control and Prevention’s (CDC) Respiratory Virus Hospitalization Surveillance Network, known as RSV-NET. RSV-NET is one of three programs that make up the Respiratory Virus Hospitalization Surveillance Network (RESP-NET), alongside the Coronavirus Disease 2019 (COVID-19) Hospitalization Surveillance Network (COVID-NET) and the Influenza Hospitalization Surveillance Network (FluSurv-NET) 8.

As of this writing, 161 counties and county equivalents across 13 states participate in RSV-NET by submitting data on laboratory-confirmed, RSV-associated hospitalizations among children and adults. This surveillance network covers approximately 30 million people, representing about 9% of the U.S. population. Although its demographic representation is comparable to the overall U.S. population, it may not be fully generalizable to the entire country. For more information, visit the CDC’s information page about RSV-NET 9.

The cleaned and harmonized version of the RSV-NET dataset was compiled as part of YSPH’s PopHIVE project 10. It reflects data downloaded from the Weekly Rates of Laboratory-Confirmed RSV Hospitalizations from the RSV-NET Surveillance System page on data.gov, which is updated weekly and accessible via the RSV-NET surveillance program’s interactive dashboard 9,11. The original data used was downloaded on December 23rd, 2024. The code used to prepare the dataset for this module can be found in the YSPH DSDE’s Book of Workshops GitHub repository: Data-Visualization-with-ggplot2/Code/Cleaning Script.R 1214.

Section Glossary

Integrated Development Environment (IDE) A software application that combines tools for editing, building, testing, and debugging code into a single, user-friendly interface 4.

References

1.
Wickham, H. et al. Function references. ggplot2 Documentation.
2.
Contributors, P. Data Visualization with Ggplot2: Cheat Sheet. (Springer-Verlag, 2024).
3.
Wickham, H. et al. Ggplot2: Elegant Graphics for Data Analysis Documentation. (Springer-Verlag, 2016).
4.
5.
Ushey, K., Wickham, H. & Posit. Introduction to renv.
6.
GitHub. Creating a new repository. GitHub Docs.
7.
Disease Control, C. for & (CDC), P. About RSV. (2025).
8.
Disease Control, C. for & (CDC), P. Respiratory virus hospitalization surveillance network (RESP-NET). (2025).
9.
Disease Control, C. for & (CDC), P. Respiratory syncytial virus hospitalization surveillance network (RSV-NET). cdc.gov (2025).
10.
Public Health, Y. S. of. PopHIVE.
11.
12.
Hong Ooi, A. by. Lm() within mutate() in group_by(). Stack Overflow (2016).
13.
14.
Disease Control, C. for & (CDC), P. MMWR weeks definition.
15.
Wickham, H., Çetinkaya-Rundel, M. & Grolemund, G. R for Data Science. O’Reilly Media (O’Reilly Media, 2023).
16.
Wickham, H., Navarro, D. & Pedersen, T. L. Ggplot2: Elegant Graphics for Data Analysis. (Springer, 2010).
17.
Pedersen, T. L. ggplot2 workshop part 1. YouTube (2020).
18.
Pedersen, T. L. ggplot2 workshop part 2. YouTube (2020).
19.
Emaasit, D. & Various. ggplot2 extensions gallery.
20.
21.
Wickham, H. et al. Extending ggplot2 vignette. ggplot2 Documentation.
22.
Wickham, H. et al. Using ggplot2 in packages vignette. ggplot2 Documentation.
23.
Tufte, E. The Visual Display of Quantitative Information. (Graphics Press, LLC, 1983).
24.
Kelleher, C. & Wagener, T. Ten guidelines for effective data visualization in scientific publications. Environmental Modelling & Software 26, 822–827 (2011).
25.
26.
Tufte, E. Envisioning Information. (Graphics Press, LLC, 1990).
27.
Cleveland, W. The Elements of Graphing Data. (Wadsworth Advanced Books; Software, 1985).
28.
Cleveland, W. S. Visualizing Data. (Hobart Press, 1995).
29.
Wilkinson, L. The Grammar of Graphics. Springer (Springer-Verlag, 1999). doi:10.1007/0-387-28695-0.