Introduction to Programming in R and Data Wrangling with tidyverse

Updated: February 16th, 2026

This material represents two versions of the same workshop: one for graduate students during the academic year, and one for undergraduate students in the Big Data Summer Immersion at Yale (BDSY) program, which includes an introduction to programming in R.

The introduction to programming in R was authored by Shelby Golden, M.S., and the introduction to tidyverse is a collaborative effort by Shelby Golden, M.S., and Howard Baik, M.S. You can learn more about the instructors here.

Workshop Learning Goals

Over the course of the workshop chapter, we will take you through:

Programming in R Learning Goals:

TipComing Soon!

Data Wrangling with tidyverse Learning Goals:

  • Explore the tidyverse ecosystem and its integrated approach to data analysis using domain specific language.
  • Develop proficiency with tidyr, dplyr, and stringr with a real-world worked-through example.
  • Practice data manipulation skills by answering prepared questions using COVID-19 data.

We have prepared real-world examples, including questions and solutions, for the hands-on portion of this workshop. To fully engage with these materials, please download the provided codebase. Detailed instructions for this process can be found below under Accessing the Codespaces.

Slides and Handouts

Download the complete slide decks with annotations and the in-person workshop handouts. Comments have been saved in the bottom left corner of each slide, and references for this webpage are located in the Appendix.

Programming in R Materials:

TipComing Soon!

Data Wrangling with tidyverse Materials:

Accessing the Codespaces

In this workshop, you’ll access questions outlined in R. If you haven’t already, please download the latest version of R to your device. We also recommend using the latest version of RStudio as your Integrated Development Environment (IDE).

ImportantAttribution and Ownership

Please note that all materials provided in this workshop, including any code added to your personal repository, belongs to DSDE. When using or referencing this material, please ensure to cite it correctly to give proper credit to the original authors.

This workshop was created using R (v 4.5.2) in the RStudio IDE (v 2026.01.0+392). The renv() package (v 1.1.7) is included to reproduce the same coding environment, ensuring all relevant packages and package versions are stored. If you encounter issues running the scripts, please check that the environment is initialized and that you are using the same versions of R and RStudio.

Initializing the Environment

  1. Download the prepared codebase, which is configured as an RStudio project and includes code for in-class questions with solutions.
  1. Unzipped the downloaded directory and move it to the file location you wish to house the project.

    Command-Line Application
    cd "file_path/Downloads/"                               # Open the directory the file was downloaded into.
    unzip A-Journey-into-the-World-of-tidyverse.zip         # Unzip the file.
    mv A-Journey-into-the-World-of-tidyverse "/new_path/"   # Move the unzipped directory to the new location.
  2. Launch the project by opening the *.Rproj file in RStudio.

    Command-Line Application
    cd "/new_path/A-Journey-into-the-World-of-tidyverse"    # Navigate to the new location directory.
    
    # Open the *.Rproj file using RStudio
    open A-Journey-into-the-World-of-tidyverse.Rproj        # For macOS
    start "" "A-Journey-into-the-World-of-tidyverse.Rproj"  # For Windows
  3. Open Questions.R. To ensure this script opens in the correct project RStudio window, you can use the “Files” pane inside RStudio to launch the scripts.

  4. In the R console, activate the environment by running the following lines of code.

    RStudio Console
    renv::init()          # Initialize the project.
    renv::restore()       # Download packages and their version saved in the lockfile.
Note

If you are asked to update packages, say no. The renv() is intended to recreate the same environment under which the project was created, making it reproducible. You are ready to proceed when running renv::restore() gives the output:

RStudio Output
- The library is already synchronized with the lockfile.

If you experience any trouble with this step, you might want to confirm that you are using R (v 4.5.1) in the RStudio IDE (v 2025.09.2+418). You can also read more about renv() in their vignette 1.

Note

A .gitignore file has been included in the directory, specifying the files and file types that do not need to be version controlled or shared with GitHub. Please note that on some devices, this file might be hidden depending on your settings.

Deploying to GitHub

With a Git-initialized project, you can add it to a remote repository on GitHub. Below are instructions to help you establish a remote copy of this codebase on GitHub. If you are new to using Git and GitHub for code-based projects, please check out our workshop on the topic: Getting Started with Git and GitHub.

  1. Log in to your personal GitHub account.

  2. In the top-right of the page navigation bar, select the dropdown menu and click New repository 2.

  3. Fill out the following sections:

    1. Adjust the GitHub account owner as needed and create the name for the new repository.

    2. It is good practice to initially set the repository to “Private”.

    3. Do NOT use a template or include a description, README.md, .gitignore, or license.

  4. Open the command-line application (i.e. Terminal for Macs and Git Bash for Windows) and navigate to the file location you want to temporarily store the repository copy.

    Command-Line Application
    cd "/file_path/A-Journey-into-the-World-of-tidyverse"
  5. Confirm that Git has been initialized by checking the project status. If the project is initialized, you should see that Git is present and tracking files.

    Command-Line Application
    git status

    If instead, you see the following message, you will need to initialize Git for the first time.

    Command-Line Output
    fatal: not a git repository (or any of the parent directories): .git
    Command-Line Application
    git init      # Initialize Git tracking.
  6. If git status shows untracked files or untracked changes to tracked files, add and commit these to the project’s .git directory. If this is your first time committing the directory contents, you will see all files listed as untracked.

    Command-Line Application
    git add .                       # Stage all changed and new files for version tracking.
    git commit -m "first commit"`   # Commit the changes with a message.
  7. Create a branch named main.

    Command-Line Application
    git branch -a                   # Check that the branch 'main' does not exist.
    git branch -M main              # If the main branch doesn’t exist, create it.
  8. In the newly created GitHub repository, under “Quick setup,” you will find the repository’s SSH or HTTPS URL. Copy one of these URLs to define the remote location and transfer protocol you want to use between your local device and this GitHub repository.

    For example, if the repository name is “NEW-REPOSITORY,” the URLs will look like this:

    # SSH
    git@github.com:EXAMPLE-USER/NEW-REPOSITORY.git
    
    # HTTPS
    https://github.com/EXAMPLE-USER/NEW-REPOSITORY.git
  9. Set the GitHub repository location and save it as origin. Replace the placeholder URL with the repository URL you copied from GitHub in the previous step.

    Command-Line Application
    # using SSH
    git remote add origin git@github.com:EXAMPLE-USER/NEW-REPOSITORY.git
    
    # or using HTTPS
    git remote add origin https://github.com/EXAMPLE-USER/NEW-REPOSITORY.git
  10. Finally, push the directory to your empty GitHub repository. Since you are establishing a new repository, use -u origin main to create and set the upstream branch in the GitHub repository.

    Command-Line Application
    git push -u origin main         # Push to remote and set upstream.
  11. Refresh the page to confirm the transfer and linking were successful.

About the Data

The Johns Hopkins Coronavirus Resource Center (JHU CRC) tracked and compiled global COVID-19 pandemic data from January 22, 2020, to March 10, 2023 3. This data is publicly available through their two GitHub repositories. For this workshop content, we imported the cumulative case and death counts for the U.S. from their CSSEGISandData GitHub repository. The raw data for these two datasets used in the analysis can be found in the csse_covid_19_data/csse_covid_19_time_series subdirectory (original source). Both time_series_covid19_confirmed_US.csv and time_series_covid19_deaths_US.csv were used 4,5.

The data dictionary provided by JHU CRC can be found here: Cases and Deaths Datasets Data Dictionary 6. For our purposes, we conducted data cleaning, harmonization, and smoothing using isotonic regression. This included harmonizing the U.S. Census Bureau’s 2010 to 2019 population projections with the 2020 to 2023 vintages.

Details about these steps can be found in the Intro-to-Programming-in-R/Code directory (link to code). The cleaned datasets are in the Intro-to-Programming-in-R/Data directory (link to data).

References

1.
Ushey, K., Wickham, H. & Posit. Introduction to renv.
2.
GitHub. Creating a new repository. GitHub Docs.
3.
Moss, Dr. B. et al. Johns hopkins coronavirus resource center. (2020).
4.
Center, J. H. U. C. R. Time series COVID-19 cases and deaths US. Center for Systems Science and Engineering (CSSE) (2023).
5.
Center, J. H. U. C. R. CSSEGISandData. Center for Systems Science and Engineering (CSSE) (2020).
6.
Center, J. H. U. C. R. CSSEGISandData data dictionary. Center for Systems Science and Engineering (CSSE) (2020).