Welcome

Materials and setup

NOTE: skip this section if you are not running R locally

You should have R installed. If not:

Notes and examples for this workshop are available at RCS Workshop Materials.

Start RStudio create a new project:

  • On Windows click the start button and search for rstudio. On Mac RStudio will be in your applications folder.
  • In RStudio go to File -> New Project.
  • Choose New Directory and New Project.
  • Choose a name and location for your new project directory.

Workshop goals and approach

In this workshop you will

  • learn R basics,
  • learn about the R package ecosystem,
  • practice reading files and manipulating data in R.

A more general goal is to get you comfortable with R so that it seems less scary and mystifying than it perhaps does now. Note that this is by no means a complete or thorough introduction to R! It’s just enough to get you started.

This workshop is relatively informal, example-oriented, and hands-on. We won’t spend much time examining language features in detail. Instead we will work through an example, and learn some things about the R along the way.

As an example project we will analyze the popularity of baby names in the US from 1960 through 2017. Among the questions we will use R to answer are:

  • In which year did your name achieve peak popularity?
  • How many children were born each year?
  • What are the most popular names overall? For girls? For Boys?

Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See the Data Science Tools workshop notes for details.

For this workshop we encourage you to use RStudio; it is a good R-specific IDE that mostly just works.

Launch RStudio (skip if not using RStudio)

Note: skip this section if you are not using Rstudio.

  • Start the RStudio program
  • In RStudio, go to File -> New File -> R Script

The window in the upper-left is your R script. This is where you will write instructions for R to carry out.

The window in the lower-left is the R console. This is where results will be displayed.

Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!


  1. Try to get R to add 2 plus 2.
##
  1. Try to calculate the square root of 10.
##
  1. R includes extensive documentation, including a manual named “An introduction to R”. Use the RStudio help pane to locate this manual.

Click for Solution Exercise 0 solution ———————————————————————-

## 1. 2 plus 2
2 + 2
## [1] 4
## or
sum(2, 2)
## [1] 4
## 2. square root of 10:
sqrt(10)
## [1] 3.162278
## or
10^(1/2)
## [1] 3.162278
## 3. Find "An Introduction to R".
## Go to the main help page by running 'help.start() or using the GUI
## menu, find and click on the link to "An Introduction to R".

R basics

Function calls

The general form for calling R functions is

## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)

Arguments can be matched by name; unnamed arguments will be matched by position.

Assignment

Values can be assigned names and used in subsequent operations

  • The <- operator (less than followed by a dash) is used to save values
  • The name on the left gets the value on the right.
sqrt(10) ## calculate square root of 10; result is not stored anywhere
## [1] 3.162278
x <- sqrt(10) ## assign result to a variable named x

Names should start with a letter, and contain only letters, numbers, underscores, and periods.

Asking R for help

You can ask R for help using the help() function, or the ? shortcut.

help(help)
?help
?sqrt

The help() function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats package by reading its documentation like this:

help(package = "stats")

Getting data into R

R has data reading functionality built-in – see e.g., help(read.table). However, faster and more robust tools are available, and so to make things easier on ourselves we will use a contributed package called readr instead. This requires that we learn a little bit about packages in R.

Installing and using R packages

A large number of contributed packages are available. If you are looking for a package for a specific task, https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.

You can install a package in R using the install.packages() function. Once a package is installed you may use the library() function to attach it so that it can be used.

## install.packages("readr")
library(readr)

Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can import data from common plain-text formats.

Data Type Function
comma separated read_csv()
tab separated read_delim()
other delimited formats read_table()
fixed width read_fwf()

Note You may be confused by the existence of similar functions, e.g., read.csv() and read.delim(). These are legacy functions that tend to be slower and less robust than the readr functions. One way to tell them apart is that the faster more robust versions use underscores in their names (e.g., read_csv()) while the older functions us dots (e.g., read.csv()). My advice is to use the more robust newer versions, i.e., the ones with underscores.

Baby names data

The examples in this workshop use US baby names data retrieved from https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data. A cleaned and merged version of these data is available at http://tutorials.iq.harvard.edu/data/babyNames.csv.

Exercise 1: Reading the baby names data

Make sure you have installed the readr package and attached it with library(readr).

Baby names data are available at "http://tutorials.iq.harvard.edu/data/babyNames.csv".

  1. Open the read_csv() help page to determine how to use it to read in data.

  2. Read the baby names data using the read_csv() function and assign the result with the name baby_names.

Click for Solution Exercise 1 solution ———————————————————————-

## read ?read_csv
baby_names <- read_csv("http://tutorials.iq.harvard.edu/data/babyNames.csv")

Popularity of your name

In this section we will pull out specific names and examine changes in their popularity over time.

The baby_names object we created in the last exercise is a data.frame. There are many other data structures in R, but for now we’ll focus on working with data.frames.

R has decent data manipulation tools built-in – see e.g., help(Extract). However, in recent years there has been a big surge in well-designed contributed packages for R. In fact, the package readr is part of a powerful collection of R packages designed specifically for data science and called tidyverse. All packages included in tidyverse share an underlying design philosophy, grammar, and data structures.

For this part we will load another tidyverse package called dplyr.

## install.packages("dplyr")
library(dplyr)

Filtering and arranging data

One way to find the year in which your name was the most popular is to filter out just the rows corresponding to your name, and then arrange (sort) by Count.

To demonstrate these techniques we’ll try to determine whether “Alex” or “Mark” was more popular in 1992 among boys. We start by filtering the data so that we keep only rows where Year is equal to 1992 and Name is either “Alex” or “Mark”.

baby_names_alexmark <- filter(baby_names, 
             Year == 1992 & (Name == "Alex" | Name == "Mark"))
baby_names_alexmark
## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Alex  Girls   366  1992
## 2 Mark  Girls    20  1992
## 3 Mark  Boys   8743  1992
## 4 Alex  Boys   7348  1992

Notice that we can combine conditions using & (AND) and | (OR).

In this case we can see that “Mark” is more popular among boys, but to make it even easier we can arrange the data so that the most popular name is listed first.

arrange(baby_names_alexmark, Count)
## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Mark  Girls    20  1992
## 2 Alex  Girls   366  1992
## 3 Alex  Boys   7348  1992
## 4 Mark  Boys   8743  1992
arrange(baby_names_alexmark, desc(Count))
## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Mark  Boys   8743  1992
## 2 Alex  Boys   7348  1992
## 3 Alex  Girls   366  1992
## 4 Mark  Girls    20  1992

Other logical operators

In the previous example we used == to filter rows. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in

These operators may be combined with & (and) or | (or).

Exercise 2.1: Peak popularity of your name

In this exercise you will discover the year your name reached its maximum popularity.

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby_names. The file is located at "http://tutorials.iq.harvard.edu/data/babyNames.csv"

Make sure you have installed the dplyr package and attached it with library(dplyr).

  1. Use filter() to extract data for your name (or another name of your choice).
##
  1. Arrange the data you produced in step 1 above by Count. In which year was the name most popular?
##
  1. BONUS (optional): Filter the data to extract only the row containing the most popular boys name in 1999.
##

Click for Solution Exercise 2.1 solution ———————————————————————-

# 1.  Use `filter()` to extract data for your name (or another name of your choice).  
baby_names_george <- filter(baby_names, Name == "George")
# 2.  Arrange the data you produced in step 1 above by `Count`. 
#     In which year was the name most popular?
arrange(baby_names_george, desc(Count))
## # A tibble: 111 x 4
##    Name   Sex   Count  Year
##    <chr>  <chr> <dbl> <dbl>
##  1 George Boys  14063  1960
##  2 George Boys  13638  1961
##  3 George Boys  12553  1962
##  4 George Boys  12084  1963
##  5 George Boys  11793  1964
##  6 George Boys  10683  1965
##  7 George Boys   9942  1966
##  8 George Boys   9702  1967
##  9 George Boys   9388  1968
## 10 George Boys   9203  1969
## # ... with 101 more rows
# 3.  BONUS (optional): Filter the data to extract _only_ the 
#     row containing the most popular boys name in 1999.
baby_names_boys1999 <- filter(baby_names, 
                    Year == 1999 & Sex == "Boys")
filter(baby_names_boys1999, Count == max(Count))
## # A tibble: 1 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Jacob Boys  35361  1999

Pipe operator in R

There is one very handy operator in R called “pipe” operator which looks like this: %>%. It allows to “chain” several function calls and, as each function returns an object, feed it into the next call in a single statement, without needing extra variables to store the intermediate results. The point of the pipe is to help you write code in a way that is easier to read and understand as we will see below.

There is no need to load any additional packages as the operator is made available via the magrittr package installed and loaded as part of dplyr. Let’s rewrite the sequence of commands to output ordered counts for names “Alex” or “Mark”.

baby_names %>% 
  filter(Year == 1992 & (Name == "Alex" | Name == "Mark")) %>%
  arrange(desc(Count))
## # A tibble: 4 x 4
##   Name  Sex   Count  Year
##   <chr> <chr> <dbl> <dbl>
## 1 Mark  Boys   8743  1992
## 2 Alex  Boys   7348  1992
## 3 Alex  Girls   366  1992
## 4 Mark  Girls    20  1992

Hint: try pronouncing “then” whenever you see %>%. Notice that we avoided creating an intermediate variable baby_names_alexmark and performed the entire task in just “one line”!

Exercise 2.2: Peak popularity of your name

Rewrite the solution to Exercise 2.1 using pipes. Remember that we were looking for the year your name reached its maximum popularity. For that, we filtered the data and then arranged by Count.

Click for Solution Exercise 2.2 solution ———————————————————————-

baby_names %>% 
  filter(Name == "George") %>%
  arrange(desc(Count))
## # A tibble: 111 x 4
##    Name   Sex   Count  Year
##    <chr>  <chr> <dbl> <dbl>
##  1 George Boys  14063  1960
##  2 George Boys  13638  1961
##  3 George Boys  12553  1962
##  4 George Boys  12084  1963
##  5 George Boys  11793  1964
##  6 George Boys  10683  1965
##  7 George Boys   9942  1966
##  8 George Boys   9702  1967
##  9 George Boys   9388  1968
## 10 George Boys   9203  1969
## # ... with 101 more rows

Percent choosing one of the top 10 names

You may have noticed that the percentage of babies given the most popular name of the year appears to have decreases over time. We can compute a more robust measure of the popularity of the most popular names by calculating the number of babies given one of the top 10 girl or boy names of the year.

In order to compute this measure we need to operate within groups, as we did using mutate() above, but this time we need to collapse each group into a single summary statistic. We can achieve this using the summarize() function.

First, let’s see how this function works without grouping. The following code outputs the total number of girls the data:

baby_names %>% 
  filter(Sex == "Girls") %>%
  summarize(Girls_n = sum(Count))
## # A tibble: 1 x 1
##     Girls_n
##       <dbl>
## 1 101422255

Next, using group_by() and summarize() together, we can calculate the number of babies born each year:

bn_by_year <-
  baby_names %>%
  group_by(Year) %>%
  summarize(Total = sum(Count))

head(bn_by_year)
## # A tibble: 6 x 2
##    Year   Total
##   <dbl>   <dbl>
## 1  1960 4154377
## 2  1961 4140244
## 3  1962 4035234
## 4  1963 3958791
## 5  1964 3887800
## 6  1965 3626029

Saving our Work

Now that we have made some changes to our data set, we might want to save those changes to a file.

You can list all the objects in your current workspace using ls():

ls() # list objects in our workspace
# rm(list=ls()) # remove all objects from our workspace 

The data.frames can be saved using functions write_csv() and write_rds() from package readr.

# write data to a .csv file
write_csv(baby_names, "babyNames.csv")
# write data to an R file
write_rds(baby_names, "babyNames.rds")

Best Practices for Writing R Code

  • Start each program with a description of what it does.
  • Then load all required packages.
  • Consider what working directory you are in when sourcing a script.
  • Use comments to mark off sections of code.
  • Put function definitions at the top of your file, or in a separate file if there are many.
  • Name and style code consistently.
  • Break code into small, discrete pieces.
  • Factor out common operations rather than repeating them.
  • Keep all of the source files for a project in one directory and use relative paths to access them.
  • Keep track of the memory used by your program.
  • Always start with a clean environment instead of saving the workspace.
  • Keep track of session information in your project folder.
  • Have someone else review your code.
  • Use version control.

Wrap-up

Help us make this workshop better!

Please take a moment to fill out a very short feedback form. These workshops exist for you – tell us what you need! http://tinyurl.com/R-intro-feedback