NOTE: skip this section if you are not running R locally
You should have R installed. If not:
Notes and examples for this workshop are available at RCS Workshop Materials.
Start RStudio create a new project:
File -> New Project
.New Directory
and New Project
.In this workshop you will
A more general goal is to get you comfortable with R so that it seems less scary and mystifying than it perhaps does now. Note that this is by no means a complete or thorough introduction to R! It’s just enough to get you started.
This workshop is relatively informal, example-oriented, and hands-on. We won’t spend much time examining language features in detail. Instead we will work through an example, and learn some things about the R along the way.
As an example project we will analyze the popularity of baby names in the US from 1960 through 2017. Among the questions we will use R to answer are:
There are many different ways you can interact with R. See the Data Science Tools workshop notes for details.
For this workshop we encourage you to use RStudio; it is a good R-specific IDE that mostly just works.
Note: skip this section if you are not using Rstudio.
The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
The window in the lower-left is the R console. This is where results will be displayed.
The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to figure it out.
Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!
##
##
Click for Solution
Exercise 0 solution ———————————————————————-
## 1. 2 plus 2
2 + 2
## [1] 4
## or
sum(2, 2)
## [1] 4
## 2. square root of 10:
sqrt(10)
## [1] 3.162278
## or
10^(1/2)
## [1] 3.162278
## 3. Find "An Introduction to R".
## Go to the main help page by running 'help.start() or using the GUI
## menu, find and click on the link to "An Introduction to R".
The general form for calling R functions is
## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)
Arguments can be matched by name; unnamed arguments will be matched by position.
Values can be assigned names and used in subsequent operations
<-
operator (less than followed by a dash) is used to save valuessqrt(10) ## calculate square root of 10; result is not stored anywhere
## [1] 3.162278
x <- sqrt(10) ## assign result to a variable named x
Names should start with a letter, and contain only letters, numbers, underscores, and periods.
You can ask R for help using the help()
function, or the ?
shortcut.
help(help)
?help
?sqrt
The help()
function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the stats
package by reading its documentation like this:
help(package = "stats")
R has data reading functionality built-in – see e.g., help(read.table)
. However, faster and more robust tools are available, and so to make things easier on ourselves we will use a contributed package called readr
instead. This requires that we learn a little bit about packages in R.
A large number of contributed packages are available. If you are looking for a package for a specific task, https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.
You can install a package in R using the install.packages()
function. Once a package is installed you may use the library()
function to attach it so that it can be used.
## install.packages("readr")
library(readr)
In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can import data from common plain-text formats.
Data Type | Function |
---|---|
comma separated | read_csv() |
tab separated | read_delim() |
other delimited formats | read_table() |
fixed width | read_fwf() |
Note You may be confused by the existence of similar functions, e.g., read.csv()
and read.delim()
. These are legacy functions that tend to be slower and less robust than the readr
functions. One way to tell them apart is that the faster more robust versions use underscores in their names (e.g., read_csv()
) while the older functions us dots (e.g., read.csv()
). My advice is to use the more robust newer versions, i.e., the ones with underscores.
The examples in this workshop use US baby names data retrieved from https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data. A cleaned and merged version of these data is available at http://tutorials.iq.harvard.edu/data/babyNames.csv
.
Make sure you have installed the readr
package and attached it with library(readr)
.
Baby names data are available at "http://tutorials.iq.harvard.edu/data/babyNames.csv"
.
Open the read_csv()
help page to determine how to use it to read in data.
Read the baby names data using the read_csv()
function and assign the result with the name baby_names
.
Click for Solution
Exercise 1 solution ———————————————————————-
## read ?read_csv
baby_names <- read_csv("http://tutorials.iq.harvard.edu/data/babyNames.csv")
In this section we will pull out specific names and examine changes in their popularity over time.
The baby_names
object we created in the last exercise is a data.frame
. There are many other data structures in R, but for now we’ll focus on working with data.frames
.
R has decent data manipulation tools built-in – see e.g., help(Extract)
. However, in recent years there has been a big surge in well-designed contributed packages for R. In fact, the package readr
is part of a powerful collection of R packages designed specifically for data science and called tidyverse
. All packages included in tidyverse
share an underlying design philosophy, grammar, and data structures.
For this part we will load another tidyverse
package called dplyr
.
## install.packages("dplyr")
library(dplyr)
One way to find the year in which your name was the most popular is to filter out just the rows corresponding to your name, and then arrange (sort) by Count.
To demonstrate these techniques we’ll try to determine whether “Alex” or “Mark” was more popular in 1992 among boys. We start by filtering the data so that we keep only rows where Year is equal to 1992
and Name is either “Alex” or “Mark”.
baby_names_alexmark <- filter(baby_names,
Year == 1992 & (Name == "Alex" | Name == "Mark"))
baby_names_alexmark
## # A tibble: 4 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 Alex Girls 366 1992
## 2 Mark Girls 20 1992
## 3 Mark Boys 8743 1992
## 4 Alex Boys 7348 1992
Notice that we can combine conditions using &
(AND) and |
(OR).
In this case we can see that “Mark” is more popular among boys, but to make it even easier we can arrange the data so that the most popular name is listed first.
arrange(baby_names_alexmark, Count)
## # A tibble: 4 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 Mark Girls 20 1992
## 2 Alex Girls 366 1992
## 3 Alex Boys 7348 1992
## 4 Mark Boys 8743 1992
arrange(baby_names_alexmark, desc(Count))
## # A tibble: 4 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 Mark Boys 8743 1992
## 2 Alex Boys 7348 1992
## 3 Alex Girls 366 1992
## 4 Mark Girls 20 1992
In the previous example we used ==
to filter rows. Other relational and logical operators are listed below.
Operator | Meaning |
---|---|
== |
equal to |
!= |
not equal to |
> |
greater than |
>= |
greater than or equal to |
< |
less than |
<= |
less than or equal to |
%in% |
contained in |
These operators may be combined with &
(and) or |
(or).
In this exercise you will discover the year your name reached its maximum popularity.
Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby_names
. The file is located at "http://tutorials.iq.harvard.edu/data/babyNames.csv"
Make sure you have installed the dplyr
package and attached it with library(dplyr)
.
filter()
to extract data for your name (or another name of your choice).##
Count
. In which year was the name most popular?##
##
Click for Solution
Exercise 2.1 solution ———————————————————————-
# 1. Use `filter()` to extract data for your name (or another name of your choice).
baby_names_george <- filter(baby_names, Name == "George")
# 2. Arrange the data you produced in step 1 above by `Count`.
# In which year was the name most popular?
arrange(baby_names_george, desc(Count))
## # A tibble: 111 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 George Boys 14063 1960
## 2 George Boys 13638 1961
## 3 George Boys 12553 1962
## 4 George Boys 12084 1963
## 5 George Boys 11793 1964
## 6 George Boys 10683 1965
## 7 George Boys 9942 1966
## 8 George Boys 9702 1967
## 9 George Boys 9388 1968
## 10 George Boys 9203 1969
## # ... with 101 more rows
# 3. BONUS (optional): Filter the data to extract _only_ the
# row containing the most popular boys name in 1999.
baby_names_boys1999 <- filter(baby_names,
Year == 1999 & Sex == "Boys")
filter(baby_names_boys1999, Count == max(Count))
## # A tibble: 1 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 Jacob Boys 35361 1999
There is one very handy operator in R called “pipe” operator which looks like this: %>%
. It allows to “chain” several function calls and, as each function returns an object, feed it into the next call in a single statement, without needing extra variables to store the intermediate results. The point of the pipe is to help you write code in a way that is easier to read and understand as we will see below.
There is no need to load any additional packages as the operator is made available via the magrittr
package installed and loaded as part of dplyr
. Let’s rewrite the sequence of commands to output ordered counts for names “Alex” or “Mark”.
baby_names %>%
filter(Year == 1992 & (Name == "Alex" | Name == "Mark")) %>%
arrange(desc(Count))
## # A tibble: 4 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 Mark Boys 8743 1992
## 2 Alex Boys 7348 1992
## 3 Alex Girls 366 1992
## 4 Mark Girls 20 1992
Hint: try pronouncing “then” whenever you see %>%
. Notice that we avoided creating an intermediate variable baby_names_alexmark
and performed the entire task in just “one line”!
Rewrite the solution to Exercise 2.1 using pipes. Remember that we were looking for the year your name reached its maximum popularity. For that, we filtered the data and then arranged by Count.
Click for Solution
Exercise 2.2 solution ———————————————————————-
baby_names %>%
filter(Name == "George") %>%
arrange(desc(Count))
## # A tibble: 111 x 4
## Name Sex Count Year
## <chr> <chr> <dbl> <dbl>
## 1 George Boys 14063 1960
## 2 George Boys 13638 1961
## 3 George Boys 12553 1962
## 4 George Boys 12084 1963
## 5 George Boys 11793 1964
## 6 George Boys 10683 1965
## 7 George Boys 9942 1966
## 8 George Boys 9702 1967
## 9 George Boys 9388 1968
## 10 George Boys 9203 1969
## # ... with 101 more rows
It can be difficult to spot trends when looking at summary tables. Plotting the data makes it easier to identify interesting patterns.
R has decent plotting tools built-in – see e.g., help(plot)
. However, again, we will make use of an excellent contributed package from tidyverse
called ggplot2
.
## install.packages("ggplot2")
library(ggplot2)
For quick and simple plots we can use the qplot()
function. For example, we can plot the number of babies given the name “Diana” over time like this:
baby_names_diana <- filter(baby_names, Name == "Diana")
qplot(x = Year, y = Count,
data = baby_names_diana)
Interestingly, there are usually some gender-atypical names, even for very strongly gendered names like “Diana”. Splitting these trends out by Sex is very easy:
qplot(x = Year, y = Count, color = Sex,
data = baby_names_diana)
Make sure the ggplot2
package is installed, and that you have attached it using library(ggplot2)
.
filter()
to extract data for your name (same as in previous exercise)##
Year
on the x-axis and Count
on the y-axis.##
##
Click for Solution
Exercise 3 solution ———————————————————————-
# 1. Use `filter()` to extract data for your name (same as previous exercise)
baby_names_george <- filter(baby_names, Name == "George")
# 2. Plot the data you produced in step 1 above, with `Year` on the x-axis
# and `Count` on the y-axis.
qplot(x = Year, y = Count, data = baby_names_george)
# 3. Adjust the plot so that is shows boys and girls in different colors.
qplot(x = Year, y = Count, color = Sex, data = baby_names_george)
# 4. BONUS (Optional): Adust the plot to use lines instead of points.
qplot(x = Year, y = Count, color = Sex, data = baby_names_george, geom = "line")
Our next goal is to find out which names have been the most popular.
So far we’ve used Count
as a measure of popularity. A better approach is to use proportion or rank to avoid confounding popularity with the number of babies born in a given year.
The mutate()
function makes it easy to add or modify the columns of a data.frame
. For example, we can use it to rescale each given number of names in each year:
baby_names <- mutate(baby_names, Count_1K = Count/1000)
baby_names ## same as print(baby_names)
## # A tibble: 1,352,203 x 5
## Name Sex Count Year Count_1K
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Mary Girls 51474 1960 51.5
## 2 Susan Girls 39200 1960 39.2
## 3 Linda Girls 37314 1960 37.3
## 4 Karen Girls 36376 1960 36.4
## 5 Donna Girls 34133 1960 34.1
## 6 Lisa Girls 33702 1960 33.7
## 7 Patricia Girls 32102 1960 32.1
## 8 Debra Girls 26737 1960 26.7
## 9 Cynthia Girls 26725 1960 26.7
## 10 Deborah Girls 25264 1960 25.3
## # ... with 1,352,193 more rows
Notice that executing the second line led to printing all 1.35 mln rows in our data.frame
(though R was “smart” enough to truncate the output at 10 rows and give us a gentle reminder that there were “1,352,193 more”)! If we would just like to glance at the first 6 lines we can use head()
:
head(baby_names)
## # A tibble: 6 x 5
## Name Sex Count Year Count_1K
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Mary Girls 51474 1960 51.5
## 2 Susan Girls 39200 1960 39.2
## 3 Linda Girls 37314 1960 37.3
## 4 Karen Girls 36376 1960 36.4
## 5 Donna Girls 34133 1960 34.1
## 6 Lisa Girls 33702 1960 33.7
Finally, the select()
function allows us to subset the data.frame
by columns. We can then assign the output to a new object.
baby_names_scaled <- select(baby_names, Name, Sex, Year, Count_1K)
head(baby_names_scaled)
## # A tibble: 6 x 4
## Name Sex Year Count_1K
## <chr> <chr> <dbl> <dbl>
## 1 Mary Girls 1960 51.5
## 2 Susan Girls 1960 39.2
## 3 Linda Girls 1960 37.3
## 4 Karen Girls 1960 36.4
## 5 Donna Girls 1960 34.1
## 6 Lisa Girls 1960 33.7
select()
can also be used with pipes too:
baby_names %>%
select(Name, Sex, Year, Count_1K) %>%
head
## # A tibble: 6 x 4
## Name Sex Year Count_1K
## <chr> <chr> <dbl> <dbl>
## 1 Mary Girls 1960 51.5
## 2 Susan Girls 1960 39.2
## 3 Linda Girls 1960 37.3
## 4 Karen Girls 1960 36.4
## 5 Donna Girls 1960 34.1
## 6 Lisa Girls 1960 33.7
Because of the nested nature of out data, we want to compute rank or proportion within each Sex
by Year
group. The dplyr
package makes this relatively straightforward.
baby_names <-
baby_names %>%
group_by(Year, Sex) %>%
mutate(Rank = rank(-Count)) %>%
arrange(Rank, Year, Sex) %>%
ungroup
baby_names
## # A tibble: 1,352,203 x 6
## Name Sex Count Year Count_1K Rank
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 David Boys 85928 1960 85.9 1
## 2 Mary Girls 51474 1960 51.5 1
## 3 Michael Boys 86922 1961 86.9 1
## 4 Mary Girls 47676 1961 47.7 1
## 5 Michael Boys 85037 1962 85.0 1
## 6 Lisa Girls 46080 1962 46.1 1
## 7 Michael Boys 83789 1963 83.8 1
## 8 Lisa Girls 56037 1963 56.0 1
## 9 Michael Boys 82653 1964 82.7 1
## 10 Lisa Girls 54276 1964 54.3 1
## # ... with 1,352,193 more rows
Note that the data remains grouped until you change the groups by running group_by()
again or remove grouping information with ungroup()
.
In this exercise your goal is to identify the most popular names for each year.
mutate()
and group_by()
to create a column named “Proportion” where Proportion = Count/sum(Count)
for each Year X Sex
group. Use pipes wherever it makes sense.## baby_names <- baby_names %>% ungroup
mutate()
and group_by()
to create a column named “Rank” where Rank = rank(-Count)
for each Year X Sex
group.##
Year X Sex
group. Output columns Name, Sex, and Proportion.##
Year
on the x-axis and Proportion
on the y-axis. How has the proportion of babies given the most popular name changed over time?##
Click for Solution
Exercise 4 solution ———————————————————————-
## 1. Use `mutate()` and `group_by()` to create a column named "Proportion"
## where `Proportion = Count/sum(Count)` for each `Year X Sex` group.
baby_names <-
baby_names %>%
group_by(Year, Sex) %>%
mutate(Proportion = Count/sum(Count)) %>%
ungroup
## 2. Use `mutate()` and `group_by()` to create a column named "Rank" where
## `Rank = rank(-Count)` for each `Year X Sex` group.
baby_names <-
baby_names %>%
group_by(Year, Sex) %>%
mutate(Rank = rank(-Count)) %>%
ungroup
## 3. Filter the baby names data to display only the most popular name
## for each `Year X Sex` group. Output columns Name, Sex, and Proportion.
top1 <- filter(baby_names, Rank == 1)
top1 %>%
select(Name, Sex, Proportion)
## # A tibble: 116 x 3
## Name Sex Proportion
## <chr> <chr> <dbl>
## 1 David Boys 0.0403
## 2 Mary Girls 0.0255
## 3 Michael Boys 0.0409
## 4 Mary Girls 0.0236
## 5 Michael Boys 0.0411
## 6 Lisa Girls 0.0234
## 7 Michael Boys 0.0412
## 8 Lisa Girls 0.0291
## 9 Michael Boys 0.0415
## 10 Lisa Girls 0.0286
## # ... with 106 more rows
## 4. Plot the data produced in step 3, putting `Year` on the x-axis
## and `Proportion` on the y-axis. How has the proportion of babies
## given the most popular name changed over time?
qplot(x = Year,
y = Proportion,
color = Sex,
data = top1,
geom = "line")
## 5. BONUS (optional): Which names are the most popular for both boys
## and girls?
bn_girls <- baby_names %>%
filter(Sex == "Boys") %>%
select(Name, Year, Count)
bn_boys <- baby_names %>%
filter(Sex == "Girls") %>%
select(Name, Year, Count)
girls_and_boys <- inner_join(bn_girls,
bn_boys,
by = c("Year", "Name"))
head(girls_and_boys)
## # A tibble: 6 x 4
## Name Year Count.x Count.y
## <chr> <dbl> <dbl> <dbl>
## 1 David 1960 85928 223
## 2 Michael 1961 86922 325
## 3 Michael 1962 85037 354
## 4 Michael 1963 83789 377
## 5 Michael 1964 82653 302
## 6 Michael 1965 81019 355
girls_and_boys <- mutate(girls_and_boys,
Product = Count.x * Count.y,
Rank = rank(-Product))
head(girls_and_boys)
## # A tibble: 6 x 6
## Name Year Count.x Count.y Product Rank
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 David 1960 85928 223 19161944 200
## 2 Michael 1961 86922 325 28249650 109
## 3 Michael 1962 85037 354 30103098 98
## 4 Michael 1963 83789 377 31588453 90
## 5 Michael 1964 82653 302 24961206 130
## 6 Michael 1965 81019 355 28761745 106
filter(girls_and_boys, Rank == 1)
## # A tibble: 1 x 6
## Name Year Count.x Count.y Product Rank
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Taylor 1993 7688 21266 163493008 1
You may have noticed that the percentage of babies given the most popular name of the year appears to have decreases over time. We can compute a more robust measure of the popularity of the most popular names by calculating the number of babies given one of the top 10 girl or boy names of the year.
In order to compute this measure we need to operate within groups, as we did using mutate()
above, but this time we need to collapse each group into a single summary statistic. We can achieve this using the summarize()
function.
First, let’s see how this function works without grouping. The following code outputs the total number of girls the data:
baby_names %>%
filter(Sex == "Girls") %>%
summarize(Girls_n = sum(Count))
## # A tibble: 1 x 1
## Girls_n
## <dbl>
## 1 101422255
Next, using group_by()
and summarize()
together, we can calculate the number of babies born each year:
bn_by_year <-
baby_names %>%
group_by(Year) %>%
summarize(Total = sum(Count))
head(bn_by_year)
## # A tibble: 6 x 2
## Year Total
## <dbl> <dbl>
## 1 1960 4154377
## 2 1961 4140244
## 3 1962 4035234
## 4 1963 3958791
## 5 1964 3887800
## 6 1965 3626029
In this exercise we will plot trends in the proportion of boys and girls given one of the 10 most popular names each year.
baby_names
data, retaining only the 10 most popular girl and boy names for each year.##
##
##
Click for Solution
Exercise 5 solution ———————————————————————-
## 1. Filter the baby_names data, retaining only the 10 most
## popular girl and boy names for each year.
most_popular <-
baby_names %>%
group_by(Year, Sex) %>%
filter(Rank <= 10)
most_popular
## # A tibble: 1,160 x 7
## # Groups: Year, Sex [116]
## Name Sex Count Year Count_1K Rank Proportion
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 David Boys 85928 1960 85.9 1 0.0403
## 2 Mary Girls 51474 1960 51.5 1 0.0255
## 3 Michael Boys 86922 1961 86.9 1 0.0409
## 4 Mary Girls 47676 1961 47.7 1 0.0236
## 5 Michael Boys 85037 1962 85.0 1 0.0411
## 6 Lisa Girls 46080 1962 46.1 1 0.0234
## 7 Michael Boys 83789 1963 83.8 1 0.0412
## 8 Lisa Girls 56037 1963 56.0 1 0.0291
## 9 Michael Boys 82653 1964 82.7 1 0.0415
## 10 Lisa Girls 54276 1964 54.3 1 0.0286
## # ... with 1,150 more rows
## 2. Summarize the data produced in step one to calculate the total
## Proportion of boys and girls given one of the top 10 names
## each year.
# #most_popular data.frame is already grouped by Year and Sex
top10 <-
most_popular %>%
summarize(TotalProportion = sum(Proportion))
## 3. Plot the data produced in step 2, with year on the x-axis
## and total proportion on the y axis. Color by sex.
qplot(x = Year,
y = TotalProportion,
color = Sex,
data = top10,
geom = "line")
Now that we have made some changes to our data set, we might want to save those changes to a file.
You can list all the objects in your current workspace using ls()
:
ls() # list objects in our workspace
# rm(list=ls()) # remove all objects from our workspace
The data.frames
can be saved using functions write_csv()
and write_rds()
from package readr
.
# write data to a .csv file
write_csv(baby_names, "babyNames.csv")
# write data to an R file
write_rds(baby_names, "babyNames.rds")
Please take a moment to fill out a very short feedback form. These workshops exist for you – tell us what you need! http://tinyurl.com/R-intro-feedback