Data Transformation


Learning objectives

  • Be able to use the six major dplyr verbs (filter, select, arrange, mutate, group_by, summarize)
  • Be able to use and understand the advantages of the magrittr pipe: %>%

Data manipulation using dplyr

It is an often bemoaned fact that a data scientist spends much, and often most, of her time wrangling data: getting it organized and clean. Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. In this lesson we will learn an efficient set of tools that can handle the vast majority of most data management tasks.

Enter dplyr, a package for making data manipulation easier.

Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.

install.packages("dplyr") ## install

You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; I’d recommend choosing the RStudio mirror.

library("dplyr")          ## load

You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases.dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

The tasks of dplyr

There are five actions we often want to apply to a tabular dataset:

  • Filter rows
  • Filter columns
  • Arrange rows
  • Make new columns
  • Summarize groups

We are about to see how to do each of those things using the dplyr package. Everything we’re going to learn to do can also be done using “base R”, but dplyr makes it easier, and the syntax is consistent, and it actually makes the computations faster.

filter()

Suppose we want to see just the gapminder data for the USA. First, we need to know how “USA” is written in the dataset: Is it USA or United States or what? We can see all the unique values of a variable with the unique function.

unique(gapminder$country)
##   [1] "Afghanistan"              "Albania"                 
##   [3] "Algeria"                  "Angola"                  
##   [5] "Argentina"                "Australia"               
##   [7] "Austria"                  "Bahrain"                 
##   [9] "Bangladesh"               "Belgium"                 
##  [11] "Benin"                    "Bolivia"                 
##  [13] "Bosnia and Herzegovina"   "Botswana"                
##  [15] "Brazil"                   "Bulgaria"                
##  [17] "Burkina Faso"             "Burundi"                 
##  [19] "Cambodia"                 "Cameroon"                
##  [21] "Canada"                   "Central African Republic"
##  [23] "Chad"                     "Chile"                   
##  [25] "China"                    "Colombia"                
##  [27] "Comoros"                  "Congo Dem. Rep."         
##  [29] "Congo Rep."               "Costa Rica"              
##  [31] "Cote d'Ivoire"            "Croatia"                 
##  [33] "Cuba"                     "Czech Republic"          
##  [35] "Denmark"                  "Djibouti"                
##  [37] "Dominican Republic"       "Ecuador"                 
##  [39] "Egypt"                    "El Salvador"             
##  [41] "Equatorial Guinea"        "Eritrea"                 
##  [43] "Ethiopia"                 "Finland"                 
##  [45] "France"                   "Gabon"                   
##  [47] "Gambia"                   "Germany"                 
##  [49] "Ghana"                    "Greece"                  
##  [51] "Guatemala"                "Guinea"                  
##  [53] "Guinea-Bissau"            "Haiti"                   
##  [55] "Honduras"                 "Hong Kong China"         
##  [57] "Hungary"                  "Iceland"                 
##  [59] "India"                    "Indonesia"               
##  [61] "Iran"                     "Iraq"                    
##  [63] "Ireland"                  "Israel"                  
##  [65] "Italy"                    "Jamaica"                 
##  [67] "Japan"                    "Jordan"                  
##  [69] "Kenya"                    "Korea Dem. Rep."         
##  [71] "Korea Rep."               "Kuwait"                  
##  [73] "Lebanon"                  "Lesotho"                 
##  [75] "Liberia"                  "Libya"                   
##  [77] "Madagascar"               "Malawi"                  
##  [79] "Malaysia"                 "Mali"                    
##  [81] "Mauritania"               "Mauritius"               
##  [83] "Mexico"                   "Mongolia"                
##  [85] "Montenegro"               "Morocco"                 
##  [87] "Mozambique"               "Myanmar"                 
##  [89] "Namibia"                  "Nepal"                   
##  [91] "Netherlands"              "New Zealand"             
##  [93] "Nicaragua"                "Niger"                   
##  [95] "Nigeria"                  "Norway"                  
##  [97] "Oman"                     "Pakistan"                
##  [99] "Panama"                   "Paraguay"                
## [101] "Peru"                     "Philippines"             
## [103] "Poland"                   "Portugal"                
## [105] "Puerto Rico"              "Reunion"                 
## [107] "Romania"                  "Rwanda"                  
## [109] "Sao Tome and Principe"    "Saudi Arabia"            
## [111] "Senegal"                  "Serbia"                  
## [113] "Sierra Leone"             "Singapore"               
## [115] "Slovak Republic"          "Slovenia"                
## [117] "Somalia"                  "South Africa"            
## [119] "Spain"                    "Sri Lanka"               
## [121] "Sudan"                    "Swaziland"               
## [123] "Sweden"                   "Switzerland"             
## [125] "Syria"                    "Taiwan"                  
## [127] "Tanzania"                 "Thailand"                
## [129] "Togo"                     "Trinidad and Tobago"     
## [131] "Tunisia"                  "Turkey"                  
## [133] "Uganda"                   "United Kingdom"          
## [135] "United States"            "Uruguay"                 
## [137] "Venezuela"                "Vietnam"                 
## [139] "West Bank and Gaza"       "Yemen Rep."              
## [141] "Zambia"                   "Zimbabwe"

Okay, now we want to see just the rows of the data.frame where country is “United States”. The syntax for all dplyr functions is the same: The first argument is the data.frame, the rest of the arguments are whatever you want to do in that data.frame.

filter(gapminder, country == "United States")
##          country year       pop continent lifeExp gdpPercap
## 1  United States 1952 157553000  Americas  68.440  13990.48
## 2  United States 1957 171984000  Americas  69.490  14847.13
## 3  United States 1962 186538000  Americas  70.210  16173.15
## 4  United States 1967 198712000  Americas  70.760  19530.37
## 5  United States 1972 209896000  Americas  71.340  21806.04
## 6  United States 1977 220239000  Americas  73.380  24072.63
## 7  United States 1982 232187835  Americas  74.650  25009.56
## 8  United States 1987 242803533  Americas  75.020  29884.35
## 9  United States 1992 256894189  Americas  76.090  32003.93
## 10 United States 1997 272911760  Americas  76.810  35767.43
## 11 United States 2002 287675526  Americas  77.310  39097.10
## 12 United States 2007 301139947  Americas  78.242  42951.65

We can also apply multiple conditions, e.g. the US after 2000:

filter(gapminder, country == "United States" & year > 2000)
##         country year       pop continent lifeExp gdpPercap
## 1 United States 2002 287675526  Americas  77.310  39097.10
## 2 United States 2007 301139947  Americas  78.242  42951.65

We can also use “or” conditions with the vertical pipe: |. Notice that the variable (column) names don’t go in quotes, but values of character variables do.

filter(gapminder, country == "United States" | country == "Mexico")
##          country year       pop continent lifeExp gdpPercap
## 1         Mexico 1952  30144317  Americas  50.789  3478.126
## 2         Mexico 1957  35015548  Americas  55.190  4131.547
## 3         Mexico 1962  41121485  Americas  58.299  4581.609
## 4         Mexico 1967  47995559  Americas  60.110  5754.734
## 5         Mexico 1972  55984294  Americas  62.361  6809.407
## 6         Mexico 1977  63759976  Americas  65.032  7674.929
## 7         Mexico 1982  71640904  Americas  67.405  9611.148
## 8         Mexico 1987  80122492  Americas  69.498  8688.156
## 9         Mexico 1992  88111030  Americas  71.455  9472.384
## 10        Mexico 1997  95895146  Americas  73.670  9767.298
## 11        Mexico 2002 102479927  Americas  74.902 10742.441
## 12        Mexico 2007 108700891  Americas  76.195 11977.575
## 13 United States 1952 157553000  Americas  68.440 13990.482
## 14 United States 1957 171984000  Americas  69.490 14847.127
## 15 United States 1962 186538000  Americas  70.210 16173.146
## 16 United States 1967 198712000  Americas  70.760 19530.366
## 17 United States 1972 209896000  Americas  71.340 21806.036
## 18 United States 1977 220239000  Americas  73.380 24072.632
## 19 United States 1982 232187835  Americas  74.650 25009.559
## 20 United States 1987 242803533  Americas  75.020 29884.350
## 21 United States 1992 256894189  Americas  76.090 32003.932
## 22 United States 1997 272911760  Americas  76.810 35767.433
## 23 United States 2002 287675526  Americas  77.310 39097.100
## 24 United States 2007 301139947  Americas  78.242 42951.653

A good, handy reference list for the operators (and, or, etc) can be found here.

select()

filter returned a subset of the data.frame’s rows. select returns a subset of the data.frame’s columns.

Suppose we only want to see country and life expectancy.

select(gapminder, country, lifeExp)

Hmm, we can’t really see all 1704 rows. We can just look at the top of the data.frame with the head function. We could assign the output of our select command to a new variable and display the head of that, or we can wrap our select function in head. Note that if we don’t assign the output of a function to a variable, the output is printed, but nothing changes. E.g. the last command didn’t remove the other columns from the gapminder data.frame, it just printed the results of that function call. Understand how the nesting functions works: head expects a data.frame as its argument, and select returns a data.frame. The output of select becomes the input to head.

gapCountryExp <- select(gapminder, country, lifeExp)
head(gapCountryExp)
##       country lifeExp
## 1 Afghanistan  28.801
## 2 Afghanistan  30.332
## 3 Afghanistan  31.997
## 4 Afghanistan  34.020
## 5 Afghanistan  36.088
## 6 Afghanistan  38.438
head(select(gapminder, country, lifeExp))
##       country lifeExp
## 1 Afghanistan  28.801
## 2 Afghanistan  30.332
## 3 Afghanistan  31.997
## 4 Afghanistan  34.020
## 5 Afghanistan  36.088
## 6 Afghanistan  38.438

arrange()

You can order the rows of a data.frame by a variable using arrange. Suppose we want to see the most populous countries. Again, we wrap the results in head to just print the first few rows:

head(arrange(gapminder, pop))
##                 country year   pop continent lifeExp gdpPercap
## 1 Sao Tome and Principe 1952 60011    Africa  46.471  879.5836
## 2 Sao Tome and Principe 1957 61325    Africa  48.945  860.7369
## 3              Djibouti 1952 63149    Africa  34.812 2669.5295
## 4 Sao Tome and Principe 1962 65345    Africa  51.893 1071.5511
## 5 Sao Tome and Principe 1967 70787    Africa  54.425 1384.8406
## 6              Djibouti 1957 71851    Africa  37.328 2864.9691

Hmm, we didn’t get the most populous countries. By default, arrange sorts the variable in increasing order. We could see the most populous countries by examining the tail of the last command, or we can sort the data.frame by descending population by wrapping the variable in desc():

head(arrange(gapminder, desc(pop)))
##   country year        pop continent lifeExp gdpPercap
## 1   China 2007 1318683096      Asia  72.961  4959.115
## 2   China 2002 1280400000      Asia  72.028  3119.281
## 3   China 1997 1230075000      Asia  70.426  2289.234
## 4   China 1992 1164970000      Asia  68.690  1655.784
## 5   India 2007 1110396331      Asia  64.698  2452.210
## 6   China 1987 1084035000      Asia  67.274  1378.904

arrange can also sort by multiple variables. It will sort the data.frame by the first variable, and if there are any ties in that variable, they will be sorted by the next variable, and so on. Here we sort from newest to oldest, and within year from richest to poorest:

head(arrange(gapminder, desc(year), desc(gdpPercap)))
##           country year       pop continent lifeExp gdpPercap
## 1          Norway 2007   4627926    Europe  80.196  49357.19
## 2          Kuwait 2007   2505559      Asia  77.588  47306.99
## 3       Singapore 2007   4553009      Asia  79.972  47143.18
## 4   United States 2007 301139947  Americas  78.242  42951.65
## 5         Ireland 2007   4109086    Europe  78.885  40676.00
## 6 Hong Kong China 2007   6980412      Asia  82.208  39724.98

Shoutout Q: Would we get the same output if we switched the order of desc(year) and desc(gdpPercap) in the last line?

Pipes

Suppose we want to look at all the countries where life expectancy is greater than 80 years, sorted from poorest to richest. First, we filter, then we arrange. Each function expects a data.frame as its first argument and returns a data.frame as its output. So we could wrap them like we did with head(select( ... )) above:

arrange(filter(gapminder, lifeExp > 80), gdpPercap)
##            country year       pop continent lifeExp gdpPercap
## 1      New Zealand 2007   4115771      <NA>  80.204  25185.01
## 2           Israel 2007   6426679      Asia  80.745  25523.28
## 3            Italy 2002  57926999    Europe  80.240  27968.10
## 4            Italy 2007  58147733    Europe  80.546  28569.72
## 5            Japan 2002 127065841      Asia  82.000  28604.59
## 6            Japan 1997 125956499      Asia  80.690  28816.58
## 7            Spain 2007  40448191    Europe  80.941  28821.06
## 8           Sweden 2002   8954175    Europe  80.040  29341.63
## 9  Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 10          France 2007  61083916    Europe  80.657  30470.02
## 11       Australia 2002  19546792      <NA>  80.370  30687.75
## 12         Iceland 2002    288030    Europe  80.500  31163.20
## 13           Japan 2007 127467972      Asia  82.603  31656.07
## 14          Sweden 2007   9031088    Europe  80.884  33859.75
## 15       Australia 2007  20434176      <NA>  81.235  34435.37
## 16     Switzerland 2002   7361757    Europe  80.620  34480.96
## 17         Iceland 2007    301931    Europe  81.757  36180.79
## 18          Canada 2007  33390141  Americas  80.653  36319.24
## 19     Switzerland 2007   7554661    Europe  81.701  37506.42
## 20 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 21          Norway 2007   4627926    Europe  80.196  49357.19

Or, we could assign the intermediate data.frame to a variable:

lifeExpGreater80 <- filter(gapminder, lifeExp > 80)
arrange(lifeExpGreater80, gdpPercap)
##            country year       pop continent lifeExp gdpPercap
## 1      New Zealand 2007   4115771      <NA>  80.204  25185.01
## 2           Israel 2007   6426679      Asia  80.745  25523.28
## 3            Italy 2002  57926999    Europe  80.240  27968.10
## 4            Italy 2007  58147733    Europe  80.546  28569.72
## 5            Japan 2002 127065841      Asia  82.000  28604.59
## 6            Japan 1997 125956499      Asia  80.690  28816.58
## 7            Spain 2007  40448191    Europe  80.941  28821.06
## 8           Sweden 2002   8954175    Europe  80.040  29341.63
## 9  Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 10          France 2007  61083916    Europe  80.657  30470.02
## 11       Australia 2002  19546792      <NA>  80.370  30687.75
## 12         Iceland 2002    288030    Europe  80.500  31163.20
## 13           Japan 2007 127467972      Asia  82.603  31656.07
## 14          Sweden 2007   9031088    Europe  80.884  33859.75
## 15       Australia 2007  20434176      <NA>  81.235  34435.37
## 16     Switzerland 2002   7361757    Europe  80.620  34480.96
## 17         Iceland 2007    301931    Europe  81.757  36180.79
## 18          Canada 2007  33390141  Americas  80.653  36319.24
## 19     Switzerland 2007   7554661    Europe  81.701  37506.42
## 20 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 21          Norway 2007   4627926    Europe  80.196  49357.19

The first option is difficult to read, and the second option clutters our Environment with a data.frame that we will only use once. This is especially true if we need to do multiple operations:

gapCountryExp <- select(gapminder, country, lifeExp)
gapCountryExpFilter <- filter(gapCountryExp, lifeExp > 80)
gapCountryExpFilterSorted <- arrange(gapCountryExpFilter, lifeExp)
gapCountryExpFilterSorted 
##            country lifeExp
## 1           Sweden  80.040
## 2           Norway  80.196
## 3      New Zealand  80.204
## 4            Italy  80.240
## 5        Australia  80.370
## 6          Iceland  80.500
## 7            Italy  80.546
## 8      Switzerland  80.620
## 9           Canada  80.653
## 10          France  80.657
## 11           Japan  80.690
## 12          Israel  80.745
## 13          Sweden  80.884
## 14           Spain  80.941
## 15       Australia  81.235
## 16 Hong Kong China  81.495
## 17     Switzerland  81.701
## 18         Iceland  81.757
## 19           Japan  82.000
## 20 Hong Kong China  82.208
## 21           Japan  82.603

or

arrange(filter(select(gapminder, country, lifeExp), lifeExp > 80), lifeExp)
##            country lifeExp
## 1           Sweden  80.040
## 2           Norway  80.196
## 3      New Zealand  80.204
## 4            Italy  80.240
## 5        Australia  80.370
## 6          Iceland  80.500
## 7            Italy  80.546
## 8      Switzerland  80.620
## 9           Canada  80.653
## 10          France  80.657
## 11           Japan  80.690
## 12          Israel  80.745
## 13          Sweden  80.884
## 14           Spain  80.941
## 15       Australia  81.235
## 16 Hong Kong China  81.495
## 17     Switzerland  81.701
## 18         Iceland  81.757
## 19           Japan  82.000
## 20 Hong Kong China  82.208
## 21           Japan  82.603

Eeewww! Environment clutter, or obfuscated code? There is a better way, which makes both writing and reading the code easier. The pipe from the magrittr package (which is automatically installed and loaded with dplyr) takes the output of first line, and plugs it in as the first argument of the next line.

filter(gapminder, lifeExp > 80) %>%
    arrange(gdpPercap)
##            country year       pop continent lifeExp gdpPercap
## 1      New Zealand 2007   4115771      <NA>  80.204  25185.01
## 2           Israel 2007   6426679      Asia  80.745  25523.28
## 3            Italy 2002  57926999    Europe  80.240  27968.10
## 4            Italy 2007  58147733    Europe  80.546  28569.72
## 5            Japan 2002 127065841      Asia  82.000  28604.59
## 6            Japan 1997 125956499      Asia  80.690  28816.58
## 7            Spain 2007  40448191    Europe  80.941  28821.06
## 8           Sweden 2002   8954175    Europe  80.040  29341.63
## 9  Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 10          France 2007  61083916    Europe  80.657  30470.02
## 11       Australia 2002  19546792      <NA>  80.370  30687.75
## 12         Iceland 2002    288030    Europe  80.500  31163.20
## 13           Japan 2007 127467972      Asia  82.603  31656.07
## 14          Sweden 2007   9031088    Europe  80.884  33859.75
## 15       Australia 2007  20434176      <NA>  81.235  34435.37
## 16     Switzerland 2002   7361757    Europe  80.620  34480.96
## 17         Iceland 2007    301931    Europe  81.757  36180.79
## 18          Canada 2007  33390141  Americas  80.653  36319.24
## 19     Switzerland 2007   7554661    Europe  81.701  37506.42
## 20 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 21          Norway 2007   4627926    Europe  80.196  49357.19

Another way to write this is

gapminder %>%
    filter(lifeExp > 80) %>%
    arrange(gdpPercap)
##            country year       pop continent lifeExp gdpPercap
## 1      New Zealand 2007   4115771      <NA>  80.204  25185.01
## 2           Israel 2007   6426679      Asia  80.745  25523.28
## 3            Italy 2002  57926999    Europe  80.240  27968.10
## 4            Italy 2007  58147733    Europe  80.546  28569.72
## 5            Japan 2002 127065841      Asia  82.000  28604.59
## 6            Japan 1997 125956499      Asia  80.690  28816.58
## 7            Spain 2007  40448191    Europe  80.941  28821.06
## 8           Sweden 2002   8954175    Europe  80.040  29341.63
## 9  Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 10          France 2007  61083916    Europe  80.657  30470.02
## 11       Australia 2002  19546792      <NA>  80.370  30687.75
## 12         Iceland 2002    288030    Europe  80.500  31163.20
## 13           Japan 2007 127467972      Asia  82.603  31656.07
## 14          Sweden 2007   9031088    Europe  80.884  33859.75
## 15       Australia 2007  20434176      <NA>  81.235  34435.37
## 16     Switzerland 2002   7361757    Europe  80.620  34480.96
## 17         Iceland 2007    301931    Europe  81.757  36180.79
## 18          Canada 2007  33390141  Americas  80.653  36319.24
## 19     Switzerland 2007   7554661    Europe  81.701  37506.42
## 20 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 21          Norway 2007   4627926    Europe  80.196  49357.19

Whatever goes through the pipe becomes the first argument of the function after the pipe. This is convenient, because all dplyr functions produce a data.frame as their output and take a data.frame as the first argument. Since R ignores white-space, we can put each function on a new line, which RStudio will automatically indent, making everything easy to read. Now each line represents a step in a sequential operation. You can read this as “Take the gapminder data.frame, filter to the rows where lifeExp is greater than 80, and arrange by gdpPercap.”

Shoutout Q: How would we modify this if we wanted to include the select statement as in the 3rd bad example above?

Making your code easier for humans to read will save you lots of time. The human reading it is usually future-you, and operations that seem simple when you’re writing them will look like gibberish when you’re three weeks removed from them, let alone three months or three years or another person. Make your code as easy to read as possible by using the pipe where appropriate, leaving white space, using descriptive variable names, being consistent with spacing and naming, and liberally commenting code.

If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:

lifeExpGreater80 <- gapminder %>% 
    filter(lifeExp > 80) %>%
    arrange(gdpPercap)

lifeExpGreater80
##            country year       pop continent lifeExp gdpPercap
## 1      New Zealand 2007   4115771      <NA>  80.204  25185.01
## 2           Israel 2007   6426679      Asia  80.745  25523.28
## 3            Italy 2002  57926999    Europe  80.240  27968.10
## 4            Italy 2007  58147733    Europe  80.546  28569.72
## 5            Japan 2002 127065841      Asia  82.000  28604.59
## 6            Japan 1997 125956499      Asia  80.690  28816.58
## 7            Spain 2007  40448191    Europe  80.941  28821.06
## 8           Sweden 2002   8954175    Europe  80.040  29341.63
## 9  Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 10          France 2007  61083916    Europe  80.657  30470.02
## 11       Australia 2002  19546792      <NA>  80.370  30687.75
## 12         Iceland 2002    288030    Europe  80.500  31163.20
## 13           Japan 2007 127467972      Asia  82.603  31656.07
## 14          Sweden 2007   9031088    Europe  80.884  33859.75
## 15       Australia 2007  20434176      <NA>  81.235  34435.37
## 16     Switzerland 2002   7361757    Europe  80.620  34480.96
## 17         Iceland 2007    301931    Europe  81.757  36180.79
## 18          Canada 2007  33390141  Americas  80.653  36319.24
## 19     Switzerland 2007   7554661    Europe  81.701  37506.42
## 20 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 21          Norway 2007   4627926    Europe  80.196  49357.19

Challenge: Data Reduction

  1. Produce a data.frame with only the names and years of countries where the life expectancy is less than 70, from highest to lowest.
  2. Do this same operation again though only for African and Oceanic countries, sorted by the year from most- to least-recent.

More tasks of dplyr

mutate()

We have learned how to drop rows, drop columns, and rearrange rows. To make a new column we use the mutate function. As usual, the first argument is a data.frame. The second argument is the name of the new column you want to create, followed by an equal sign, followed by what to put in that column. You can reference other variables in the data.frame, and mutate will treat each row independently. E.g. we can calculate the total GDP of each country in each year by multiplying the per-capita GDP by the population. We pass the output of mutate to head to keep the display under control. Pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded.

How would we view the highest-total-gdp countries?

mutate(gapminder, total_gdp = gdpPercap * pop) %>%
    head()
##       country year      pop continent lifeExp gdpPercap   total_gdp
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231

You can create multiple columns in the same function call, separating them by commas. E.g. suppose we want the base-10 logarithm of each country’s population and per-capita gdp:

gapminder %>%
    mutate(log_gdp = log10(gdpPercap), log_pop = log10(pop)) %>%
    arrange(desc(log_gdp)) %>%
    head()
##   country year     pop continent lifeExp gdpPercap  log_gdp  log_pop
## 1  Kuwait 1957  212846      Asia  58.033 113523.13 5.055084 5.328065
## 2  Kuwait 1972  841934      Asia  67.712 109347.87 5.038810 5.925278
## 3  Kuwait 1952  160000      Asia  55.565 108382.35 5.034959 5.204120
## 4  Kuwait 1962  358266      Asia  60.470  95458.11 4.979813 5.554206
## 5  Kuwait 1967  575003      Asia  64.624  80894.88 4.907921 5.759670
## 6  Kuwait 1977 1140357      Asia  69.343  59265.48 4.772802 6.057041

MCQ: Data Reduction

Produce a data.frame with only the names and years of countries where per capita gdp is less than a dollar a day sorted from most- to least-recent.

  • Tip: The gdpPercap variable is annual gdp. You’ll need to adjust.
  • Tip: For complex tasks, it often helps to use pencil and paper to write/draw/map the various steps needed and how they fit together before writing any code.

What is the annual per-capita gdp, rounded to the nearest dollar, of the first row in the data.frame?

  1. $278
  2. $312
  3. $331
  4. $339

group_by() and summarize()

Often we want to calculate a new variable, but rather than keeping each row as an independent observation, we want to group observations together to calculate some summary statistic. To do this we need two functions, one to do the grouping and one to calculate the summary statistic: group_by and summarize. By itself group_by doesn’t change a data.frame; it just sets up the grouping. summarize then goes over each group in the data.frame and does whatever calculation you want. E.g. suppose we want the average global gdp for each year. While we’re at it, let’s calculate the mean and median and see how they differ.

gapminder %>%
    group_by(year) %>%
    summarize(mean_gdp = mean(gdpPercap), median_gdp = median(gdpPercap))
## # A tibble: 12 x 3
##     year  mean_gdp median_gdp
##    <int>     <dbl>      <dbl>
##  1  1952  3725.276   1968.528
##  2  1957  4299.408   2173.220
##  3  1962  4725.812   2335.440
##  4  1967  5483.653   2678.335
##  5  1972  6770.083   3339.129
##  6  1977  7313.166   3798.609
##  7  1982  7518.902   4216.228
##  8  1987  7900.920   4280.300
##  9  1992  8158.609   4386.086
## 10  1997  9090.175   4781.825
## 11  2002  9917.848   5319.805
## 12  2007 11680.072   6124.371

Note that summarize() eliminates any other columns. Why?

There are several different summary statistics that can be generated from our data. The R base package provides many built-in functions such as mean, median, min, max, and range. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove). An alternate option is to use the function is.na(), which evaluates to true if the value passed to it is not a number. This function is more useful as a part of a filter, where you can filter out everything that is not a number. For that purpose you would do something like

gapminder %>%
  filter(!is.na(someColumn)) %>%
  head

The ! symbol negates it, so we’re asking for everything that is not an NA.

We often want to calculate the number of entries within a group. E.g. we might wonder if our dataset is balanced by country. We can do this with the n() function, or dplyr provides a count function as a convenience:

gapminder %>%
    group_by(country) %>%
    summarize(number_entries = n())
## # A tibble: 142 x 2
##        country number_entries
##          <chr>          <int>
##  1 Afghanistan             12
##  2     Albania             12
##  3     Algeria             12
##  4      Angola             12
##  5   Argentina             12
##  6   Australia             12
##  7     Austria             12
##  8     Bahrain             12
##  9  Bangladesh             12
## 10     Belgium             12
## # ... with 132 more rows
count(gapminder, country)
## # A tibble: 142 x 2
##        country     n
##          <chr> <int>
##  1 Afghanistan    12
##  2     Albania    12
##  3     Algeria    12
##  4      Angola    12
##  5   Argentina    12
##  6   Australia    12
##  7     Austria    12
##  8     Bahrain    12
##  9  Bangladesh    12
## 10     Belgium    12
## # ... with 132 more rows

The highly observant will notice the output of this looks a little different than most of our data.frames – why didn’t it print out all 60 rows, and what’s with that header info at the top? dplyr converted the data.frame to a table-data.frame, or tbl_df object. The most salient difference between them is that tbl_dfs never print more than a few rows, which can be nice. If you like this behavior, you can convert any data.frame to a tbl_df like so:

gapminder <- tbl_df(gapminder)
gapminder
## # A tibble: 1,704 x 6
##        country  year      pop continent lifeExp gdpPercap
##          <chr> <int>    <dbl>    <fctr>   <dbl>     <dbl>
##  1 Afghanistan  1952  8425333      Asia  28.801  779.4453
##  2 Afghanistan  1957  9240934      Asia  30.332  820.8530
##  3 Afghanistan  1962 10267083      Asia  31.997  853.1007
##  4 Afghanistan  1967 11537966      Asia  34.020  836.1971
##  5 Afghanistan  1972 13079460      Asia  36.088  739.9811
##  6 Afghanistan  1977 14880372      Asia  38.438  786.1134
##  7 Afghanistan  1982 12881816      Asia  39.854  978.0114
##  8 Afghanistan  1987 13867957      Asia  40.822  852.3959
##  9 Afghanistan  1992 16317921      Asia  41.674  649.3414
## 10 Afghanistan  1997 22227415      Asia  41.763  635.3414
## # ... with 1,694 more rows

Now it prints nicely, so we don’t need to use head.

We can also do multiple groupings. Suppose we want the maximum life expectancy in each continent for each year. We group by continent and year and calculate the maximum with the max function:

gapminder %>%
    group_by(continent, year) %>%
    summarize(longest_life = max(lifeExp))
## # A tibble: 60 x 3
## # Groups:   continent [?]
##    continent  year longest_life
##       <fctr> <int>        <dbl>
##  1    Europe  1952        72.67
##  2    Europe  1957        73.47
##  3    Europe  1962        73.68
##  4    Europe  1967        74.16
##  5    Europe  1972        74.72
##  6    Europe  1977        76.11
##  7    Europe  1982        76.99
##  8    Europe  1987        77.41
##  9    Europe  1992        78.77
## 10    Europe  1997        79.39
## # ... with 50 more rows

Hmm, we got the longest life expectancy for each continent-year, but we didn’t get the country. To get the country, we have to ask R “Where lifeExp is at a maximum, what is the entry in country?” For that we use the which.max function. max returns the maximum value; which.max returns the location of the maximum value. That is illustrated in the following example:

max(c(1, 7, 4))
## [1] 7
which.max(c(1, 7, 4))
## [1] 2

Now, back to the question: Where lifeExp is at a maximum, what is the entry in country?

gapminder %>%
    group_by(continent, year) %>%
    summarize(longest_life = max(lifeExp), country = country[which.max(lifeExp)])
## # A tibble: 60 x 4
## # Groups:   continent [?]
##    continent  year longest_life     country
##       <fctr> <int>        <dbl>       <chr>
##  1    Europe  1952        72.67      Norway
##  2    Europe  1957        73.47     Iceland
##  3    Europe  1962        73.68     Iceland
##  4    Europe  1967        74.16      Sweden
##  5    Europe  1972        74.72      Sweden
##  6    Europe  1977        76.11     Iceland
##  7    Europe  1982        76.99     Iceland
##  8    Europe  1987        77.41 Switzerland
##  9    Europe  1992        78.77     Iceland
## 10    Europe  1997        79.39      Sweden
## # ... with 50 more rows

Challenge – Part 1

  • Calculate a new column: the total GDP of each country in each year.
  • Calculate the variance – var() of countries’ gdps in each year.
  • Is country-level GDP getting more or less equal over time?

Challenge – Part 2

  • Modify the code you just wrote to calculate the variance in both country-level GDP and per-capita GDP.
  • Do both measures support the conclusion you arrived at above?

Resources

That is the core of dplyr’s functionality, but it does more. RStudio makes a great cheatsheet that covers all the dplyr functions we just learned, plus what we will learn in the next lesson: keeping data tidy.