Data Read/Write

There are many ways to get data into and out of R. This lesson aims to give you the tools to navigate your many options. RStudio’s Data Import Cheat Sheet provides an excellent overview. Working in the social sciences you are likely to come across the following types of data:

Data Type Suggested Library
CSV readr
SPSS, Stata, or SAS haven
Excel readxl
Databases DBI
Web APIs httr
Web Scraping rvest

To keep this lesson manageable we will focus primarily on reading and writing .csv files, and we will show how working with Stata .dta files is quite similar.

Paths to Files and Folders

One of the trickiest parts of working with R is telling it where to find files on disk. I found this introduction to working with files and folders quite helpful. Here’s a quick overview of that article.

Whenever you are working in R, your R session has a current working directory. This is the folder where R is working from. So, if you tell R to open a file named my-data.csv it will look in the current working directory to see if it can find that file. If that file does not exist in that directory you’ll receive an error message.

library(readr)

read_csv("my-data.csv")
## Error: 'my-data.csv' does not exist in current working directory ('/home/travis/build/hbs-rcs/R_Intro-gapminder/content/post').

Often it is helpful to know where R is working from, to see the current working directory use the getwd() function:

getwd()
## [1] "/home/travis/build/hbs-rcs/R_Intro-gapminder/content/post"

As R compiles this lesson, it’s working from the directory listed above. This is likely different from the output you see running the same function on your computer. To change the working directory use the setwd() function. Unfortunately paths on Windows are a little different compared to paths on Mac and Linux. RStudio makes navigating your file system a little easier. Use the “Files” pane to navigate into the folder where you want to be working, click the “More” button and select “Set As Working Directory”. You’ll see that RStudio calls the setwd() function with the appropriate path for your operating system.

Read from CSV

A comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

Download the data for this lesson using this link. This CSV file contains country-by-year level data on GDP, population, and life expectancy.

Use RStudio’s “Files” pane to navigate to where you downloaded the data.

Read the CSV with the read_csv() function from the readr package. The argument to read_csv() is the location of the file, relative to your working directory. Run the following code to read the data into R:

library(readr)

gapminder <- read_csv("gapminder-FiveYearData.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_integer(),
##   pop = col_double(),
##   continent = col_character(),
##   lifeExp = col_double(),
##   gdpPercap = col_double()
## )

If readr is not installed on your computer, use install.packages("readr") to install it.

We have created a new object in the global environment called gapminder. Looking at RStudio’s “Environment” pane we see this object has type tbl_df, it has 6 columns and 1704 rows. If we want to check that our data has been loaded, we can print the variable’s value:

gapminder
## # A tibble: 1,704 x 6
##        country  year      pop continent lifeExp gdpPercap
##          <chr> <int>    <dbl>     <chr>   <dbl>     <dbl>
##  1 Afghanistan  1952  8425333      Asia  28.801  779.4453
##  2 Afghanistan  1957  9240934      Asia  30.332  820.8530
##  3 Afghanistan  1962 10267083      Asia  31.997  853.1007
##  4 Afghanistan  1967 11537966      Asia  34.020  836.1971
##  5 Afghanistan  1972 13079460      Asia  36.088  739.9811
##  6 Afghanistan  1977 14880372      Asia  38.438  786.1134
##  7 Afghanistan  1982 12881816      Asia  39.854  978.0114
##  8 Afghanistan  1987 13867957      Asia  40.822  852.3959
##  9 Afghanistan  1992 16317921      Asia  41.674  649.3414
## 10 Afghanistan  1997 22227415      Asia  41.763  635.3414
## # ... with 1,694 more rows

Congratulations! You have successfully read data into R.

Inspect Data

Now that you have data in R, let’s run through a few functions that are helpful for inspecting your data.

Use the head() function to see the first few rows of a data frame.

head(gapminder)
## # A tibble: 6 x 6
##       country  year      pop continent lifeExp gdpPercap
##         <chr> <int>    <dbl>     <chr>   <dbl>     <dbl>
## 1 Afghanistan  1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan  1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan  1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan  1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan  1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan  1977 14880372      Asia  38.438  786.1134

Use the tail() function to see the last few rows:

tail(gapminder)
## # A tibble: 6 x 6
##    country  year      pop continent lifeExp gdpPercap
##      <chr> <int>    <dbl>     <chr>   <dbl>     <dbl>
## 1 Zimbabwe  1982  7636524    Africa  60.363  788.8550
## 2 Zimbabwe  1987  9216418    Africa  62.351  706.1573
## 3 Zimbabwe  1992 10704340    Africa  60.377  693.4208
## 4 Zimbabwe  1997 11404948    Africa  46.809  792.4500
## 5 Zimbabwe  2002 11926563    Africa  39.989  672.0386
## 6 Zimbabwe  2007 12311143    Africa  43.487  469.7093

Use the str() function to inspect the structure of an object:

str(gapminder, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Note: If you use the read.csv() function from base R, the default behavior is to convert character columns into factors (integer values with character labels).

str(read.csv("gapminder-FiveYearData.csv"))
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

How big is my data?

Three frequently used functions will allow you to see the shape of your data. dim() will return a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)

dim(gapminder)
## [1] 1704    6

You can also assess these independently using the nrow() and ncol() functions:

nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6

What variables are in my data?

If your data has column headers, this information can be retrieved with the colnames() command. More colloquially, you can use names() if your object is a data.frame.

colnames(gapminder)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
names(gapminder)
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Likewise, the rownames() command will retrieve the row headers, if present. If not, this command will simply return the row number:

# look at the head of all the rownames
head(rownames(gapminder))
## [1] "1" "2" "3" "4" "5" "6"

How are the variables distributed?

The summary() function is a very convenient way to get summary statistics for each column:

summary(gapminder)
##    country               year           pop             continent        
##  Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
##  Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
##  Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
##                     Mean   :1980   Mean   :2.960e+07                     
##                     3rd Qu.:1993   3rd Qu.:1.959e+07                     
##                     Max.   :2007   Max.   :1.319e+09                     
##     lifeExp        gdpPercap       
##  Min.   :23.60   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:  1202.1  
##  Median :60.71   Median :  3531.8  
##  Mean   :59.47   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :113523.1

As you can see, this information is more valuable for some columns than others. We can use the table() function to tabulate categorical variables.

table(gapminder$continent)
## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

Create Data in Code

Often it is useful to create variables to hold scalar values, for instance

path <- "~/my-file.txt"
i <- 10

Other times it’s useful to create vectors:

filenames <- c("one.txt", "two.txt")
indices <- c(2, 5, 7)

And sometimes it is useful to create small data frames in code. The read_csv() function makes this easy as well:

animals <- read_csv(
    "name   , feel    , type         , weight_lbs
     Andrew , Furry   , Dog          , 45
     Bob    , Furry   , Cat          , 8
     Carl   , Squishy , Sea Cucumber , 1.1
     Doug   , Spiny   , Sea Urchin   , 0.8"
)

Let’s look at the data we’ve created.

animals
## # A tibble: 4 x 4
##     name    feel         type weight_lbs
##    <chr>   <chr>        <chr>      <dbl>
## 1 Andrew   Furry          Dog       45.0
## 2    Bob   Furry          Cat        8.0
## 3   Carl Squishy Sea Cucumber        1.1
## 4   Doug   Spiny   Sea Urchin        0.8

We have created a table of animals. There are four rows (animals) and four columns (characteristics of each animal). Andrew is a furry dog that weighs 45 pounds.

Write to CSV

Let’s use the write_csv() function from the readr package to store our animals data in a CSV file:

write_csv(animals, "my-animals.csv")

In RStudio, we can use the “Files” pane to “Go To Working Directory” and “View File”. Notice that all the extra white space around the data has been trimmed away.

Stata DTA Files

It is easy to read Stata .dta files using the read_dta() function from the haven package:

library(haven)

effort <- read_dta("http://data.princeton.edu/wws509/datasets/effort.dta")
head(effort)
## # A tibble: 6 x 4
##     country setting effort change
##       <chr>   <dbl>  <dbl>  <dbl>
## 1   Bolivia      46      0      1
## 2    Brazil      74      0     10
## 3     Chile      89     16     29
## 4  Colombia      77     16     25
## 5 CostaRica      84     21     29
## 6      Cuba      89     15     40

Writing to a .dta file can be done using the write_dta() function from the haven package:

write_dta(effort, "my-effort.dta")