Data Read/Write
There are many ways to get data into and out of R. This lesson aims to give you the tools to navigate your many options. RStudio’s Data Import Cheat Sheet provides an excellent overview. Working in the social sciences you are likely to come across the following types of data:
Data Type | Suggested Library |
---|---|
CSV | readr |
SPSS, Stata, or SAS | haven |
Excel | readxl |
Databases | DBI |
Web APIs | httr |
Web Scraping | rvest |
To keep this lesson manageable we will focus primarily on reading and writing .csv files, and we will show how working with Stata .dta files is quite similar.
Paths to Files and Folders
One of the trickiest parts of working with R is telling it where to find files on disk. I found this introduction to working with files and folders quite helpful. Here’s a quick overview of that article.
Whenever you are working in R, your R session has a current working directory. This is the folder where R is working from. So, if you tell R to open a file named my-data.csv
it will look in the current working directory to see if it can find that file. If that file does not exist in that directory you’ll receive an error message.
library(readr)
read_csv("my-data.csv")
## Error: 'my-data.csv' does not exist in current working directory ('/home/travis/build/hbs-rcs/R_Intro-gapminder/content/post').
Often it is helpful to know where R is working from, to see the current working directory use the getwd()
function:
getwd()
## [1] "/home/travis/build/hbs-rcs/R_Intro-gapminder/content/post"
As R compiles this lesson, it’s working from the directory listed above. This is likely different from the output you see running the same function on your computer. To change the working directory use the setwd()
function. Unfortunately paths on Windows are a little different compared to paths on Mac and Linux. RStudio makes navigating your file system a little easier. Use the “Files” pane to navigate into the folder where you want to be working, click the “More” button and select “Set As Working Directory”. You’ll see that RStudio calls the setwd()
function with the appropriate path for your operating system.
Read from CSV
A comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
Download the data for this lesson using this link. This CSV file contains country-by-year level data on GDP, population, and life expectancy.
Use RStudio’s “Files” pane to navigate to where you downloaded the data.
Read the CSV with the read_csv()
function from the readr package. The argument to read_csv()
is the location of the file, relative to your working directory. Run the following code to read the data into R:
library(readr)
gapminder <- read_csv("gapminder-FiveYearData.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_integer(),
## pop = col_double(),
## continent = col_character(),
## lifeExp = col_double(),
## gdpPercap = col_double()
## )
If readr is not installed on your computer, use install.packages("readr")
to install it.
We have created a new object in the global environment called gapminder
. Looking at RStudio’s “Environment” pane we see this object has type tbl_df, it has 6 columns and 1704 rows. If we want to check that our data has been loaded, we can print the variable’s value:
gapminder
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## 7 Afghanistan 1982 12881816 Asia 39.854 978.0114
## 8 Afghanistan 1987 13867957 Asia 40.822 852.3959
## 9 Afghanistan 1992 16317921 Asia 41.674 649.3414
## 10 Afghanistan 1997 22227415 Asia 41.763 635.3414
## # ... with 1,694 more rows
Congratulations! You have successfully read data into R.
Inspect Data
Now that you have data in R, let’s run through a few functions that are helpful for inspecting your data.
Use the head()
function to see the first few rows of a data frame.
head(gapminder)
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
Use the tail()
function to see the last few rows:
tail(gapminder)
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <chr> <int> <dbl> <chr> <dbl> <dbl>
## 1 Zimbabwe 1982 7636524 Africa 60.363 788.8550
## 2 Zimbabwe 1987 9216418 Africa 62.351 706.1573
## 3 Zimbabwe 1992 10704340 Africa 60.377 693.4208
## 4 Zimbabwe 1997 11404948 Africa 46.809 792.4500
## 5 Zimbabwe 2002 11926563 Africa 39.989 672.0386
## 6 Zimbabwe 2007 12311143 Africa 43.487 469.7093
Use the str()
function to inspect the structure of an object:
str(gapminder, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Note: If you use the read.csv()
function from base R, the default behavior is to convert character columns into factors (integer values with character labels).
str(read.csv("gapminder-FiveYearData.csv"))
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
How big is my data?
Three frequently used functions will allow you to see the shape of your data. dim()
will return a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
dim(gapminder)
## [1] 1704 6
You can also assess these independently using the nrow()
and ncol()
functions:
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6
What variables are in my data?
If your data has column headers, this information can be retrieved with the colnames()
command. More colloquially, you can use names()
if your object is a data.frame
.
colnames(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
names(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
Likewise, the rownames()
command will retrieve the row headers, if present. If not, this command will simply return the row number:
# look at the head of all the rownames
head(rownames(gapminder))
## [1] "1" "2" "3" "4" "5" "6"
How are the variables distributed?
The summary()
function is a very convenient way to get summary statistics for each column:
summary(gapminder)
## country year pop continent
## Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
## Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
## Mode :character Median :1980 Median :7.024e+06 Mode :character
## Mean :1980 Mean :2.960e+07
## 3rd Qu.:1993 3rd Qu.:1.959e+07
## Max. :2007 Max. :1.319e+09
## lifeExp gdpPercap
## Min. :23.60 Min. : 241.2
## 1st Qu.:48.20 1st Qu.: 1202.1
## Median :60.71 Median : 3531.8
## Mean :59.47 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.: 9325.5
## Max. :82.60 Max. :113523.1
As you can see, this information is more valuable for some columns than others. We can use the table()
function to tabulate categorical variables.
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
Create Data in Code
Often it is useful to create variables to hold scalar values, for instance
path <- "~/my-file.txt"
i <- 10
Other times it’s useful to create vectors:
filenames <- c("one.txt", "two.txt")
indices <- c(2, 5, 7)
And sometimes it is useful to create small data frames in code. The read_csv()
function makes this easy as well:
animals <- read_csv(
"name , feel , type , weight_lbs
Andrew , Furry , Dog , 45
Bob , Furry , Cat , 8
Carl , Squishy , Sea Cucumber , 1.1
Doug , Spiny , Sea Urchin , 0.8"
)
Let’s look at the data we’ve created.
animals
## # A tibble: 4 x 4
## name feel type weight_lbs
## <chr> <chr> <chr> <dbl>
## 1 Andrew Furry Dog 45.0
## 2 Bob Furry Cat 8.0
## 3 Carl Squishy Sea Cucumber 1.1
## 4 Doug Spiny Sea Urchin 0.8
We have created a table of animals. There are four rows (animals) and four columns (characteristics of each animal). Andrew is a furry dog that weighs 45 pounds.
Write to CSV
Let’s use the write_csv()
function from the readr package to store our animals data in a CSV file:
write_csv(animals, "my-animals.csv")
In RStudio, we can use the “Files” pane to “Go To Working Directory” and “View File”. Notice that all the extra white space around the data has been trimmed away.
Stata DTA Files
It is easy to read Stata .dta files using the read_dta()
function from the haven package:
library(haven)
effort <- read_dta("http://data.princeton.edu/wws509/datasets/effort.dta")
head(effort)
## # A tibble: 6 x 4
## country setting effort change
## <chr> <dbl> <dbl> <dbl>
## 1 Bolivia 46 0 1
## 2 Brazil 74 0 10
## 3 Chile 89 16 29
## 4 Colombia 77 16 25
## 5 CostaRica 84 21 29
## 6 Cuba 89 15 40
Writing to a .dta file can be done using the write_dta()
function from the haven package:
write_dta(effort, "my-effort.dta")