Easier web scraping in R with tidyverse
I recently used R for a moderately complicated scraping task, and found that using
tools and techniques from the tidyverse made for a very pleasant web scraping
experience, especially for retrieving nested data. In particular, the nest/unnest
functions in the tidyr
package make it easy to implement breadth-first scrapers in R by
nesting the results from each level and then expanding to a tabular structure. This
approach has the advantage of making it easy to follow the program logic, and it also
makes it very easy to store retrieved values in a convenient format.
Example: HBS workshops
As a simple example of a website with a nested structure consider https://training.rcs.hbs.org/workshops. This site lists workshops nested within categories.
Start at the top and store results in tibbles
Using the tidyverse packages along with rvest make web scraping in R more convenient.
library(tidyverse)
library(rvest)
To retrieve workshop information from https://training.rcs.hbs.org we can start by creating
a tibble
to store the data we will retrieve from the site. To begin with this tibble
has
only one row and one column containing the URL of the starting page. This might seem like
a strange way to start, but it helps us keep a consistent and clean pattern as we descend
through the nested structure of the website.
ws_data <- tibble(start_url = "https://training.rcs.hbs.org/workshops")
Store retrieved data in list columns and unnest as needed
Next we mutate the data, reading the page containing the outer-most collection and extracting the information we need. The information we extract includes URLs at the next level of the tree we are traversing. Because we will retrieve multiple elements we store the result in a list-column.
ws_data <- ws_data %>%
mutate(category = map(start_url,
~ read_html(.) %>%
html_nodes(".menu-depth-2 a") %>%
{tibble(name = html_text(.),
url = html_attr(., "href"))})
)
glimpse(ws_data)
## Observations: 1
## Variables: 2
## $ start_url <chr> "https://training.rcs.hbs.org/workshops"
## $ category <list> [<tbl_df[7 x 2]>]
Our data structure still only has one row, but we can easily expand it so that it has one row per category.
ws_data <- ws_data %>%
unnest(category, names_sep = "_", keep_empty = TRUE)
glimpse(ws_data)
## Observations: 7
## Variables: 3
## $ start_url <chr> "https://training.rcs.hbs.org/workshops", "https://tr...
## $ category_name <chr> "HBS Grid Training ", "R", "Stata", "Python", "Other ...
## $ category_url <chr> "https://training.rcs.hbs.org/compute-grid-training",...
Each of the categories contains one or more workshops, so the next step is to iterate over categories and retrieve the all the workshop links. Because we want to retrieve more than one value for each category we store the result in a list-column.
ws_data <- ws_data %>%
mutate(workshop = map(category_url,
~ read_html(.) %>%
html_nodes(".menu-depth-3 a") %>%
{tibble(name = html_text(.),
url = html_attr(., "href"))})
)
glimpse(ws_data)
## Observations: 7
## Variables: 4
## $ start_url <chr> "https://training.rcs.hbs.org/workshops", "https://tr...
## $ category_name <chr> "HBS Grid Training ", "R", "Stata", "Python", "Other ...
## $ category_url <chr> "https://training.rcs.hbs.org/compute-grid-training",...
## $ workshop <list> [<tbl_df[0 x 2]>, <tbl_df[5 x 2]>, <tbl_df[2 x 2]>, ...
As before we unnest the data, making sure to keep empty rows.
ws_data <- ws_data %>%
unnest(workshop, names_sep = "_", keep_empty = TRUE)
glimpse(ws_data)
## Observations: 18
## Variables: 5
## $ start_url <chr> "https://training.rcs.hbs.org/workshops", "https://tr...
## $ category_name <chr> "HBS Grid Training ", "R", "R", "R", "R", "R", "Stata...
## $ category_url <chr> "https://training.rcs.hbs.org/compute-grid-training",...
## $ workshop_name <chr> NA, "Introduction to R", "Introduction to R Graphics ...
## $ workshop_url <chr> NA, "https://training.rcs.hbs.org/introduction-r", "h...
Putting it all together
As simple as it is, the code examples above can be simplified even further by modularizing the data processing functions. Here is the whole simplified program for retrieving workshop information, in less than 20 lines of code.
library(tidyverse)
library(rvest)
get_links <- function(url, css) {
read_html(url) %>%
html_nodes(css) %>%
{tibble(name = html_text(.),
url = html_attr(., "href"))}
}
ws_data <- tibble(start_url = "https://training.rcs.hbs.org/workshops")
ws_data <- ws_data %>%
mutate(category = map(start_url, get_links, css = ".menu-depth-2 a")) %>%
unnest(category, names_sep = "_", keep_empty = TRUE) %>%
mutate(workshop = map(category_url, get_links, css = ".menu-depth-3 a")) %>%
unnest(workshop, names_sep = "_", keep_empty = TRUE)
glimpse(ws_data)
## Observations: 18
## Variables: 5
## $ start_url <chr> "https://training.rcs.hbs.org/workshops", "https://tr...
## $ category_name <chr> "HBS Grid Training ", "R", "R", "R", "R", "R", "Stata...
## $ category_url <chr> "https://training.rcs.hbs.org/compute-grid-training",...
## $ workshop_name <chr> NA, "Introduction to R", "Introduction to R Graphics ...
## $ workshop_url <chr> NA, "https://training.rcs.hbs.org/introduction-r", "h...
Conclusions
The key pattern is mutate
to a list-column
containing tibbles
and then unnest
to maintain a tabular record of URLs and results at
each level. This expands the data structure as you descend
through each level, resulting in a nice clean tabular structure at the end. At each level
unest(names_sep = "_")
produces a consistent naming scheme with minimal effort. Finally,
this pattern generalizes easily to cases where you wish to retrieve multiple pieces of
information at each level.