web scraping

Easier web scraping in R with tidyverse

I recently used R for a moderately complicated scraping task, and found that using tools and techniques from the tidyverse made for a very pleasant web scraping experience, especially for retrieving nested data. In particular, the nest/unnest functions in the tidyr package make it easy to implement breadth-first scrapers in R by nesting the results from each level and then expanding to a tabular structure. This approach has the advantage of making it easy to follow the program logic, and it also makes it very easy to store retrieved values in a convenient format.

Web Scraping in R

Let’s walk through some steps for web scraping with R. On this Wikipedia page there is a table of visa requirements that I want to scrape. Let’s use the rvest package to get the HTML associated with that page: library(rvest) html <- read_html("https://en.wikipedia.org/wiki/Visa_requirements_for_United_States_citizens") html ## {html_document} ## <html class="client-nojs" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... ## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ... Now let’s use the html_nodes() function to extract the table of interest.