Web Scraping in R

Let’s walk through some steps for web scraping with R. On this Wikipedia page there is a table of visa requirements that I want to scrape. Let’s use the rvest package to get the HTML associated with that page:

library(rvest)

html <- read_html("https://en.wikipedia.org/wiki/Visa_requirements_for_United_States_citizens")
html
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Now let’s use the html_nodes() function to extract the table of interest. I used Chrome’s Developer Tools to get the XPath of the table (see notes at the end of the post on how to do it):

referenced_by <- html_node(html, xpath='//*[@id="mw-content-text"]/div/table[1]')
referenced_by
## {html_node}
## <table class="sortable wikitable">
## [1] <tbody>\n<tr>\n<th style="width:18%;">Country\n</th>\n<th style="wid ...

Now let’s convert that HTML table into a data frame.

visa_requirements <- html_table(referenced_by)
head(visa_requirements[,1:3])
##               Country          Visa requirement     Allowed stay
## 1         Afghanistan       Visa required[2][3]                 
## 2             Albania   Visa not required[5][6]        1 year[7]
## 3             Algeria       Visa required[8][9]                 
## 4             Andorra     Visa not required[10] 3 months[11][12]
## 5              Angola         eVisa[13][14][15]          30 days
## 6 Antigua and Barbuda Visa not required[18][19]     6 months[20]

Finally, we can clean footnote references from columns 2 and 3 using gsub().

visa_requirements <- html_table(referenced_by)
visa_requirements$`Visa requirement` <- gsub("\\[.*","",visa_requirements$`Visa requirement`)
visa_requirements$`Allowed stay` <-  gsub("\\[.*","",visa_requirements$`Allowed stay`)
head(visa_requirements[,1:3])
##               Country  Visa requirement Allowed stay
## 1         Afghanistan     Visa required             
## 2             Albania Visa not required       1 year
## 3             Algeria     Visa required             
## 4             Andorra Visa not required     3 months
## 5              Angola             eVisa      30 days
## 6 Antigua and Barbuda Visa not required     6 months

We’ve only scratched the surface here, but hope this example shows off the convenience of the rvest package.

Notes:

  • Chrome’s Developer Tools can be launched by right-clicking on the page and selecting Inspect. Then, mouse over the html code listed under elements and find a place that highlights the table of interest on the right. Then right-click again, select Copy -> Copy XPath.

  • If writing custom scraping scripts in R is not the route you’d want to take, our team has recently discovered a very nice and flexible commercial tool Mozenda. As of 8/8/2019, they offer a 30-day trial of a full product.

Related