Web Scraping in R
Let’s walk through some steps for web scraping with R. On this Wikipedia page there is a table of visa requirements that I want to scrape. Let’s use the rvest package to get the HTML associated with that page:
library(rvest)
html <- read_html("https://en.wikipedia.org/wiki/Visa_requirements_for_United_States_citizens")
html
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
Now let’s use the html_nodes()
function to extract the table of interest. I used Chrome’s Developer Tools to get the XPath of the table (see notes at the end of the post on how to do it):
referenced_by <- html_node(html, xpath='//*[@id="mw-content-text"]/div/table[1]')
referenced_by
## {html_node}
## <table class="sortable wikitable">
## [1] <tbody>\n<tr>\n<th style="width:18%;">Country\n</th>\n<th style="wid ...
Now let’s convert that HTML table into a data frame.
visa_requirements <- html_table(referenced_by)
head(visa_requirements[,1:3])
## Country Visa requirement Allowed stay
## 1 Afghanistan Visa required[2][3]
## 2 Albania Visa not required[5][6] 1 year[7]
## 3 Algeria Visa required[8][9]
## 4 Andorra Visa not required[10] 3 months[11][12]
## 5 Angola eVisa[13][14][15] 30 days
## 6 Antigua and Barbuda Visa not required[18][19] 6 months[20]
Finally, we can clean footnote references from columns 2 and 3 using gsub()
.
visa_requirements <- html_table(referenced_by)
visa_requirements$`Visa requirement` <- gsub("\\[.*","",visa_requirements$`Visa requirement`)
visa_requirements$`Allowed stay` <- gsub("\\[.*","",visa_requirements$`Allowed stay`)
head(visa_requirements[,1:3])
## Country Visa requirement Allowed stay
## 1 Afghanistan Visa required
## 2 Albania Visa not required 1 year
## 3 Algeria Visa required
## 4 Andorra Visa not required 3 months
## 5 Angola eVisa 30 days
## 6 Antigua and Barbuda Visa not required 6 months
We’ve only scratched the surface here, but hope this example shows off the convenience of the rvest
package.
Notes:
Chrome’s Developer Tools can be launched by right-clicking on the page and selecting Inspect. Then, mouse over the html code listed under elements and find a place that highlights the table of interest on the right. Then right-click again, select Copy -> Copy XPath.
If writing custom scraping scripts in R is not the route you’d want to take, our team has recently discovered a very nice and flexible commercial tool Mozenda. As of 8/8/2019, they offer a 30-day trial of a full product.