Easier web scraping in R with tidyverse

I recently used R for a moderately complicated scraping task, and found that using tools and techniques from the tidyverse made for a very pleasant web scraping experience, especially for retrieving nested data. In particular, the nest/unnest functions in the tidyr package make it easy to implement breadth-first scrapers in R by nesting the results from each level and then expanding to a tabular structure. This approach has the advantage of making it easy to follow the program logic, and it also makes it very easy to store retrieved values in a convenient format.

Web Scraping in R

Let’s walk through some steps for web scraping with R. On this Wikipedia page there is a table of visa requirements that I want to scrape. Let’s use the rvest package to get the HTML associated with that page: library(rvest) html <- read_html("https://en.wikipedia.org/wiki/Visa_requirements_for_United_States_citizens") html ## {html_document} ## <html class="client-nojs" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ... ## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ... Now let’s use the html_nodes() function to extract the table of interest.

Use machine learning for causal effect in observational study

A simulation for OLS model In an observational study, we need to assume we have the functional form to get causal effect estimated correctly, in addtion to the assumption of treatment being exogenous. library(MASS) library(ggplot2) library(dplyr) library(tmle) library(glmnet) set.seed(366) nobs <- 2000 xw <- .8 xz <- .5 zw <- .6 nrow <- 3 ncol <- 3 covarMat = matrix( c(1^2, xz^2, xw^2, xz^2, 1^2, zw^2, xw^2, zw^2, 1^2 ) , nrow=ncol , ncol=ncol ) mu <- rep(0,3) rawvars <- mvrnorm(n=nobs, mu=mu, Sigma=covarMat) df <- tbl_df(rawvars) names(df) <- c('x','z','w') df <- df %>% mutate(log.