Introduction

Workshop description

  • This is an intermediate/advanced R course
  • Appropriate for those with basic knowledge of R
  • This is not a statistics course!
  • Learning objectives:
    • Learn the R formula interface
    • Specify factor contrasts to test specific hypotheses
    • Perform model comparisons
    • Run and interpret variety of regression models in R

Materials and Setup

Lab computer users: Log in using the user name and password on the board to your left.

Laptop users:

Everyone:

Launch RStudio

  • Open the RStudio program from the Windows start menu
  • Create a project in the Rstatistics folder you downloaded earlier:
    • File => New Project => Existing Directory => Browse and select the Rstatistics folder.

Set working directory

It is often helpful to start your R session by setting your working directory so you don’t have to type the full path names to your data and other files

You might also start by listing the files in your working directory

## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rstatistics"
## [1] "Exam.rds"          "NatHealth2008MI"   "NatHealth2011.rds"
## [4] "states.dta"        "states.rds"

Load the states data

The states.dta data comes from http://anawida.de/teach/SS14/anawida/4.linReg/data/states.dta.txt and appears to have originally appeared in Statistics with Stata by Lawrence C. Hamilton.

##      names                      var.labels
## 14    csat        Mean composite SAT score
## 15    vsat           Mean verbal SAT score
## 16    msat             Mean math SAT score
## 17 percent       % HS graduates taking SAT
## 18 expense Per pupil expenditures prim&sec
## 19  income Median household income, $1,000
## 20    high             % adults HS diploma
## 21 college         % adults college degree

Linear regression

Examine the data before fitting models

Start by examining the data to check for problems.

##     expense          csat       
##  Min.   :2960   Min.   : 832.0  
##  1st Qu.:4352   1st Qu.: 888.0  
##  Median :5000   Median : 926.0  
##  Mean   :5236   Mean   : 944.1  
##  3rd Qu.:5794   3rd Qu.: 997.0  
##  Max.   :9259   Max.   :1093.0
##            expense       csat
## expense  1.0000000 -0.4662978
## csat    -0.4662978  1.0000000

Plot the data before fitting models

Plot the data to look for multivariate outliers, non-linear relationships etc.

Linear regression example

  • Linear regression models can be fit with the lm() function
  • For example, we can use lm to predict SAT scores based on per-pupal expenditures:
## 
## Call:
## lm(formula = csat ~ expense, data = states.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -131.811  -38.085    5.607   37.852  136.495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.061e+03  3.270e+01   32.44  < 2e-16 ***
## expense     -2.228e-02  6.037e-03   -3.69 0.000563 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.81 on 49 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2015 
## F-statistic: 13.61 on 1 and 49 DF,  p-value: 0.0005631

Why is the association between expense and SAT scores negative?

Many people find it surprising that the per-capita expenditure on students is negatively related to SAT scores. The beauty of multiple regression is that we can try to pull these apart. What would the association between expense and SAT scores be if there were no difference among the states in the percentage of students taking the SAT?

## 
## Call:
## lm(formula = csat ~ expense + percent, data = states.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.921 -24.318   1.741  15.502  75.623 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 989.807403  18.395770  53.806  < 2e-16 ***
## expense       0.008604   0.004204   2.046   0.0462 *  
## percent      -2.537700   0.224912 -11.283 4.21e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.62 on 48 degrees of freedom
## Multiple R-squared:  0.7857, Adjusted R-squared:  0.7768 
## F-statistic: 88.01 on 2 and 48 DF,  p-value: < 2.2e-16

The lm class and methods

OK, we fit our model. Now what?

  • Examine the model object:
## [1] "lm"
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
## [1] "add1.lm"                   "alias.lm"                 
## [3] "anova.lm"                  "case.names.lm"            
## [5] "coerce,oldClass,S3-method" "confint.lm"               
## [7] "cooks.distance.lm"         "deviance.lm"              
## [9] "dfbeta.lm"
  • Use function methods to get more information about the fit
##                    2.5 %        97.5 %
## (Intercept) 995.01753164 1126.44735626
## expense      -0.03440768   -0.01014361

Linear Regression Assumptions

  • Ordinary least squares regression relies on several assumptions, including that the residuals are normally distributed and homoscedastic, the errors are independent and the relationships are linear.
  • Investigate these assumptions visually by plotting your model:

Comparing models

Do congressional voting patterns predict SAT scores over and above expense? Fit two models and compare them:

## Analysis of Variance Table
## 
## Model 1: csat ~ expense
## Model 2: csat ~ expense + house + senate
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     46 169050                              
## 2     44 149284  2     19766 2.9128 0.06486 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) 1082.93438041 38.633812740 28.0307405 1.067795e-29
## expense       -0.01870832  0.009691494 -1.9303852 6.001998e-02
## house         -1.44243754  0.600478382 -2.4021473 2.058666e-02
## senate         0.49817861  0.513561356  0.9700469 3.373256e-01

Exercise 0: least squares regression

Use the states.rds data set. Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to

  1. Examine/plot the data before fitting the model
  2. Print and interpret the model summary
  3. plot the model to look for deviations from modeling assumptions

Select one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?

Interactions and factors

Modeling interactions

Interactions allow us assess the extent to which the association between one predictor and the outcome depends on a second predictor. For example: Does the association between expense and SAT scores depend on the median income in the state?

##                     Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)     1.380364e+03 1.720863e+02  8.021351 2.367069e-10
## expense        -6.384067e-02 3.270087e-02 -1.952262 5.687837e-02
## income         -1.049785e+01 4.991463e+00 -2.103161 4.083253e-02
## expense:income  1.384647e-03 8.635529e-04  1.603431 1.155395e-01

Regression with categorical predictors

Let’s try to predict SAT scores from region, a categorical variable. Note that you must make sure R does not think your categorical variable is numeric.

##  Factor w/ 4 levels "West","N. East",..: 3 1 1 3 1 1 2 3 NA 3 ...
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)   946.30769   14.79582 63.9577807 1.352577e-46
## regionN. East -56.75214   23.13285 -2.4533141 1.800383e-02
## regionSouth   -16.30769   19.91948 -0.8186806 4.171898e-01
## regionMidwest  63.77564   21.35592  2.9863209 4.514152e-03
## Analysis of Variance Table
## 
## Response: csat
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## region     3  82049 27349.8  9.6102 4.859e-05 ***
## Residuals 46 130912  2845.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, make sure to tell R which variables are categorical by converting them to factors!

Setting factor reference groups and contrasts

In the previous example we use the default contrasts for region. The default in R is treatment contrasts, with the first level as the reference. We can change the reference group or use another coding scheme using the C function.

##         N. East South Midwest
## West          0     0       0
## N. East       1     0       0
## South         0     1       0
## Midwest       0     0       1
##                        Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)          1010.08333   15.39998 65.589930 4.296307e-47
## C(region, base = 4)1  -63.77564   21.35592 -2.986321 4.514152e-03
## C(region, base = 4)2 -120.52778   23.52385 -5.123641 5.798399e-06
## C(region, base = 4)3  -80.08333   20.37225 -3.931000 2.826007e-04
##                             Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)               943.986645   7.706155 122.4977451 1.689670e-59
## C(region, contr.helmert)1 -28.376068  11.566423  -2.4533141 1.800383e-02
## C(region, contr.helmert)2   4.022792   5.884552   0.6836191 4.976450e-01
## C(region, contr.helmert)3  22.032229   4.446777   4.9546509 1.023364e-05

See also ?contrasts, ?contr.treatment, and ?relevel.

Exercise 1: interactions and factors

Use the states data set.

  1. Add on to the regression equation that you created in exercise 1 by generating an interaction term and testing the interaction.

  2. Try adding region to the model. Are there significant differences across the four regions?

Regression with binary outcomes

Logistic regression

This far we have used the lm function to fit our regression models. lm is great, but limited–in particular it only fits models for continuous dependent variables. For categorical dependent variables we can use the glm() function.

For these models we will use a different dataset, drawn from the National Health Interview Survey. From the CDC website:

The National Health Interview Survey (NHIS) has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. For over 50 years, the U.S. Census Bureau has been the data collection agent for the National Health Interview Survey. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives.

Load the National Health Interview Survey data:

Logistic regression example

Let’s predict the probability of being diagnosed with hypertension based on age, sex, sleep, and bmi

##  Factor w/ 5 levels "1 Yes","2 No",..: 2 2 1 2 2 1 2 2 1 2 ...
## [1] "1 Yes"             "2 No"              "7 Refused"        
## [4] "8 Not ascertained" "9 Don't know"
##                 Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept) -4.269466028 0.0564947294 -75.572820 0.000000e+00
## age_p        0.060699303 0.0008227207  73.778743 0.000000e+00
## sex2 Female -0.144025092 0.0267976605  -5.374540 7.677854e-08
## sleep       -0.007035776 0.0016397197  -4.290841 1.779981e-05
## bmi          0.018571704 0.0009510828  19.526906 6.485172e-85

Logistic regression coefficients

Generalized linear models use link functions, so raw coefficients are difficult to interpret. For example, the age coefficient of .06 in the previous model tells us that for every one unit increase in age, the log odds of hypertension diagnosis increases by 0.06. Since most of us are not used to thinking in log odds this is not too helpful!

One solution is to transform the coefficients to make them easier to interpret

##               Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept) 0.01398925 0.0564947294 -75.572820 0.000000e+00
## age_p       1.06257935 0.0008227207  73.778743 0.000000e+00
## sex2 Female 0.86586602 0.0267976605  -5.374540 7.677854e-08
## sleep       0.99298892 0.0016397197  -4.290841 1.779981e-05
## bmi         1.01874523 0.0009510828  19.526906 6.485172e-85

Generating predicted values

In addition to transforming the log-odds produced by glm to odds, we can use the predict() function to make direct statements about the predictors in our model. For example, we can ask “How much more likely is a 63 year old female to have hypertension compared to a 33 year old female?”.

##   age_p      sex      bmi   sleep       fit      se.fit residual.scale
## 1    33 2 Female 29.89565 7.86221 0.1289227 0.002849622              1
## 2    63 2 Female 29.89565 7.86221 0.4776303 0.004816059              1

This tells us that a 33 year old female has a 13% probability of having been diagnosed with hypertension, while and 63 year old female has a 48% probability of having been diagnosed.

Packages for computing and graphing predicted values

Instead of doing all this ourselves, we can use the effects package to compute quantities of interest for us (cf. the Zelig package).

## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

Exercise 2: logistic regression

Use the NH11 data set that we loaded earlier.

  1. Use glm to conduct a logistic regression to predict ever worked (everwrk) using age (age_p) and marital status (r_maritl).
  2. Predict the probability of working for each level of marital status.

Note that the data is not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.

Multilevel Modeling

Multilevel modeling overview

  • Multi-level (AKA hierarchical) models are a type of mixed-effects models
  • Used to model variation due to group membership where the goal is to generalize to a population of groups
  • Can model different intercepts and/or slopes for each group
  • Mixed-effecs models include two types of predictors: fixed-effects and random effects
    • Fixed-effects – observed levels are of direct interest (.e.g, sex, political party…)
    • Random-effects – observed levels not of direct interest: goal is to make inferences to a population represented by observed levels
    • In R the lme4 package is the most popular for mixed effects models
    • Use the lmer function for liner mixed models, glmer for generalized mixed models
## Loading required package: Matrix

The Exam data

The Exam data set contans exam scores of 4,059 students from 65 schools in Inner London. The variable names are as follows:

variable Description
school School ID - a factor.
normexam Normalized exam score.
schgend School gender - a factor. Levels are ‘mixed’, ‘boys’, and ‘girls’.
schavg School average of intake score.
vr Student level Verbal Reasoning (VR) score band at intake - ‘bottom 25%’, ‘mid 50%’, and ‘top 25%’.
intake Band of student’s intake score - a factor. Levels are ‘bottom 25%’, ‘mid 50%’ and ‘top 25%’./
standLRT Standardised LR test score.
sex Sex of the student - levels are ‘F’ and ‘M’.
type School type - levels are ‘Mxd’ and ‘Sngl’.
student Student id (within school) - a factor

The null model and ICC

As a preliminary step it is often useful to partition the variance in the dependent variable into the various levels. This can be accomplished by running a null model (i.e., a model with a random effects grouping structure, but no fixed-effects predictors).

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: normexam ~ 1 + (1 | school)
##    Data: na.omit(Exam)
## 
##      AIC      BIC   logLik deviance df.resid 
##   9964.9   9983.5  -4979.4   9958.9     3659 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8751 -0.6460  0.0045  0.6898  3.6842 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept) 0.1606   0.4007  
##  Residual             0.8520   0.9231  
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) -0.01301    0.05270  -0.247

The is .169/(.169 + .848) = .17: 17% of the variance is at the school level.

Adding fixed-effects predictors

Predict exam scores from student’s standardized tests scores

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: normexam ~ standLRT + (1 | school)
##    Data: na.omit(Exam)
## 
##      AIC      BIC   logLik deviance df.resid 
##   8480.1   8505.0  -4236.1   8472.1     3658 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6722 -0.6300  0.0236  0.6770  3.3347 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept) 0.09197  0.3033  
##  Residual             0.56909  0.7544  
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 1.913e-05  4.022e-02    0.00
## standLRT    5.670e-01  1.324e-02   42.84
## 
## Correlation of Fixed Effects:
##          (Intr)
## standLRT 0.007

Multiple degree of freedom comparisons

As with lm and glm models, you can compare the two lmer models using the anova function.

## Data: na.omit(Exam)
## Models:
## Norm1: normexam ~ 1 + (1 | school)
## Norm2: normexam ~ standLRT + (1 | school)
##       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## Norm1  3 9964.9 9983.5 -4979.4   9958.9                             
## Norm2  4 8480.1 8505.0 -4236.1   8472.1 1486.8      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random slopes

Add a random effect of students’ standardized test scores as well. Now in addition to estimating the distribution of intercepts across schools, we also estimate the distribution of the slope of exam on standardized test.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: normexam ~ standLRT + (standLRT | school)
##    Data: na.omit(Exam)
## 
##      AIC      BIC   logLik deviance df.resid 
##   8444.1   8481.4  -4216.1   8432.1     3656 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7894 -0.6233  0.0242  0.6736  3.4372 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  school   (Intercept) 0.08926  0.2988       
##           standLRT    0.01592  0.1262   0.50
##  Residual             0.55591  0.7456       
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) -0.01237    0.03978  -0.311
## standLRT     0.56111    0.02100  26.714
## 
## Correlation of Fixed Effects:
##          (Intr)
## standLRT 0.361

Test the significance of the random slope

To test the significance of a random slope just compare models with and without the random slope term

## Data: na.omit(Exam)
## Models:
## Norm2: normexam ~ standLRT + (1 | school)
## Norm3: normexam ~ standLRT + (standLRT | school)
##       Df    AIC    BIC  logLik deviance Chisq Chi Df Pr(>Chisq)    
## Norm2  4 8480.1 8505.0 -4236.1   8472.1                            
## Norm3  6 8444.1 8481.4 -4216.1   8432.1 40.01      2  2.051e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 3: multilevel modeling

Use the dataset, bh1996: data(bh1996, package="multilevel")

From the data documentation:

Variables are Cohesion (COHES), Leadership Climate (LEAD), Well-Being (WBEING) and Work Hours (HRS). Each of these variables has two variants - a group mean version that replicates each group mean for every individual, and a within-group version where the group mean is subtracted from each individual response. The group mean version is designated with a G. (e.g., G.HRS), and the within-group version is designated with a W. (e.g., W.HRS).

  1. Create a null model predicting wellbeing (“WBEING”)
  2. Calculate the ICC for your null model
  3. Run a second multi-level model that adds two individual-level predictors, average number of hours worked (“HRS”) and leadership skills (“LEAD”) to the model and interpret your output.
  4. Now, add a random effect of average number of hours worked (“HRS”) to the model and interpret your output. Test the significance of this random term.

Exercise solutions

Exercise 0 prototype

Use the states.rds data set.

Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to

  1. Examine/plot the data before fitting the model
##      metro            energy     
##  Min.   : 20.40   Min.   :200.0  
##  1st Qu.: 46.98   1st Qu.:285.0  
##  Median : 67.55   Median :320.0  
##  Mean   : 64.07   Mean   :354.5  
##  3rd Qu.: 81.58   3rd Qu.:371.5  
##  Max.   :100.00   Max.   :991.0  
##  NA's   :1        NA's   :1

##             metro     energy
## metro   1.0000000 -0.3397445
## energy -0.3397445  1.0000000
  1. Print and interpret the model summary
## 
## Call:
## lm(formula = energy ~ metro, data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.51  -64.54  -30.87   18.71  583.97 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 501.0292    61.8136   8.105 1.53e-10 ***
## metro        -2.2871     0.9139  -2.503   0.0158 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140.2 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1154, Adjusted R-squared:  0.097 
## F-statistic: 6.263 on 1 and 48 DF,  p-value: 0.01578
  1. plot the model to look for deviations from modeling assumptions

Select one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?

##      energy          metro             pop               waste       
##  Min.   :200.0   Min.   : 20.40   Min.   :  454000   Min.   :0.5400  
##  1st Qu.:285.0   1st Qu.: 46.98   1st Qu.: 1299750   1st Qu.:0.8225  
##  Median :320.0   Median : 67.55   Median : 3390500   Median :0.9600  
##  Mean   :354.5   Mean   : 64.07   Mean   : 4962040   Mean   :0.9888  
##  3rd Qu.:371.5   3rd Qu.: 81.58   3rd Qu.: 5898000   3rd Qu.:1.1450  
##  Max.   :991.0   Max.   :100.00   Max.   :29760000   Max.   :1.5100  
##  NA's   :1       NA's   :1        NA's   :1          NA's   :1

##            energy      metro        pop      waste
## energy  1.0000000 -0.3397445 -0.1840357 -0.2526499
## metro  -0.3397445  1.0000000  0.5653562  0.4877881
## pop    -0.1840357  0.5653562  1.0000000  0.5255713
## waste  -0.2526499  0.4877881  0.5255713  1.0000000
## 
## Call:
## lm(formula = energy ~ metro + pop + waste, data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -224.62  -67.48  -31.76   12.65  589.48 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.617e+02  9.905e+01   5.671    9e-07 ***
## metro       -2.079e+00  1.168e+00  -1.780   0.0816 .  
## pop          1.649e-06  4.809e-06   0.343   0.7332    
## waste       -8.307e+01  1.042e+02  -0.797   0.4295    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.2 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1276, Adjusted R-squared:  0.07067 
## F-statistic: 2.242 on 3 and 46 DF,  p-value: 0.09599
## Analysis of Variance Table
## 
## Model 1: energy ~ metro
## Model 2: energy ~ metro + pop + waste
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     48 943103                           
## 2     46 930153  2     12949 0.3202 0.7276

Exercise 1: prototype

Use the states data set.

  1. Add on to the regression equation that you created in exercise 1 by generating an interaction term and testing the interaction.
  1. Try adding a region to the model. Are there significant differences across the four regions?
## Analysis of Variance Table
## 
## Response: energy
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## metro        1 123064  123064  6.4435 0.01484 *
## waste        1  10572   10572  0.5535 0.46093  
## region       3 111128   37043  1.9395 0.13749  
## metro:waste  1    156     156  0.0082 0.92835  
## Residuals   43 821247   19099                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 2 prototype

Use the NH11 data set that we loaded earlier. Note that the data is not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.

  1. Use glm to conduct a logistic regression to predict ever worked (everwrk) using age (age_p) and marital status (r_maritl).
##               everwrk          age_p      
##  1 Yes            :12153   Min.   :18.00  
##  2 No             : 1887   1st Qu.:33.00  
##  7 Refused        :   17   Median :47.00  
##  8 Not ascertained:    0   Mean   :48.11  
##  9 Don't know     :    8   3rd Qu.:62.00  
##  NA's             :18949   Max.   :85.00  
##                                           
##                             r_maritl    
##  1 Married - spouse in household:13943  
##  7 Never married                : 7763  
##  5 Divorced                     : 4511  
##  4 Widowed                      : 3069  
##  8 Living with partner          : 2002  
##  6 Separated                    : 1121  
##  (Other)                        :  605
## 
## Call:
## glm(formula = everwrk ~ age_p + r_maritl, family = "binomial", 
##     data = NH11)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0436  -0.5650  -0.4391  -0.3370   2.7308  
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                 -0.440248   0.093538  -4.707
## age_p                                       -0.029812   0.001645 -18.118
## r_maritl2 Married - spouse not in household  0.049675   0.217310   0.229
## r_maritl4 Widowed                            0.683618   0.084335   8.106
## r_maritl5 Divorced                          -0.730115   0.111681  -6.538
## r_maritl6 Separated                         -0.128091   0.151366  -0.846
## r_maritl7 Never married                      0.343611   0.069222   4.964
## r_maritl8 Living with partner               -0.443583   0.137770  -3.220
## r_maritl9 Unknown marital status             0.395480   0.492967   0.802
##                                             Pr(>|z|)    
## (Intercept)                                 2.52e-06 ***
## age_p                                        < 2e-16 ***
## r_maritl2 Married - spouse not in household  0.81919    
## r_maritl4 Widowed                           5.23e-16 ***
## r_maritl5 Divorced                          6.25e-11 ***
## r_maritl6 Separated                          0.39742    
## r_maritl7 Never married                     6.91e-07 ***
## r_maritl8 Living with partner                0.00128 ** 
## r_maritl9 Unknown marital status             0.42241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11082  on 14039  degrees of freedom
## Residual deviance: 10309  on 14031  degrees of freedom
##   (18974 observations deleted due to missingness)
## AIC: 10327
## 
## Number of Fisher Scoring iterations: 5
  1. Predict the probability of working for each level of marital status.
##                              r_maritl        fit         se      lower
## 1     1 Married - spouse in household 0.10822000 0.04413754 0.10014980
## 2 2 Married - spouse not in household 0.11310823 0.21326041 0.07746061
## 3                           4 Widowed 0.19381087 0.06806325 0.17381358
## 4                          5 Divorced 0.05524394 0.10272953 0.04562877
## 5                         6 Separated 0.09646417 0.14579706 0.07426824
## 6                     7 Never married 0.14611000 0.05978759 0.13208775
## 7               8 Living with partner 0.07224958 0.13285112 0.05662466
## 8            9 Unknown marital status 0.15270076 0.49100994 0.06440837
##        upper
## 1 0.11685606
## 2 0.16227532
## 3 0.21550873
## 4 0.06674358
## 5 0.12440219
## 6 0.16134411
## 7 0.09176661
## 8 0.32055728

Exercise 3 prototype

Use the dataset, bh1996:

From the data documentation:

Variables are Cohesion (COHES), Leadership Climate (LEAD), Well-Being (WBEING) and Work Hours (HRS). Each of these variables has two variants - a group mean version that replicates each group mean for every individual, and a within-group version where the group mean is subtracted from each individual response. The group mean version is designated with a G. (e.g., G.HRS), and the within-group version is designated with a W. (e.g., W.HRS).

Note that the group identifier is named “GRP”.

  1. Create a null model predicting wellbeing (“WBEING”)
  1. Run a second multi-level model that adds two individual-level predictors, average number of hours worked (“HRS”) and leadership skills (“LEAD”) to the model and interpret your output.
  1. Now, add a random effect of average number of hours worked (“HRS”) to the model and interpret your output. Test the significance of this random term.

Wrap-up

Help us make this workshop better!

  • Please take a moment to fill out a very short

feedback form