Lab computer users: Log in using the user name and password on the board to your left.
Laptop users:
Everyone:
File => New Project => Existing Directory => Browse
and select the Rstatistics folder.It is often helpful to start your R session by setting your working directory so you don’t have to type the full path names to your data and other files
# set the working directory
# setwd("~/Desktop/Rstatistics")
# setwd("C:/Users/dataclass/Desktop/Rstatistics")
You might also start by listing the files in your working directory
## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rstatistics"
## [1] "Exam.rds" "NatHealth2008MI" "NatHealth2011.rds"
## [4] "states.dta" "states.rds"
The states.dta data comes from http://anawida.de/teach/SS14/anawida/4.linReg/data/states.dta.txt and appears to have originally appeared in Statistics with Stata by Lawrence C. Hamilton.
# read the states data
states.data <- readRDS("dataSets/states.rds")
#get labels
states.info <- data.frame(attributes(states.data)[c("names", "var.labels")])
#look at last few labels
tail(states.info, 8)
## names var.labels
## 14 csat Mean composite SAT score
## 15 vsat Mean verbal SAT score
## 16 msat Mean math SAT score
## 17 percent % HS graduates taking SAT
## 18 expense Per pupil expenditures prim&sec
## 19 income Median household income, $1,000
## 20 high % adults HS diploma
## 21 college % adults college degree
Start by examining the data to check for problems.
# summary of expense and csat columns, all rows
sts.ex.sat <- subset(states.data, select = c("expense", "csat"))
summary(sts.ex.sat)
## expense csat
## Min. :2960 Min. : 832.0
## 1st Qu.:4352 1st Qu.: 888.0
## Median :5000 Median : 926.0
## Mean :5236 Mean : 944.1
## 3rd Qu.:5794 3rd Qu.: 997.0
## Max. :9259 Max. :1093.0
## expense csat
## expense 1.0000000 -0.4662978
## csat -0.4662978 1.0000000
Plot the data to look for multivariate outliers, non-linear relationships etc.
lm()
functionlm
to predict SAT scores based on per-pupal expenditures: # Fit our regression model
sat.mod <- lm(csat ~ expense, # regression formula
data=states.data) # data set
# Summarize and print the results
summary(sat.mod) # show regression coefficients table
##
## Call:
## lm(formula = csat ~ expense, data = states.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.811 -38.085 5.607 37.852 136.495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.061e+03 3.270e+01 32.44 < 2e-16 ***
## expense -2.228e-02 6.037e-03 -3.69 0.000563 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.81 on 49 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2015
## F-statistic: 13.61 on 1 and 49 DF, p-value: 0.0005631
Many people find it surprising that the per-capita expenditure on students is negatively related to SAT scores. The beauty of multiple regression is that we can try to pull these apart. What would the association between expense and SAT scores be if there were no difference among the states in the percentage of students taking the SAT?
##
## Call:
## lm(formula = csat ~ expense + percent, data = states.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.921 -24.318 1.741 15.502 75.623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 989.807403 18.395770 53.806 < 2e-16 ***
## expense 0.008604 0.004204 2.046 0.0462 *
## percent -2.537700 0.224912 -11.283 4.21e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.62 on 48 degrees of freedom
## Multiple R-squared: 0.7857, Adjusted R-squared: 0.7768
## F-statistic: 88.01 on 2 and 48 DF, p-value: < 2.2e-16
OK, we fit our model. Now what?
## [1] "lm"
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## [1] "add1.lm" "alias.lm"
## [3] "anova.lm" "case.names.lm"
## [5] "coerce,oldClass,S3-method" "confint.lm"
## [7] "cooks.distance.lm" "deviance.lm"
## [9] "dfbeta.lm"
## 2.5 % 97.5 %
## (Intercept) 995.01753164 1126.44735626
## expense -0.03440768 -0.01014361
par(mar = c(4, 4, 2, 2), mfrow = c(1, 2)) #optional
plot(sat.mod, which = c(1, 2)) # "which" argument optional
Do congressional voting patterns predict SAT scores over and above expense? Fit two models and compare them:
# fit another model, adding house and senate as predictors
sat.voting.mod <- lm(csat ~ expense + house + senate,
data = na.omit(states.data))
sat.mod <- update(sat.mod, data=na.omit(states.data))
# compare using the anova() function
anova(sat.mod, sat.voting.mod)
## Analysis of Variance Table
##
## Model 1: csat ~ expense
## Model 2: csat ~ expense + house + senate
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 169050
## 2 44 149284 2 19766 2.9128 0.06486 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1082.93438041 38.633812740 28.0307405 1.067795e-29
## expense -0.01870832 0.009691494 -1.9303852 6.001998e-02
## house -1.44243754 0.600478382 -2.4021473 2.058666e-02
## senate 0.49817861 0.513561356 0.9700469 3.373256e-01
Use the states.rds data set. Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to
summary
plot
the model to look for deviations from modeling assumptionsSelect one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?
Interactions allow us assess the extent to which the association between one predictor and the outcome depends on a second predictor. For example: Does the association between expense and SAT scores depend on the median income in the state?
#Add the interaction to the model
sat.expense.by.percent <- lm(csat ~ expense*income,
data=states.data)
#Show the results
coef(summary(sat.expense.by.percent)) # show regression coefficients table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.380364e+03 1.720863e+02 8.021351 2.367069e-10
## expense -6.384067e-02 3.270087e-02 -1.952262 5.687837e-02
## income -1.049785e+01 4.991463e+00 -2.103161 4.083253e-02
## expense:income 1.384647e-03 8.635529e-04 1.603431 1.155395e-01
Let’s try to predict SAT scores from region, a categorical variable. Note that you must make sure R does not think your categorical variable is numeric.
## Factor w/ 4 levels "West","N. East",..: 3 1 1 3 1 1 2 3 NA 3 ...
states.data$region <- factor(states.data$region)
#Add region to the model
sat.region <- lm(csat ~ region,
data=states.data)
#Show the results
coef(summary(sat.region)) # show regression coefficients table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 946.30769 14.79582 63.9577807 1.352577e-46
## regionN. East -56.75214 23.13285 -2.4533141 1.800383e-02
## regionSouth -16.30769 19.91948 -0.8186806 4.171898e-01
## regionMidwest 63.77564 21.35592 2.9863209 4.514152e-03
## Analysis of Variance Table
##
## Response: csat
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 82049 27349.8 9.6102 4.859e-05 ***
## Residuals 46 130912 2845.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, make sure to tell R which variables are categorical by converting them to factors!
In the previous example we use the default contrasts for region. The default in R is treatment contrasts, with the first level as the reference. We can change the reference group or use another coding scheme using the C
function.
## N. East South Midwest
## West 0 0 0
## N. East 1 0 0
## South 0 1 0
## Midwest 0 0 1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1010.08333 15.39998 65.589930 4.296307e-47
## C(region, base = 4)1 -63.77564 21.35592 -2.986321 4.514152e-03
## C(region, base = 4)2 -120.52778 23.52385 -5.123641 5.798399e-06
## C(region, base = 4)3 -80.08333 20.37225 -3.931000 2.826007e-04
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 943.986645 7.706155 122.4977451 1.689670e-59
## C(region, contr.helmert)1 -28.376068 11.566423 -2.4533141 1.800383e-02
## C(region, contr.helmert)2 4.022792 5.884552 0.6836191 4.976450e-01
## C(region, contr.helmert)3 22.032229 4.446777 4.9546509 1.023364e-05
See also ?contrasts
, ?contr.treatment
, and ?relevel
.
Use the states data set.
Add on to the regression equation that you created in exercise 1 by generating an interaction term and testing the interaction.
Try adding region to the model. Are there significant differences across the four regions?
This far we have used the lm
function to fit our regression models. lm
is great, but limited–in particular it only fits models for continuous dependent variables. For categorical dependent variables we can use the glm()
function.
For these models we will use a different dataset, drawn from the National Health Interview Survey. From the CDC website:
The National Health Interview Survey (NHIS) has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. For over 50 years, the U.S. Census Bureau has been the data collection agent for the National Health Interview Survey. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives.
Load the National Health Interview Survey data:
Let’s predict the probability of being diagnosed with hypertension based on age, sex, sleep, and bmi
## Factor w/ 5 levels "1 Yes","2 No",..: 2 2 1 2 2 1 2 2 1 2 ...
## [1] "1 Yes" "2 No" "7 Refused"
## [4] "8 Not ascertained" "9 Don't know"
# collapse all missing values to NA
NH11$hypev <- factor(NH11$hypev, levels=c("2 No", "1 Yes"))
# run our regression model
hyp.out <- glm(hypev~age_p+sex+sleep+bmi,
data=NH11, family="binomial")
coef(summary(hyp.out))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.269466028 0.0564947294 -75.572820 0.000000e+00
## age_p 0.060699303 0.0008227207 73.778743 0.000000e+00
## sex2 Female -0.144025092 0.0267976605 -5.374540 7.677854e-08
## sleep -0.007035776 0.0016397197 -4.290841 1.779981e-05
## bmi 0.018571704 0.0009510828 19.526906 6.485172e-85
Generalized linear models use link functions, so raw coefficients are difficult to interpret. For example, the age coefficient of .06 in the previous model tells us that for every one unit increase in age, the log odds of hypertension diagnosis increases by 0.06. Since most of us are not used to thinking in log odds this is not too helpful!
One solution is to transform the coefficients to make them easier to interpret
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01398925 0.0564947294 -75.572820 0.000000e+00
## age_p 1.06257935 0.0008227207 73.778743 0.000000e+00
## sex2 Female 0.86586602 0.0267976605 -5.374540 7.677854e-08
## sleep 0.99298892 0.0016397197 -4.290841 1.779981e-05
## bmi 1.01874523 0.0009510828 19.526906 6.485172e-85
In addition to transforming the log-odds produced by glm
to odds, we can use the predict()
function to make direct statements about the predictors in our model. For example, we can ask “How much more likely is a 63 year old female to have hypertension compared to a 33 year old female?”.
# Create a dataset with predictors set at desired levels
predDat <- with(NH11,
expand.grid(age_p = c(33, 63),
sex = "2 Female",
bmi = mean(bmi, na.rm = TRUE),
sleep = mean(sleep, na.rm = TRUE)))
# predict hypertension at those levels
cbind(predDat, predict(hyp.out, type = "response",
se.fit = TRUE, interval="confidence",
newdata = predDat))
## age_p sex bmi sleep fit se.fit residual.scale
## 1 33 2 Female 29.89565 7.86221 0.1289227 0.002849622 1
## 2 63 2 Female 29.89565 7.86221 0.4776303 0.004816059 1
This tells us that a 33 year old female has a 13% probability of having been diagnosed with hypertension, while and 63 year old female has a 48% probability of having been diagnosed.
Instead of doing all this ourselves, we can use the effects package to compute quantities of interest for us (cf. the Zelig package).
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
Use the NH11 data set that we loaded earlier.
Note that the data is not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.
lmer
function for liner mixed models, glmer
for generalized mixed models## Loading required package: Matrix
The Exam data set contans exam scores of 4,059 students from 65 schools in Inner London. The variable names are as follows:
variable | Description |
---|---|
school | School ID - a factor. |
normexam | Normalized exam score. |
schgend | School gender - a factor. Levels are ‘mixed’, ‘boys’, and ‘girls’. |
schavg | School average of intake score. |
vr | Student level Verbal Reasoning (VR) score band at intake - ‘bottom 25%’, ‘mid 50%’, and ‘top 25%’. |
intake | Band of student’s intake score - a factor. Levels are ‘bottom 25%’, ‘mid 50%’ and ‘top 25%’./ |
standLRT | Standardised LR test score. |
sex | Sex of the student - levels are ‘F’ and ‘M’. |
type | School type - levels are ‘Mxd’ and ‘Sngl’. |
student | Student id (within school) - a factor |
As a preliminary step it is often useful to partition the variance in the dependent variable into the various levels. This can be accomplished by running a null model (i.e., a model with a random effects grouping structure, but no fixed-effects predictors).
# null model, grouping by school but not fixed effects.
Norm1 <-lmer(normexam ~ 1 + (1|school),
data=na.omit(Exam), REML = FALSE)
summary(Norm1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: normexam ~ 1 + (1 | school)
## Data: na.omit(Exam)
##
## AIC BIC logLik deviance df.resid
## 9964.9 9983.5 -4979.4 9958.9 3659
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8751 -0.6460 0.0045 0.6898 3.6842
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 0.1606 0.4007
## Residual 0.8520 0.9231
## Number of obs: 3662, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.01301 0.05270 -0.247
The is .169/(.169 + .848) = .17: 17% of the variance is at the school level.
Predict exam scores from student’s standardized tests scores
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: normexam ~ standLRT + (1 | school)
## Data: na.omit(Exam)
##
## AIC BIC logLik deviance df.resid
## 8480.1 8505.0 -4236.1 8472.1 3658
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.6722 -0.6300 0.0236 0.6770 3.3347
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 0.09197 0.3033
## Residual 0.56909 0.7544
## Number of obs: 3662, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.913e-05 4.022e-02 0.00
## standLRT 5.670e-01 1.324e-02 42.84
##
## Correlation of Fixed Effects:
## (Intr)
## standLRT 0.007
As with lm
and glm
models, you can compare the two lmer
models using the anova
function.
## Data: na.omit(Exam)
## Models:
## Norm1: normexam ~ 1 + (1 | school)
## Norm2: normexam ~ standLRT + (1 | school)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## Norm1 3 9964.9 9983.5 -4979.4 9958.9
## Norm2 4 8480.1 8505.0 -4236.1 8472.1 1486.8 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add a random effect of students’ standardized test scores as well. Now in addition to estimating the distribution of intercepts across schools, we also estimate the distribution of the slope of exam on standardized test.
Norm3 <- lmer(normexam~standLRT + (standLRT|school),
data = na.omit(Exam),
REML = FALSE)
summary(Norm3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: normexam ~ standLRT + (standLRT | school)
## Data: na.omit(Exam)
##
## AIC BIC logLik deviance df.resid
## 8444.1 8481.4 -4216.1 8432.1 3656
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.7894 -0.6233 0.0242 0.6736 3.4372
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## school (Intercept) 0.08926 0.2988
## standLRT 0.01592 0.1262 0.50
## Residual 0.55591 0.7456
## Number of obs: 3662, groups: school, 65
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.01237 0.03978 -0.311
## standLRT 0.56111 0.02100 26.714
##
## Correlation of Fixed Effects:
## (Intr)
## standLRT 0.361
To test the significance of a random slope just compare models with and without the random slope term
## Data: na.omit(Exam)
## Models:
## Norm2: normexam ~ standLRT + (1 | school)
## Norm3: normexam ~ standLRT + (standLRT | school)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## Norm2 4 8480.1 8505.0 -4236.1 8472.1
## Norm3 6 8444.1 8481.4 -4216.1 8432.1 40.01 2 2.051e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use the dataset, bh1996: data(bh1996, package="multilevel")
From the data documentation:
Variables are Cohesion (COHES), Leadership Climate (LEAD), Well-Being (WBEING) and Work Hours (HRS). Each of these variables has two variants - a group mean version that replicates each group mean for every individual, and a within-group version where the group mean is subtracted from each individual response. The group mean version is designated with a G. (e.g., G.HRS), and the within-group version is designated with a W. (e.g., W.HRS).
Use the states.rds data set.
Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to
## metro energy
## Min. : 20.40 Min. :200.0
## 1st Qu.: 46.98 1st Qu.:285.0
## Median : 67.55 Median :320.0
## Mean : 64.07 Mean :354.5
## 3rd Qu.: 81.58 3rd Qu.:371.5
## Max. :100.00 Max. :991.0
## NA's :1 NA's :1
## metro energy
## metro 1.0000000 -0.3397445
## energy -0.3397445 1.0000000
summary
##
## Call:
## lm(formula = energy ~ metro, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.51 -64.54 -30.87 18.71 583.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 501.0292 61.8136 8.105 1.53e-10 ***
## metro -2.2871 0.9139 -2.503 0.0158 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.2 on 48 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1154, Adjusted R-squared: 0.097
## F-statistic: 6.263 on 1 and 48 DF, p-value: 0.01578
plot
the model to look for deviations from modeling assumptionsSelect one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?
states.en.met.pop.wst <- subset(states, select = c("energy", "metro", "pop", "waste"))
summary(states.en.met.pop.wst)
## energy metro pop waste
## Min. :200.0 Min. : 20.40 Min. : 454000 Min. :0.5400
## 1st Qu.:285.0 1st Qu.: 46.98 1st Qu.: 1299750 1st Qu.:0.8225
## Median :320.0 Median : 67.55 Median : 3390500 Median :0.9600
## Mean :354.5 Mean : 64.07 Mean : 4962040 Mean :0.9888
## 3rd Qu.:371.5 3rd Qu.: 81.58 3rd Qu.: 5898000 3rd Qu.:1.1450
## Max. :991.0 Max. :100.00 Max. :29760000 Max. :1.5100
## NA's :1 NA's :1 NA's :1 NA's :1
## energy metro pop waste
## energy 1.0000000 -0.3397445 -0.1840357 -0.2526499
## metro -0.3397445 1.0000000 0.5653562 0.4877881
## pop -0.1840357 0.5653562 1.0000000 0.5255713
## waste -0.2526499 0.4877881 0.5255713 1.0000000
mod.en.met.pop.waste <- lm(energy ~ metro + pop + waste, data = states)
summary(mod.en.met.pop.waste)
##
## Call:
## lm(formula = energy ~ metro + pop + waste, data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -224.62 -67.48 -31.76 12.65 589.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.617e+02 9.905e+01 5.671 9e-07 ***
## metro -2.079e+00 1.168e+00 -1.780 0.0816 .
## pop 1.649e-06 4.809e-06 0.343 0.7332
## waste -8.307e+01 1.042e+02 -0.797 0.4295
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.2 on 46 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1276, Adjusted R-squared: 0.07067
## F-statistic: 2.242 on 3 and 46 DF, p-value: 0.09599
## Analysis of Variance Table
##
## Model 1: energy ~ metro
## Model 2: energy ~ metro + pop + waste
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 943103
## 2 46 930153 2 12949 0.3202 0.7276
Use the states data set.
## Analysis of Variance Table
##
## Response: energy
## Df Sum Sq Mean Sq F value Pr(>F)
## metro 1 123064 123064 6.4435 0.01484 *
## waste 1 10572 10572 0.5535 0.46093
## region 3 111128 37043 1.9395 0.13749
## metro:waste 1 156 156 0.0082 0.92835
## Residuals 43 821247 19099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use the NH11 data set that we loaded earlier. Note that the data is not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.
nh11.wrk.age.mar <- subset(NH11, select = c("everwrk", "age_p", "r_maritl"))
summary(nh11.wrk.age.mar)
## everwrk age_p
## 1 Yes :12153 Min. :18.00
## 2 No : 1887 1st Qu.:33.00
## 7 Refused : 17 Median :47.00
## 8 Not ascertained: 0 Mean :48.11
## 9 Don't know : 8 3rd Qu.:62.00
## NA's :18949 Max. :85.00
##
## r_maritl
## 1 Married - spouse in household:13943
## 7 Never married : 7763
## 5 Divorced : 4511
## 4 Widowed : 3069
## 8 Living with partner : 2002
## 6 Separated : 1121
## (Other) : 605
NH11 <- transform(NH11,
everwrk = factor(everwrk,
levels = c("1 Yes", "2 No")),
r_maritl = droplevels(r_maritl))
mod.wk.age.mar <- glm(everwrk ~ age_p + r_maritl, data = NH11,
family = "binomial")
summary(mod.wk.age.mar)
##
## Call:
## glm(formula = everwrk ~ age_p + r_maritl, family = "binomial",
## data = NH11)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0436 -0.5650 -0.4391 -0.3370 2.7308
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.440248 0.093538 -4.707
## age_p -0.029812 0.001645 -18.118
## r_maritl2 Married - spouse not in household 0.049675 0.217310 0.229
## r_maritl4 Widowed 0.683618 0.084335 8.106
## r_maritl5 Divorced -0.730115 0.111681 -6.538
## r_maritl6 Separated -0.128091 0.151366 -0.846
## r_maritl7 Never married 0.343611 0.069222 4.964
## r_maritl8 Living with partner -0.443583 0.137770 -3.220
## r_maritl9 Unknown marital status 0.395480 0.492967 0.802
## Pr(>|z|)
## (Intercept) 2.52e-06 ***
## age_p < 2e-16 ***
## r_maritl2 Married - spouse not in household 0.81919
## r_maritl4 Widowed 5.23e-16 ***
## r_maritl5 Divorced 6.25e-11 ***
## r_maritl6 Separated 0.39742
## r_maritl7 Never married 6.91e-07 ***
## r_maritl8 Living with partner 0.00128 **
## r_maritl9 Unknown marital status 0.42241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11082 on 14039 degrees of freedom
## Residual deviance: 10309 on 14031 degrees of freedom
## (18974 observations deleted due to missingness)
## AIC: 10327
##
## Number of Fisher Scoring iterations: 5
## r_maritl fit se lower
## 1 1 Married - spouse in household 0.10822000 0.04413754 0.10014980
## 2 2 Married - spouse not in household 0.11310823 0.21326041 0.07746061
## 3 4 Widowed 0.19381087 0.06806325 0.17381358
## 4 5 Divorced 0.05524394 0.10272953 0.04562877
## 5 6 Separated 0.09646417 0.14579706 0.07426824
## 6 7 Never married 0.14611000 0.05978759 0.13208775
## 7 8 Living with partner 0.07224958 0.13285112 0.05662466
## 8 9 Unknown marital status 0.15270076 0.49100994 0.06440837
## upper
## 1 0.11685606
## 2 0.16227532
## 3 0.21550873
## 4 0.06674358
## 5 0.12440219
## 6 0.16134411
## 7 0.09176661
## 8 0.32055728
Use the dataset, bh1996:
From the data documentation:
Variables are Cohesion (COHES), Leadership Climate (LEAD), Well-Being (WBEING) and Work Hours (HRS). Each of these variables has two variants - a group mean version that replicates each group mean for every individual, and a within-group version where the group mean is subtracted from each individual response. The group mean version is designated with a G. (e.g., G.HRS), and the within-group version is designated with a W. (e.g., W.HRS).
Note that the group identifier is named “GRP”.
feedback form