In this session1 Part of Introduction to Statistical Learning in R
Regression Analysis – Linear, Logistic Regression & Dummy Variables by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License., we turned to examining regression modelling which enables:
Type of Data | Category | Approach |
---|---|---|
Scale/continuous | Categorical: Binary | Multiple linear regression |
Categorical | Nominal: Binary | Logistic regression |
Categorical | Nominal: Multiple | Multiple logistic regression |
Categorical | Ordinal: Multiple | Ordinal logistic regression |
Categorical | Count | Poisson regression |
A Multiple Linear Regression captures the average linear relationship.
# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")
## `geom_smooth()` using formula 'y ~ x'
Fig.1 relationship unemployment and illness
This simple regression line can described using the equation:
\[y = \alpha + \beta_{1} x_{1} + \epsilon\]
\(y\): dependent variable (or outcome variable, predicted variable, explained variable)
\(\alpha\): constant (or intercept)
\(\beta_{1}\): regression coefficient (or slope, coefficient, beta coefficient, estimated coefficient, estimated parameter)
\(x\): independent variable (or explanatory variable, covariate, predictor, control variable)
\(\epsilon\): error term
Why is a multiple regression model known as ‘ordinary least squares’ or OLS regression?
The regression line is computed minimising the total sum squares; that is, the sum of distances from each point in the scatter point to the regression line. The line which minimises this sum is the regression line.
TASK #2 How would the regression line be if \(\beta\) = 0?
If you add more independent variables, the regression equation becomes:
\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + ... + \beta_{k} x_{k} + \epsilon\]
##Interpretation
Intercept (\(\alpha\)): is the estimated average value of \(y\) when the value of \(x\) is zero.
Slope (\(\beta\)): is the estimated average change in \(y\) for a one unit change in \(x\), when all other explanatory variables are held constant.
Let’s explore the following question: how do local factors affect residents’ long-term illness in the UK?
We can use our census data and explore how the local population can explain differences in the percentage of ill population across districts in the UK.
We want to estimate the following model:
\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \epsilon\] \[illness = \alpha + \beta_{1} noqualification + \beta_{2} professionals + \beta_{3} elderly + \epsilon\]
# specify a model equation
eq <- illness ~ No_Quals + Professionals + Age_65plus
model <- lm(formula = eq, data = census)
# coefficients
round(coefficients(model),2)
## (Intercept) No_Quals Professionals Age_65plus
## 4.00 0.44 -0.10 0.29
ie.
\[\hat{y} = 4 + 0.44 x_{1} - 0.10 x_{2} + 0.29 x_{3}\]
Interpretation: A 1% point increase in the percentage of people with no qualification is associated with a rise in the percentage of local long-term ill population by .44%, with the percentage of professional and elderly population held constant
Individual predictions:
# using base functions
census$p_illness <- predict.lm(model)
Let’s see the observed and predicted % of long-term illness for Liverpool:
census[166, c(1, 6, 24)]
District | illness | p_illness |
---|---|---|
Liverpool | 22.4 | 20.1 |
TASK #3 Obtain the % of long-term illness for Cheltenham.
You can use predictions to assess ‘what-if’ scenarios.
# get typical values
attach(census)
a_x1 <- mean(No_Quals)
a_x2 <- mean(Professionals)
# prediction if the population aged 65+ is 30%
4.00 + 0.44*a_x1 - 0.10*a_x2 + 0.29*30
## [1] 21.53108
Full model output:
summary(model)
##
## Call:
## lm(formula = eq, data = census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2275 -1.0969 -0.3077 0.9173 5.4294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.00288 1.20566 3.320 0.000996 ***
## No_Quals 0.44312 0.03419 12.960 < 2e-16 ***
## Professionals -0.09663 0.04417 -2.188 0.029375 *
## Age_65plus 0.28828 0.02157 13.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.568 on 344 degrees of freedom
## Multiple R-squared: 0.7917, Adjusted R-squared: 0.7899
## F-statistic: 435.9 on 3 and 344 DF, p-value: < 2.2e-16
The summary output indicates:
Residuals: summary statistics of the distribution of residuals (model error)
Coefficients: model coefficients and their statistical significance
R-squared: indicates how much of the variation in the dependent variable is ‘explained’ by the explanatory variables.
F statistics: indicates whether the fitted model is a statistically significant improvement on the null model.
R-squared and F test: A problem with R-squared is hat every time an additional variable is added to the model, the R-squared will increase.
# compute AIC
AIC(model)
## [1] 1306.781
A limitation: AIC can only be compared between models fitted to the same data; R-squared can be compared between models fitted to any data.
Alternatives:
# compute correlation
cor(p_illness, illness)
## [1] 0.8897922
# compute MSE
census %>%
add_predictions(model) %>%
summarise(MSE = mean((illness - pred)^2))
MSE |
---|
2.43 |
We assess the sign, size and statistical significance of coefficients.
# summary output and CIs in a single table
cbind(summary(model)$coefficients, confint(model) )
## Estimate Std. Error t value Pr(>|t|) 2.5 %
## (Intercept) 4.00287606 1.20565723 3.320078 9.963810e-04 1.6314881
## No_Quals 0.44312392 0.03419171 12.959982 1.486323e-31 0.3758728
## Professionals -0.09663492 0.04417488 -2.187554 2.937473e-02 -0.1835218
## Age_65plus 0.28828276 0.02157319 13.363009 4.278867e-33 0.2458508
## 97.5 %
## (Intercept) 6.37426402
## No_Quals 0.51037505
## Professionals -0.00974805
## Age_65plus 0.33071472
You can also visualise the regression outputs using coefplot
:
Fig.2 plotting regression coefficients
coefplot(model, title = " ", xlab = "Coefficient", ylab = " ",
intercept=FALSE, horizontal = FALSE,
color = "black", offset=0.2, sort='magnitude', fillColor = "grey")
TASK #4 Estimate an additional model and compare the estimates with model above
So far, we have explored how to treat independent variables measured in a continuous scale. What about if we want to understand differences across qualitative attributes eg.
Where is the rate of long-term illness higher in particular regions of the UK? Does it vary much across regions?
Is there a gender pay gap?
How does ethnicity influence the industry people work?
How does marital status influence house ownership?
What age group is more migratory?
Now we focus on how to quantify the relationship between qualitative attributes and an outcome variable.
What Are Dummy Variables (DVs)?
are categorical variables that take the value of 0 or 1
serve to quantify the relationship between an outcome variable and a qualitative attribute
We can create DVs from categorial data or classifying continuous variables into categories (eg. income quantiles).
id | gender | female |
---|---|---|
1 | female | 1 |
2 | female | 1 |
3 | male | 0 |
4 | female | 1 |
id | qualification | no qualification | degree |
---|---|---|---|
1 | degree | 0 | 1 |
2 | no qual | 1 | 0 |
3 | no qual | 1 | 0 |
4 | 2+ A levels | 0 | 0 |
Let’s explore how the % of long-term illness vary across regions in the UK.
We have 10 regions in the dataset, so let’s create our dummy variables.
# create dummy variables
region_dv <- dummy.code(census$Region)
census <- cbind(census, region_dv)
census[1:10,c(2,24:34)]
Region | p_illness | South East | East of England | East Midlands | North West | South West | London | West Midlands | Wales | Yorkshire and The Humber | North East |
---|---|---|---|---|---|---|---|---|---|---|---|
South East | 20.8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
North West | 21.2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
East Midlands | 20.4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
South East | 21.7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
East Midlands | 22.2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
South East | 17.3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
South East | 14.7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
East of England | 19.2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
London | 18.9 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
London | 13.3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
NOTE: To fit regression models, we need a base category ie. we need to exclude one of the categories. If our variables has 5 categories, we only include 4 dummy variables. The category that is excluded serves as our base category.
\[illness = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \beta_{5} D_{2} + .... + \beta_{12} D_{9} + \epsilon\]
attach(census)
# specify a model equation
eq2 <- illness ~ No_Quals + Professionals + Age_65plus + as.matrix(census[,c(25:29, 31:34)])
model2 <- lm(formula = eq2, data = census)
export_summs(model2)
Model 1 | |
---|---|
(Intercept) | 3.09 ** |
(0.95) | |
No_Quals | 0.45 *** |
(0.03) | |
Professionals | -0.01 |
(0.04) | |
Age_65plus | 0.30 *** |
(0.02) | |
as.matrix(census[, c(25:29, 31:34)])South East | -1.03 *** |
(0.28) | |
as.matrix(census[, c(25:29, 31:34)])East of England | -1.73 *** |
(0.30) | |
as.matrix(census[, c(25:29, 31:34)])East Midlands | -0.71 * |
(0.31) | |
as.matrix(census[, c(25:29, 31:34)])North West | 0.98 ** |
(0.32) | |
as.matrix(census[, c(25:29, 31:34)])South West | -0.00 |
(0.35) | |
as.matrix(census[, c(25:29, 31:34)])West Midlands | -1.00 ** |
(0.34) | |
as.matrix(census[, c(25:29, 31:34)])Wales | 2.46 *** |
(0.37) | |
as.matrix(census[, c(25:29, 31:34)])Yorkshire and The Humber | -0.87 * |
(0.36) | |
as.matrix(census[, c(25:29, 31:34)])North East | 1.25 ** |
(0.43) | |
N | 348 |
R2 | 0.89 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Fig.3 regression coefficients
# plot coefficients
coefplot(model2, title = " ", xlab = "Coefficient", ylab = " ",
intercept=FALSE, horizontal = FALSE,
color = "black", offset=0.2, sort='magnitude', fillColor = "grey")
If the independent variable you want as a DV is a factor, you can just execute lm()
:
eq3 <- illness ~ No_Quals + Professionals + Age_65plus + Region
model3 <-lm(formula = eq3, data = census)
Or you can create dummy variables in the lm()
function using factor()
. Note: this functionality may not be available with all packages and modelling frameworks.
# dummy variables using `factor()`
eq4 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model4 <- lm(formula = eq4, data = census)
TASK #5 Compare the results from these approaches by:
# comparing models
export_summs(model2, model3, model4)
We can use the relevel()
function to directly specify the base category.
Note:
It achieves this by changing the order of the levels in a factor variable
This causes a permanent change in the order of the factor levels so remember to reverse the change, or create a second version of your data frame
relevel()
only works for unordered factors
# check the order of levels
levels(census$Region)
## [1] "East Midlands" "East of England"
## [3] "London" "North East"
## [5] "North West" "South East"
## [7] "South West" "Wales"
## [9] "West Midlands" "Yorkshire and The Humber"
# change reference category
census$Region <- relevel(census$Region, ref="West Midlands")
# check levels again
levels(census$Region)
## [1] "West Midlands" "East Midlands"
## [3] "East of England" "London"
## [5] "North East" "North West"
## [7] "South East" "South West"
## [9] "Wales" "Yorkshire and The Humber"
# estimate a new regression
eq5 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model5 <- lm(formula = eq5, data = census)
export_summs(model5)
Model 1 | |
---|---|
(Intercept) | 2.09 * |
(1.03) | |
No_Quals | 0.45 *** |
(0.03) | |
Professionals | -0.01 |
(0.04) | |
Age_65plus | 0.30 *** |
(0.02) | |
factor(Region)East Midlands | 0.29 |
(0.28) | |
factor(Region)East of England | -0.73 ** |
(0.27) | |
factor(Region)London | 1.00 ** |
(0.34) | |
factor(Region)North East | 2.25 *** |
(0.40) | |
factor(Region)North West | 1.97 *** |
(0.28) | |
factor(Region)South East | -0.04 |
(0.27) | |
factor(Region)South West | 0.99 ** |
(0.32) | |
factor(Region)Wales | 3.45 *** |
(0.33) | |
factor(Region)Yorkshire and The Humber | 0.13 |
(0.33) | |
N | 348 |
R2 | 0.89 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
We have learned how to estimate the associations for models with independent continuous and categorical variables when the outcome variable is continuous.
\[y = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \epsilon\]
What about when the outcome variable is categorical?
Examples:
Health outcomes/behaviours: eg. smoking, drinking, cancer, heart attack, HIV, etc.
Employment outcomes: unemployed, employed full-time, employed part-time, self-employed, job satisfaction, etc.
Decision making processes: University, Brexit vote, travelling, migration, long-distance commuting, etc.
To motivate this session, we will seek to answer the following question: Who are more likely to use sustainable transport?
Read QLFS data
# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData")
Create our sustainable transport variable
qlfs <- mutate(qlfs,
sus_tr = ifelse(
TravelMode == "Motorbike,moped,scooter"
| TravelMode == "Bicycle"
| TravelMode =="Walk"
, 1, 0))
They all use a different denominator so they have different meaning.
An example:
attach(qlfs)
t1 <- table(sus_tr, Sex)
Probability of using sustainable transport if female:
1148 / (26189 + 1148) = 0.041
Odds of using sustainable transport if female:
1,148 / 26,189 = 0.044
Log-odds of using sustainable transport if female
\(log({1,148}/{26,189})\) = \(log(0.044)\) = -3.12
Note:
Logistic regression is one of a family of regression models known as “General Linear Models” (GLMs).
GLMs are characterised by three properties:
The assumed distribution of the model errors
The transformation (link function) applied to the outcome variable
The way in which model error (deviance) is measured (log likelihood)
Logistic regression models:
Predict the the log-odds of an event happening based on at least one independent variable vs. linear regression - estimates the average value
Dependent variable: qualitative (dummy or binary) variable vs. linear regression - continuous variable
Requires estimates of the dependent variable to lie between 0 and 1 (i.e. positive values) vs. linear regression - continuous variable
Assumes a logistic (binomial) distribution vs. linear regression - normal distribution
Various names: binary regression model, discrete choice model, probability regression model, and qualitative response regression model
In mathemathical terms for a model with a single independent variable:
\[log(p(1-p)) = \alpha + \beta_{1} x_{1}\] which can be rearrange:
\[ p = \frac{exp(\alpha + \beta_{1} x_{1}) }{ 1 + exp(\alpha + \beta_{1} x_{1})} \]
\(p\): probability of an event happening
\(\alpha\): regression intercept
\(\beta_{1}\): regression coefficient associated to \(x_{1}\)
Intercept is the log-odds of an event happening (ie. \(y=1\)) if the value of the explanatory variables \(x\)s is zero.
Slope is the estimated change in the log-odds for one unit change in \(x_{k}\), holding all other variables constant.
They give the expected change in the odds for a unit change in \(x\)s, holding all other variables constant.
\(Exp(\beta_{k} x_{k})\) = 1, if \(\beta = 0\): indicates is equally likely to occur.
\(Exp(\beta_{k} x_{k})\) > 1, if \(\beta > 0\): indicates an event is more likely to occur, or the odds are ‘\(\beta_{k}\) times larger’
\(Exp(\beta_{k} x_{k})\) < 1, if \(\beta < 0\): indicates an event is less likely to occur, or the odds are ‘\(\beta_{k}\) times smaller’
In R, we use the basic syntax for fitting a general linear model:
# general function
glm(outcome ~ predictor(s), data = dataframe,
family = name of assumed error distribution(link = name of a link function),
na.action = na.exlcude OR na.fail) )
Hence, to fit a logistic regression model, the code required is:
# estimate a logistic regression
glm(outcome ~ predictor(s), data= data.frame,
family=binomial(link="logit"),
na.action=na.exclude)
# specify a model
eq1 <- sus_tr ~ Sex + AgeGroup + NetPay
# estimate model
model1 <- glm(eq1, data= qlfs,
family=binomial(link="logit"),
na.action=na.exclude)
# coefficients (log-odds)
round(coefficients(model1),2)
## (Intercept) SexFemale AgeGroup30-44 AgeGroup45-64 AgeGroup65+
## -3.51 0.09 0.43 0.17 -2.25
## NetPay
## 0.00
# odds ratio
round(exp(coef(model1)),2)
## odds ratios and 95% CI
round(exp(cbind("Odds-ratio" = coef(model1), confint(model1))),2)
## Waiting for profiling to be done...
## Odds-ratio 2.5 % 97.5 %
## (Intercept) 0.03 0.03 0.03
## SexFemale 1.09 0.98 1.21
## AgeGroup30-44 1.53 1.33 1.76
## AgeGroup45-64 1.19 1.04 1.36
## AgeGroup65+ 0.11 0.08 0.14
## NetPay 1.00 1.00 1.00
What is the base category for Sex? and for Age Group? Hint: excluded category
Interpretation:
For SexFemale, a \(Exp(\beta_{k} x_{k})\) of \(1.09\) indicates that being female changes the odds of using sustainable transport (versus males) by a factor of \(1.09\), holding all other variables constant. i.e. females are 1.09 more likely than males to use sustainable transport.
For NetPay, a \(Exp(\beta_{k} x_{k})\) of \(1\) indicates that for every one pound change, the odds of using sustainable transport by a factor of \(1\) (versus not using sustainable transport), holding all other variables constant. ie. equal chance.
Note: R produces the odds ratio for the intercept but it is rarely interpreted.
Full model output:
summary(model1)
##
## Call:
## glm(formula = eq1, family = binomial(link = "logit"), data = qlfs,
## na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4193 -0.2991 -0.2515 -0.0821 3.3800
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.511e+00 6.306e-02 -55.680 < 2e-16 ***
## SexFemale 8.791e-02 5.319e-02 1.653 0.0984 .
## AgeGroup30-44 4.269e-01 7.175e-02 5.950 2.68e-09 ***
## AgeGroup45-64 1.719e-01 6.908e-02 2.488 0.0128 *
## AgeGroup65+ -2.253e+00 1.607e-01 -14.026 < 2e-16 ***
## NetPay 2.062e-03 8.697e-05 23.711 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14406 on 45220 degrees of freedom
## Residual deviance: 12842 on 45215 degrees of freedom
## (39471 observations deleted due to missingness)
## AIC: 12854
##
## Number of Fisher Scoring iterations: 8
The summary output indicates:
Residuals: summary statistics of the distribution of residuals (model error)
Coefficients: model coefficients and their statistical significance. Note: The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable eg. being female, versus male, changes the log odds of using sustainable transport by 0.09.
Deviance: Deviances for a null and full model. If the Residual deviance < Null deviance, the better the model
AIC: Akaike’s Information Criterion - the lower the better the model.
Formal chi-squared test based on the null and residual model deviances.
# computing the p-value for the test
with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0
TASK #5: What is the null hypothesis of this test?
Interpretation of the test: The chi-square of 1563.814 with 5 degrees of freedom and an associated p-value of less than 0 tells us that our model is a significantly better fit to the data than an empty model.
To compare across different model specifications use the AIC.
Create a new data set for Female with varying values for AgeGroup and PayNet
df_p <- with(qlfs, data.frame(Sex = factor("Female", levels = c("Male", "Female"))
, AgeGroup = factor(c("16-29", "30-44", "45-64", "65+"))
, NetPay = rep(seq(from = 200, to = 800, length.out = 100), 4)
)
)
Computing probabilities and their CIs:
# add se
df_pci <- cbind(df_p, predict(model1
, newdata = df_p,
type = "link",
se = TRUE)
)
# add CIs
df_pci <- within(df_pci, {
p_prob <- plogis(fit)
ci95_lb <- plogis(fit - (1.96 * se.fit))
ci95_ub <- plogis(fit + (1.96 * se.fit))
})
#View
head(df_pci)
Sex | AgeGroup | NetPay | fit | se.fit | residual.scale | ci95_ub | ci95_lb | p_prob |
---|---|---|---|---|---|---|---|---|
Female | 16-29 | 200 | -3.01 | 0.0614 | 1 | 0.0526 | 0.0418 | 0.0469 |
Female | 30-44 | 206 | -2.57 | 0.0484 | 1 | 0.0775 | 0.065 | 0.071 |
Female | 45-64 | 212 | -2.81 | 0.0466 | 1 | 0.0616 | 0.0519 | 0.0566 |
Female | 65+ | 218 | -5.23 | 0.155 | 1 | 0.00722 | 0.00395 | 0.00534 |
Female | 16-29 | 224 | -2.96 | 0.062 | 1 | 0.0552 | 0.0438 | 0.0492 |
Female | 30-44 | 230 | -2.52 | 0.0484 | 1 | 0.0812 | 0.0681 | 0.0744 |
Fig.4 Predicted probabilities
ggplot(df_pci, aes(x = NetPay, y = p_prob)) +
geom_ribbon(aes(ymin = ci95_ub, ymax = ci95_lb, fill = AgeGroup), alpha = 0.2) +
geom_line(aes(colour = AgeGroup), size = 1) +
theme_classic()
Function | Description |
---|---|
lm() | estimate linear regression |
glm() | estimate logistic regression |
predict.lm() | create prediction based on linear regression estimates |
predict() | predict probabilities based on logistic regression estimates |
mutate() | create new variables |
dummy.code() | create dummy variables |
relevel() | change reference category |
cbind() | combine data objects into a single data frame |
coefplot() | plot coefficients |
export_summs() | create and export regression tables |