Type of Data	Category	Approach
Scale/continuous	Categorical: Binary	Multiple linear regression
Categorical	Nominal: Binary	Logistic regression
Categorical	Nominal: Multiple	Multiple logistic regression
Categorical	Ordinal: Multiple	Ordinal logistic regression
Categorical	Count	Poisson regression

2 Multiple Linear Regression

2.1 The intuition

A Multiple Linear Regression captures the average linear relationship.

# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")

## `geom_smooth()` using formula 'y ~ x'

Fig.1 relationship unemployment and illness

This simple regression line can described using the equation:

\[y = \alpha + \beta_{1} x_{1} + \epsilon\]

\(y\): dependent variable (or outcome variable, predicted variable, explained variable)
\(\alpha\): constant (or intercept)
\(\beta_{1}\): regression coefficient (or slope, coefficient, beta coefficient, estimated coefficient, estimated parameter)
\(x\): independent variable (or explanatory variable, covariate, predictor, control variable)
\(\epsilon\): error term

Why is a multiple regression model known as ‘ordinary least squares’ or OLS regression?

The regression line is computed minimising the total sum squares; that is, the sum of distances from each point in the scatter point to the regression line. The line which minimises this sum is the regression line.

TASK #2 How would the regression line be if \(\beta\) = 0?

If you add more independent variables, the regression equation becomes:

\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + ... + \beta_{k} x_{k} + \epsilon\]

##Interpretation

Intercept (\(\alpha\)): is the estimated average value of \(y\) when the value of \(x\) is zero.

Slope (\(\beta\)): is the estimated average change in \(y\) for a one unit change in \(x\), when all other explanatory variables are held constant.

2.2 Estimation

Let’s explore the following question: how do local factors affect residents’ long-term illness in the UK?

We can use our census data and explore how the local population can explain differences in the percentage of ill population across districts in the UK.

We want to estimate the following model:

\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \epsilon\] \[illness = \alpha + \beta_{1} noqualification + \beta_{2} professionals + \beta_{3} elderly + \epsilon\]

# specify a model equation
eq <- illness ~ No_Quals + Professionals + Age_65plus
model <- lm(formula = eq, data = census)
# coefficients
round(coefficients(model),2)

##   (Intercept)      No_Quals Professionals    Age_65plus 
##          4.00          0.44         -0.10          0.29

ie.

\[\hat{y} = 4 + 0.44 x_{1} - 0.10 x_{2} + 0.29 x_{3}\]

Interpretation: A 1% point increase in the percentage of people with no qualification is associated with a rise in the percentage of local long-term ill population by .44%, with the percentage of professional and elderly population held constant

2.3 Predictions

Individual predictions:

# using base functions
census$p_illness <- predict.lm(model)

Let’s see the observed and predicted % of long-term illness for Liverpool:

census[166, c(1, 6, 24)]

District	illness	p_illness
Liverpool	22.4	20.1

TASK #3 Obtain the % of long-term illness for Cheltenham.

You can use predictions to assess ‘what-if’ scenarios.

# get typical values
attach(census)
a_x1 <- mean(No_Quals)
a_x2 <- mean(Professionals)
# prediction if the population aged 65+ is 30%
4.00 + 0.44*a_x1 - 0.10*a_x2 + 0.29*30

## [1] 21.53108

2.4 Model Assessment

Full model output:

summary(model)

## 
## Call:
## lm(formula = eq, data = census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2275 -1.0969 -0.3077  0.9173  5.4294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.00288    1.20566   3.320 0.000996 ***
## No_Quals       0.44312    0.03419  12.960  < 2e-16 ***
## Professionals -0.09663    0.04417  -2.188 0.029375 *  
## Age_65plus     0.28828    0.02157  13.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.568 on 344 degrees of freedom
## Multiple R-squared:  0.7917, Adjusted R-squared:  0.7899 
## F-statistic: 435.9 on 3 and 344 DF,  p-value: < 2.2e-16

The summary output indicates:

Residuals: summary statistics of the distribution of residuals (model error)
Coefficients: model coefficients and their statistical significance
R-squared: indicates how much of the variation in the dependent variable is ‘explained’ by the explanatory variables.
F statistics: indicates whether the fitted model is a statistically significant improvement on the null model.

2.4.1 Overall Fit

R-squared and F test: A problem with R-squared is hat every time an additional variable is added to the model, the R-squared will increase.

Adjusted R-squared and Akaike’s Information Criterion (AIC): take into account the number of explanatory variables in the model when assessing model fit. The \(AIC\) is more robust so that adding extra variables to the model doesn’t necessarily lead to an improvement. Smaller AIC scores indicate better fit.

# compute AIC
AIC(model)

## [1] 1306.781

A limitation: AIC can only be compared between models fitted to the same data; R-squared can be compared between models fitted to any data.

Alternatives:

Correlation coefficient: between the observed and predicted value of your dependent variable. The resulting measure is comparable between models and datasets, but assumes relationships are linear.

# compute correlation
cor(p_illness, illness)

## [1] 0.8897922

MSE: enables comparison between models and datasets.

# compute MSE
census %>% 
  add_predictions(model) %>%
  summarise(MSE = mean((illness - pred)^2))

MSE
2.43

2.4.2 Individual Coefficients

We assess the sign, size and statistical significance of coefficients.

# summary output and CIs in a single table
cbind(summary(model)$coefficients, confint(model) )

##                  Estimate Std. Error   t value     Pr(>|t|)      2.5 %
## (Intercept)    4.00287606 1.20565723  3.320078 9.963810e-04  1.6314881
## No_Quals       0.44312392 0.03419171 12.959982 1.486323e-31  0.3758728
## Professionals -0.09663492 0.04417488 -2.187554 2.937473e-02 -0.1835218
## Age_65plus     0.28828276 0.02157319 13.363009 4.278867e-33  0.2458508
##                    97.5 %
## (Intercept)    6.37426402
## No_Quals       0.51037505
## Professionals -0.00974805
## Age_65plus     0.33071472

You can also visualise the regression outputs using coefplot:

Fig.2 plotting regression coefficients

coefplot(model, title = " ", xlab = "Coefficient", ylab = " ",
         intercept=FALSE, horizontal = FALSE,
         color = "black", offset=0.2, sort='magnitude', fillColor = "grey")

TASK #4 Estimate an additional model and compare the estimates with model above

3 Dummy Variables

So far, we have explored how to treat independent variables measured in a continuous scale. What about if we want to understand differences across qualitative attributes eg.

Where is the rate of long-term illness higher in particular regions of the UK? Does it vary much across regions?
Is there a gender pay gap?
How does ethnicity influence the industry people work?
How does marital status influence house ownership?
What age group is more migratory?

Now we focus on how to quantify the relationship between qualitative attributes and an outcome variable.

3.1 The Theory

What Are Dummy Variables (DVs)?

are categorical variables that take the value of 0 or 1
serve to quantify the relationship between an outcome variable and a qualitative attribute

We can create DVs from categorial data or classifying continuous variables into categories (eg. income quantiles).

3.1.1 Example: 2 Categories

id	gender	female
1	female	1
2	female	1
3	male	0
4	female	1

3.1.2 Example: 3+ Categories

id	qualification	no qualification	degree
1	degree	0	1
2	no qual	1	0
3	no qual	1	0
4	2+ A levels	0	0

3.2 Practice

Let’s explore how the % of long-term illness vary across regions in the UK.

We have 10 regions in the dataset, so let’s create our dummy variables.

# create dummy variables
region_dv <- dummy.code(census$Region)
census <- cbind(census, region_dv)
census[1:10,c(2,24:34)]

Region	p_illness	South East	East of England	East Midlands	North West	London
South East	20.8	1	0	0	0	0
North West	21.2	0	0	0	1	0
East Midlands	20.4	0	0	1	0	0
South East	21.7	1	0	0	0	0
East Midlands	22.2	0	0	1	0	0
South East	17.3	1	0	0	0	0
South East	14.7	1	0	0	0	0
East of England	19.2	0	1	0	0	0
London	18.9	0	0	0	0	1
London	13.3	0	0	0	0	1

NOTE: To fit regression models, we need a base category ie. we need to exclude one of the categories. If our variables has 5 categories, we only include 4 dummy variables. The category that is excluded serves as our base category.

3.3 Estimation Using DVs

\[illness = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \beta_{5} D_{2} + .... + \beta_{12} D_{9} + \epsilon\]

attach(census)
# specify a model equation
eq2 <- illness ~ No_Quals + Professionals + Age_65plus + as.matrix(census[,c(25:29, 31:34)])
model2 <- lm(formula = eq2, data = census)
export_summs(model2)

	Model 1
(Intercept)	3.09 **
	(0.95)
No_Quals	0.45 ***
	(0.03)
Professionals	-0.01
	(0.04)
Age_65plus	0.30 ***
	(0.02)
as.matrix(census[, c(25:29, 31:34)])South East	-1.03 ***
	(0.28)
as.matrix(census[, c(25:29, 31:34)])East of England	-1.73 ***
	(0.30)
as.matrix(census[, c(25:29, 31:34)])East Midlands	-0.71 *
	(0.31)
as.matrix(census[, c(25:29, 31:34)])North West	0.98 **
	(0.32)
as.matrix(census[, c(25:29, 31:34)])South West	-0.00
	(0.35)
as.matrix(census[, c(25:29, 31:34)])West Midlands	-1.00 **
	(0.34)
as.matrix(census[, c(25:29, 31:34)])Wales	2.46 ***
	(0.37)
as.matrix(census[, c(25:29, 31:34)])Yorkshire and The Humber	-0.87 *
	(0.36)
as.matrix(census[, c(25:29, 31:34)])North East	1.25 **
	(0.43)
N	348
R2	0.89
* p < 0.001; p < 0.01; * p < 0.05.

In a graph:

Fig.3 regression coefficients

# plot coefficients
coefplot(model2, title = " ", xlab = "Coefficient", ylab = " ",
         intercept=FALSE, horizontal = FALSE,
         color = "black", offset=0.2, sort='magnitude', fillColor = "grey")

3.4 Alternative approaches to DVs

If the independent variable you want as a DV is a factor, you can just execute lm():

eq3 <- illness ~ No_Quals + Professionals + Age_65plus + Region
model3 <-lm(formula = eq3, data = census)

Or you can create dummy variables in the lm() function using factor(). Note: this functionality may not be available with all packages and modelling frameworks.

# dummy variables using `factor()`
eq4 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model4 <- lm(formula = eq4, data = census)

TASK #5 Compare the results from these approaches by:

# comparing models  
export_summs(model2, model3, model4)

3.5 Changing The Base Category

We can use the relevel() function to directly specify the base category.

Note:

It achieves this by changing the order of the levels in a factor variable
This causes a permanent change in the order of the factor levels so remember to reverse the change, or create a second version of your data frame
relevel() only works for unordered factors

# check the order of levels
levels(census$Region)

##  [1] "East Midlands"            "East of England"         
##  [3] "London"                   "North East"              
##  [5] "North West"               "South East"              
##  [7] "South West"               "Wales"                   
##  [9] "West Midlands"            "Yorkshire and The Humber"

# change reference category
census$Region <- relevel(census$Region, ref="West Midlands")
# check levels again
levels(census$Region)

##  [1] "West Midlands"            "East Midlands"           
##  [3] "East of England"          "London"                  
##  [5] "North East"               "North West"              
##  [7] "South East"               "South West"              
##  [9] "Wales"                    "Yorkshire and The Humber"

# estimate a new regression
eq5 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model5 <- lm(formula = eq5, data = census)
export_summs(model5)

	Model 1
(Intercept)	2.09 *
	(1.03)
No_Quals	0.45 ***
	(0.03)
Professionals	-0.01
	(0.04)
Age_65plus	0.30 ***
	(0.02)
factor(Region)East Midlands	0.29
	(0.28)
factor(Region)East of England	-0.73 **
	(0.27)
factor(Region)London	1.00 **
	(0.34)
factor(Region)North East	2.25 ***
	(0.40)
factor(Region)North West	1.97 ***
	(0.28)
factor(Region)South East	-0.04
	(0.27)
factor(Region)South West	0.99 **
	(0.32)
factor(Region)Wales	3.45 ***
	(0.33)
factor(Region)Yorkshire and The Humber	0.13
	(0.33)
N	348
R2	0.89
* p < 0.001; p < 0.01; * p < 0.05.

4 Logistic Regression

We have learned how to estimate the associations for models with independent continuous and categorical variables when the outcome variable is continuous.

\[y = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \epsilon\]

What about when the outcome variable is categorical?

Examples:

Health outcomes/behaviours: eg. smoking, drinking, cancer, heart attack, HIV, etc.
Employment outcomes: unemployed, employed full-time, employed part-time, self-employed, job satisfaction, etc.
Decision making processes: University, Brexit vote, travelling, migration, long-distance commuting, etc.

To motivate this session, we will seek to answer the following question: Who are more likely to use sustainable transport?

Read QLFS data

# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData")

Create our sustainable transport variable

qlfs <- mutate(qlfs, 
               sus_tr = ifelse(
                 TravelMode == "Motorbike,moped,scooter" 
                 | TravelMode == "Bicycle"
                 | TravelMode =="Walk"
                 , 1, 0))

4.1 Probability, Odds & Log-odds

They all use a different denominator so they have different meaning.

An example:

attach(qlfs)
t1 <- table(sus_tr, Sex)

4.1.1 Probability

Probability of using sustainable transport if female:

1148 / (26189 + 1148) = 0.041

4.1.2 Odds

Odds of using sustainable transport if female:

1,148 / 26,189 = 0.044

4.1.3 Log-odds

Log-odds of using sustainable transport if female

\(log({1,148}/{26,189})\) = \(log(0.044)\) = -3.12

Note:

Probabilities can vary between 0 and 1
Odds can vary between 0 and infinity
Log-odds can vary between -ve infinity and +ve infinity

4.2 Formal Definition

Logistic regression is one of a family of regression models known as “General Linear Models” (GLMs).

GLMs are characterised by three properties:

The assumed distribution of the model errors
The transformation (link function) applied to the outcome variable
The way in which model error (deviance) is measured (log likelihood)

Logistic regression models:

Predict the the log-odds of an event happening based on at least one independent variable vs. linear regression - estimates the average value
Dependent variable: qualitative (dummy or binary) variable vs. linear regression - continuous variable
Requires estimates of the dependent variable to lie between 0 and 1 (i.e. positive values) vs. linear regression - continuous variable
Assumes a logistic (binomial) distribution vs. linear regression - normal distribution

Various names: binary regression model, discrete choice model, probability regression model, and qualitative response regression model

In mathemathical terms for a model with a single independent variable:

\[log(p(1-p)) = \alpha + \beta_{1} x_{1}\] which can be rearrange:

\[ p = \frac{exp(\alpha + \beta_{1} x_{1}) }{ 1 + exp(\alpha + \beta_{1} x_{1})} \]

\(p\): probability of an event happening

\(\alpha\): regression intercept

\(\beta_{1}\): regression coefficient associated to \(x_{1}\)

4.3 Interpretation

4.3.1 Log-odds

Intercept is the log-odds of an event happening (ie. \(y=1\)) if the value of the explanatory variables \(x\)s is zero.

Slope is the estimated change in the log-odds for one unit change in \(x_{k}\), holding all other variables constant.

4.3.2 Interpretation of \(Exp(\beta_{k} x_{k})\)

They give the expected change in the odds for a unit change in \(x\)s, holding all other variables constant.

\(Exp(\beta_{k} x_{k})\) = 1, if \(\beta = 0\): indicates is equally likely to occur.

\(Exp(\beta_{k} x_{k})\) > 1, if \(\beta > 0\): indicates an event is more likely to occur, or the odds are ‘\(\beta_{k}\) times larger’

\(Exp(\beta_{k} x_{k})\) < 1, if \(\beta < 0\): indicates an event is less likely to occur, or the odds are ‘\(\beta_{k}\) times smaller’

4.4 Practice

4.4.1 General Function

In R, we use the basic syntax for fitting a general linear model:

# general function
glm(outcome ~ predictor(s), data = dataframe,
    family = name of assumed error distribution(link = name of a link function),
             na.action = na.exlcude  OR  na.fail) )

Hence, to fit a logistic regression model, the code required is:

# estimate a logistic regression
glm(outcome ~ predictor(s), data= data.frame,
    family=binomial(link="logit"),
    na.action=na.exclude)

4.5 Estimation

# specify a model
eq1 <- sus_tr ~ Sex + AgeGroup + NetPay
# estimate model
model1 <- glm(eq1, data= qlfs,
                      family=binomial(link="logit"),
                      na.action=na.exclude)

# coefficients (log-odds)
round(coefficients(model1),2)

##   (Intercept)     SexFemale AgeGroup30-44 AgeGroup45-64   AgeGroup65+ 
##         -3.51          0.09          0.43          0.17         -2.25 
##        NetPay 
##          0.00

# odds ratio
round(exp(coef(model1)),2)

## odds ratios and 95% CI
round(exp(cbind("Odds-ratio" = coef(model1), confint(model1))),2)

## Waiting for profiling to be done...

##               Odds-ratio 2.5 % 97.5 %
## (Intercept)         0.03  0.03   0.03
## SexFemale           1.09  0.98   1.21
## AgeGroup30-44       1.53  1.33   1.76
## AgeGroup45-64       1.19  1.04   1.36
## AgeGroup65+         0.11  0.08   0.14
## NetPay              1.00  1.00   1.00

What is the base category for Sex? and for Age Group? Hint: excluded category

Interpretation:

For SexFemale, a \(Exp(\beta_{k} x_{k})\) of \(1.09\) indicates that being female changes the odds of using sustainable transport (versus males) by a factor of \(1.09\), holding all other variables constant. i.e. females are 1.09 more likely than males to use sustainable transport.

For NetPay, a \(Exp(\beta_{k} x_{k})\) of \(1\) indicates that for every one pound change, the odds of using sustainable transport by a factor of \(1\) (versus not using sustainable transport), holding all other variables constant. ie. equal chance.

Note: R produces the odds ratio for the intercept but it is rarely interpreted.

4.6 Model Assessment

Full model output:

summary(model1)

## 
## Call:
## glm(formula = eq1, family = binomial(link = "logit"), data = qlfs, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4193  -0.2991  -0.2515  -0.0821   3.3800  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.511e+00  6.306e-02 -55.680  < 2e-16 ***
## SexFemale      8.791e-02  5.319e-02   1.653   0.0984 .  
## AgeGroup30-44  4.269e-01  7.175e-02   5.950 2.68e-09 ***
## AgeGroup45-64  1.719e-01  6.908e-02   2.488   0.0128 *  
## AgeGroup65+   -2.253e+00  1.607e-01 -14.026  < 2e-16 ***
## NetPay         2.062e-03  8.697e-05  23.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14406  on 45220  degrees of freedom
## Residual deviance: 12842  on 45215  degrees of freedom
##   (39471 observations deleted due to missingness)
## AIC: 12854
## 
## Number of Fisher Scoring iterations: 8

The summary output indicates:

Residuals: summary statistics of the distribution of residuals (model error)
Coefficients: model coefficients and their statistical significance. Note: The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable eg. being female, versus male, changes the log odds of using sustainable transport by 0.09.
Deviance: Deviances for a null and full model. If the Residual deviance < Null deviance, the better the model
AIC: Akaike’s Information Criterion - the lower the better the model.

4.6.1 Overall Fit

Formal chi-squared test based on the null and residual model deviances.

# computing the p-value for the test
with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

## [1] 0

TASK #5: What is the null hypothesis of this test?

Interpretation of the test: The chi-square of 1563.814 with 5 degrees of freedom and an associated p-value of less than 0 tells us that our model is a significantly better fit to the data than an empty model.

To compare across different model specifications use the AIC.

4.7 Visualising Probabilities

4.7.1 Estimating Probabilities

Create a new data set for Female with varying values for AgeGroup and PayNet

df_p <- with(qlfs, data.frame(Sex = factor("Female", levels = c("Male", "Female"))
  , AgeGroup = factor(c("16-29", "30-44", "45-64", "65+"))
  , NetPay = rep(seq(from = 200, to = 800, length.out = 100), 4)
  )
)

Computing probabilities and their CIs:

# add se
df_pci <- cbind(df_p, predict(model1
                             , newdata = df_p,
                             type = "link",
                             se = TRUE)
               )
# add CIs
df_pci <- within(df_pci, {
    p_prob <- plogis(fit)
    ci95_lb <- plogis(fit - (1.96 * se.fit))
    ci95_ub <- plogis(fit + (1.96 * se.fit))
})

#View
head(df_pci)

Sex	AgeGroup	NetPay	fit	se.fit	residual.scale	ci95_ub	ci95_lb	p_prob
Female	16-29	200	-3.01	0.0614	1	0.0526	0.0418	0.0469
Female	30-44	206	-2.57	0.0484	1	0.0775	0.065	0.071
Female	45-64	212	-2.81	0.0466	1	0.0616	0.0519	0.0566
Female	65+	218	-5.23	0.155	1	0.00722	0.00395	0.00534
Female	16-29	224	-2.96	0.062	1	0.0552	0.0438	0.0492
Female	30-44	230	-2.52	0.0484	1	0.0812	0.0681	0.0744

4.7.2 Visualising Probabilities

Fig.4 Predicted probabilities

ggplot(df_pci, aes(x = NetPay, y = p_prob)) +
  geom_ribbon(aes(ymin = ci95_ub, ymax = ci95_lb, fill = AgeGroup), alpha = 0.2) +
  geom_line(aes(colour = AgeGroup), size = 1) +
  theme_classic()

Function	Description
lm()	estimate linear regression
glm()	estimate logistic regression
predict.lm()	create prediction based on linear regression estimates
predict()	predict probabilities based on logistic regression estimates
mutate()	create new variables
dummy.code()	create dummy variables
relevel()	change reference category
cbind()	combine data objects into a single data frame
coefplot()	plot coefficients
export_summs()	create and export regression tables

Regression Analysis

Linear, Logistic Regression & Dummy Variables

Francisco Rowe

2020-08-31

1 Modelling Approaches & Data Types