Regression Analysis

Linear, Logistic Regression & Dummy Variables

Francisco Rowe

2020-08-31

In this session1 Part of Introduction to Statistical Learning in R Creative Commons License
Regression Analysis – Linear, Logistic Regression & Dummy Variables by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
, we turned to examining regression modelling which enables:

1 Modelling Approaches & Data Types

Type of Data Category Approach
Scale/continuous Categorical: Binary Multiple linear regression
Categorical Nominal: Binary Logistic regression
Categorical Nominal: Multiple Multiple logistic regression
Categorical Ordinal: Multiple Ordinal logistic regression
Categorical Count Poisson regression

2 Multiple Linear Regression

2.1 The intuition

A Multiple Linear Regression captures the average linear relationship.

# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")
## `geom_smooth()` using formula 'y ~ x'

Fig.1 relationship unemployment and illness Fig.1 relationship unemployment and illness

This simple regression line can described using the equation:

\[y = \alpha + \beta_{1} x_{1} + \epsilon\]

Why is a multiple regression model known as ‘ordinary least squares’ or OLS regression?

The regression line is computed minimising the total sum squares; that is, the sum of distances from each point in the scatter point to the regression line. The line which minimises this sum is the regression line.

TASK #2 How would the regression line be if \(\beta\) = 0?

If you add more independent variables, the regression equation becomes:

\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + ... + \beta_{k} x_{k} + \epsilon\]

##Interpretation

Intercept (\(\alpha\)): is the estimated average value of \(y\) when the value of \(x\) is zero.

Slope (\(\beta\)): is the estimated average change in \(y\) for a one unit change in \(x\), when all other explanatory variables are held constant.

2.2 Estimation

Let’s explore the following question: how do local factors affect residents’ long-term illness in the UK?

We can use our census data and explore how the local population can explain differences in the percentage of ill population across districts in the UK.

We want to estimate the following model:

\[y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \epsilon\] \[illness = \alpha + \beta_{1} noqualification + \beta_{2} professionals + \beta_{3} elderly + \epsilon\]

# specify a model equation
eq <- illness ~ No_Quals + Professionals + Age_65plus
model <- lm(formula = eq, data = census)
# coefficients
round(coefficients(model),2)
##   (Intercept)      No_Quals Professionals    Age_65plus 
##          4.00          0.44         -0.10          0.29

ie.

\[\hat{y} = 4 + 0.44 x_{1} - 0.10 x_{2} + 0.29 x_{3}\]

Interpretation: A 1% point increase in the percentage of people with no qualification is associated with a rise in the percentage of local long-term ill population by .44%, with the percentage of professional and elderly population held constant

2.3 Predictions

Individual predictions:

# using base functions
census$p_illness <- predict.lm(model)

Let’s see the observed and predicted % of long-term illness for Liverpool:

census[166, c(1, 6, 24)]
Districtillnessp_illness
Liverpool22.420.1

TASK #3 Obtain the % of long-term illness for Cheltenham.

You can use predictions to assess ‘what-if’ scenarios.

# get typical values
attach(census)
a_x1 <- mean(No_Quals)
a_x2 <- mean(Professionals)
# prediction if the population aged 65+ is 30%
4.00 + 0.44*a_x1 - 0.10*a_x2 + 0.29*30
## [1] 21.53108

2.4 Model Assessment

Full model output:

summary(model)
## 
## Call:
## lm(formula = eq, data = census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2275 -1.0969 -0.3077  0.9173  5.4294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.00288    1.20566   3.320 0.000996 ***
## No_Quals       0.44312    0.03419  12.960  < 2e-16 ***
## Professionals -0.09663    0.04417  -2.188 0.029375 *  
## Age_65plus     0.28828    0.02157  13.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.568 on 344 degrees of freedom
## Multiple R-squared:  0.7917, Adjusted R-squared:  0.7899 
## F-statistic: 435.9 on 3 and 344 DF,  p-value: < 2.2e-16

The summary output indicates:

2.4.1 Overall Fit

R-squared and F test: A problem with R-squared is hat every time an additional variable is added to the model, the R-squared will increase.

  • Adjusted R-squared and Akaike’s Information Criterion (AIC): take into account the number of explanatory variables in the model when assessing model fit. The \(AIC\) is more robust so that adding extra variables to the model doesn’t necessarily lead to an improvement. Smaller AIC scores indicate better fit.
# compute AIC
AIC(model)
## [1] 1306.781

A limitation: AIC can only be compared between models fitted to the same data; R-squared can be compared between models fitted to any data.

Alternatives:

  • Correlation coefficient: between the observed and predicted value of your dependent variable. The resulting measure is comparable between models and datasets, but assumes relationships are linear.
# compute correlation
cor(p_illness, illness)
## [1] 0.8897922
  • MSE: enables comparison between models and datasets.
# compute MSE
census %>% 
  add_predictions(model) %>%
  summarise(MSE = mean((illness - pred)^2))
MSE
2.43

2.4.2 Individual Coefficients

We assess the sign, size and statistical significance of coefficients.

# summary output and CIs in a single table
cbind(summary(model)$coefficients, confint(model) )
##                  Estimate Std. Error   t value     Pr(>|t|)      2.5 %
## (Intercept)    4.00287606 1.20565723  3.320078 9.963810e-04  1.6314881
## No_Quals       0.44312392 0.03419171 12.959982 1.486323e-31  0.3758728
## Professionals -0.09663492 0.04417488 -2.187554 2.937473e-02 -0.1835218
## Age_65plus     0.28828276 0.02157319 13.363009 4.278867e-33  0.2458508
##                    97.5 %
## (Intercept)    6.37426402
## No_Quals       0.51037505
## Professionals -0.00974805
## Age_65plus     0.33071472

You can also visualise the regression outputs using coefplot:

Fig.2 plotting regression coefficients Fig.2 plotting regression coefficients

coefplot(model, title = " ", xlab = "Coefficient", ylab = " ",
         intercept=FALSE, horizontal = FALSE,
         color = "black", offset=0.2, sort='magnitude', fillColor = "grey")

TASK #4 Estimate an additional model and compare the estimates with model above

3 Dummy Variables

So far, we have explored how to treat independent variables measured in a continuous scale. What about if we want to understand differences across qualitative attributes eg.

Now we focus on how to quantify the relationship between qualitative attributes and an outcome variable.

3.1 The Theory

What Are Dummy Variables (DVs)?

We can create DVs from categorial data or classifying continuous variables into categories (eg. income quantiles).

3.1.1 Example: 2 Categories

id gender female
1 female 1
2 female 1
3 male 0
4 female 1

3.1.2 Example: 3+ Categories

id qualification no qualification degree
1 degree 0 1
2 no qual 1 0
3 no qual 1 0
4 2+ A levels 0 0

3.2 Practice

Let’s explore how the % of long-term illness vary across regions in the UK.

We have 10 regions in the dataset, so let’s create our dummy variables.

# create dummy variables
region_dv <- dummy.code(census$Region)
census <- cbind(census, region_dv)
census[1:10,c(2,24:34)]
Regionp_illnessSouth EastEast of EnglandEast MidlandsNorth WestSouth WestLondonWest MidlandsWalesYorkshire and The HumberNorth East
South East20.81000000000
North West21.20001000000
East Midlands20.40010000000
South East21.71000000000
East Midlands22.20010000000
South East17.31000000000
South East14.71000000000
East of England19.20100000000
London18.90000010000
London13.30000010000

NOTE: To fit regression models, we need a base category ie. we need to exclude one of the categories. If our variables has 5 categories, we only include 4 dummy variables. The category that is excluded serves as our base category.

3.3 Estimation Using DVs

\[illness = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \beta_{5} D_{2} + .... + \beta_{12} D_{9} + \epsilon\]

attach(census)
# specify a model equation
eq2 <- illness ~ No_Quals + Professionals + Age_65plus + as.matrix(census[,c(25:29, 31:34)])
model2 <- lm(formula = eq2, data = census)
export_summs(model2)
Model 1
(Intercept)3.09 ** 
(0.95)   
No_Quals0.45 ***
(0.03)   
Professionals-0.01    
(0.04)   
Age_65plus0.30 ***
(0.02)   
as.matrix(census[, c(25:29, 31:34)])South East-1.03 ***
(0.28)   
as.matrix(census[, c(25:29, 31:34)])East of England-1.73 ***
(0.30)   
as.matrix(census[, c(25:29, 31:34)])East Midlands-0.71 *  
(0.31)   
as.matrix(census[, c(25:29, 31:34)])North West0.98 ** 
(0.32)   
as.matrix(census[, c(25:29, 31:34)])South West-0.00    
(0.35)   
as.matrix(census[, c(25:29, 31:34)])West Midlands-1.00 ** 
(0.34)   
as.matrix(census[, c(25:29, 31:34)])Wales2.46 ***
(0.37)   
as.matrix(census[, c(25:29, 31:34)])Yorkshire and The Humber-0.87 *  
(0.36)   
as.matrix(census[, c(25:29, 31:34)])North East1.25 ** 
(0.43)   
N348       
R20.89    
*** p < 0.001; ** p < 0.01; * p < 0.05.
In a graph:

Fig.3 regression coefficients Fig.3 regression coefficients

# plot coefficients
coefplot(model2, title = " ", xlab = "Coefficient", ylab = " ",
         intercept=FALSE, horizontal = FALSE,
         color = "black", offset=0.2, sort='magnitude', fillColor = "grey")

3.4 Alternative approaches to DVs

If the independent variable you want as a DV is a factor, you can just execute lm():

eq3 <- illness ~ No_Quals + Professionals + Age_65plus + Region
model3 <-lm(formula = eq3, data = census)

Or you can create dummy variables in the lm() function using factor(). Note: this functionality may not be available with all packages and modelling frameworks.

# dummy variables using `factor()`
eq4 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model4 <- lm(formula = eq4, data = census)

TASK #5 Compare the results from these approaches by:

# comparing models  
export_summs(model2, model3, model4)

3.5 Changing The Base Category

We can use the relevel() function to directly specify the base category.

Note:

# check the order of levels
levels(census$Region)
##  [1] "East Midlands"            "East of England"         
##  [3] "London"                   "North East"              
##  [5] "North West"               "South East"              
##  [7] "South West"               "Wales"                   
##  [9] "West Midlands"            "Yorkshire and The Humber"
# change reference category
census$Region <- relevel(census$Region, ref="West Midlands")
# check levels again
levels(census$Region)
##  [1] "West Midlands"            "East Midlands"           
##  [3] "East of England"          "London"                  
##  [5] "North East"               "North West"              
##  [7] "South East"               "South West"              
##  [9] "Wales"                    "Yorkshire and The Humber"
# estimate a new regression
eq5 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model5 <- lm(formula = eq5, data = census)
export_summs(model5)
Model 1
(Intercept)2.09 *  
(1.03)   
No_Quals0.45 ***
(0.03)   
Professionals-0.01    
(0.04)   
Age_65plus0.30 ***
(0.02)   
factor(Region)East Midlands0.29    
(0.28)   
factor(Region)East of England-0.73 ** 
(0.27)   
factor(Region)London1.00 ** 
(0.34)   
factor(Region)North East2.25 ***
(0.40)   
factor(Region)North West1.97 ***
(0.28)   
factor(Region)South East-0.04    
(0.27)   
factor(Region)South West0.99 ** 
(0.32)   
factor(Region)Wales3.45 ***
(0.33)   
factor(Region)Yorkshire and The Humber0.13    
(0.33)   
N348       
R20.89    
*** p < 0.001; ** p < 0.01; * p < 0.05.

4 Logistic Regression

We have learned how to estimate the associations for models with independent continuous and categorical variables when the outcome variable is continuous.

\[y = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \epsilon\]

What about when the outcome variable is categorical?

Examples:

To motivate this session, we will seek to answer the following question: Who are more likely to use sustainable transport?

Read QLFS data

# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData") 

Create our sustainable transport variable

qlfs <- mutate(qlfs, 
               sus_tr = ifelse(
                 TravelMode == "Motorbike,moped,scooter" 
                 | TravelMode == "Bicycle"
                 | TravelMode =="Walk"
                 , 1, 0))

4.1 Probability, Odds & Log-odds

They all use a different denominator so they have different meaning.

An example:

attach(qlfs)
t1 <- table(sus_tr, Sex)

4.1.1 Probability

Probability of using sustainable transport if female:

1148 / (26189 + 1148) = 0.041

4.1.2 Odds

Odds of using sustainable transport if female:

1,148 / 26,189 = 0.044

4.1.3 Log-odds

Log-odds of using sustainable transport if female

\(log({1,148}/{26,189})\) = \(log(0.044)\) = -3.12

Note:

  • Probabilities can vary between 0 and 1
  • Odds can vary between 0 and infinity
  • Log-odds can vary between -ve infinity and +ve infinity

4.2 Formal Definition

Logistic regression is one of a family of regression models known as “General Linear Models” (GLMs).

GLMs are characterised by three properties:

Logistic regression models:

Various names: binary regression model, discrete choice model, probability regression model, and qualitative response regression model

In mathemathical terms for a model with a single independent variable:

\[log(p(1-p)) = \alpha + \beta_{1} x_{1}\] which can be rearrange:

\[ p = \frac{exp(\alpha + \beta_{1} x_{1}) }{ 1 + exp(\alpha + \beta_{1} x_{1})} \]

\(p\): probability of an event happening

\(\alpha\): regression intercept

\(\beta_{1}\): regression coefficient associated to \(x_{1}\)

4.3 Interpretation

4.3.1 Log-odds

Intercept is the log-odds of an event happening (ie. \(y=1\)) if the value of the explanatory variables \(x\)s is zero.

Slope is the estimated change in the log-odds for one unit change in \(x_{k}\), holding all other variables constant.

4.3.2 Interpretation of \(Exp(\beta_{k} x_{k})\)

They give the expected change in the odds for a unit change in \(x\)s, holding all other variables constant.

\(Exp(\beta_{k} x_{k})\) = 1, if \(\beta = 0\): indicates is equally likely to occur.

\(Exp(\beta_{k} x_{k})\) > 1, if \(\beta > 0\): indicates an event is more likely to occur, or the odds are ‘\(\beta_{k}\) times larger’

\(Exp(\beta_{k} x_{k})\) < 1, if \(\beta < 0\): indicates an event is less likely to occur, or the odds are ‘\(\beta_{k}\) times smaller’

4.4 Practice

4.4.1 General Function

In R, we use the basic syntax for fitting a general linear model:

# general function
glm(outcome ~ predictor(s), data = dataframe,
    family = name of assumed error distribution(link = name of a link function),
             na.action = na.exlcude  OR  na.fail) )

Hence, to fit a logistic regression model, the code required is:

# estimate a logistic regression
glm(outcome ~ predictor(s), data= data.frame,
    family=binomial(link="logit"),
    na.action=na.exclude)

4.5 Estimation

# specify a model
eq1 <- sus_tr ~ Sex + AgeGroup + NetPay
# estimate model
model1 <- glm(eq1, data= qlfs,
                      family=binomial(link="logit"),
                      na.action=na.exclude)

# coefficients (log-odds)
round(coefficients(model1),2)
##   (Intercept)     SexFemale AgeGroup30-44 AgeGroup45-64   AgeGroup65+ 
##         -3.51          0.09          0.43          0.17         -2.25 
##        NetPay 
##          0.00
# odds ratio
round(exp(coef(model1)),2)
## odds ratios and 95% CI
round(exp(cbind("Odds-ratio" = coef(model1), confint(model1))),2)
## Waiting for profiling to be done...
##               Odds-ratio 2.5 % 97.5 %
## (Intercept)         0.03  0.03   0.03
## SexFemale           1.09  0.98   1.21
## AgeGroup30-44       1.53  1.33   1.76
## AgeGroup45-64       1.19  1.04   1.36
## AgeGroup65+         0.11  0.08   0.14
## NetPay              1.00  1.00   1.00

What is the base category for Sex? and for Age Group? Hint: excluded category

Interpretation:

For SexFemale, a \(Exp(\beta_{k} x_{k})\) of \(1.09\) indicates that being female changes the odds of using sustainable transport (versus males) by a factor of \(1.09\), holding all other variables constant. i.e. females are 1.09 more likely than males to use sustainable transport.

For NetPay, a \(Exp(\beta_{k} x_{k})\) of \(1\) indicates that for every one pound change, the odds of using sustainable transport by a factor of \(1\) (versus not using sustainable transport), holding all other variables constant. ie. equal chance.

Note: R produces the odds ratio for the intercept but it is rarely interpreted.

4.6 Model Assessment

Full model output:

summary(model1)
## 
## Call:
## glm(formula = eq1, family = binomial(link = "logit"), data = qlfs, 
##     na.action = na.exclude)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4193  -0.2991  -0.2515  -0.0821   3.3800  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.511e+00  6.306e-02 -55.680  < 2e-16 ***
## SexFemale      8.791e-02  5.319e-02   1.653   0.0984 .  
## AgeGroup30-44  4.269e-01  7.175e-02   5.950 2.68e-09 ***
## AgeGroup45-64  1.719e-01  6.908e-02   2.488   0.0128 *  
## AgeGroup65+   -2.253e+00  1.607e-01 -14.026  < 2e-16 ***
## NetPay         2.062e-03  8.697e-05  23.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14406  on 45220  degrees of freedom
## Residual deviance: 12842  on 45215  degrees of freedom
##   (39471 observations deleted due to missingness)
## AIC: 12854
## 
## Number of Fisher Scoring iterations: 8

The summary output indicates:

4.6.1 Overall Fit

Formal chi-squared test based on the null and residual model deviances.

# computing the p-value for the test
with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0

TASK #5: What is the null hypothesis of this test?

Interpretation of the test: The chi-square of 1563.814 with 5 degrees of freedom and an associated p-value of less than 0 tells us that our model is a significantly better fit to the data than an empty model.

To compare across different model specifications use the AIC.

4.7 Visualising Probabilities

4.7.1 Estimating Probabilities

Create a new data set for Female with varying values for AgeGroup and PayNet

df_p <- with(qlfs, data.frame(Sex = factor("Female", levels = c("Male", "Female"))
  , AgeGroup = factor(c("16-29", "30-44", "45-64", "65+"))
  , NetPay = rep(seq(from = 200, to = 800, length.out = 100), 4)
  )
)

Computing probabilities and their CIs:

# add se
df_pci <- cbind(df_p, predict(model1
                             , newdata = df_p,
                             type = "link",
                             se = TRUE)
               )
# add CIs
df_pci <- within(df_pci, {
    p_prob <- plogis(fit)
    ci95_lb <- plogis(fit - (1.96 * se.fit))
    ci95_ub <- plogis(fit + (1.96 * se.fit))
})

#View
head(df_pci)
SexAgeGroupNetPayfitse.fitresidual.scaleci95_ubci95_lbp_prob
Female16-29200-3.010.061410.0526 0.0418 0.0469 
Female30-44206-2.570.048410.0775 0.065  0.071  
Female45-64212-2.810.046610.0616 0.0519 0.0566 
Female65+218-5.230.155 10.007220.003950.00534
Female16-29224-2.960.062 10.0552 0.0438 0.0492 
Female30-44230-2.520.048410.0812 0.0681 0.0744 

4.7.2 Visualising Probabilities