# Regression Analysis

### Linear, Logistic Regression & Dummy Variables

#### 2020-08-31

• exploring the relationship between an outcome variable and multiple variables; and,
• predicting possible future scenarios.

# 1 Modelling Approaches & Data Types

Type of Data Category Approach
Scale/continuous Categorical: Binary Multiple linear regression
Categorical Nominal: Binary Logistic regression
Categorical Nominal: Multiple Multiple logistic regression
Categorical Ordinal: Multiple Ordinal logistic regression
Categorical Count Poisson regression

# 2 Multiple Linear Regression

## 2.1 The intuition

A Multiple Linear Regression captures the average linear relationship.

# clean workspace
rm(list=ls())
load("../data/data_census.RData")
## geom_smooth() using formula 'y ~ x' Fig.1 relationship unemployment and illness

This simple regression line can described using the equation:

$y = \alpha + \beta_{1} x_{1} + \epsilon$

• $$y$$: dependent variable (or outcome variable, predicted variable, explained variable)

• $$\alpha$$: constant (or intercept)

• $$\beta_{1}$$: regression coefficient (or slope, coefficient, beta coefficient, estimated coefficient, estimated parameter)

• $$x$$: independent variable (or explanatory variable, covariate, predictor, control variable)

• $$\epsilon$$: error term

Why is a multiple regression model known as ‘ordinary least squares’ or OLS regression?

The regression line is computed minimising the total sum squares; that is, the sum of distances from each point in the scatter point to the regression line. The line which minimises this sum is the regression line.

TASK #2 How would the regression line be if $$\beta$$ = 0?

If you add more independent variables, the regression equation becomes:

$y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + ... + \beta_{k} x_{k} + \epsilon$

##Interpretation

Intercept ($$\alpha$$): is the estimated average value of $$y$$ when the value of $$x$$ is zero.

Slope ($$\beta$$): is the estimated average change in $$y$$ for a one unit change in $$x$$, when all other explanatory variables are held constant.

## 2.2 Estimation

Let’s explore the following question: how do local factors affect residents’ long-term illness in the UK?

We can use our census data and explore how the local population can explain differences in the percentage of ill population across districts in the UK.

We want to estimate the following model:

$y = \alpha + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \epsilon$ $illness = \alpha + \beta_{1} noqualification + \beta_{2} professionals + \beta_{3} elderly + \epsilon$

# specify a model equation
eq <- illness ~ No_Quals + Professionals + Age_65plus
model <- lm(formula = eq, data = census)
# coefficients
round(coefficients(model),2)
##   (Intercept)      No_Quals Professionals    Age_65plus
##          4.00          0.44         -0.10          0.29

ie.

$\hat{y} = 4 + 0.44 x_{1} - 0.10 x_{2} + 0.29 x_{3}$

Interpretation: A 1% point increase in the percentage of people with no qualification is associated with a rise in the percentage of local long-term ill population by .44%, with the percentage of professional and elderly population held constant

## 2.3 Predictions

Individual predictions:

# using base functions
census$p_illness <- predict.lm(model) Let’s see the observed and predicted % of long-term illness for Liverpool: census[166, c(1, 6, 24)] Districtillnessp_illness Liverpool22.420.1 TASK #3 Obtain the % of long-term illness for Cheltenham. You can use predictions to assess ‘what-if’ scenarios. # get typical values attach(census) a_x1 <- mean(No_Quals) a_x2 <- mean(Professionals) # prediction if the population aged 65+ is 30% 4.00 + 0.44*a_x1 - 0.10*a_x2 + 0.29*30 ##  21.53108 ## 2.4 Model Assessment Full model output: summary(model) ## ## Call: ## lm(formula = eq, data = census) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.2275 -1.0969 -0.3077 0.9173 5.4294 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.00288 1.20566 3.320 0.000996 *** ## No_Quals 0.44312 0.03419 12.960 < 2e-16 *** ## Professionals -0.09663 0.04417 -2.188 0.029375 * ## Age_65plus 0.28828 0.02157 13.363 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.568 on 344 degrees of freedom ## Multiple R-squared: 0.7917, Adjusted R-squared: 0.7899 ## F-statistic: 435.9 on 3 and 344 DF, p-value: < 2.2e-16 The summary output indicates: • Residuals: summary statistics of the distribution of residuals (model error) • Coefficients: model coefficients and their statistical significance • R-squared: indicates how much of the variation in the dependent variable is ‘explained’ by the explanatory variables. • F statistics: indicates whether the fitted model is a statistically significant improvement on the null model. ### 2.4.1 Overall Fit R-squared and F test: A problem with R-squared is hat every time an additional variable is added to the model, the R-squared will increase. • Adjusted R-squared and Akaike’s Information Criterion (AIC): take into account the number of explanatory variables in the model when assessing model fit. The $$AIC$$ is more robust so that adding extra variables to the model doesn’t necessarily lead to an improvement. Smaller AIC scores indicate better fit. # compute AIC AIC(model) ##  1306.781 A limitation: AIC can only be compared between models fitted to the same data; R-squared can be compared between models fitted to any data. Alternatives: • Correlation coefficient: between the observed and predicted value of your dependent variable. The resulting measure is comparable between models and datasets, but assumes relationships are linear. # compute correlation cor(p_illness, illness) ##  0.8897922 • MSE: enables comparison between models and datasets. # compute MSE census %>% add_predictions(model) %>% summarise(MSE = mean((illness - pred)^2)) MSE 2.43 ### 2.4.2 Individual Coefficients We assess the sign, size and statistical significance of coefficients. # summary output and CIs in a single table cbind(summary(model)$coefficients, confint(model) )
##                  Estimate Std. Error   t value     Pr(>|t|)      2.5 %
## (Intercept)    4.00287606 1.20565723  3.320078 9.963810e-04  1.6314881
## No_Quals       0.44312392 0.03419171 12.959982 1.486323e-31  0.3758728
## Professionals -0.09663492 0.04417488 -2.187554 2.937473e-02 -0.1835218
## Age_65plus     0.28828276 0.02157319 13.363009 4.278867e-33  0.2458508
##                    97.5 %
## (Intercept)    6.37426402
## No_Quals       0.51037505
## Professionals -0.00974805
## Age_65plus     0.33071472

You can also visualise the regression outputs using coefplot: Fig.2 plotting regression coefficients

coefplot(model, title = " ", xlab = "Coefficient", ylab = " ",
intercept=FALSE, horizontal = FALSE,
color = "black", offset=0.2, sort='magnitude', fillColor = "grey")

TASK #4 Estimate an additional model and compare the estimates with model above

# 3 Dummy Variables

So far, we have explored how to treat independent variables measured in a continuous scale. What about if we want to understand differences across qualitative attributes eg.

• Where is the rate of long-term illness higher in particular regions of the UK? Does it vary much across regions?

• Is there a gender pay gap?

• How does ethnicity influence the industry people work?

• How does marital status influence house ownership?

• What age group is more migratory?

Now we focus on how to quantify the relationship between qualitative attributes and an outcome variable.

## 3.1 The Theory

What Are Dummy Variables (DVs)?

• are categorical variables that take the value of 0 or 1

• serve to quantify the relationship between an outcome variable and a qualitative attribute

We can create DVs from categorial data or classifying continuous variables into categories (eg. income quantiles).

id gender female
1 female 1
2 female 1
3 male 0
4 female 1

### 3.1.2 Example: 3+ Categories

id qualification no qualification degree
1 degree 0 1
2 no qual 1 0
3 no qual 1 0
4 2+ A levels 0 0

## 3.2 Practice

Let’s explore how the % of long-term illness vary across regions in the UK.

We have 10 regions in the dataset, so let’s create our dummy variables.

# create dummy variables
region_dv <- dummy.code(census$Region) census <- cbind(census, region_dv) census[1:10,c(2,24:34)] Regionp_illnessSouth EastEast of EnglandEast MidlandsNorth WestSouth WestLondonWest MidlandsWalesYorkshire and The HumberNorth East South East20.81000000000 North West21.20001000000 East Midlands20.40010000000 South East21.71000000000 East Midlands22.20010000000 South East17.31000000000 South East14.71000000000 East of England19.20100000000 London18.90000010000 London13.30000010000 NOTE: To fit regression models, we need a base category ie. we need to exclude one of the categories. If our variables has 5 categories, we only include 4 dummy variables. The category that is excluded serves as our base category. ## 3.3 Estimation Using DVs $illness = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \beta_{5} D_{2} + .... + \beta_{12} D_{9} + \epsilon$ attach(census) # specify a model equation eq2 <- illness ~ No_Quals + Professionals + Age_65plus + as.matrix(census[,c(25:29, 31:34)]) model2 <- lm(formula = eq2, data = census) export_summs(model2) Model 1 (Intercept)3.09 ** (0.95) No_Quals0.45 *** (0.03) Professionals-0.01 (0.04) Age_65plus0.30 *** (0.02) as.matrix(census[, c(25:29, 31:34)])South East-1.03 *** (0.28) as.matrix(census[, c(25:29, 31:34)])East of England-1.73 *** (0.30) as.matrix(census[, c(25:29, 31:34)])East Midlands-0.71 * (0.31) as.matrix(census[, c(25:29, 31:34)])North West0.98 ** (0.32) as.matrix(census[, c(25:29, 31:34)])South West-0.00 (0.35) as.matrix(census[, c(25:29, 31:34)])West Midlands-1.00 ** (0.34) as.matrix(census[, c(25:29, 31:34)])Wales2.46 *** (0.37) as.matrix(census[, c(25:29, 31:34)])Yorkshire and The Humber-0.87 * (0.36) as.matrix(census[, c(25:29, 31:34)])North East1.25 ** (0.43) N348 R20.89 *** p < 0.001; ** p < 0.01; * p < 0.05. In a graph: Fig.3 regression coefficients # plot coefficients coefplot(model2, title = " ", xlab = "Coefficient", ylab = " ", intercept=FALSE, horizontal = FALSE, color = "black", offset=0.2, sort='magnitude', fillColor = "grey") ## 3.4 Alternative approaches to DVs If the independent variable you want as a DV is a factor, you can just execute lm(): eq3 <- illness ~ No_Quals + Professionals + Age_65plus + Region model3 <-lm(formula = eq3, data = census) Or you can create dummy variables in the lm() function using factor(). Note: this functionality may not be available with all packages and modelling frameworks. # dummy variables using factor() eq4 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region) model4 <- lm(formula = eq4, data = census) TASK #5 Compare the results from these approaches by: # comparing models export_summs(model2, model3, model4) ## 3.5 Changing The Base Category We can use the relevel() function to directly specify the base category. Note: • It achieves this by changing the order of the levels in a factor variable • This causes a permanent change in the order of the factor levels so remember to reverse the change, or create a second version of your data frame • relevel() only works for unordered factors # check the order of levels levels(census$Region)
##   "East Midlands"            "East of England"
##   "London"                   "North East"
##   "North West"               "South East"
##   "South West"               "Wales"
##   "West Midlands"            "Yorkshire and The Humber"
# change reference category
census$Region <- relevel(census$Region, ref="West Midlands")
# check levels again
levels(census\$Region)
##   "West Midlands"            "East Midlands"
##   "East of England"          "London"
##   "North East"               "North West"
##   "South East"               "South West"
##   "Wales"                    "Yorkshire and The Humber"
# estimate a new regression
eq5 <- illness ~ No_Quals + Professionals + Age_65plus + factor(Region)
model5 <- lm(formula = eq5, data = census)
export_summs(model5)
Model 1
(Intercept)2.09 *
(1.03)
No_Quals0.45 ***
(0.03)
Professionals-0.01
(0.04)
Age_65plus0.30 ***
(0.02)
factor(Region)East Midlands0.29
(0.28)
factor(Region)East of England-0.73 **
(0.27)
factor(Region)London1.00 **
(0.34)
factor(Region)North East2.25 ***
(0.40)
factor(Region)North West1.97 ***
(0.28)
factor(Region)South East-0.04
(0.27)
factor(Region)South West0.99 **
(0.32)
factor(Region)Wales3.45 ***
(0.33)
factor(Region)Yorkshire and The Humber0.13
(0.33)
N348
R20.89
*** p < 0.001; ** p < 0.01; * p < 0.05.

# 4 Logistic Regression

We have learned how to estimate the associations for models with independent continuous and categorical variables when the outcome variable is continuous.

$y = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \beta_{4} D_{1} + \epsilon$

What about when the outcome variable is categorical?

Examples:

• Health outcomes/behaviours: eg. smoking, drinking, cancer, heart attack, HIV, etc.

• Employment outcomes: unemployed, employed full-time, employed part-time, self-employed, job satisfaction, etc.

• Decision making processes: University, Brexit vote, travelling, migration, long-distance commuting, etc.

To motivate this session, we will seek to answer the following question: Who are more likely to use sustainable transport?

# clean workspace
rm(list=ls())
load("../data/data_qlfs.RData") 

Create our sustainable transport variable

qlfs <- mutate(qlfs,
sus_tr = ifelse(
TravelMode == "Motorbike,moped,scooter"
| TravelMode == "Bicycle"
| TravelMode =="Walk"
, 1, 0))

## 4.1 Probability, Odds & Log-odds

They all use a different denominator so they have different meaning.

An example:

attach(qlfs)
t1 <- table(sus_tr, Sex)

### 4.1.1 Probability

Probability of using sustainable transport if female:

1148 / (26189 + 1148) = 0.041

### 4.1.2 Odds

Odds of using sustainable transport if female:

1,148 / 26,189 = 0.044

### 4.1.3 Log-odds

Log-odds of using sustainable transport if female

$$log({1,148}/{26,189})$$ = $$log(0.044)$$ = -3.12

Note:

• Probabilities can vary between 0 and 1
• Odds can vary between 0 and infinity
• Log-odds can vary between -ve infinity and +ve infinity

## 4.2 Formal Definition

Logistic regression is one of a family of regression models known as “General Linear Models” (GLMs).

GLMs are characterised by three properties:

• The assumed distribution of the model errors

• The transformation (link function) applied to the outcome variable

• The way in which model error (deviance) is measured (log likelihood)

Logistic regression models:

• Predict the the log-odds of an event happening based on at least one independent variable vs. linear regression - estimates the average value

• Dependent variable: qualitative (dummy or binary) variable vs. linear regression - continuous variable

• Requires estimates of the dependent variable to lie between 0 and 1 (i.e. positive values) vs. linear regression - continuous variable

• Assumes a logistic (binomial) distribution vs. linear regression - normal distribution

Various names: binary regression model, discrete choice model, probability regression model, and qualitative response regression model

In mathemathical terms for a model with a single independent variable:

$log(p(1-p)) = \alpha + \beta_{1} x_{1}$ which can be rearrange:

$p = \frac{exp(\alpha + \beta_{1} x_{1}) }{ 1 + exp(\alpha + \beta_{1} x_{1})}$

$$p$$: probability of an event happening

$$\alpha$$: regression intercept

$$\beta_{1}$$: regression coefficient associated to $$x_{1}$$

## 4.3 Interpretation

### 4.3.1 Log-odds

Intercept is the log-odds of an event happening (ie. $$y=1$$) if the value of the explanatory variables $$x$$s is zero.

Slope is the estimated change in the log-odds for one unit change in $$x_{k}$$, holding all other variables constant.

### 4.3.2 Interpretation of $$Exp(\beta_{k} x_{k})$$

They give the expected change in the odds for a unit change in $$x$$s, holding all other variables constant.

$$Exp(\beta_{k} x_{k})$$ = 1, if $$\beta = 0$$: indicates is equally likely to occur.

$$Exp(\beta_{k} x_{k})$$ > 1, if $$\beta > 0$$: indicates an event is more likely to occur, or the odds are ‘$$\beta_{k}$$ times larger’

$$Exp(\beta_{k} x_{k})$$ < 1, if $$\beta < 0$$: indicates an event is less likely to occur, or the odds are ‘$$\beta_{k}$$ times smaller’

## 4.4 Practice

### 4.4.1 General Function

In R, we use the basic syntax for fitting a general linear model:

# general function
glm(outcome ~ predictor(s), data = dataframe,
family = name of assumed error distribution(link = name of a link function),
na.action = na.exlcude  OR  na.fail) )

Hence, to fit a logistic regression model, the code required is:

# estimate a logistic regression
glm(outcome ~ predictor(s), data= data.frame,
na.action=na.exclude)

## 4.5 Estimation

# specify a model
eq1 <- sus_tr ~ Sex + AgeGroup + NetPay
# estimate model
model1 <- glm(eq1, data= qlfs,
na.action=na.exclude)

# coefficients (log-odds)
round(coefficients(model1),2)
##   (Intercept)     SexFemale AgeGroup30-44 AgeGroup45-64   AgeGroup65+
##         -3.51          0.09          0.43          0.17         -2.25
##        NetPay
##          0.00
# odds ratio
round(exp(coef(model1)),2)
## odds ratios and 95% CI
round(exp(cbind("Odds-ratio" = coef(model1), confint(model1))),2)
## Waiting for profiling to be done...
##               Odds-ratio 2.5 % 97.5 %
## (Intercept)         0.03  0.03   0.03
## SexFemale           1.09  0.98   1.21
## AgeGroup30-44       1.53  1.33   1.76
## AgeGroup45-64       1.19  1.04   1.36
## AgeGroup65+         0.11  0.08   0.14
## NetPay              1.00  1.00   1.00

What is the base category for Sex? and for Age Group? Hint: excluded category

Interpretation:

For SexFemale, a $$Exp(\beta_{k} x_{k})$$ of $$1.09$$ indicates that being female changes the odds of using sustainable transport (versus males) by a factor of $$1.09$$, holding all other variables constant. i.e. females are 1.09 more likely than males to use sustainable transport.

For NetPay, a $$Exp(\beta_{k} x_{k})$$ of $$1$$ indicates that for every one pound change, the odds of using sustainable transport by a factor of $$1$$ (versus not using sustainable transport), holding all other variables constant. ie. equal chance.

Note: R produces the odds ratio for the intercept but it is rarely interpreted.

## 4.6 Model Assessment

Full model output:

summary(model1)
##
## Call:
## glm(formula = eq1, family = binomial(link = "logit"), data = qlfs,
##     na.action = na.exclude)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -4.4193  -0.2991  -0.2515  -0.0821   3.3800
##
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -3.511e+00  6.306e-02 -55.680  < 2e-16 ***
## SexFemale      8.791e-02  5.319e-02   1.653   0.0984 .
## AgeGroup30-44  4.269e-01  7.175e-02   5.950 2.68e-09 ***
## AgeGroup45-64  1.719e-01  6.908e-02   2.488   0.0128 *
## AgeGroup65+   -2.253e+00  1.607e-01 -14.026  < 2e-16 ***
## NetPay         2.062e-03  8.697e-05  23.711  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 14406  on 45220  degrees of freedom
## Residual deviance: 12842  on 45215  degrees of freedom
##   (39471 observations deleted due to missingness)
## AIC: 12854
##
## Number of Fisher Scoring iterations: 8

The summary output indicates:

• Residuals: summary statistics of the distribution of residuals (model error)

• Coefficients: model coefficients and their statistical significance. Note: The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable eg. being female, versus male, changes the log odds of using sustainable transport by 0.09.

• Deviance: Deviances for a null and full model. If the Residual deviance < Null deviance, the better the model

• AIC: Akaike’s Information Criterion - the lower the better the model.

### 4.6.1 Overall Fit

Formal chi-squared test based on the null and residual model deviances.

# computing the p-value for the test
with(model1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
##  0

TASK #5: What is the null hypothesis of this test?

Interpretation of the test: The chi-square of 1563.814 with 5 degrees of freedom and an associated p-value of less than 0 tells us that our model is a significantly better fit to the data than an empty model.

To compare across different model specifications use the AIC.

## 4.7 Visualising Probabilities

### 4.7.1 Estimating Probabilities

Create a new data set for Female with varying values for AgeGroup and PayNet

df_p <- with(qlfs, data.frame(Sex = factor("Female", levels = c("Male", "Female"))
, AgeGroup = factor(c("16-29", "30-44", "45-64", "65+"))
, NetPay = rep(seq(from = 200, to = 800, length.out = 100), 4)
)
)

Computing probabilities and their CIs:

# add se
df_pci <- cbind(df_p, predict(model1
, newdata = df_p,
se = TRUE)
)
df_pci <- within(df_pci, {
p_prob <- plogis(fit)
ci95_lb <- plogis(fit - (1.96 * se.fit))
ci95_ub <- plogis(fit + (1.96 * se.fit))
})

#View
head(df_pci)
SexAgeGroupNetPayfitse.fitresidual.scaleci95_ubci95_lbp_prob
Female16-29200-3.010.061410.0526 0.0418 0.0469
Female30-44206-2.570.048410.0775 0.065  0.071
Female45-64212-2.810.046610.0616 0.0519 0.0566
Female65+218-5.230.155 10.007220.003950.00534
Female16-29224-2.960.062 10.0552 0.0438 0.0492
Female30-44230-2.520.048410.0812 0.0681 0.0744