In this session1 Part of Introduction to Statistical Learning in R
Correlation – Correlation Visualisation & Measures by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License., we turn our focus on understanding how we can measure the relationship between two variables.
We are often interested in associations eg. How is unemployment associated with education? How is commuting associated with income?
# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")
#introducing the data
head(census)
Each row corresponds to a district
The dataset records the % of persons in each district falling into a range of socio-demographic categories, except for the first five columns, which record geographic information.
TASK #1 Explore the structure of the data
Fig.1 relationship unemployment and education across UK districts
ggplot(data=census) +
geom_point( aes(y= Unemployed, x= No_Quals) ) +
geom_smooth(aes(y= Unemployed, x= No_Quals), method = "lm", se=FALSE) +
# Add labels
labs(title= paste(" "), y="Unemployed (%)", x="No Qualification (%)") +
theme_classic() +
theme(axis.text=element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'
Correlation coefficient measures the strength of the relationship between two variables.
Measure | Type of Data |
---|---|
Pearson | symmetrical continuous distributions |
Spearman Rank | one or both skewed distributions |
Spearman Rank | both ordinal |
Cramer’s V | one or both nominal |
In practice: Cramer’s V is rarely used.
Correlation between continuous variables
attach(census)
# Pearson correlation
cor( No_Quals, Unemployed, method="pearson")
## [1] 0.5500458
# Spearman correlation
cor( No_Quals, Unemployed, method="spearman")
## [1] 0.569688
Between all possible combinations of variables in a data frame:
pc <- cor( census[ , -c(1:5) ], method="pearson" )
round(pc, 2)
TASK #2 Identify the 3 variables most strongly and most weakly correlated with the % of residents in ill health (illness).
TASK #3 Create graphs visualising examples of strong, moderate and weak correlations.
We use a different function rcorr
from the Hmisc
package.
pc <- rcorr(as.matrix(census[, 6:10]), type = "pearson")
pc
## illness Age_65plus Couple_with_kids Crowded Flats
## illness 1.00 0.51 -0.45 -0.32 -0.40
## Age_65plus 0.51 1.00 -0.11 -0.69 -0.56
## Couple_with_kids -0.45 -0.11 1.00 -0.20 -0.46
## Crowded -0.32 -0.69 -0.20 1.00 0.70
## Flats -0.40 -0.56 -0.46 0.70 1.00
##
## n= 348
##
##
## P
## illness Age_65plus Couple_with_kids Crowded Flats
## illness 0.0000 0.0000 0.0000 0.0000
## Age_65plus 0.0000 0.0374 0.0000 0.0000
## Couple_with_kids 0.0000 0.0374 0.0002 0.0000
## Crowded 0.0000 0.0000 0.0002 0.0000
## Flats 0.0000 0.0000 0.0000 0.0000
Visualising correlation matrices
Fig.2 Correlogram
# get correlations
pc <- cor( census[ , -c(1:5) ], method="pearson" )
ggcorrplot(pc)
You can adjust the options and add the statistical significance:
# get p-values
sig <- cor_pmat(pc)
# draw correlogram
ggcorrplot(pc, method = "square", type= "upper",
ggtheme = ggplot2::theme_classic,
hc.order= TRUE, colors = brewer.pal(n = 3, name = "RdBu"),
outline.col = "white", lab = FALSE,
p.mat = sig) + scale_fill_viridis(option="inferno")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
There are more functions and packages you can explore:
corrplot
- corrplot package
ggcor
- ggally package
Correlation between continuous and categorical variables
Using the QLFS and density distributions:
Fig.3 Net Pay Density by Ethicity
ggplot(data= df) +
geom_density(alpha=0.5, colour="#FF6666", aes(x = NetPay, fill = EthnicGroup))
Another way to visualise data distributions is using boxplots:
ggplot(data = df) +
geom_boxplot(aes(x = EthnicGroup, y= NetPay, fill= EthnicGroup)) +
theme(axis.text.x = element_text(angle=90, vjust=0.5, size=8), legend.position="none") +
scale_x_discrete(name="Ethnic Group") +
scale_y_continuous(limits = c(0, 1500), name="Weight (Net Pay (weekly))")
Function | Description |
---|---|
cor() | compute Pearson’s (method=“pearson”) or Spearman’s (method=“spearman”) correlation |
rcorr(), cor_pmat() | compute statistical significance |