Correlation

Correlation Visualisation & Measures

Francisco Rowe

2020-08-31

In this session1 Part of Introduction to Statistical Learning in R Creative Commons License
Correlation – Correlation Visualisation & Measures by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
, we turn our focus on understanding how we can measure the relationship between two variables.

We are often interested in associations eg. How is unemployment associated with education? How is commuting associated with income?

1 Read data

# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")
#introducing the data
head(census)

TASK #1 Explore the structure of the data

2 Correlation: The Logic

Fig.1 relationship unemployment and education across UK districts Fig.1 relationship unemployment and education across UK districts

ggplot(data=census) +
  geom_point( aes(y= Unemployed, x= No_Quals) )   +
  geom_smooth(aes(y= Unemployed, x= No_Quals), method = "lm", se=FALSE) +
    # Add labels
    labs(title= paste(" "), y="Unemployed (%)", x="No Qualification (%)") +
    theme_classic() +
    theme(axis.text=element_text(size=14))
## `geom_smooth()` using formula 'y ~ x'

Correlation coefficient measures the strength of the relationship between two variables.

2.1 Correlation & Type of Data

Measure Type of Data
Pearson symmetrical continuous distributions
Spearman Rank one or both skewed distributions
Spearman Rank both ordinal
Cramer’s V one or both nominal

In practice: Cramer’s V is rarely used.

3 Correlation

Correlation between continuous variables

attach(census)
# Pearson correlation
cor( No_Quals, Unemployed, method="pearson")
## [1] 0.5500458
# Spearman correlation
cor( No_Quals, Unemployed, method="spearman")
## [1] 0.569688

Between all possible combinations of variables in a data frame:

pc <- cor( census[ , -c(1:5) ], method="pearson" )
round(pc, 2)

TASK #2 Identify the 3 variables most strongly and most weakly correlated with the % of residents in ill health (illness).

TASK #3 Create graphs visualising examples of strong, moderate and weak correlations.

4 Testing Statistical Significance

We use a different function rcorr from the Hmisc package.

pc <- rcorr(as.matrix(census[, 6:10]), type = "pearson")
pc
##                  illness Age_65plus Couple_with_kids Crowded Flats
## illness             1.00       0.51            -0.45   -0.32 -0.40
## Age_65plus          0.51       1.00            -0.11   -0.69 -0.56
## Couple_with_kids   -0.45      -0.11             1.00   -0.20 -0.46
## Crowded            -0.32      -0.69            -0.20    1.00  0.70
## Flats              -0.40      -0.56            -0.46    0.70  1.00
## 
## n= 348 
## 
## 
## P
##                  illness Age_65plus Couple_with_kids Crowded Flats 
## illness                  0.0000     0.0000           0.0000  0.0000
## Age_65plus       0.0000             0.0374           0.0000  0.0000
## Couple_with_kids 0.0000  0.0374                      0.0002  0.0000
## Crowded          0.0000  0.0000     0.0002                   0.0000
## Flats            0.0000  0.0000     0.0000           0.0000

5 Visualisation

Visualising correlation matrices

Fig.2 Correlogram Fig.2 Correlogram

# get correlations
pc <- cor( census[ , -c(1:5) ], method="pearson" )
ggcorrplot(pc)

You can adjust the options and add the statistical significance:

# get p-values
sig <- cor_pmat(pc)
# draw correlogram
ggcorrplot(pc, method = "square", type= "upper", 
          ggtheme = ggplot2::theme_classic,
          hc.order= TRUE, colors = brewer.pal(n = 3, name = "RdBu"), 
          outline.col = "white", lab = FALSE,
          p.mat = sig) + scale_fill_viridis(option="inferno")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

There are more functions and packages you can explore:

Correlation between continuous and categorical variables

Using the QLFS and density distributions:

Fig.3 Net Pay Density by Ethicity Fig.3 Net Pay Density by Ethicity

ggplot(data= df) +
  geom_density(alpha=0.5, colour="#FF6666", aes(x = NetPay, fill = EthnicGroup))

Another way to visualise data distributions is using boxplots:

ggplot(data = df) +
  geom_boxplot(aes(x = EthnicGroup, y= NetPay, fill= EthnicGroup)) +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=8), legend.position="none") +
  scale_x_discrete(name="Ethnic Group") +
  scale_y_continuous(limits = c(0, 1500), name="Weight (Net Pay (weekly))")

6 Appendix: Concepts and Functions to Remember

Function Description
cor() compute Pearson’s (method=“pearson”) or Spearman’s (method=“spearman”) correlation
rcorr(), cor_pmat() compute statistical significance