We are often interested in associations eg. How is unemployment associated with education? How is commuting associated with income?

1 Read data

# clean workspace
rm(list=ls())
# load data
load("../data/data_census.RData")

#introducing the data
head(census)

Each row corresponds to a district
The dataset records the % of persons in each district falling into a range of socio-demographic categories, except for the first five columns, which record geographic information.

TASK #1 Explore the structure of the data

2 Correlation: The Logic

Fig.1 relationship unemployment and education across UK districts

ggplot(data=census) +
  geom_point( aes(y= Unemployed, x= No_Quals) )   +
  geom_smooth(aes(y= Unemployed, x= No_Quals), method = "lm", se=FALSE) +
    # Add labels
    labs(title= paste(" "), y="Unemployed (%)", x="No Qualification (%)") +
    theme_classic() +
    theme(axis.text=element_text(size=14))

## `geom_smooth()` using formula 'y ~ x'

Correlation coefficient measures the strength of the relationship between two variables.

2.1 Correlation & Type of Data

Measure	Type of Data
Pearson	symmetrical continuous distributions
Spearman Rank	one or both skewed distributions
Spearman Rank	both ordinal
Cramer’s V	one or both nominal

In practice: Cramer’s V is rarely used.

3 Correlation

Correlation between continuous variables

attach(census)
# Pearson correlation
cor( No_Quals, Unemployed, method="pearson")

## [1] 0.5500458

# Spearman correlation
cor( No_Quals, Unemployed, method="spearman")

## [1] 0.569688

Between all possible combinations of variables in a data frame:

pc <- cor( census[ , -c(1:5) ], method="pearson" )
round(pc, 2)

TASK #2 Identify the 3 variables most strongly and most weakly correlated with the % of residents in ill health (illness).

TASK #3 Create graphs visualising examples of strong, moderate and weak correlations.

4 Testing Statistical Significance

We use a different function rcorr from the Hmisc package.

pc <- rcorr(as.matrix(census[, 6:10]), type = "pearson")
pc

##                  illness Age_65plus Couple_with_kids Crowded Flats
## illness             1.00       0.51            -0.45   -0.32 -0.40
## Age_65plus          0.51       1.00            -0.11   -0.69 -0.56
## Couple_with_kids   -0.45      -0.11             1.00   -0.20 -0.46
## Crowded            -0.32      -0.69            -0.20    1.00  0.70
## Flats              -0.40      -0.56            -0.46    0.70  1.00
## 
## n= 348 
## 
## 
## P
##                  illness Age_65plus Couple_with_kids Crowded Flats 
## illness                  0.0000     0.0000           0.0000  0.0000
## Age_65plus       0.0000             0.0374           0.0000  0.0000
## Couple_with_kids 0.0000  0.0374                      0.0002  0.0000
## Crowded          0.0000  0.0000     0.0002                   0.0000
## Flats            0.0000  0.0000     0.0000           0.0000

5 Visualisation

Visualising correlation matrices

Fig.2 Correlogram

# get correlations
pc <- cor( census[ , -c(1:5) ], method="pearson" )
ggcorrplot(pc)

You can adjust the options and add the statistical significance:

# get p-values
sig <- cor_pmat(pc)
# draw correlogram
ggcorrplot(pc, method = "square", type= "upper", 
          ggtheme = ggplot2::theme_classic,
          hc.order= TRUE, colors = brewer.pal(n = 3, name = "RdBu"), 
          outline.col = "white", lab = FALSE,
          p.mat = sig) + scale_fill_viridis(option="inferno")

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

There are more functions and packages you can explore:

corrplot - corrplot package
ggcor - ggally package

Correlation between continuous and categorical variables

Using the QLFS and density distributions:

Fig.3 Net Pay Density by Ethicity

ggplot(data= df) +
  geom_density(alpha=0.5, colour="#FF6666", aes(x = NetPay, fill = EthnicGroup))

Another way to visualise data distributions is using boxplots:

ggplot(data = df) +
  geom_boxplot(aes(x = EthnicGroup, y= NetPay, fill= EthnicGroup)) +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=8), legend.position="none") +
  scale_x_discrete(name="Ethnic Group") +
  scale_y_continuous(limits = c(0, 1500), name="Weight (Net Pay (weekly))")

Function	Description
cor()	compute Pearson’s (method=“pearson”) or Spearman’s (method=“spearman”) correlation
rcorr(), cor_pmat()	compute statistical significance

Correlation

Correlation Visualisation & Measures

Francisco Rowe

2020-08-31