In this session1 Part of Introduction to Statistical Learning in R
Descriptive Statistics – Measures of Centrality & Dispersion by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License., we continue with Descriptive Statistics focusing on understanding how we can characterise a variable distribution. Each variable distribution has two key components, known as moments in Statistics: centrality and spread. We will look at the appropriate statistical measure of centrality and spread depending on the type of data in analysis.
# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData")
Recall, there are two main types of data:
Variable has response categories
Nominal: no specific order to the categories eg. gender (male/female)
Ordinal: categories have a clear ranking eg. age groups (young; middle aged; old)
Variable is a precise measure of a quantity.
Continuous (skewed): distribution of measures NOT symmetrical about the mean (skew <> 0) eg. income (to nearest penny)
Continuous (symmetrical): distribution of measures IS symmetrical about the mean (skew = 0) eg. height (to nearest mm)
The graphs below illustrate the difference between a symmetrical and a skewed distribution:
Fig.1 Skewness of a distribution
TASK #1 Work out the data type for each of the following variables:
Central tendency is statistical jargon for ‘the average / mean’.
It is important to realise that the statistic used to measure the average varies by data type:
Data Type | Measure of average |
---|---|
Nominal | Mode |
Ordinal | Median |
Scale (skewed1) | Median |
Scale (symmetrical1) | Mean |
1 Symmetrical = skew close to 0 (i.e. in range -0.5 to 0.5)
# attach data
attach(qlfs)
mean(Age)
## [1] 46.08521
median(Age)
## [1] 46
R does not have a standard in-built function to calculate mode. So we create a user function to calculate mode of a data set in R.
## create the function.
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode(Age)
## [1] 41
Or you can use:
summary(Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 32.00 46.00 46.09 60.00 99.00
Or for all the variables in the data
summary(qlfs)
TASK #2 Calculate the most appropriate measure of the ‘average’ for each of the following variables:
Dispersion is statistical jargon for the spread of a distribution.
The graphs below show two symmetrical distributions; one with a wide spread (large dispersion); and one with a narrow spread (small dispersion).
Fig.2 Dispersion
The most appropriate statistic for summarising the dispersion (spread) of a distribution also varies by data type:
Data Type | Measure of dispersion |
---|---|
Nominal | % Misclassified |
Ordinal | % Misclassified |
Scale (skewed) | Inter-Quartile Range |
Scale (symmetrical) | Standard Deviation / Variance |
People and places are complex entities. The art of statistical analysis is to distil the essence of this complexity into simplified models of reality.
The average is the simplest model of all, and is widely used. We are frequently told that average wages, health or educational outcomes, sales have gone up or down.
This begs the immediate question: how well does our model represent the reality?
We can best illustrate this by focussing on only the first 10 cases in our data, and treating them as if they were the entire population (Fig. 3).
Fig.3 Age of 10 respondents
As can be seen from this graph, the age of these 10 respondents varies from 18 to 75.
A simple model of respondent’s age is their average (mean) age.
We can add a dashed horizontal line to the graph to represent this model (Fig. 4).
Fig.4 Adding the mean age
Then we can add vertical lines measuring the distance (deviation) of each observation from the mean (Fig. 5).
Fig.5 Measuring the deviation
Clearly, each vertical line in the graph in Fig.5 provides a measure of the difference between one of the observations and the model or mean age.
We can represent this information as a data.frame:
df <- data.frame(Respondent = 1:10,
Age = qlfs.10$Age,
model = mean(qlfs.10$Age),
error = qlfs.10$Age - mean(qlfs.10$Age) )
df
Respondent | Age | model | error |
---|---|---|---|
1 | 42 | 54 | -12 |
2 | 43 | 54 | -11 |
3 | 18 | 54 | -36 |
4 | 75 | 54 | 21 |
5 | 69 | 54 | 15 |
6 | 73 | 54 | 19 |
7 | 40 | 54 | -14 |
8 | 42 | 54 | -12 |
9 | 72 | 54 | 18 |
10 | 66 | 54 | 12 |
The model we are using to summarise the ages of the population members is the average
(central tendency) of the dataset, specifically, the mean age
of all members of the population ie. 54
.
The difference between the first person’s age and the mean age = 42 - 54 = -12
. This difference is known as the model error.
Looking back to the graph, it is also clear that the greater the total length of the vertical lines (deviations from the model), the worse the model fits the data.
What happens if we change our model from the assumption that everyone is of mean age, to the assumption that everyone is 21? - See Fig.6.
Fig.6 Error comparison
The answer is that the total length of the vertical distances from the model is clearly greater when the model is Age = 21
than when the model is Age = 54
.
There are a number of ways of summarising the overall model error.
We can measure the total amount of error by summing up the deviations:
\[ \mbox{Total Error} = \sum{(x_i - \overline{x}) } \]
This is easy to calculate in R using the sum( ) function:
sum(df$error)
## [1] 0
Why was the total error 0?
Well, we can square
the errors. This works, because a negative times a negative makes a positive. Hence we want to find:
\[\mbox{Total Squared Error} = \sum{ \left(x_i - \overline{x}\right )^2 }\]
Squaring all of the errors gives the following:
df$square.error <- df$error^2
df
Respondent | Age | model | error | square.error |
---|---|---|---|---|
1 | 42 | 54 | -12 | 144 |
2 | 43 | 54 | -11 | 121 |
3 | 18 | 54 | -36 | 1296 |
4 | 75 | 54 | 21 | 441 |
5 | 69 | 54 | 15 | 225 |
6 | 73 | 54 | 19 | 361 |
7 | 40 | 54 | -14 | 196 |
8 | 42 | 54 | -12 | 144 |
9 | 72 | 54 | 18 | 324 |
10 | 66 | 54 | 12 | 144 |
From which we can derive the total squared error:
sum(df$square.error)
## [1] 3396
The problem with Total Squared Error is that the larger the number of observations, the more scope there is for model error. Ten observations with a squared error of 0.1 each have a Total Square Error of 10 x 0.1 = 1. Yet one hundred observations, also with a squared error of 0.1 each, will have a larger Total Squared Error because 100 x 0.1 = 10.
Of more interest is the variance: the average (mean) squared error per respondent.
\[\mbox{Var} = \sigma^2 = \frac{ \sum{ \left (x_i - \overline{x}\right )^2 } }{N}\]
This is easily calculated:
# First count and store the number of observations (rows) in the data.frame
N <- nrow(df)
N
## [1] 10
# Then calculate the variance
variance <- sum(df$square.error) / N
variance
## [1] 339.6
The larger the dispersion (spread) of the data around the mean, the greater the variance (Fig.7).
Fig.7 Variance comparison
In statistics, a distinction is often drawn between a population and a sample.
A population is the entire set of entities of interest (e.g. all persons or households in the UK).
A sample is a random sub-set of population entities (e.g. persons or household captured by the QLFS survey).
The method for calculating the variance used above assumes that we are calculating the variance of a set of population values around the population mean.
We can calculate the variance of the sample values around the sample mean in a similar fashion. However, if we want to use the sample variance as an estimate of the population variance, we need to make a slight adjustment. This is because the unadjusted sample variance is a biased estimator of the population variance. It is biased because a sample is likely to omit some of the relatively rare values at the extreme ends of the population distribution. Due to the omission of some of these rarer extreme values, the unadjusted variance of the sample will on average be slightly smaller than the variance of the population.
In order to correct for this bias, we simply need to divide the total square error by N-1
instead of by N
.
variance.adjusted <- sum(df$square.error) / (N - 1)
variance.adjusted
## [1] 377.3333
ie.
\(\mbox{Var}_{sample}= s^2 = \frac{ \sum{ \left (x_i - \overline{x}\right )^2 } }{N-1}\)
Regardless of sample size, in all cases the adjusted estimator of the population variance will be larger than the unadjusted estimator. The size of the difference between the estimators depends upon the size of the sample.
For large samples, dividing by N
or N-1
yields almost identical estimates of the population variance. For example, 100 / 100 = 1, whilst 100 / 99 = 1.01. In contrast, for small samples the difference in estimated population variance can be noticeable. For example, 100/10 = 10, whilst 100/9 = 11.1. This reflects the fact that the smaller the sample size, the less likely the sample is to be fully representative of the population from which it is drawn.
One problem with variance as a measure of dispersion is that it is measured in squared units. For instance, the variance for the distribution of respondents by age
is measured in years squared. But years squared isn’t a particularly intuitive measure of dispersion.
Fortunately we can overcome this by finding the square-root of the variance, which is known as the standard deviation. The standard deviation of a set of population values around the population mean can be calculated as follows:
\(\mbox{Std Dev} = \sigma = \sqrt{\frac{ \sum{ \left (x_i - \overline{x}\right )^2 } }{N}}\)
The standard deviation is measured in the same units as the original observations. For Age (measured in years), this would be years.
standard.deviation <- variance^0.5
standard.deviation
## [1] 18.42824
As might be expected, the larger the value of the standard deviation, the greater the dispersion (spread) of the observations around the population mean.
If calculating the standard deviation based upon a sample we once again need to divide by N-1
rather than N
:
standard.deviation.adjusted <- variance.adjusted^0.5
standard.deviation.adjusted
## [1] 19.42507
# or using the built-in R function:
sd(df$Age)
## [1] 19.42507
Formally, this is:
\(\mbox{Std Dev}_{sample} = s = \sqrt{ \frac{ \sum{ \left (x_i - \overline{x}\right )^2 } }{N-1} }\)
In practice we normally sample measures of dispersion.
The variance and standard deviation are both based upon deviations from the mean. However, the mean is only a good measure of the average for symmetrical (non-skewed) distributions. Consequently the variance and standard deviation are poor measures of dispersion for asymmetrical distributions.
For continuous data with a skewed distribution a more robust measure of dispersion is the Inter-Quartile Range (IQR).
To calculate the IQR the data values first need to be listed in rank order, from lowest to highest; and then divided into four equal parts, or quartiles. The ‘lower’ quartile contains the 25% of data points with the lowest values; whilst the ‘upper’ quartile contains the 25% of data points with the highest values. The inter-quartile range (IQR) is the difference between the upper and lower quartiles:- i.e. the difference between the 25th and 75th percentiles of the data distribution.
In the graph below colours flag the quartiles, whilst dashed vertical lines mark the 25th and 75th percentiles of the data distribution.
Fig.8 Symmetrical distribution
The advantage of the Inter-Quartile Range is that a few rogue outliers make little difference to its value, because the IQR focuses on the dispersion of the central 50% of observations:
Fig.9 Assymmetrical distribution
Calculating the IQR in R requires use of the quantile( )
function:
# Create a data frame of the observations plotted in the first graph above
df <- data.frame(x= c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,8,8,8,9,9,10))
#Find 25th percentile of variable x:
quantile(df$x, 0.25)
## 25%
## 4
#Find 75th percentile of variable x:
quantile(df$x, 0.75)
## 75%
## 7
#Find the IQR (75th - 25th percentile):
IQR <- quantile(df$x, 0.75) - quantile(df$x, 0.25)
IQR
## 75%
## 3
#Get rid of annoying and misleading '75% label':
attributes(IQR) <- NULL
IQR
## [1] 3
In practice we don’t typically compute the IQR but we use boxplots to visualise the distribution of variables. Returning to our example, if we want to explore the distribution of age
by educational qualification
.
Fig.10 Boxplots of age by qualification
For categorical data, the equivalent of assuming that all numerical values equal the mean is to assume that the observed values are spread evenly across the available categories.
In the graph below, we can see the distribution of 12
selected QLFS respondents across tenure categories (black bars); plus the distribution that would apply if respondents were spread evenly across categories (dashed horizontal line) - Fig.11.
Fig.11 Twelve selected individuals by tenure
If the twelve respondents were split equally across four categories, there would be 3
respondents in each category (ie. 12 / 4 = 3
).
For this model (of even spread across categories), the model error is the difference between the observed frequency in each category and the frequency that would arise if the respondents were evenly distributed.
Tenure | Obs.Freq | Model.Freq | Error |
---|---|---|---|
Owned | 7 | 3 | 4 |
Mortgaged | 4 | 3 | 1 |
Rented: Social Landlord | 0 | 3 | -3 |
Rented: Private Landlord | 1 | 3 | -2 |
As before, the sum of errors cancels out to 0:
sum(df$Error)
## [1] 0
We therefore need to find an alternative summary measure of overall error. For categorical data this summary measure is based not on squared error, but on absolute error.
The absolute value of any number is the value of that number ignoring any negative sign.
eg.
abs(-15)
## [1] 15
abs(15)
## [1] 15
The sum of the absolute errors is known as the Total Absolute Error (TAE).
\(TAE = \sum|f_i - \bar{f}|\)
For our example:
df$Abs.Error <- abs(df$Error)
df
## Tenure Obs.Freq Model.Freq Error Abs.Error
## 1 Owned 7 3 4 4
## 2 Mortgaged 4 3 1 1
## 3 Rented: Social Landlord 0 3 -3 3
## 4 Rented: Private Landlord 1 3 -2 2
TAE <- sum(df$Abs.Error)
TAE
## [1] 10
Total Absolute Error, just like the Total Square Error, needs to be adjusted to take account of sample size, since the more observations there are, the larger the total error is likely to be.
The Standardised Absolute Error (SAE) records the average model error per observation.
\(SAE = \frac{\sum|f_i - \bar{f}|}{N}\)
In R, we can use the sum( )
function to find the number of observations being summed:
SAE <- TAE / sum(df$Obs.Freq)
SAE
## [1] 0.8333333
A final refinment is to rescale the Standardised Absolute Error by dividing it by two, in order to find the proportion misclassified (PM).
\(PM = \frac{\sum|f_i - \bar{f}|}{2N}\)
PM <- SAE / 2
PM
## [1] 0.4166667
Or you can run the following function:
pm <- function(y, x) {
Y <- y / sum(y)
X <- x / sum(x)
ID <- 0.5 * sum(abs(Y - X))
round(ID,3)
} # y: observed frequency; x: expected model frequency
pm(Obs.Freq, Model.Freq)
## [1] 0.417
The proportion misclassified represents the smallest proportion of cases that would have to change categories in order to achieve a perfect fit with the model ie. an even distribution across categories.
A value of 0 (0%) indicates that the observed frequency distribution exactly matches the model frequency distribution (ie. an even spread across the available categories). A value of 1 (100%) indicates no overlap between the observed and model frequency distributions. In the example above, 41.67% of our population of 12 respondents (ie. 5) would have change Tenure
category to match the model assumption of an even distribution across categories:
Observed | Change | Result |
---|---|---|
7 | -4 | 3 |
4 | -1 | 3 |
0 | +3 | 3 |
1 | +2 | 3 |
As can be seen above, five persons have to leave their original category (-4 + -1 = -5); and the same five persons then have to join a new category (+3 + +2 = +5).
The proportion misclassified is also variously known as the Index of Dissimilarity (IoD) or the Gini Coefficient. All three are mathematically equivalent.
Within Geography the IoD is widely used to study the spatial segregation of population sub-groups. Within economics the Gini Coefficient is widely used to study the distribution of income.
Here our assumed model has been one of an even distribution, over a set of either social or spatial categories. More commonly an alternative model is used. For example, the spatial distribution of ‘Whites’ could be treated as the model / standard against which the spatial distribution of other ethnic groups is compared.
TASK #3 Using the QLFS visualise and calculate the proportion misclassified for two categorical variables of your choice.
Function | Description |
---|---|
mean() | calculate mean |
median() | calculate median |
mode() | return the mode (use the function provided in this notebook) |
sd() | calculate sample standard deviation |
quantile() | return a specifc percentile of a set of numbers |
pm() | proportion misclassified (use the function provided in this notebook) |
sum() | sum of a variable vector |
summary() | return min, q1, median, q3 and max values |