In this session1 Part of Introduction to Statistical Learning in R
Measuring Statistical Significance – Hypothesis Testing & Confidence Intervals by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License., we move to examine some of the principles of Inferential Statistics. The previous sessions offered tools to descriptively profile our sample of data, understanding the distribution, composition, tendency and spread. For example, differences in unemployment rate across ethnic groups.
# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData")
A percentage comprises two parts: numerator and denominator.
\[\% = \frac{numerator}{denominator} \times 100\] Let’s have a look at employment status:
attach(qlfs)
prop.table(table(WorkStatus)) * 100
## WorkStatus
## Employee (full-time) Employee (part-time)
## 35.973087 12.547969
## Self-employed (full-time) Self-employed (part-time)
## 6.070272 2.427112
## Unemployed (ILO definition) Student (FT or non-working)
## 4.364565 7.461999
## Looking after family/home Sick, injured or disabled
## 4.857962 5.375031
## Other Retired
## 2.075754 18.846250
Two-way cross tabulations:
prop.table(table(WorkStatus, Sex), 1) * 100
## Sex
## WorkStatus Male Female
## Employee (full-time) 59.098781 40.901219
## Employee (part-time) 18.627743 81.372257
## Self-employed (full-time) 79.454023 20.545977
## Self-employed (part-time) 41.375770 58.624230
## Unemployed (ILO definition) 56.123323 43.876677
## Student (FT or non-working) 46.919352 53.080648
## Looking after family/home 9.463965 90.536035
## Sick, injured or disabled 48.678720 51.321280
## Other 44.357743 55.642257
## Retired 43.911146 56.088854
TASK #1 What is the percentage of female who are unemployed? How does this differ from the percentage above?
What is the percentage of unemployment across ethnic groups?
First, identify the unemployed:
qlfs <- qlfs %>%
# mutate() to create new variables
mutate(unemp = ifelse(as.numeric(WorkStatus)== 5, "Yes", "No"))
attach(qlfs)
## The following objects are masked from qlfs (pos = 3):
##
## Age, AgeGroup, Case_ID, CountryOfBirth, DegreeClass, EthnicGroup,
## Family_ID, FamilySize, FamilyToddlers, FamilyType, GovtRegion,
## GrossPay, Hhold_ID, HHoldSize, HighestQual, HoursWorked, LastMoved,
## MaritalStatus, NetPay, NSSEC, Person_ID, Religion, Sex, Tenure,
## TravelMode, TravelTime, WorkStatus, YoungestChild
And then the percentage of unemployment by ethnic group:
df_pune <- as.data.frame.matrix(prop.table(
with(qlfs, table(EthnicGroup, unemp))
, 1) * 100)
df_pune
## No Yes
## White 95.91176 4.088237
## Mixed/multiple ethnic background 92.02552 7.974482
## Indian 94.72436 5.275637
## Pakistani 90.52133 9.478673
## Bangladeshi 94.11765 5.882353
## Chinese 96.54179 3.458213
## Other Asian background 94.61538 5.384615
## Black/African/Caribbean/Black British 90.75044 9.249564
## Other ethnic group 94.04040 5.959596
df_pune$EthnicGroup <- rownames(df_pune)
Visually:
Fig.1 % of unemployed by ethinicity
# get percentages
prc_df <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(unemp, EthnicGroup) %>%
group_by(EthnicGroup) %>%
mutate(percent = (n / sum(n))*100) %>%
ungroup() %>%
filter(unemp=="Yes")
# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
# Add labels
labs(title= paste(" "), x="% of Unemployed", y="Ethnic Group") +
theme_classic() +
theme(axis.text=element_text(size=14))
How do we know these percentages are statistically different if they are based on a random sample from a larger population? We are using estimates from a sample to make guesses about the population. Our sample estimates may similar or not to the population values. There is uncertainty due to sampling error. A different sample may produce different estimates. We know, however, that larger samples tend to produce more reliable sample estimates.
There are two issues then:
We don’t know if our sample estimates are similar / close to the population values eg. gender composition.
We can’t be certain if they are different.
To deal with these issues, we use two related approaches:
Confidence Intervals
Hypothesis Testing
In this session, we cover the theory of hypothesis testing (though we will return to the practice of this when during the regression analysis session) and elaborate on confidence intervals which addresses the problem we are interested: assessing if percentages across groups are statistically different.
A confidence interval provides additional information on variability about the point estimate. A range of values for a population parameter are calculated using a single sample.
A specific confidence interval will either contain or will not contain the population parameter.
We often use a 95% level of confidence (LOC): 95% of the time, the population parameter should lie within the confidence interval constructed around the sample mean…assuming that any difference between the sample estimate and the population parameter is attributable to random sampling error alone.
For example, our estimate of unemployment among Bangladeshi is 5.9% (23/391 x 100). This estimate has an associated 95% confidence interval of 3.6% to 8.2%. In other words, we are 95% sure that the % unemployed Bangladeshis in the population as a whole falls in the range 3.6% – 8.2%.
Level of significance: \[ \alpha = {100\% - LOC} \]
So if 95%:
\[ 5\% = 100\% - 95\% \]
\[95\% \:CI = \pm 1.96 \times se(p)\] where se(p) is the standard error of the sample estimate.
The formula for calculating the standard error for the sample estimate of a percentage is: \[se(p) = \sqrt{ \frac {p \times (100-p)} {N} } \] where p
= the estimated percentage ( numerator / denominator x 100
) and N
= the size of the denominator upon which the percentage is based.
For Bangladeshis who are unemployed:
# recall the percentage
p <- prc_df[5,4]
p
## # A tibble: 1 x 1
## percent
## <dbl>
## 1 5.88
# denominator
NG <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(EthnicGroup)
N <- NG[5,2]
# standard error
se <- ( (p * (100 - p)) / N)^0.5
se
## percent
## 1 1.189933
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)
ci95_lb
## percent
## 1 3.550083
ci95_ub
## percent
## 1 8.214623
Calculating the 95% CI for one percentage is all well and good, but often we want to calculate and compare the 95% CIs for multiple percentages at the same time:
Fig.2 CIs
p <- prc_df[,4]
se <- ((p * (100 - p)) / NG[,2])^0.5
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)
prc_df <- data.frame(prc_df, ci95_lb, ci95_ub)
colnames(prc_df) <- c("unemp", "EthnicGroup", "n", "percent", "ci95_lb", "ci95_ub")
# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
geom_segment( aes( y = EthnicGroup, yend = EthnicGroup, x = ci95_lb, xend = ci95_ub) ) +
# Add labels
labs(title= paste(" "), x = "% of Unemployed", y = "Ethnic Group") +
theme_classic() +
# Axis label size
theme(axis.text=element_text(size=14))
We can draw various conclusions from this graph:
The narrower the confidence interval, the more confident we are about a given survey estimate.
We are more confident about the survey estimates of the unemployment rate for some ethnic groups than for others.
This is mainly a reflection of the size of the ethnic group sample. The fewer respondents from a given ethnic group in the survey, the wider the confidence interval.
If two confidence intervals overlap, then we can’t be 95% confident which unemployment rate is larger in the population as a whole eg. the survey estimate of the unemployment rate for Indians looks notably higher than the rate for the Chinese. However, since the confidence intervals overlap, we can’t be 95% confident that this is true in the population as a whole.
If two confidence intervals do not overlap, then we can be at least 95% confident that the one rate is higher than the other in the population as a whole eg. the 95% CIs of the survey estimates for the Black and Chinese unemployment rates do not overlap; therefore we are at least 95% confident that in the population as a whole the Black unemployment rate is higher than the Chinese unemployment rate.
Hypothesis testing is used to formally validate / test a hypothesis ie. test if there is enough statistical evvidence for a claim.
Null hypothesis: Claim or assumption to be tested eg. group percentages are equal
Alternative hypothesis: Alternative claim if evidence doesn’t support the null hypothesis eg. group percetanges are different, or smaller than, or greater than
No significant at 5%, if \(|z|\) < 1.96 (or p-value \(>\) 5%)
Significant at 5%, if \(|z| \ge\) 1.96 (or p-value \(\le\) 5%)
p-value: is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. The smaller the p-value, the higher the significance indicating that the hypothesis under consideration may not adequately explain the observation.
Compute the test statistic (z)
Interpretation / conclusion of the results
Testing the difference in unemployment between Chinese (3.5%) and Other Asian background (5.4%).
# get n of unemployed people
n_une <- prc_df[6:7, 3]
# total population
n_pop <- as.numeric(unlist(NG[6:7, 2])) # unlist simplifies the structure of the data to produce a vector
prop.test(x = n_une, n = n_pop)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: n_une out of n_pop
## X-squared = 1.5542, df = 1, p-value = 0.2125
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.046256039 0.007727997
## sample estimates:
## prop 1 prop 2
## 0.03458213 0.05384615
The output:
The p-value of the test is 0.0.2125, which is greater than the significance level \(\alpha\) = 0.05. We can conclude that the proportion of unemployed people between Chineses and Other Asian background are not significantly different.
This equivalent to the conclusion we would reach using CIs.
p-values are often used or interpreted incorrectly. There has been an intense debate to move beyond p-value < 5%
. Read here, here, here, here and here.
Some common misunderstandings regarding p-values:
The 0.05 significance level is arbitrary
The p-value does not prove the null hypothesis is true - but indicates the degree of compatability between the dataset and the hypothesis
The p-value does not indicate the size or importance of the observed effect
Solution: analyse data in multiple ways to see whether different analyses converge on the same answer: Logic, background knowledge and experimental design should be considered alongside P values and similar metrics to reach a conclusion and decide on its certainty.
Function | Description |
---|---|
mutate() | create new variables |
filter() | select observations based on their values |
prop.table() | compute proportions |
!is.na() | is not NA |
%>% | pipe operator to chain functions together |
group_by, ungroup | group/ungroup based on categorical variables |
prop.test | test of equal or given proportions |