# Measuring Statistical Significance

### Hypothesis Testing & Confidence Intervals

#### 2020-08-31

# clean workspace
rm(list=ls())
load("../data/data_qlfs.RData") 

# 2 Percentage

A percentage comprises two parts: numerator and denominator.

$\% = \frac{numerator}{denominator} \times 100$ Let’s have a look at employment status:

attach(qlfs)
prop.table(table(WorkStatus)) * 100
## WorkStatus
##        Employee (full-time)        Employee (part-time)
##                   35.973087                   12.547969
##   Self-employed (full-time)   Self-employed (part-time)
##                    6.070272                    2.427112
## Unemployed (ILO definition) Student (FT or non-working)
##                    4.364565                    7.461999
##   Looking after family/home   Sick, injured or disabled
##                    4.857962                    5.375031
##                       Other                     Retired
##                    2.075754                   18.846250

Two-way cross tabulations:

prop.table(table(WorkStatus, Sex), 1) * 100
##                              Sex
## WorkStatus                         Male    Female
##   Employee (full-time)        59.098781 40.901219
##   Employee (part-time)        18.627743 81.372257
##   Self-employed (full-time)   79.454023 20.545977
##   Self-employed (part-time)   41.375770 58.624230
##   Unemployed (ILO definition) 56.123323 43.876677
##   Student (FT or non-working) 46.919352 53.080648
##   Looking after family/home    9.463965 90.536035
##   Sick, injured or disabled   48.678720 51.321280
##   Other                       44.357743 55.642257
##   Retired                     43.911146 56.088854

TASK #1 What is the percentage of female who are unemployed? How does this differ from the percentage above?

# 3 The Problem

What is the percentage of unemployment across ethnic groups?

First, identify the unemployed:

qlfs <- qlfs %>%
# mutate() to create new variables
mutate(unemp = ifelse(as.numeric(WorkStatus)== 5, "Yes", "No")) 
attach(qlfs)
## The following objects are masked from qlfs (pos = 3):
##
##     Age, AgeGroup, Case_ID, CountryOfBirth, DegreeClass, EthnicGroup,
##     Family_ID, FamilySize, FamilyToddlers, FamilyType, GovtRegion,
##     GrossPay, Hhold_ID, HHoldSize, HighestQual, HoursWorked, LastMoved,
##     MaritalStatus, NetPay, NSSEC, Person_ID, Religion, Sex, Tenure,
##     TravelMode, TravelTime, WorkStatus, YoungestChild

And then the percentage of unemployment by ethnic group:

df_pune <- as.data.frame.matrix(prop.table(
with(qlfs, table(EthnicGroup, unemp))
, 1) * 100)
df_pune
##                                             No      Yes
## White                                 95.91176 4.088237
## Mixed/multiple ethnic background      92.02552 7.974482
## Indian                                94.72436 5.275637
## Pakistani                             90.52133 9.478673
## Chinese                               96.54179 3.458213
## Other Asian background                94.61538 5.384615
## Black/African/Caribbean/Black British 90.75044 9.249564
## Other ethnic group                    94.04040 5.959596
df_pune\$EthnicGroup <- rownames(df_pune)
Visually: Fig.1 % of unemployed by ethinicity

# get percentages
prc_df <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(unemp, EthnicGroup) %>%
group_by(EthnicGroup) %>%
mutate(percent = (n / sum(n))*100) %>%
ungroup() %>%
filter(unemp=="Yes")
# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
labs(title= paste(" "), x="% of Unemployed", y="Ethnic Group") +
theme_classic() +
theme(axis.text=element_text(size=14))

How do we know these percentages are statistically different if they are based on a random sample from a larger population? We are using estimates from a sample to make guesses about the population. Our sample estimates may similar or not to the population values. There is uncertainty due to sampling error. A different sample may produce different estimates. We know, however, that larger samples tend to produce more reliable sample estimates.

There are two issues then:

1. We don’t know if our sample estimates are similar / close to the population values eg. gender composition.

2. We can’t be certain if they are different.

To deal with these issues, we use two related approaches:

• Confidence Intervals

• Hypothesis Testing

In this session, we cover the theory of hypothesis testing (though we will return to the practice of this when during the regression analysis session) and elaborate on confidence intervals which addresses the problem we are interested: assessing if percentages across groups are statistically different.

# 4 Confidence Intervals (CIs)

## 4.1 The Theory

A confidence interval provides additional information on variability about the point estimate. A range of values for a population parameter are calculated using a single sample.

A specific confidence interval will either contain or will not contain the population parameter.

We often use a 95% level of confidence (LOC): 95% of the time, the population parameter should lie within the confidence interval constructed around the sample mean…assuming that any difference between the sample estimate and the population parameter is attributable to random sampling error alone.

For example, our estimate of unemployment among Bangladeshi is 5.9% (23/391 x 100). This estimate has an associated 95% confidence interval of 3.6% to 8.2%. In other words, we are 95% sure that the % unemployed Bangladeshis in the population as a whole falls in the range 3.6% – 8.2%.

Level of significance: $\alpha = {100\% - LOC}$

So if 95%:

$5\% = 100\% - 95\%$

## 4.2 CI Calculation

$95\% \:CI = \pm 1.96 \times se(p)$ where se(p) is the standard error of the sample estimate.

The formula for calculating the standard error for the sample estimate of a percentage is: $se(p) = \sqrt{ \frac {p \times (100-p)} {N} }$ where p = the estimated percentage ( numerator / denominator x 100) and N = the size of the denominator upon which the percentage is based.

# recall the percentage
p <- prc_df[5,4]
p
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1    5.88
# denominator
NG <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(EthnicGroup)
N <- NG[5,2]
# standard error
se <- ( (p * (100 - p)) / N)^0.5
se
##    percent
## 1 1.189933
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)
ci95_lb 
##    percent
## 1 3.550083
ci95_ub
##    percent
## 1 8.214623

## 4.3 Comparing CIs

Calculating the 95% CI for one percentage is all well and good, but often we want to calculate and compare the 95% CIs for multiple percentages at the same time: Fig.2 CIs

p <- prc_df[,4]
se <- ((p * (100 - p)) / NG[,2])^0.5
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)

prc_df <- data.frame(prc_df, ci95_lb, ci95_ub)
colnames(prc_df) <- c("unemp", "EthnicGroup", "n", "percent", "ci95_lb", "ci95_ub")

# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
geom_segment( aes( y = EthnicGroup, yend = EthnicGroup, x = ci95_lb, xend = ci95_ub) ) +
labs(title= paste(" "), x = "% of Unemployed", y = "Ethnic Group") +
theme_classic() +
# Axis label size
theme(axis.text=element_text(size=14))

We can draw various conclusions from this graph:

1. The narrower the confidence interval, the more confident we are about a given survey estimate.

2. We are more confident about the survey estimates of the unemployment rate for some ethnic groups than for others.

3. This is mainly a reflection of the size of the ethnic group sample. The fewer respondents from a given ethnic group in the survey, the wider the confidence interval.

4. If two confidence intervals overlap, then we can’t be 95% confident which unemployment rate is larger in the population as a whole eg. the survey estimate of the unemployment rate for Indians looks notably higher than the rate for the Chinese. However, since the confidence intervals overlap, we can’t be 95% confident that this is true in the population as a whole.

5. If two confidence intervals do not overlap, then we can be at least 95% confident that the one rate is higher than the other in the population as a whole eg. the 95% CIs of the survey estimates for the Black and Chinese unemployment rates do not overlap; therefore we are at least 95% confident that in the population as a whole the Black unemployment rate is higher than the Chinese unemployment rate.

# 5 Hypothesis Testing

Hypothesis testing is used to formally validate / test a hypothesis ie. test if there is enough statistical evvidence for a claim.

## 5.1 The Theory

1. State your hypothesis eg. group percentages are equal. Two hypotheses:

Null hypothesis: Claim or assumption to be tested eg. group percentages are equal

Alternative hypothesis: Alternative claim if evidence doesn’t support the null hypothesis eg. group percetanges are different, or smaller than, or greater than

1. Establish a decision rule based a LOC (or significance) eg. 95% (or 5%):

No significant at 5%, if $$|z|$$ < 1.96 (or p-value $$>$$ 5%)

Significant at 5%, if $$|z| \ge$$ 1.96 (or p-value $$\le$$ 5%)

p-value: is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. The smaller the p-value, the higher the significance indicating that the hypothesis under consideration may not adequately explain the observation.

1. Compute the test statistic (z)

2. Interpretation / conclusion of the results Fig. 2. Hypothesis testing: Rejection regions.

## 5.2 Implementation

Testing the difference in unemployment between Chinese (3.5%) and Other Asian background (5.4%).

# get n of unemployed people
n_une <- prc_df[6:7, 3]
# total population
n_pop <- as.numeric(unlist(NG[6:7, 2])) # unlist simplifies the structure of the data to produce a vector
prop.test(x = n_une, n = n_pop)
##
##  2-sample test for equality of proportions with continuity correction
##
## data:  n_une out of n_pop
## X-squared = 1.5542, df = 1, p-value = 0.2125
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.046256039  0.007727997
## sample estimates:
##     prop 1     prop 2
## 0.03458213 0.05384615

The output:

• the value of Pearson’s chi-squared test statistic.
• a p-value
• a 95% confidence intervals
• an estimated probability of success ie. the proportion of unemployed people in the two groups

The p-value of the test is 0.0.2125, which is greater than the significance level $$\alpha$$ = 0.05. We can conclude that the proportion of unemployed people between Chineses and Other Asian background are not significantly different.

This equivalent to the conclusion we would reach using CIs.

# 6 Misuse of p-values

p-values are often used or interpreted incorrectly. There has been an intense debate to move beyond p-value < 5%. Read here, here, here, here and here.

Some common misunderstandings regarding p-values:

• The 0.05 significance level is arbitrary

• The p-value does not prove the null hypothesis is true - but indicates the degree of compatability between the dataset and the hypothesis

• The p-value does not indicate the size or importance of the observed effect

Solution: analyse data in multiple ways to see whether different analyses converge on the same answer: Logic, background knowledge and experimental design should be considered alongside P values and similar metrics to reach a conclusion and decide on its certainty.

# 7 Appendix: Concepts and Functions to Remember

Function Description
mutate() create new variables
filter() select observations based on their values
prop.table() compute proportions
!is.na() is not NA
%>% pipe operator to chain functions together
group_by, ungroup group/ungroup based on categorical variables
prop.test test of equal or given proportions