In this session1 Part of Introduction to Statistical Learning in R

Measuring Statistical Significance â€“ Hypothesis Testing & Confidence Intervals by Francisco Rowe is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License., we move to examine some of the principles of *Inferential Statistics*. The previous sessions offered tools to descriptively profile our sample of data, understanding the distribution, composition, tendency and spread. For example, differences in unemployment rate across ethnic groups.

```
# clean workspace
rm(list=ls())
# load data
load("../data/data_qlfs.RData")
```

A percentage comprises two parts: *numerator* and *denominator*.

\[\% = \frac{numerator}{denominator} \times 100\] Letâ€™s have a look at employment status:

```
attach(qlfs)
prop.table(table(WorkStatus)) * 100
```

```
## WorkStatus
## Employee (full-time) Employee (part-time)
## 35.973087 12.547969
## Self-employed (full-time) Self-employed (part-time)
## 6.070272 2.427112
## Unemployed (ILO definition) Student (FT or non-working)
## 4.364565 7.461999
## Looking after family/home Sick, injured or disabled
## 4.857962 5.375031
## Other Retired
## 2.075754 18.846250
```

Two-way cross tabulations:

`prop.table(table(WorkStatus, Sex), 1) * 100`

```
## Sex
## WorkStatus Male Female
## Employee (full-time) 59.098781 40.901219
## Employee (part-time) 18.627743 81.372257
## Self-employed (full-time) 79.454023 20.545977
## Self-employed (part-time) 41.375770 58.624230
## Unemployed (ILO definition) 56.123323 43.876677
## Student (FT or non-working) 46.919352 53.080648
## Looking after family/home 9.463965 90.536035
## Sick, injured or disabled 48.678720 51.321280
## Other 44.357743 55.642257
## Retired 43.911146 56.088854
```

**TASK #1** What is the percentage of female who are unemployed? How does this differ from the percentage above?

What is the percentage of unemployment across ethnic groups?

First, identify the unemployed:

```
qlfs <- qlfs %>%
# mutate() to create new variables
mutate(unemp = ifelse(as.numeric(WorkStatus)== 5, "Yes", "No"))
```

`attach(qlfs)`

```
## The following objects are masked from qlfs (pos = 3):
##
## Age, AgeGroup, Case_ID, CountryOfBirth, DegreeClass, EthnicGroup,
## Family_ID, FamilySize, FamilyToddlers, FamilyType, GovtRegion,
## GrossPay, Hhold_ID, HHoldSize, HighestQual, HoursWorked, LastMoved,
## MaritalStatus, NetPay, NSSEC, Person_ID, Religion, Sex, Tenure,
## TravelMode, TravelTime, WorkStatus, YoungestChild
```

And then the percentage of unemployment by ethnic group:

```
df_pune <- as.data.frame.matrix(prop.table(
with(qlfs, table(EthnicGroup, unemp))
, 1) * 100)
df_pune
```

```
## No Yes
## White 95.91176 4.088237
## Mixed/multiple ethnic background 92.02552 7.974482
## Indian 94.72436 5.275637
## Pakistani 90.52133 9.478673
## Bangladeshi 94.11765 5.882353
## Chinese 96.54179 3.458213
## Other Asian background 94.61538 5.384615
## Black/African/Caribbean/Black British 90.75044 9.249564
## Other ethnic group 94.04040 5.959596
```

`df_pune$EthnicGroup <- rownames(df_pune)`

Visually:
Fig.1 % of unemployed by ethinicity

```
# get percentages
prc_df <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(unemp, EthnicGroup) %>%
group_by(EthnicGroup) %>%
mutate(percent = (n / sum(n))*100) %>%
ungroup() %>%
filter(unemp=="Yes")
# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
# Add labels
labs(title= paste(" "), x="% of Unemployed", y="Ethnic Group") +
theme_classic() +
theme(axis.text=element_text(size=14))
```

How do we know these percentages are statistically different if they are based on a random sample from a larger population? We are using estimates from a sample to make guesses about the population. Our sample estimates may similar or not to the population values. There is uncertainty due to *sampling error*. A different sample may produce different estimates. We know, however, that larger samples tend to produce more reliable sample estimates.

There are two issues then:

We donâ€™t know if our sample estimates are similar / close to the population values eg. gender composition.

We canâ€™t be certain if they are different.

To deal with these issues, we use two related approaches:

Confidence Intervals

Hypothesis Testing

**In this session, we cover the theory of hypothesis testing (though we will return to the practice of this when during the regression analysis session) and elaborate on confidence intervals which addresses the problem we are interested: assessing if percentages across groups are statistically different.**

A confidence interval provides additional information on variability about the point estimate. A range of values for a population parameter are calculated using a single sample.

A specific confidence interval will either contain or will not contain the population parameter.

We often use a 95% level of confidence (LOC): 95% of the time, the population parameter should lie within the confidence interval constructed around the sample meanâ€¦assuming that any difference between the sample estimate and the population parameter is attributable to random sampling error alone.

For example, our estimate of unemployment among Bangladeshi is 5.9% (23/391 x 100). This estimate has an associated 95% confidence interval of 3.6% to 8.2%. In other words, we are 95% sure that the % unemployed Bangladeshis in the population as a whole falls in the range 3.6% â€“ 8.2%.

*Level of significance*: \[ \alpha = {100\% - LOC} \]

So if 95%:

\[ 5\% = 100\% - 95\% \]

\[95\% \:CI = \pm 1.96 \times se(p)\] where se(p) is the standard error of the sample estimate.

The formula for calculating the standard error for the sample estimate of a percentage is: \[se(p) = \sqrt{ \frac {p \times (100-p)} {N} } \] where `p`

= the estimated percentage ( `numerator / denominator x 100`

) and `N`

= the size of the denominator upon which the percentage is based.

For Bangladeshis who are unemployed:

```
# recall the percentage
p <- prc_df[5,4]
p
```

```
## # A tibble: 1 x 1
## percent
## <dbl>
## 1 5.88
```

```
# denominator
NG <- qlfs %>%
filter(!is.na(unemp), !is.na(EthnicGroup)) %>%
count(EthnicGroup)
N <- NG[5,2]
```

```
# standard error
se <- ( (p * (100 - p)) / N)^0.5
se
```

```
## percent
## 1 1.189933
```

```
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)
ci95_lb
```

```
## percent
## 1 3.550083
```

`ci95_ub`

```
## percent
## 1 8.214623
```

Calculating the 95% CI for one percentage is all well and good, but often we want to calculate and compare the 95% CIs for multiple percentages at the same time:

Fig.2 CIs

```
p <- prc_df[,4]
se <- ((p * (100 - p)) / NG[,2])^0.5
ci95_lb <- p - (1.96 * se)
ci95_ub <- p + (1.96 * se)
prc_df <- data.frame(prc_df, ci95_lb, ci95_ub)
colnames(prc_df) <- c("unemp", "EthnicGroup", "n", "percent", "ci95_lb", "ci95_ub")
# plot
ggplot(data = prc_df, aes(y = EthnicGroup, x = percent)) +
geom_point(stat = "identity") +
geom_segment( aes( y = EthnicGroup, yend = EthnicGroup, x = ci95_lb, xend = ci95_ub) ) +
# Add labels
labs(title= paste(" "), x = "% of Unemployed", y = "Ethnic Group") +
theme_classic() +
# Axis label size
theme(axis.text=element_text(size=14))
```

We can draw various conclusions from this graph:

The narrower the confidence interval, the more confident we are about a given survey estimate.

We are more confident about the survey estimates of the unemployment rate for some ethnic groups than for others.

This is mainly a reflection of the size of the ethnic group sample. The fewer respondents from a given ethnic group in the survey, the wider the confidence interval.

If two confidence intervals overlap, then we canâ€™t be 95% confident which unemployment rate is larger in the population as a whole eg. the survey estimate of the unemployment rate for Indians looks notably higher than the rate for the Chinese. However, since the confidence intervals overlap, we canâ€™t be 95% confident that this is true in the population as a whole.

If two confidence intervals do not overlap, then we can be at least 95% confident that the one rate is higher than the other in the population as a whole eg. the 95% CIs of the survey estimates for the Black and Chinese unemployment rates do not overlap; therefore we are at least 95% confident that in the population as a whole the Black unemployment rate is higher than the Chinese unemployment rate.

Hypothesis testing is used to formally validate / test a hypothesis ie. test if there is enough statistical evvidence for a claim.

- State your hypothesis eg. group percentages are equal. Two hypotheses:

*Null hypothesis*: Claim or assumption to be tested eg. group percentages are equal

*Alternative hypothesis*: Alternative claim if evidence doesnâ€™t support the null hypothesis eg. group percetanges are different, or smaller than, or greater than

- Establish a decision rule based a LOC (or significance) eg. 95% (or 5%):

No significant at 5%, if \(|z|\) < 1.96 (or *p-value* \(>\) 5%)

Significant at 5%, if \(|z| \ge\) 1.96 (or *p-value* \(\le\) 5%)

*p-value: is the probability of an observed (or more extreme) result assuming that the null hypothesis is true. The smaller the p-value, the higher the significance indicating that the hypothesis under consideration may not adequately explain the observation.*

Compute the test statistic (z)

Interpretation / conclusion of the results

Testing the difference in unemployment between Chinese (3.5%) and Other Asian background (5.4%).

```
# get n of unemployed people
n_une <- prc_df[6:7, 3]
# total population
n_pop <- as.numeric(unlist(NG[6:7, 2])) # unlist simplifies the structure of the data to produce a vector
prop.test(x = n_une, n = n_pop)
```

```
##
## 2-sample test for equality of proportions with continuity correction
##
## data: n_une out of n_pop
## X-squared = 1.5542, df = 1, p-value = 0.2125
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.046256039 0.007727997
## sample estimates:
## prop 1 prop 2
## 0.03458213 0.05384615
```

The output:

- the value of Pearsonâ€™s chi-squared test statistic.
- a p-value
- a 95% confidence intervals
- an estimated probability of success ie. the proportion of unemployed people in the two groups

The *p-value* of the test is 0.0.2125, which is greater than the significance level \(\alpha\) = 0.05. We can conclude that the proportion of unemployed people between Chineses and Other Asian background are *not* significantly different.

This equivalent to the conclusion we would reach using CIs.

*p-values* are often used or interpreted incorrectly. There has been an intense debate to move beyond `p-value < 5%`

. Read here, here, here, here and here.

Some common misunderstandings regarding *p-values*:

The 0.05 significance level is arbitrary

The p-value does not prove the null hypothesis is true - but indicates the degree of compatability between the dataset and the hypothesis

The p-value does not indicate the size or importance of the observed effect

Solution: analyse data in multiple ways to see whether different analyses converge on the same answer: Logic, background knowledge and experimental design should be considered alongside P values and similar metrics to reach a conclusion and decide on its certainty.

Function | Description |
---|---|

mutate() | create new variables |

filter() | select observations based on their values |

prop.table() | compute proportions |

!is.na() | is not NA |

%>% | pipe operator to chain functions together |

group_by, ungroup | group/ungroup based on categorical variables |

prop.test | test of equal or given proportions |