This notebook illustrates three ways to estimate a logistic regression model based on individual-level data and aggregate data and arrive to the same model estimates.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For the purposes of this notebook, let’s assume we are interested in fitting a migration model and we start with individual level data. We assume that the decision to migration is a function of gender
and age
. The decision to migrate is captured with a binary variable: 1 if an individual migrated in the last year; 0 otherwise. Here we use simulated data so let’s hope that our estimates will be representative of a real migration process.
rm(list=ls())
set.seed(5)
df <- tibble(gender = as.factor(sample(c("male","female"),
300,
replace = TRUE,
prob=c(0.6, 0.4))),
age_bracket = as.factor(sample(c("[15<25]","[25-45]", "[45-65]"),
300,
replace = TRUE,
prob=c(0.35, 0.55, 0.1))),
migrate = rbinom(300,
1,
prob = 0.2))
df
## # A tibble: 300 x 3
## gender age_bracket migrate
## <fct> <fct> <int>
## 1 male [15<25] 1
## 2 female [25-45] 0
## 3 female [15<25] 0
## 4 male [25-45] 1
## 5 male [25-45] 1
## 6 female [45-65] 0
## 7 male [15<25] 0
## 8 female [25-45] 0
## 9 female [25-45] 0
## 10 male [15<25] 0
## # … with 290 more rows
Let’s first a logistic regression on individual level data
eq1 <- migrate ~ gender + age_bracket
m1 <- glm(eq1, data = df, family = binomial("logit"))
summary(m1)
##
## Call:
## glm(formula = eq1, family = binomial("logit"), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8081 -0.7528 -0.6683 -0.6564 1.8116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.07602 0.30248 -3.557 0.000375 ***
## gendermale -0.30941 0.28676 -1.079 0.280595
## age_bracket[25-45] -0.04005 0.30686 -0.131 0.896158
## age_bracket[45-65] 0.12441 0.48543 0.256 0.797731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 316.14 on 299 degrees of freedom
## Residual deviance: 314.70 on 296 degrees of freedom
## AIC: 322.7
##
## Number of Fisher Scoring iterations: 4
Now let’s fit the same model on aggregate data. To this end, we first need to aggregate the data:
df_agg<-df %>% group_by(gender, age_bracket) %>%
summarise( pop_count = n(),
moves = sum(migrate)) %>%
ungroup() %>%
mutate(prob_mig = moves / pop_count)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
df_agg
## # A tibble: 6 x 5
## gender age_bracket pop_count moves prob_mig
## <fct> <fct> <int> <int> <dbl>
## 1 female [15<25] 33 8 0.242
## 2 female [25-45] 65 16 0.246
## 3 female [45-65] 20 6 0.3
## 4 male [15<25] 68 14 0.206
## 5 male [25-45] 103 20 0.194
## 6 male [45-65] 11 2 0.182
Let’s now estimate the model:
eq2 <- prob_mig ~ gender + age_bracket
m2 <- glm(eq2,
data=df_agg,
weights = pop_count,
family = binomial("logit"))
summary(m2)
##
## Call:
## glm(formula = eq2, family = binomial("logit"), data = df_agg,
## weights = pop_count)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -0.15696 -0.01099 0.21226 0.11797 0.00952 -0.31889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.07602 0.30248 -3.557 0.000375 ***
## gendermale -0.30941 0.28676 -1.079 0.280595
## age_bracket[25-45] -0.04005 0.30686 -0.131 0.896158
## age_bracket[45-65] 0.12441 0.48543 0.256 0.797731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.62887 on 5 degrees of freedom
## Residual deviance: 0.18551 on 2 degrees of freedom
## AIC: 30.793
##
## Number of Fisher Scoring iterations: 3
Compare these estimates to our individual-level model and note that we arrive to the same model estimates!
Let’s try a second alternative and use cbind
. It is useful to know this if you ever want to fit a multinomial logistic model in R using the package VGAM
df_agg$stayers <- df_agg$pop_count - df_agg$moves
eq3 <- cbind(moves, stayers) ~ gender + age_bracket
m3 <- glm(eq3,
data = df_agg,
family = binomial("logit"))
summary(m3)
##
## Call:
## glm(formula = eq3, family = binomial("logit"), data = df_agg)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -0.15696 -0.01099 0.21226 0.11797 0.00952 -0.31889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.07602 0.30248 -3.557 0.000375 ***
## gendermale -0.30941 0.28676 -1.079 0.280595
## age_bracket[25-45] -0.04005 0.30686 -0.131 0.896158
## age_bracket[45-65] 0.12441 0.48543 0.256 0.797731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.62887 on 5 degrees of freedom
## Residual deviance: 0.18551 on 2 degrees of freedom
## AIC: 30.793
##
## Number of Fisher Scoring iterations: 3
For more on logistic regression, check out this course: Introduction to Statistical Learning in R