Fitting Logistic Regression with Aggregate Data in R

Example Using Simulated Migration Data

Francisco Rowe

2021-09-28

Aim

This notebook illustrates three ways to estimate a logistic regression model based on individual-level data and aggregate data and arrive to the same model estimates.

Dependencies

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data

For the purposes of this notebook, let’s assume we are interested in fitting a migration model and we start with individual level data. We assume that the decision to migration is a function of gender and age. The decision to migrate is captured with a binary variable: 1 if an individual migrated in the last year; 0 otherwise. Here we use simulated data so let’s hope that our estimates will be representative of a real migration process.

rm(list=ls())
set.seed(5)
df <- tibble(gender = as.factor(sample(c("male","female"), 
                                       300, 
                                       replace = TRUE, 
                                       prob=c(0.6, 0.4))),
           age_bracket = as.factor(sample(c("[15<25]","[25-45]", "[45-65]"), 
                                          300, 
                                          replace = TRUE, 
                                          prob=c(0.35, 0.55, 0.1))),
           migrate = rbinom(300, 
                             1, 
                             prob = 0.2))

df
## # A tibble: 300 x 3
##    gender age_bracket migrate
##    <fct>  <fct>         <int>
##  1 male   [15<25]           1
##  2 female [25-45]           0
##  3 female [15<25]           0
##  4 male   [25-45]           1
##  5 male   [25-45]           1
##  6 female [45-65]           0
##  7 male   [15<25]           0
##  8 female [25-45]           0
##  9 female [25-45]           0
## 10 male   [15<25]           0
## # … with 290 more rows

Logistic Regression on Individual-Level Data

Let’s first a logistic regression on individual level data

eq1 <- migrate ~ gender + age_bracket
m1 <- glm(eq1, data = df, family = binomial("logit"))
summary(m1)
## 
## Call:
## glm(formula = eq1, family = binomial("logit"), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8081  -0.7528  -0.6683  -0.6564   1.8116  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.07602    0.30248  -3.557 0.000375 ***
## gendermale         -0.30941    0.28676  -1.079 0.280595    
## age_bracket[25-45] -0.04005    0.30686  -0.131 0.896158    
## age_bracket[45-65]  0.12441    0.48543   0.256 0.797731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.14  on 299  degrees of freedom
## Residual deviance: 314.70  on 296  degrees of freedom
## AIC: 322.7
## 
## Number of Fisher Scoring iterations: 4

Logistic Regression on Aggregate Data

Now let’s fit the same model on aggregate data. To this end, we first need to aggregate the data:

df_agg<-df %>% group_by(gender, age_bracket) %>%
  summarise( pop_count = n(), 
             moves = sum(migrate)) %>%
  ungroup() %>% 
  mutate(prob_mig = moves / pop_count)
## `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
df_agg
## # A tibble: 6 x 5
##   gender age_bracket pop_count moves prob_mig
##   <fct>  <fct>           <int> <int>    <dbl>
## 1 female [15<25]            33     8    0.242
## 2 female [25-45]            65    16    0.246
## 3 female [45-65]            20     6    0.3  
## 4 male   [15<25]            68    14    0.206
## 5 male   [25-45]           103    20    0.194
## 6 male   [45-65]            11     2    0.182

Let’s now estimate the model:

eq2 <- prob_mig ~ gender + age_bracket
m2 <- glm(eq2,
          data=df_agg,
          weights = pop_count, 
          family = binomial("logit"))
summary(m2)
## 
## Call:
## glm(formula = eq2, family = binomial("logit"), data = df_agg, 
##     weights = pop_count)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
## -0.15696  -0.01099   0.21226   0.11797   0.00952  -0.31889  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.07602    0.30248  -3.557 0.000375 ***
## gendermale         -0.30941    0.28676  -1.079 0.280595    
## age_bracket[25-45] -0.04005    0.30686  -0.131 0.896158    
## age_bracket[45-65]  0.12441    0.48543   0.256 0.797731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.62887  on 5  degrees of freedom
## Residual deviance: 0.18551  on 2  degrees of freedom
## AIC: 30.793
## 
## Number of Fisher Scoring iterations: 3

Compare these estimates to our individual-level model and note that we arrive to the same model estimates!

Logistic Regression Using cbind

Let’s try a second alternative and use cbind. It is useful to know this if you ever want to fit a multinomial logistic model in R using the package VGAM

df_agg$stayers <- df_agg$pop_count - df_agg$moves

eq3 <- cbind(moves, stayers) ~ gender + age_bracket

m3 <- glm(eq3, 
          data = df_agg,
          family = binomial("logit"))
summary(m3)
## 
## Call:
## glm(formula = eq3, family = binomial("logit"), data = df_agg)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
## -0.15696  -0.01099   0.21226   0.11797   0.00952  -0.31889  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.07602    0.30248  -3.557 0.000375 ***
## gendermale         -0.30941    0.28676  -1.079 0.280595    
## age_bracket[25-45] -0.04005    0.30686  -0.131 0.896158    
## age_bracket[45-65]  0.12441    0.48543   0.256 0.797731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.62887  on 5  degrees of freedom
## Residual deviance: 0.18551  on 2  degrees of freedom
## AIC: 30.793
## 
## Number of Fisher Scoring iterations: 3

For more on logistic regression, check out this course: Introduction to Statistical Learning in R