@fcorowe
)Let’s start by identifying a problem / a research question. This will give a purpose for our analysis and define what what want to do. We may want to understand why people are moving out from certain areas of the country, or we may want to understand why they may be moving to certain destinations. Both questions are equally important from a range of different perspectives.
Can you identify reasons?
We have identified our purpose. Let’s aim to understand the extent of spatial variation in net migration gains and losses i.e. certain areas have experience population growth due to migration while others have registered population decline.
We’ll work with the same data describe in our first session of today Mapping data.
At this stage, an important aspect is to define a metric or set of metrics which we can use to quantify the extent of spatial variation in net migration gains and losses. Here we’ll use a simple yet useful metric: net migration flows.
# clean workspace
rm(list=ls())
# load data
df_long <- read_csv("../data/internal_migration/Detailed_Estimates_2020_LA_2021_Dataset_1.csv")
# id for origins and destinations
orig_la_nm <- as.data.frame(unique(df_long$OutLA))
dest_la_nm <- as.data.frame(unique(df_long$InLA))
# read shapefile
la_shp <- st_read("../data/Local_Authority_Districts_(May_2021)_UK_BFE_V3/LAD_MAY_2021_UK_BFE_V2.shp")
## Reading layer `LAD_MAY_2021_UK_BFE_V2' from data source `/Users/Franciscorowe 1/Dropbox/Francisco/Research/github_projects/courses/udd_gds_course/data/Local_Authority_Districts_(May_2021)_UK_BFE_V3/LAD_MAY_2021_UK_BFE_V2.shp' using driver `ESRI Shapefile'
## Simple feature collection with 374 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -116.1928 ymin: 5333.81 xmax: 655989 ymax: 1220310
## projected CRS: OSGB 1936 / British National Grid
# out-migration
outflows <- df_long %>%
group_by(OutLA) %>%
dplyr::summarise( n = sum(Moves, na.rm = T))
# in-migration
inflows <- df_long %>%
group_by(InLA) %>%
dplyr::summarise( n = sum(Moves, na.rm = T))
# net migration
indicators <- full_join(outflows,
inflows,
by = c("OutLA" = "InLA")) %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0)) %>%
mutate_if(is.numeric, round) %>%
rename(
outflows = n.x,
inflows = n.y
) %>%
mutate(
netflows = (inflows - outflows)
)
# join dfs
la_shp <- left_join(la_shp, indicators, by = c("LAD21CD" = "OutLA"))
# id for country name initial
la_shp$ctry_nm <- substr(la_shp$LAD21CD, 1, 1)
la_shp$ctry_nm <- as.factor(la_shp$ctry_nm)
# simplify boundaries
la_shp_simple <- st_simplify(la_shp,
preserveTopology =T,
dTolerance = 1000) # 1km
# ensure geometry is valid
la_shp_simple <- sf::st_make_valid(la_shp_simple)
Before diving into more sophisticated analysis, a good starting point is to run exploratory spatial data analysis (ESDA). ESDAs are usually divided into two main groups: (1) global spatial autocorrelation: which focuses on the overall trend or the degree of spatial clustering in a variable;
(2) local spatial autocorrelation: which focuses on spatial instability: the departure of parts of a map from the general trend. it is useful to identify hot or cold spots.
Recall: Spatial autocorrelation relates to the degree to which the similarity in values between observations in a variable in neighbouring areas.
A key idea to develop some intuition here is the idea of spatial randomness i.e. a situation in which values of an observation is unrelated to location, and therefore a variable’s distribution does not follow a no discernible pattern over space.
Spatial autocorrelation can be defined as the “absence of spatial randomness”. This gives rise to two main classes of autocorrelation:
(1) Positive spatial autocorrelation: when similar values tend to group together in similar locations; and,
(2) Negative spatial autocorrelation, where similar values tend to be dispersed and further apart from each other in nearby locations.
Here we will explore spatial autocorrelation looking at how we can identify its presence, nature, and strength.
Let’s start with some simple exploration of the data creating a scatterplot.
What can this tell us?
ggplot(la_shp_simple, aes(x = outflows, y = inflows)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("In-migration") +
xlab("Out-migration") +
theme_classic()
Now let’s run a simple linear regression model. Since we know inflows and outflows are highly correlated, we will use only outflows and avoid potential problems of multicollinearity.
eq1 <- netflows ~ outflows
m1 <- lm(
eq1,
la_shp_simple
)
summary(m1)
##
## Call:
## lm(formula = eq1, data = la_shp_simple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3778.6 -1466.9 -541.3 632.3 19349.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2812.08126 211.60203 13.29 <2e-16 ***
## outflows -0.56483 0.02902 -19.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2735 on 329 degrees of freedom
## (43 observations deleted due to missingness)
## Multiple R-squared: 0.5351, Adjusted R-squared: 0.5337
## F-statistic: 378.7 on 1 and 329 DF, p-value: < 2.2e-16
This initially exploration is helpful but it does not tell use anything about how net migration outcomes are distributed across space. Maps are nice tools for this.
What do we learn from this?
tm_shape(la_shp_simple) +
tm_fill(col = "netflows", style = "equal", palette = viridis(6), title = "Net migration") +
tm_borders(lwd = 0) +
tm_facets(by = "ctry_nm", ncol = 2) +
tm_layout(legend.title.size = 1,
legend.text.size = 0.6,
legend.position = c("right","top"),
legend.bg.color = "white",
legend.bg.alpha = 1)
Yes, there is some spatial pattering: similar values tend to cluster together in space.
How can we measure this apparently spatial clustering or spatial dependence? Is it statistically significant?
To measure spatial dependence and further explore it, we will need to create an spatial lag. An spatial lag is the product of a spatial weight matrix and a given variable. The spatial lag of a variable is the average value of that variable in the neighborhood; that is, using the values of all the areas which are defined as neighbours; hence, the concept of spatial lag is inherently related to the concept of spatial weight matrix.
So first let’s build and standardise a spatial weight matrix. For this example, we’ll use the 10 k nearest neighbours.
Try other spatial weights matrices definitions
# replace nas with 0s to avoid issues
la_shp_simple <- la_shp_simple %>% mutate_if(
is.numeric, ~replace(., is.na(.), 0)
)
# create knn list
coords <- st_centroid(st_geometry(la_shp_simple))
col_knn <- knearneigh(coords, k=10)
# create nb object
hnb <- knn2nb(col_knn)
# create spatial weights matrix (note it row-standardizes by default)
hknn <- nb2listw(hnb)
hknn
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 374
## Number of nonzero links: 3740
## Percentage nonzero weights: 2.673797
## Average number of links: 10
## Non-symmetric neighbours list
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 374 139876 374 65.56 1532.14
Have a go at interpreting the summary of the spatial weight matrix
Once we have built a spatial weights matrix, we can compute an spatial lag. A spatial lag offers a quantitative way to represent spatial dependence, specifically the degree of connection between geographic units.
Remember: the spatial lag is the product of a spatial weights matrix and a given variable and amounts to the average value of the variable in the neighborhood of each variable’s value.
We use the row-standardised matrix for this and compute the spatial lag of the migration outflows.
outflows_lag <- lag.listw(hknn, la_shp_simple$outflows)
head(outflows_lag)
## [1] 4533.2 4712.7 5027.6 4334.3 4706.2 3357.4
The way to interpret the spatial lag outflows_lag
for the first observation: Hartlepool, where 2,660 people out-migrated is surrounded by neighbouring local authorities where, on average, 4,533 people also left.
We first start exploring global spatial autocorrelation. To this end, we will focus on the Moran Plot and Moran’s I statistics.
The moran plot is a way of visualising the nature and strength of spatial autocorrelation. It’s essentially an scatter plot between a variable and its spatial lag. To more easily interpret the plot, variables are standardised.
ggplot(la_shp_simple, aes(x = outflows, y = outflows_lag)) +
geom_point() +
geom_smooth(method = "lm") +
ylab("Out-migration lag") +
xlab("Out-migration") +
theme_classic()
la_shp_simple_eng <- cbind(la_shp_simple, as.data.frame(outflows_lag))
la_shp_simple_eng <- la_shp_simple_eng %>% filter(
ctry_nm == "E"
) %>%
mutate(
st_outflows = ( outflows - mean(outflows)) / sd(outflows),
st_outflows_lag = ( outflows_lag - mean(outflows_lag)) / sd(outflows_lag)
)
In a standardised Moran Plot, average values are centered around zero and dispersion is expressed in standard deviations. The rule of thumb is that values greater or smaller than two standard deviations can be considered outliers. A standardised Moran Plot can also be used to visualise local spatial autocorrelation.
Do you recall what local spatial autocorrelation is?
We can observe local spatial autocorraltion by partitioning the Moran Plot into four quadrants that represent different situations:
ggplot(la_shp_simple_eng, aes(x = st_outflows, y = st_outflows_lag)) +
geom_point() +
geom_smooth(method = "lm") +
geom_hline(yintercept = 0, color = "grey", alpha =.5) +
geom_vline(xintercept = 0, color = "grey", alpha =.5) +
ylab("Out-migration lag") +
xlab("Out-migration") +
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
What do we learn from the Moran Plot?
To measure global spatial autocorrelation, we can use the Moran’s I. The Moran Plot and intrinsicaly related. The value of Moran’s I corresponds with the slope of the linear fit on the Moran Plot. We can compute it by running:
moran.test(la_shp_simple$outflows, listw = hknn, zero.policy = TRUE, na.action = na.omit)
##
## Moran I test under randomisation
##
## data: la_shp_simple$outflows
## weights: hknn
##
## Moran I statistic standard deviate = 16.415, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.3438737067 -0.0026809651 0.0004457342
What does the Moran’s I tell us?
Finally let’s explore how we can use our spatial lag variable in a regression model and what it can tell us. So far, we have measured spatial dependence in isolation. But that spatial dependence could be associated to a particular factor that could be explicitly measured and included in a model. So it is worth considering spatial dependence in a wider context, analysing its degree as other variables are accounted in a regression model. We can do this plugging our spatial lag variable into a regression model.
eq2 <- netflows ~ outflows + outflows_lag
m2 <- lm(
eq2,
la_shp_simple
)
summary(m2)
##
## Call:
## lm(formula = eq2, data = la_shp_simple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3821.6 -1643.4 -561.3 649.2 19979.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2080.16270 239.26428 8.694 <2e-16 ***
## outflows -0.52909 0.03319 -15.943 <2e-16 ***
## outflows_lag 0.05764 0.05505 1.047 0.296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2710 on 371 degrees of freedom
## Multiple R-squared: 0.4856, Adjusted R-squared: 0.4828
## F-statistic: 175.1 on 2 and 371 DF, p-value: < 2.2e-16
We can see that the coefficient associated with the spatial lag is positive but not statistically significant. This would imply that the average number of people leaving neighbouring local authority districts are not necessarily related to the volume of people leaving a given area. It could be related to additional factors of the same area in question.
Final Note: Introducing a spatial lag of an explanatory variable, as we have done here, is the most straightforward way of incorporating the notion of spatial dependence in a linear regression framework. It does not require additional changes to the modelling structure, can be estimated via OLS and the interpretation is similar to interpreting non-spatial variables. However, other model specifications are more common in the field of spatial econometrics, specifically: the spatial lag and spatial error model. While both built on the notion of spatial lag, they require a different modelling and estimation strategy.
Rowe, F., Arribas-Bel, D. 2021. Spatial Modelling for Data Scientists.
Excellent references to continue your learning on spatial econometrics are:
Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Vol. 4. Springer Science & Business Media.
Anselin, Luc. 2003. Spatial Externalities, Spatial Multipliers, and Spatial Econometrics. International Regional Science Review 26 (2): 153–66.
Anselin, Luc, and Sergio J. Rey. 2014. Modern Spatial Econometrics in Practice: A Guide to Geoda, Geodaspace and Pysal. GeoDa Press LLC.