4 Spatial Interaction Modelling

This chapter introduces origin-destination thinking through activity-space flows in Uruguay. It aims to illustrate appropriate approaches to understand the structure of origin-destination (OD) data tables and model interactions between places of origin and destination. These techniques are useful to analyse and model any form of spatial interaction data capturing the intensity between a set of origins and destinations such as internal migration, commuting, trade and internet data flows (Rowe and Patias 2020; Fotheringham and O’Kelly 1989).

By the end of this chapter, you should be able to:

Describe origin-destination flow data, aggregate activity-space records into a usable OD table, and distinguish between local retention and non-local flows.
Join OD data to Uruguay ADM2 polygons and derived centroids, compute inter-area distances, and visualise the main flow structure with maps and simple summaries.
Fit and interpret a simple gravity-style spatial interaction model in R, assess its fit, and use it to generate basic flow predictions.

4.1 Dependencies

library(tidyverse)
library(sf)
library(spData)
library(broom)
library(scales)
library(viridis)

# Load the shared course plotting and mapping style
source("style/data-visualisation_theme.R")

4.2 Data

We use Meta’s activity-space data on the places that Facebook users tend to visit from where they live. The file used here is a Uruguay-only subset of a larger export. Each raw row links a home area to a visited area and reports a visit_fraction. In this chapter, visit_fraction is the raw flow-strength variable in the downloaded file. The release contains records for one date and distinguishes between daytime and nighttime activity spaces.

An important detail is that the original data reports repeated observations for an OD pair. The same home_gadm_id and visit_gadm_id can appear various times because multiple fine-grained observations are mapped into the same GADM polygons. As a result, the raw row count is larger than the number of distinct polygon OD pairs. For our purposes, we will focus on analysis at the ADM2 level and daytime patterns. We therefore filter to daytime only and then aggregate repeated rows to one OD record per origin-destination pair using the average raw visit_fraction. In the aggregated table, that average is stored in a new column called visit_share. We will do this data transformation in the next section.

Task

Ensure you understand all the data files we are processing.

activity_path <- "data/fb-activity-spaces/activity_space_distributions_20260209_t_to_z.csv"
gadm_path <- "data/boundaries/gadm36_URY_gpkg/gadm36_URY.gpkg"
label_path <- "data/boundaries/gadm36_ury_adm2_labels.csv"

# Check that the main input files are available before reading anything
if (!file.exists(activity_path)) {
  stop("The chapter expects the fb-activity-spaces CSV to be present in data/fb-activity-spaces/.")
}

if (!file.exists(gadm_path)) {
  stop("The chapter expects the GADM 3.6 Uruguay GeoPackage in data/boundaries/gadm36_URY_gpkg/.")
}

if (!file.exists(label_path)) {
  stop("The chapter expects the GADM 3.6 label crosswalk in data/boundaries/gadm36_ury_adm2_labels.csv.")
}

# Read the raw activity-space extract used throughout the chapter
activity_raw <- read_csv(
  activity_path,
  show_col_types = FALSE
) |>
  # Drop the import index column if it is present
  dplyr::select(-any_of("X")) |>
  # Parse the release date field for simple filtering and checking
  mutate(ds = as.Date(ds))

activity_uy <- activity_raw |>
  # Keep only the Uruguay rows required by the course workflow
  filter(country == "UY") |>
  # Standardise names and order the day/night categories for later filtering
  mutate(
    home_gadm_name = str_squish(home_gadm_name),
    visit_gadm_name = str_squish(visit_gadm_name),
    day_or_night = factor(day_or_night, levels = c("daytime", "nighttime"))
  ) |>
  # Drop incomplete OD records before building centroids and distances
  filter(
    !is.na(home_gadm_id),
    !is.na(visit_gadm_id),
    !is.na(home_longitude),
    !is.na(home_latitude),
    !is.na(visit_longitude),
    !is.na(visit_latitude)
  )

# Stop early if the Uruguay subset is unexpectedly empty
if (nrow(activity_uy) == 0) {
  stop("No Uruguay rows were found after filtering country == 'UY'.")
}

# Focus the practical workflow on the daytime activity-space records
activity_daytime <- activity_uy |>
  filter(day_or_night == "daytime")

if (nrow(activity_daytime) == 0) {
  stop("No daytime Uruguay rows were found after filtering day_or_night == 'daytime'.")
}

# Keep the set of ADM2 IDs that appear in the daytime OD system
gadm_ids <- union(unique(activity_daytime$home_gadm_id), unique(activity_daytime$visit_gadm_id))

# Read the cleaned display labels used to replace placeholder GADM names
label_lookup <- read_csv(
  label_path,
  show_col_types = FALSE
) |>
  dplyr::select(gadm_id, label_short, label_full, label_confidence)

# Read the national outline for country-level maps and context
uruguay <- st_read(
  gadm_path,
  layer = "gadm36_URY_0",
  quiet = TRUE
) |>
  st_make_valid()

adm2_polygons <- st_read(
  gadm_path,
  layer = "gadm36_URY_2",
  quiet = TRUE
) |>
  # Keep only the ADM2 polygons referenced by the activity-space data
  filter(GID_2 %in% gadm_ids) |>
  # Repair geometries before mapping and centroid creation
  st_make_valid() |>
  # Keep the fields needed for joins, labels, and mapping
  transmute(
    gadm_id = GID_2,
    gadm_name_raw = NAME_2,
    department_name = NAME_1,
    geometry = geom
  ) |>
  # Join the cleaner teaching labels onto the GADM polygons
  left_join(label_lookup, by = "gadm_id") |>
  mutate(
    # Prefer the cleaned label file for clearer maps and tables
    gadm_name = coalesce(label_short, gadm_name_raw)
  )

adm2_centroids <- adm2_polygons |>
  # Derive one representative point per polygon for distances and flow lines
  st_point_on_surface()

# Extract centroid coordinates so they can be reused in OD line construction
adm2_centroid_coords <- st_coordinates(adm2_centroids)

adm2_centroids <- adm2_centroids |>
  # Store centroid coordinates explicitly for later line construction
  mutate(
    longitude = adm2_centroid_coords[, 1],
    latitude = adm2_centroid_coords[, 2]
  )

# Reproject centroids to a metric CRS before calculating distances
adm2_centroids_utm <- st_transform(adm2_centroids, 32721)

glimpse(activity_daytime)

Rows: 40,813
Columns: 15
$ ...1                <dbl> 478, 593, 929, 1674, 1848, 2057, 2069, 2647, 2793,…
$ home_latitude       <dbl> -32.525, -34.525, -34.475, -34.375, -34.675, -34.5…
$ home_longitude      <dbl> -54.525, -56.025, -56.425, -56.525, -55.875, -56.7…
$ home_gadm_name      <chr> "n.a47", "n.a183", "n.a338", "n.a154", "n.a343", "…
$ home_gadm_id        <chr> "URY.3.3_1", "URY.2.8_1", "URY.2.10_1", "URY.16.3_…
$ home_polygon_level  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ visit_latitude      <dbl> -34.925, -34.925, -33.725, -34.875, -34.875, -33.4…
$ visit_longitude     <dbl> -56.125, -56.125, -53.475, -55.125, -55.125, -54.5…
$ visit_gadm_name     <chr> "n.a354", "n.a354", "n.a289", "n.a339", "n.a339", …
$ visit_gadm_id       <chr> "URY.10.1_1", "URY.10.1_1", "URY.14.3_1", "URY.9.5…
$ visit_polygon_level <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ country             <chr> "UY", "UY", "UY", "UY", "UY", "UY", "UY", "UY", "U…
$ visit_fraction      <dbl> 9.998422e-04, 4.903432e-04, 1.109440e-03, 4.169515…
$ day_or_night        <fct> daytime, daytime, daytime, daytime, daytime, dayti…
$ ds                  <date> 2026-02-09, 2026-02-09, 2026-02-09, 2026-02-09, 2…

Let us first explore the broad structure of the data by summarising key features: origins, destinations and OD pairs.

od_snapshot <- tibble(
  measure = c(
    "Raw rows in the Uruguay daytime subset",
    "Distinct home areas",
    "Distinct visit areas",
    "Distinct observed origin-destination pairs",
    "Possible origin-destination pairs",
    "Observed share of possible pairs",
    "Dates in the released file"
  ),
  value = c(
    nrow(activity_daytime),
    n_distinct(activity_daytime$home_gadm_id),
    n_distinct(activity_daytime$visit_gadm_id),
    n_distinct(paste(activity_daytime$home_gadm_id, activity_daytime$visit_gadm_id)),
    n_distinct(activity_daytime$home_gadm_id) * n_distinct(activity_daytime$visit_gadm_id),
    round(
      n_distinct(paste(activity_daytime$home_gadm_id, activity_daytime$visit_gadm_id)) /
        (n_distinct(activity_daytime$home_gadm_id) * n_distinct(activity_daytime$visit_gadm_id)),
      3
    ),
    n_distinct(activity_daytime$ds)
  )
)

od_snapshot

# A tibble: 7 × 2
  measure                                        value
  <chr>                                          <dbl>
1 Raw rows in the Uruguay daytime subset     40813    
2 Distinct home areas                          173    
3 Distinct visit areas                         203    
4 Distinct observed origin-destination pairs  5569    
5 Possible origin-destination pairs          35119    
6 Observed share of possible pairs               0.159
7 Dates in the released file                     1

4.3 Data Wrangling

As we identified, our dataset has repeated OD observations. For our analysis, we want a unique OD pair that we can analyse. We focus on the daytime records and average the raw visit_fraction within each origin-destination pair so that each row in the OD table represents one directed daytime link with one mean flow-strength value. In the code below, that aggregated mean is stored in the new column visit_share.

od_total <- activity_daytime |>
  # Aggregate repeated daytime rows to one OD record per polygon pair
  group_by(
    home_gadm_id,
    home_gadm_name,
    visit_gadm_id,
    visit_gadm_name
  ) |>
  summarise(
    visit_share = mean(visit_fraction, na.rm = TRUE),
    n_raw_rows = n(),
    .groups = "drop"
  )
head(od_total)

# A tibble: 6 × 6
  home_gadm_id home_gadm_name visit_gadm_id visit_gadm_name visit_share
  <chr>        <chr>          <chr>         <chr>                 <dbl>
1 URY.1.1_1    n.a2           URY.1.1_1     n.a2               0.145   
2 URY.1.1_1    n.a2           URY.1.2_1     n.a3               0.00512 
3 URY.1.1_1    n.a2           URY.1.3_1     n.a4               0.00361 
4 URY.1.1_1    n.a2           URY.1.5_1     n.a6               0.0103  
5 URY.1.1_1    n.a2           URY.1.6_1     n.a7               0.000677
6 URY.1.1_1    n.a2           URY.1.7_1     n.a8               0.00453 
# ℹ 1 more variable: n_raw_rows <int>

This is the dataset we will use for the rest of the chapter. Each row now represents one directed daytime link from a home area to a visited area. The raw variable visit_fraction has been averaged within each ADM2 OD pair and stored as visit_share. This keeps the raw repetition in the release from dominating the polygon-level analysis. Once that structure is in place, we can start asking familiar spatial interaction questions: which origins connect to many destinations, which destinations attract many origins, how much movement stays local, and how quickly flows thin out with distance?

Important

Important: For the analysis of spatial interaction data, we are generally interested in understanding the data from three perspectives: origin, destination and their interactions. So we create summary statistics that help us understand all these three dimensions.

Next we will focus on origins. The same principles can be applied to destination areas. Then we will transition to OD pairs which will be the focus of our analysis.

We will draw a distinction between local retention and non-local visits. It is especially useful as it shows that self-links are also part of the OD system rather than an awkward exception to remove without thought. That indicates local activity representing movement within the same ADM2 area. We will explore the outcomes of these transformations in the next section.

origin_summary <- od_total |>
  # Summarise outbound structure for each home area
  group_by(home_gadm_id, home_gadm_name) |>
  summarise(
    # Add up the full average daytime flow sent out by each origin
    total_outbound_share = sum(visit_share),
    # Separate within-area retention from links to other destinations
    local_retention_share = sum(visit_share[home_gadm_id == visit_gadm_id], na.rm = TRUE),
    non_local_share = sum(visit_share[home_gadm_id != visit_gadm_id], na.rm = TRUE),
    # Count how many distinct destinations each origin connects to
    n_destinations = n_distinct(visit_gadm_id),
    # Keep the strongest single outgoing link for each origin
    top_destination_share = max(visit_share, na.rm = TRUE),
    .groups = "drop"
  ) |>
  # Replace any missing summaries with zeros before mapping and ranking
  mutate(
    local_retention_share = replace_na(local_retention_share, 0),
    non_local_share = replace_na(non_local_share, 0),
    top_destination_share = replace_na(top_destination_share, 0)
  ) |>
  # Join the cleaned short labels used elsewhere in the chapter
  left_join(
    label_lookup |>
      select(home_gadm_id = gadm_id, home_label = label_short),
    by = "home_gadm_id"
  ) |>
  # Fall back to the original name if no cleaned label is available
  mutate(home_label = coalesce(home_label, home_gadm_name)) |>
  # Rank origins by the strength of their non-local connections
  arrange(desc(non_local_share))

origin_summary |>
  # Show the leading origins as a quick diagnostic table
  slice_head(n = 10) |>
  select(home_gadm_id, home_label, non_local_share, local_retention_share, n_destinations, top_destination_share)

# A tibble: 10 × 6
   home_gadm_id home_label  non_local_share local_retention_share n_destinations
   <chr>        <chr>                 <dbl>                 <dbl>          <int>
 1 URY.18.11_1  Tacuarembó…           0.533                0.0685             11
 2 URY.5.4_1    Durazno un…           0.516                0.0610             18
 3 URY.6.6_1    Flores uni…           0.426                0.0636              6
 4 URY.17.11_1  Soriano un…           0.403                0.113              32
 5 URY.16.4_1   Libertad              0.374                0.0415             10
 6 URY.6.4_1    Flores uni…           0.369                0.0402              5
 7 URY.4.2_1    Carmelo               0.366                0.0488             11
 8 URY.6.3_1    Flores uni…           0.362                0.0728             12
 9 URY.7.3_1    Sarandí Gr…           0.330                0.178              14
10 URY.18.7_1   Tacuarembó…           0.330                0.0699             16
# ℹ 1 more variable: top_destination_share <dbl>

We can prepare the destination summary at the same stage. This keeps the main analytical summary tables together before we move on to maps and modelling.

destination_summary <- od_total |>
  # Summarise inbound attraction for each destination
  group_by(visit_gadm_id, visit_gadm_name) |>
  summarise(
    # Add up the full average daytime flow arriving at each destination
    destination_mass = sum(visit_share),
    # Keep only the inflow coming from other ADM2 areas
    inbound_non_local_share = sum(visit_share[home_gadm_id != visit_gadm_id], na.rm = TRUE),
    # Count how many distinct origins send flow to each destination
    n_origins = n_distinct(home_gadm_id),
    .groups = "drop"
  ) |>
  # Join the cleaned short labels used elsewhere in the chapter
  left_join(
    label_lookup |>
      select(visit_gadm_id = gadm_id, visit_label = label_short),
    by = "visit_gadm_id"
  ) |>
  # Fall back to the original name if no cleaned label is available
  mutate(visit_label = coalesce(visit_label, visit_gadm_name))

4.3.1 Auxiliary: Area identifies and coordinates

For the analysis of spatial interaction data, we often seek to measure distances and create maps focusing on origins, destinations or their interactions. For this purpose, it is generally handy to have two types of auxiliary tables that will help us connect the OD data to the Uruguay ADM2 geography.

We create two lookup tables. The first polygon_lookup keeps the area IDs and names as a regular table. This is useful when we want to join labels or other area attributes back into our OD summaries. The second centroid_lookup keeps one point and its coordinates for each area. This is useful later when we calculate distances between areas and draw flow lines from origins to destinations.

polygon_lookup <- adm2_polygons |>
  # Keep the key polygon identifiers for later joins
  st_drop_geometry()

centroid_lookup <- adm2_centroids |>
  # Keep centroid coordinates alongside the matching GADM IDs
  st_drop_geometry() |>
  select(gadm_id, gadm_name, label_full, label_confidence, department_name, longitude, latitude)

This step is useful because we now have two geographic layers ready for use: one with the ADM2 area boundaries and one with one point for each area. That means we can do three practical operations more easily: (1) map the areas, (2) measure distances between them, and (3) draw lines linking origins to destinations.

4.3.2 Measuring distance

We can now measure distances between origins and destinations. We measure the distance between the representative point of the origin area and of destination. Because distance should be measured in kilometres rather than in longitude and latitude degrees, we do this in a projected CRS. We use EPSG:32721, which works well for Uruguay.

distance_matrix <- st_distance(adm2_centroids_utm)
dimnames(distance_matrix) <- list(adm2_centroids_utm$gadm_id, adm2_centroids_utm$gadm_id)

od_total <- od_total |>
  # Attach centroid-to-centroid distance to every OD pair
  mutate(
    distance_km = map2_dbl(
      home_gadm_id,
      visit_gadm_id,
      \(origin_id, destination_id) {
        as.numeric(distance_matrix[origin_id, destination_id]) / 1000
      }
    )
  )

The result is an additional column to our OD table including a distance value for each OD pair. That makes the later steps easier because the same table can now support summaries, plots, maps and regression models.

distance_summary <- od_total |>
  summarise(
    # Calculate the average distance, giving more weight to stronger flows
    mean_distance_km = weighted.mean(distance_km, visit_share, na.rm = TRUE),
    # Keep the median as a less outlier-sensitive summary of separation
    median_distance_km = median(distance_km, na.rm = TRUE),
    # Measure how much flow stays within the same ADM2 area
    share_within_area = sum(visit_share[distance_km == 0], na.rm = TRUE) / sum(visit_share, na.rm = TRUE),
    # Measure how much flow involves longer-distance links
    share_over_100km = sum(visit_share[distance_km > 100], na.rm = TRUE) / sum(visit_share, na.rm = TRUE)
  ) |>
  # Round the summary measures for easier reading in the chapter
  mutate(across(where(is.numeric), round, 3))

distance_summary

# A tibble: 1 × 4
  mean_distance_km median_distance_km share_within_area share_over_100km
             <dbl>              <dbl>             <dbl>            <dbl>
1             33.3               115.             0.393            0.057

4.4 Data Exploration

We first focus on gaining an descriptive understanding of the patterns of activity spaces. We begin by exploring its overall distribution with a histogram.

plot_flow_distribution <- ggplot(od_total, aes(x = visit_share)) +
  # Combine a histogram and kernel density to describe the OD outcome first
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 60,
    fill = "grey85",
    colour = "white",
    linewidth = 0.2
  ) +
  geom_density(fill = "#2A6F97", alpha = 0.45, colour = "#2A6F97", linewidth = 0.5) +
  labs(
    title = "Distribution of visit activity",
    x = "Average daytime flow strength (`visit_share`)",
    y = "Density"
  ) +
  theme_plot_course()

print(plot_flow_distribution)

What does this histogram tells you? Most places record a low activity or connections with other areas. A very small set of places have residents visiting other places for a large amount of time.

Next we can start looking at the geography of space activity. A common approach is to understand the effect of distance. Typically distance is expected to have a deterrent effect on the interaction between places. We tend to see that in most empirical applications, including migration, commuting and trade. We then analyse how distance influences the frequency of visits.

distance_bands <- od_total |>
  # Group daytime OD links into broad, interpretable distance classes
  mutate(
    distance_band = case_when(
      distance_km == 0 ~ "Within same area",
      distance_km < 25 ~ "0 to <25 km",
      distance_km < 100 ~ "25 to <100 km",
      TRUE ~ "100+ km"
    ),
    distance_band = factor(
      distance_band,
      levels = c("Within same area", "0 to <25 km", "25 to <100 km", "100+ km")
    )
  ) |>
  group_by(distance_band) |>
  summarise(
    total_share = sum(visit_share, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(share_of_daytime_flow = total_share / sum(total_share, na.rm = TRUE))

plot_distance_bands <- ggplot(
  distance_bands,
  aes(x = distance_band, y = share_of_daytime_flow)
) +
  # Show the broad distance profile of the daytime OD system
  geom_col(fill = "#2A6F97", colour = "white", linewidth = 0.2) +
  scale_y_continuous(labels = label_percent(accuracy = 1)) +
  labs(
    title = "Share of visits",
    x = "Centroid-based distance bands",
    y = "Share of visits"
  ) +
  theme_plot_course()

print(plot_distance_bands)

We can then seek to gain a better understanding of places of origins and / or destinations displaying unusual activity i.e. key origins and destinations. To this end, we join the origin and destination summaries (we created) to the ADM2 polygons for mapping.

origin_polygons <- adm2_polygons |>
  # Join origin summaries back to the ADM2 polygons
  left_join(
    origin_summary |>
      select(home_gadm_id, home_label, non_local_share, local_retention_share),
    by = c("gadm_id" = "home_gadm_id")
  )

destination_polygons <- adm2_polygons |>
  # Join destination summaries back to the ADM2 polygons
  left_join(
    destination_summary |>
      select(visit_gadm_id, visit_label, inbound_non_local_share),
    by = c("gadm_id" = "visit_gadm_id")
  )

We explore the origins. What conclusions can you infer from the maps below?

origin_non_local_labels <- adm2_centroids |>
  # Label only the strongest non-local origins to keep the map readable
  left_join(
    origin_summary |>
      select(home_gadm_id, home_label, non_local_share),
    by = c("gadm_id" = "home_gadm_id")
  ) |>
  filter(!is.na(non_local_share)) |>
  slice_max(non_local_share, n = 12, with_ties = FALSE)

map_origin_non_local <- ggplot() +
  # Draw the national outline first for spatial context
  geom_sf(data = uruguay, fill = "grey97", colour = "grey75", linewidth = 0.25) +
  # Fill polygons by their share of non-local daytime activity
  geom_sf(
    data = filter(origin_polygons, !is.na(non_local_share)),
    aes(fill = non_local_share),
    colour = "white",
    linewidth = 0.08
  ) +
  # Label the strongest origins with the cleaned short names
  geom_sf_text(
    data = origin_non_local_labels,
    aes(label = home_label),
    size = 2.2,
    colour = "grey20",
    check_overlap = TRUE
  ) +
  scale_fill_course_seq_c(labels = label_number(accuracy = 0.001)) +
  labs(
    title = "Origin shares of non-local activity to other areas",
    fill = "Non-local\n`visit_share`"
  ) +
  theme_map_course()

print(map_origin_non_local)

Let us focus on local visits. What can we observe?

origin_local_labels <- adm2_centroids |>
  # Label only the strongest local-retention origins
  left_join(
    origin_summary |>
      select(home_gadm_id, home_label, local_retention_share),
    by = c("gadm_id" = "home_gadm_id")
  ) |>
  filter(!is.na(local_retention_share)) |>
  slice_max(local_retention_share, n = 12, with_ties = FALSE)

map_origin_local <- ggplot() +
  # Draw the national outline first for spatial context
  geom_sf(data = uruguay, fill = "grey97", colour = "grey75", linewidth = 0.25) +
  # Fill polygons by their within-area daytime activity
  geom_sf(
    data = filter(origin_polygons, !is.na(local_retention_share)),
    aes(fill = local_retention_share),
    colour = "white",
    linewidth = 0.08
  ) +
  # Label the strongest local-retention origins
  geom_sf_text(
    data = origin_local_labels,
    aes(label = home_label),
    size = 2.2,
    colour = "grey20",
    check_overlap = TRUE
  ) +
  scale_fill_course_seq_c(labels = label_number(accuracy = 0.001)) +
  labs(
    title = "Origin share of local activity",
    fill = "Local\n`visit_share`"
  ) +
  theme_map_course()

print(map_origin_local)

Let us change our perspective and now look at areas of destination.

destination_labels <- adm2_centroids |>
  # Label only the destinations with the strongest non-local inflow
  left_join(
    destination_summary |>
      select(visit_gadm_id, visit_label, inbound_non_local_share),
    by = c("gadm_id" = "visit_gadm_id")
  ) |>
  filter(!is.na(inbound_non_local_share)) |>
  slice_max(inbound_non_local_share, n = 12, with_ties = FALSE)

map_destination_non_local <- ggplot() +
  # Draw the national outline first for spatial context
  geom_sf(data = uruguay, fill = "grey97", colour = "grey75", linewidth = 0.25) +
  # Fill polygons by the non-local inflow they receive
  geom_sf(
    data = filter(destination_polygons, !is.na(inbound_non_local_share)),
    aes(fill = inbound_non_local_share),
    colour = "white",
    linewidth = 0.08
  ) +
  # Label the destinations with the strongest non-local inflow
  geom_sf_text(
    data = destination_labels,
    aes(label = visit_label),
    size = 2.2,
    colour = "grey20",
    check_overlap = TRUE
  ) +
  scale_fill_course_seq_c(labels = label_number(accuracy = 0.001)) +
  labs(
    title = "Destination share of non-local activity",
    fill = "Inbound non-local\n`visit_share`"
  ) +
  theme_map_course()

print(map_destination_non_local)

4.4.1 Seeing observed flows

The area-based maps above help us understand origins and destinations separately. Now let us quickly understand which origins and destinations are strongly connected through resident visits. Often visualising spatial interaction data is challenging. The main challenge is that they become cluttered very quickly. So we often need a selective visual strategy rather than drawing every single interaction.

A first task is to keep only the strongest inter-area flows and turn each one into a line joining the origin point to the destination point. This is one of the most common practical steps in OD analysis because it lets us see the main structure without overwhelming the map.

top_flows_tbl <- od_total |>
  # Exclude self-flows so the map highlights between-area links
  filter(home_gadm_id != visit_gadm_id) |>
  # Rank links by their aggregated daytime flow strength
  arrange(desc(visit_share)) |>
  # Keep only the strongest flows so the map stays readable
  slice_head(n = 40) |>
  # Attach origin centroid coordinates and labels
  left_join(
    centroid_lookup |>
      rename(
        home_gadm_id = gadm_id,
        home_gadm_name_polygon = gadm_name,
        home_longitude = longitude,
        home_latitude = latitude,
        home_department = department_name
      ),
    by = "home_gadm_id"
  ) |>
  # Attach destination centroid coordinates and labels
  left_join(
    centroid_lookup |>
      rename(
        visit_gadm_id = gadm_id,
        visit_gadm_name_polygon = gadm_name,
        visit_longitude = longitude,
        visit_latitude = latitude,
        visit_department = department_name
      ),
    by = "visit_gadm_id"
  )

top_flow_geometries <- pmap(
  top_flows_tbl[, c("home_longitude", "home_latitude", "visit_longitude", "visit_latitude")],
  \(home_longitude, home_latitude, visit_longitude, visit_latitude) {
    # Build one line from the origin point to the destination point
    st_linestring(
      matrix(
        c(home_longitude, home_latitude, visit_longitude, visit_latitude),
        ncol = 2,
        byrow = TRUE
      )
    )
  }
)

top_flows <- st_sf(
  top_flows_tbl,
  # Turn the flow table plus line geometries into an sf object for mapping
  geometry = st_sfc(top_flow_geometries, crs = 4326)
)

top_flow_labels <- top_flows_tbl |>
  # Label only the strongest origins in the flow-line map
  distinct(home_gadm_id, home_gadm_name_polygon, home_longitude, home_latitude) |>
  # Keep a small set of origin labels so the map remains legible
  slice_head(n = 12) |>
  st_as_sf(coords = c("home_longitude", "home_latitude"), crs = 4326, remove = FALSE)

map_top_flows <- ggplot() +
  # Draw the country outline underneath the flow lines
  geom_sf(data = uruguay, fill = "grey97", colour = "grey80", linewidth = 0.25) +
  # Add the ADM2 polygons lightly so the flow lines sit over the actual geography
  geom_sf(data = adm2_polygons, fill = NA, colour = "grey88", linewidth = 0.08) +
  # Plot the strongest observed inter-area flows
  geom_sf(
    data = top_flows,
    aes(linewidth = visit_share, colour = visit_share),
    alpha = 0.6,
    lineend = "round"
  ) +
  # Add the centroid locations used to generate the lines
  geom_sf(data = adm2_centroids, colour = "grey25", size = 0.55, alpha = 0.55) +
  # Label the strongest origin nodes with the cleaned short names
  geom_sf_text(
    data = top_flow_labels,
    aes(label = home_gadm_name_polygon),
    size = 2.1,
    colour = "grey20",
    check_overlap = TRUE
  ) +
  scale_linewidth_continuous(labels = label_number(accuracy = 0.001), range = c(0.2, 2.5)) +
  scale_colour_course_seq_c(labels = label_number(accuracy = 0.001)) +
  labs(
    title = "The top 40 strongest visit activity connections",
    linewidth = "`visit_share`",
    colour = "`visit_share`"
  ) +
  theme_map_course()

print(map_top_flows)

4.5 Spatial interaction data modelling

Now that we have a better understanding of the data, we can start focusing on modelling. The core idea in this section is to fit a model that can capture the particular characteristics of our variable of interest (share of visits) using a set of predictors that describe the nature of our visit variable. We start from the simplest model and then progressively build complexity until we get to a satisfying point. Often this is a conventional linear regression before moving to the more sophisticated models.

The intuition is simple. Interactions between places often depend on three broad ingredients:

some measure of how much activity an origin can send out
some measure of how attractive or important a destination is
some measure of separation or impedance between them, often distance

4.5.1 Baseline model

In formal research, spatial interaction models are often estimated with Poisson or related count models and may include balancing constraints, origin effects, destination effects and richer measures of accessibility (Fotheringham and O’Kelly 1989; Rowe, Lovelace, and Dennett 2024). For this course, we use a lighter version that remains easy to read and adapt later. The goal is to understand the modelling logic rather than to cover every variant.

One of the standard ways to express that logic is with a simple gravity model:

\[ T_{ij} = k \frac{O_i^{\alpha} D_j^{\gamma}}{d_{ij}^{\beta}} \]

where:

T_ij is the flow from origin i to destination j
O_i captures the sending capacity of the origin
D_j captures the attractiveness of the destination
d_ij represents the separation between them, often distance
k rescales the whole system
alpha, gamma, and beta control how strongly origin size, destination size, and distance shape the flow

In standard practice, a log-linear approach is often used:

\[ \log(T_{ij}) = \alpha_0 + \alpha_1 \log(O_i) + \alpha_2 \log(D_j) - \alpha_3 \log(d_{ij}) + \epsilon_{ij} \]

We follow this logic. We do not implement a full gravity model in all its classical forms, but it captures the core intuition in a way that is transparent and easy to estimate with the OD table at hand.

We want to build a first model of inter-area daytime flows. So we first need to create a modelling table with one outcome and a small set of predictors. We keep only inter-area flows so that the model focuses on movement between different places, then attach three predictors:

origin_mass: total outbound visit_share generated by the origin
destination_mass: total inbound visit_share attracted by the destination
distance_km: centroid-to-centroid separation

Note

In an standard gravity model using flows, origin_mass is approximated by population at origins and destination_mass by population at destinations as we do in Rowe and Arribas-Bel (2022).

od_model <- od_total |>
  # Focus on inter-area links for the simplified interaction model
  filter(home_gadm_id != visit_gadm_id) |>
  left_join(
    origin_summary |>
      select(home_gadm_id, origin_mass = total_outbound_share),
    by = "home_gadm_id"
  ) |>
  left_join(
    destination_summary |>
      select(visit_gadm_id, destination_mass),
    by = "visit_gadm_id"
  ) |>
  # Add department context so we can test whether within-department links behave differently
  left_join(
    polygon_lookup |>
      select(home_gadm_id = gadm_id, home_department = department_name),
    by = "home_gadm_id"
  ) |>
  left_join(
    polygon_lookup |>
      select(visit_gadm_id = gadm_id, visit_department = department_name),
    by = "visit_gadm_id"
  ) |>
  mutate(
    # Mark OD pairs that remain within the same department
    same_department = as.integer(home_department == visit_department),
    # Log-transform the main quantities to match the usual interaction logic
    log_flow = log(visit_share),
    log_origin_mass = log(origin_mass),
    log_destination_mass = log(destination_mass),
    log_distance_km = log(distance_km)
  ) |>
  drop_na(log_flow, log_origin_mass, log_destination_mass, log_distance_km)

interaction_model <- lm(
  log_flow ~ log_origin_mass + log_destination_mass + log_distance_km,
  data = od_model
)

Interpretation and inference

What can we infer from the estimated coefficients? Do our results match the basic gravity-model intuition? That is: whether larger origins should tend to send more flow, more attractive destinations should tend to receive more flow, and greater distance should tend to reduce the observed connection between them.

model_results <- tidy(interaction_model) |>
  mutate(across(where(is.numeric), round, 3))

model_results

# A tibble: 4 × 5
  term                 estimate std.error statistic p.value
  <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)             2.97      0.14       21.2       0
2 log_origin_mass         2.08      0.05       42.0       0
3 log_destination_mass    0.695     0.035      20.0       0
4 log_distance_km        -1.26      0.024     -51.8       0

Model assessment

A next step is to assess our model to judge whether the baseline model captures any meaningful structure in the OD data. We can use familiar summary measures and then compare fitted and observed values visually. R^2 tells us how much variation in logged flow-strength is captured by the model, adjusted R^2 penalises unnecessary complexity, and AIC gives a comparative measure of fit that is most useful when comparing models estimated on a similar basis.

model_fit_summary <- glance(interaction_model) |>
  transmute(
    r_squared = round(r.squared, 3),
    adj_r_squared = round(adj.r.squared, 3),
    sigma = round(sigma, 3),
    aic = round(AIC, 1),
    bic = round(BIC, 1),
    n_obs = nobs
  )

model_fit_summary

# A tibble: 1 × 6
  r_squared adj_r_squared sigma    aic   bic n_obs
      <dbl>         <dbl> <dbl>  <dbl> <dbl> <int>
1     0.534         0.534  1.48 19525. 19558  5396

These values should be read as guides rather than as a final verdict on the model. A higher R^2 means the baseline gravity-style regression is capturing more of the broad OD structure, while a lower AIC would indicate a better trade-off between fit and parsimony when comparing similar linear specifications. For a broader overview of regression-based spatial analysis workflows can follow up with texts such as Bivand, Pebesma, and Gómez-Rubio (2013).

We can also assess how predicted aligned with observed share visits.

model_diagnostics <- augment(interaction_model) |>
  mutate(
    observed_flow = exp(.resid + .fitted),
    fitted_flow = exp(.fitted)
  )

plot_model_fit <- ggplot(
  model_diagnostics,
  aes(x = fitted_flow, y = observed_flow)
) +
  # Compare observed and fitted values of visit_share on the original scale
  geom_point(alpha = 0.2, colour = "#2A6F97", size = 1) +
  geom_abline(intercept = 0, slope = 1, colour = "#B2182B", linewidth = 0.5) +
  scale_x_log10(labels = label_number(accuracy = 0.001)) +
  scale_y_log10(labels = label_number(accuracy = 0.001)) +
  labs(
    title = "Predicted vs observed share visits",
    x = "Fitted `visit_share`",
    y = "Observed `visit_share`"
  ) +
  theme_plot_course()

print(plot_model_fit)

4.5.2 Improving the model

The baseline model is a helpful starting point, but it is unlikely to capture the whole structure of the daytime OD system. In practice, we often improve a model step by step rather than replacing it all at once.

Here we follow three routes. First, we add a contextual variable that records whether an origin and destination lie in the same department. Second, we let the effect of distance become more flexible by adding a squared logged-distance term. Third, we fit a fractional logit model, which is often more suitable when the dependent variable is a proportion bounded between 0 and 1, as visit_share.

These three routes are aligned around the following principles:

Use better approximations to model your dependent variable.
Recognise the structure of your data.
Get better predictors.

interaction_model_context <- lm(
  log_flow ~ log_origin_mass + log_destination_mass + log_distance_km + same_department,
  data = od_model
)

interaction_model_nonlinear <- lm(
  log_flow ~ log_origin_mass + log_destination_mass + log_distance_km +
    I(log_distance_km^2) + same_department,
  data = od_model
)

interaction_model_fractional <- glm(
  visit_share ~ log_origin_mass + log_destination_mass + log_distance_km +
    same_department,
  family = quasibinomial(link = "logit"),
  data = od_model
)

To compare these models fairly, it helps to separate two ideas. Some fit measures are model-specific, while others can be compared more broadly on the response scale. We therefore begin with response-scale measures that can be compared across all four specifications.

model_predictions <- bind_rows(
  od_model |>
    transmute(
      model = "Baseline gravity-style",
      observed = visit_share,
      fitted = exp(predict(interaction_model, newdata = od_model))
    ),
  od_model |>
    transmute(
      model = "Add same-department effect",
      observed = visit_share,
      fitted = exp(predict(interaction_model_context, newdata = od_model))
    ),
  od_model |>
    transmute(
      model = "Add nonlinear distance decay",
      observed = visit_share,
      fitted = exp(predict(interaction_model_nonlinear, newdata = od_model))
    ),
  od_model |>
    transmute(
      model = "Fractional logit for proportions",
      observed = visit_share,
      fitted = predict(interaction_model_fractional, newdata = od_model, type = "response")
    )
) 

response_scale_comparison <- model_predictions |>
  group_by(model) |>
  summarise(
    rmse = sqrt(mean((observed - fitted)^2, na.rm = TRUE)),
    mae = mean(abs(observed - fitted), na.rm = TRUE),
    correlation = cor(observed, fitted, use = "complete.obs"),
    .groups = "drop"
  ) |>
  transmute(
    model,
    rmse = round(rmse, 4),
    mae = round(mae, 4),
    correlation = round(correlation, 3)
  )

response_scale_comparison

# A tibble: 4 × 4
  model                              rmse    mae correlation
  <chr>                             <dbl>  <dbl>       <dbl>
1 Add nonlinear distance decay     0.0708 0.0049       0.256
2 Add same-department effect       0.0125 0.0038       0.559
3 Baseline gravity-style           0.0127 0.0038       0.539
4 Fractional logit for proportions 0.0117 0.0042       0.603

These measures are useful because they compare all models on the original scale of visit_share. Lower RMSE and MAE indicate better fit on the response scale, while a higher correlation means the model is tracking the broad rank order of the observed flows more closely.

We can then add a second table with model-specific fit measures. For the linear models, R^2, adjusted R^2, AIC, and BIC remain informative. For the fractional logit, a pseudo-R^2 and the deviance measures are more appropriate.

lm_specific_comparison <- bind_rows(
  glance(interaction_model) |> mutate(model = "Baseline gravity-style"),
  glance(interaction_model_context) |> mutate(model = "Add same-department effect"),
  glance(interaction_model_nonlinear) |> mutate(model = "Add nonlinear distance decay")
) |>
  transmute(
    model,
    r_squared = round(r.squared, 3),
    adj_r_squared = round(adj.r.squared, 3),
    aic = round(AIC, 1),
    bic = round(BIC, 1)
  )

fractional_specific_comparison <- tibble(
  model = "Fractional logit for proportions",
  pseudo_r2 = round(1 - interaction_model_fractional$deviance / interaction_model_fractional$null.deviance, 3),
  null_deviance = round(interaction_model_fractional$null.deviance, 1),
  residual_deviance = round(interaction_model_fractional$deviance, 1)
)

lm_specific_comparison

# A tibble: 3 × 5
  model                        r_squared adj_r_squared    aic    bic
  <chr>                            <dbl>         <dbl>  <dbl>  <dbl>
1 Baseline gravity-style           0.534         0.534 19525. 19558 
2 Add same-department effect       0.546         0.546 19383. 19423.
3 Add nonlinear distance decay     0.567         0.567 19127. 19174.

fractional_specific_comparison

# A tibble: 1 × 4
  model                            pseudo_r2 null_deviance residual_deviance
  <chr>                                <dbl>         <dbl>             <dbl>
1 Fractional logit for proportions     0.612          86.6              33.6

Based on all these metrics, what model do you think is best?

We use the fractional logit as the preferred practical model. It is attractive here because visit_share is a proportion, so the model respects the bounded nature of the outcome while still keeping the gravity-style logic easy to explain.

preferred_model <- interaction_model_fractional

preferred_results <- tidy(preferred_model) |>
  mutate(across(where(is.numeric), round, 3))

preferred_results

# A tibble: 5 × 5
  term                 estimate std.error statistic p.value
  <chr>                   <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)             3.13      0.156     20.0        0
2 log_origin_mass         1.69      0.05      34.1        0
3 log_destination_mass    0.591     0.036     16.2        0
4 log_distance_km        -1.26      0.032    -39.7        0
5 same_department         0.395     0.049      8.14       0

4.6 Model prediction

So far we have put all of our modeling efforts in understanding the model we fit and improving such model so it fits our data as closely as possible. If we are confident enough that we can represent our target process, we can start using the model for prediction. In applied work, this is often the bridge between a conceptual gravity model and a planning task such as forecasting travel demand, identifying strong corridors or identifying links that deserve closer scrutiny.

We use the fitted model to produce a structured estimate of the expected flow strength. That estimate is based on the broad ingredients of the system, such as origin strength, destination attractiveness, distance, and whether the link stays within the same department.

In a fuller forecasting exercise, we would ideally compare predictions against a later observation window that was not used to fit the model. In this chapter we do not have that second period, so the example below should be read as a practical illustration of model-based fitted prediction rather than as a true out-of-sample forecast. Even so, it is still useful as illustration because it shows how the model turns its coefficients back into concrete expected flows between places.

Before mapping the predictions, we remind ourselves how the candidate models performed on the response scale. That gives us a simple reason for taking the preferred fractional logit forward into the prediction stage.

response_scale_comparison

# A tibble: 4 × 4
  model                              rmse    mae correlation
  <chr>                             <dbl>  <dbl>       <dbl>
1 Add nonlinear distance decay     0.0708 0.0049       0.256
2 Add same-department effect       0.0125 0.0038       0.559
3 Baseline gravity-style           0.0127 0.0038       0.539
4 Fractional logit for proportions 0.0117 0.0042       0.603

We generate fitted values from the preferred model and identify the links with the strongest predicted visit_share. We then attach origin and destination coordinates so those predictions can be mapped.

predicted_top_flows_tbl <- od_model |>
  # Convert fitted values back to the original flow scale
  mutate(predicted_flow = predict(preferred_model, newdata = od_model, type = "response")) |>
  arrange(desc(predicted_flow)) |>
  slice_head(n = 40) |>
  left_join(
    centroid_lookup |>
      rename(
        home_gadm_id = gadm_id,
        pred_home_name = gadm_name,
        pred_home_longitude = longitude,
        pred_home_latitude = latitude
      ),
    by = "home_gadm_id"
  ) |>
  left_join(
    centroid_lookup |>
      rename(
        visit_gadm_id = gadm_id,
        pred_visit_name = gadm_name,
        pred_visit_longitude = longitude,
        pred_visit_latitude = latitude
      ),
    by = "visit_gadm_id"
  )

predicted_flow_geometries <- pmap(
  predicted_top_flows_tbl[, c("pred_home_longitude", "pred_home_latitude", "pred_visit_longitude", "pred_visit_latitude")],
  \(pred_home_longitude, pred_home_latitude, pred_visit_longitude, pred_visit_latitude) {
    st_linestring(
      matrix(
        c(pred_home_longitude, pred_home_latitude, pred_visit_longitude, pred_visit_latitude),
        ncol = 2,
        byrow = TRUE
      )
    )
  }
)

predicted_top_flows <- st_sf(
  predicted_top_flows_tbl,
  geometry = st_sfc(predicted_flow_geometries, crs = 4326)
)

map_predicted_flows <- ggplot() +
  # Draw the national outline and ADM2 boundaries for predicted links
  geom_sf(data = uruguay, fill = "grey97", colour = "grey80", linewidth = 0.25) +
  geom_sf(data = adm2_polygons, fill = NA, colour = "grey88", linewidth = 0.08) +
  # Show the strongest fitted flows implied by the interaction model
  geom_sf(
    data = predicted_top_flows,
    aes(linewidth = predicted_flow, colour = predicted_flow),
    alpha = 0.6,
    lineend = "round"
  ) +
  geom_sf(data = adm2_centroids, colour = "grey25", size = 0.55, alpha = 0.55) +
  scale_linewidth_continuous(labels = label_number(accuracy = 0.001), range = c(0.2, 2.5)) +
  scale_colour_course_seq_c(labels = label_number(accuracy = 0.001)) +
  labs(
title = "Predicted top 40 strongest visit activity connections",
    linewidth = "Predicted\n`visit_share`",
    colour = "Predicted\n`visit_share`"
  ) +
  theme_map_course()

print(map_predicted_flows)

The map gives a useful overview. We complement this with a table which more easily identifies the strongest predicted links.

predicted_top_flows_tbl |>
  transmute(
    origin = pred_home_name,
    destination = pred_visit_name,
    predicted_flow = round(predicted_flow, 4),
    distance_km = round(distance_km, 1),
    same_department
  ) |>
  slice_head(n = 12)

# A tibble: 12 × 5
   origin           destination      predicted_flow distance_km same_department
   <chr>            <chr>                     <dbl>       <dbl>           <int>
 1 Flores unit 118  Ismael Cortinas          0.413          6                 1
 2 Soriano unit 136 Soriano unit 129         0.183         15.8               1
 3 Flores unit 106  Ismael Cortinas          0.181         16.3               1
 4 Soriano unit 136 Soriano unit 122         0.123         22.8               1
 5 Flores unit 100  Ismael Cortinas          0.118         30.7               1
 6 Durazno unit 72  Villa del Carmen         0.108         32.4               1
 7 Parque del Plata Progreso                 0.108         10.3               1
 8 Valentín         Salto unit 21            0.0944        29.9               1
 9 Ismael Cortinas  Flores unit 118          0.0861         6                 1
10 Sarandí Grande   Fray Marcos              0.0821        28.8               1
11 Parque del Plata Los Cerrillos            0.0801        12.2               1
12 Durazno unit 72  Durazno unit 81          0.0786        33.4               1

Two points are worth keeping in mind. First, these are not the original observations. They are fitted values implied by the preferred fractional logit model. Second, the point is not just to rank links from highest to lowest. The point is that we now have a model that can be used again to predict flows in other datasets or under different assumptions. Once a new OD table becomes available, or once we want to test a policy-relevant scenario such as reducing distance frictions or comparing within- and between-department flows, the same modelling logic can be applied again.

4.7 What to take away

The most important lesson from this chapter is not a single function call. It is the analytical sequence. With OD data, we usually need to:

define what one flow record should represent
aggregate to that OD level
attach a spatial representation of origins and destinations
compute separation measures such as distance
map the structure selectively
move into a model that links flow strength to origin mass, destination mass, and separation

That sequence is transferable well beyond this dataset. Students working on commuting, school catchments, service access, migration, or transport demand will often follow the same logic even if the substantive application changes. The details differ, but the OD way of thinking remains the same.