Comparing mgcv::bam with glmmLDTS for the modeling and prediction of spotted seal haul-out behavior

Author
Affiliation

Josh M. London

Alaska Fisheries Science Center
NOAA Fisheries

Published

September 3, 2023

This is and evolving document

Content, figures, and analysis within this document are evolving based on new things learned and feedback from others. This is an open learning exercise, not something to be cited or applied outside of this specific context.

Inspiration & Objectives

The glmmLDTS package (Ver Hoef, London, and Boveng 2009) was developed and published over ten years ago and, at the time, provided a fast-computing solution for generalized linear mixed models with massive data sets and many repeated measures on subjects. While the package has general applicability, it was developed with behavioral observations from telemetry deployments on seals in mind (see example data in (Ver Hoef, London, and Boveng 2009) and analysis in (London et al. 2012)). In recent years, the mgcv package’s bam() model fitting function has emerged as a common solution for fitting similarly massive (if, even much more massive) data sets (Wood et al. 2017; Wood, Goude, and Shaw 2014; Li and Wood 2019). The bam() approach differs in that it fits a generalized additive model (GAM) and employs additional numerical methods designed for data sets containing upwards of several tens of thousands of data. The random effects are included as special penalized smooths. There is also support for AR1 structure in the model to account for temporal autocorrelation.

While mgcv::bam() relies on a low rank approximation it does use a full likelihood estimation. glmmLDTS, on the other hand, relies on quasi-likelihood but with a full rank approach. Additional simulation studies would be required to fully understand and throrouhgly compare the two (and, other frameworks such as glmmTMB, gam, INLA (first order Laplace approximation), spaMM, spBayes, and spmodel (second order Laplace approximation, see (Hoef et al. 2023)).

Here, the focus, is more on whether the speed of mgcv::bam() can be used as a tool for efficient model exploration and comparison in the sepecific case of spotted seal haul-out behavior from a large telemetry data set.

Initial testing and exploration of the mgcv::bam() approach suggests that models taking a few hours to fit with glmmLDTS can be fit in just a few seconds with mgcv::bam(). Thus, there’s keen interest in understanding whether the results of the fits are comparable and if mgcv::bam() might provide an astonishing improvement in computation time without sacrificing important statistical considerations and ecological insight.

Here, we compare speed, model performance, and predictions of glmmLDTS and mgcv::bam() for a data set of spotted seal haul-out observations from deployed telemetry devices. The primary aim of this model is to describe haul-out behavior during the critical periods of pupping, breeding, and molting and to also provide estimate of availability during aerial surveys such that counts of seals can provide estimates of population abundance. The data set consists of 114,852 from deployments on 105 individual spotted seals.

Exploring glmmLDTS and Fit Results

The model specification below was used to specify the glmmLDTS model

fit_spotted <- function(HO_spotted) {
  glmmLDTS(fixed.formula = dry ~ age_sex + 
             sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + 
             poly(day, 4, raw=TRUE) + 
             wind*temp2m + pressure + precip + 
             age_sex:poly(day, 4, raw=TRUE),
           random.formula = dry ~ speno,
           data = HO_spotted,
           EstMeth = "REML",
           timecol = "time_vec",
           group.vec = "ar1_id")
}

As a general description of the model, dry is a binomial term that indicates whether a seal was out of the water for the majority of a given hour. A collection of fixed effects predictors are included such as age and sex class, weather conditions determined from climate reanalysis (e.g. temperature (at 2m), wind, precipitation), linear, quadratic, cubic, and quatric effects of day-of-year to represent temporal changes in behavior, and time of day as a continuous formulation based on Fourier series (sin1, cos1, sin2, …) that provides a flexible model while preserving the inherent circularity needed for time-of-day effects (i.e., hour 0 should be equal to hour 24). It also represents hour-of-day with 6 parameters, which is a considerable reduction when compared to a 24-parameter variable. The random effect is speno (or seal identifier) which is further blocked into consecutive observations – the ar1_id specifies this grouping.

Exact timing for the fit depends on the compute platform used. For an Apple M2 Silicon laptop (macbook Air) the model fit took 3.114 hours. Estimates of the fixed effects from the glmmLDTS model are provided below in Table 1.

Table 1:

Preview of fixed effects from the glmmLDTS model

effect levels estimate std.err df t.value prob.t
1 intercept -0.577619483 0.196772228 114822 -2.935473 0.00333
2 age_sex ADULT.F 0.000000000 NA NA NA NA
3 age_sex ADULT.M 0.379491785 0.277958097 114822 1.365284 0.17217
4 age_sex SUBADULT -0.272680722 0.250518429 114822 -1.088466 0.27639
5 age_sex YOUNG OF YEAR 1.961605614 0.375635757 114822 5.222095 0.00000
6..35
36 age_sex:day_4 YOUNG OF YEAR, 0.007596039 0.004649195 114822 1.633840 0.10230

Exploring mgcv::bam() Fit & Results

The model specification below was used to specify the mgcv::bam() model

m1 <- mgcv::bam(
  dry ~ age_sex + s(speno, bs = "re") + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + 
    poly(day, 4, raw=TRUE) + 
    wind*temp2m + pressure + precip + 
    age_sex:poly(day, 4, raw=TRUE),
  data = spotted_model_data,
  family = binomial,
  discrete = TRUE)

The initial model specification was meant to match the previous glmmLDTS model. The s(speno, bs = "re") term is the smooth term for the random effect. All other predictors are the same. Note, the specification for m1 here does not include any AR1 structure for temporal autocorrelation. To include this, we need to provide a value for \(\rho\) (or rho). We can examine the autocorrelation within the model and use the lag-1 value for \(\rho\) .

lag1 <- acf(resid(m1), plot=FALSE)[1][[1]]

The value for lag-1 autocorrelation is 0.8255 which is rather high but not surprising. We can, now, update our model specification with a value for rho as well as the A1.start argument with specifies (TRUE/FALSE) the start point of each speno/block.

m2 <- mgcv::bam(
  dry ~ age_sex + s(speno, bs = "re") + 
    sin1 + cos1 + sin2 + cos2 + sin3 + cos3 + 
    poly(day, 4, raw=TRUE) + 
    wind*temp2m + pressure + precip + 
    age_sex:poly(day, 4, raw=TRUE),
  data = spotted_model_data,
  family = binomial,
  AR.start = ar1_start,
  rho = lag1,
  discrete = TRUE)

Timing for the mgcv::bam() fit is approximately 4 seconds on an Apple M1 Silicon laptop. Estimates of the fixed effects from the mgcv::bam() model are provided below in Table 2.

Table 2:

Preview of fixed effects from the bam model

term estimate std.error statistic p.value
1 (Intercept) -0.594777463 0.196709850 -3.023628 2.497631e-03
2 age_sexADULT.M 0.385801763 0.278617516 1.384700 1.661441e-01
3 age_sexSUBADULT -0.281249113 0.250613891 -1.122241 2.617601e-01
4 age_sexYOUNG OF YEAR 1.950551281 0.372985192 5.229568 1.699070e-07
5 sin1 -0.045638470 0.020558838 -2.219895 2.642586e-02
6..30
31 age_sexYOUNG OF YEAR:poly(day, 4, raw = TRUE)4 0.007798545 0.004568144 1.707158 8.779265e-02

The estimates in Table 2 align quite well with those presented in Table 1. The same holds true for the estimates of standard error. This provides good indication that both approaches are interpreting the data in a very similar manner. There are, however, the cos3 term is estimated as 0 in the mgcv::bam() model but has a non-zero estimates in the glmmLDTS output ( see Table 3 ). Understanding the nature of this is worth pursuing further at some point.

Table 3:

Terms with zero estimate in the bam model

term estimate std.error statistic p.value
cos3 0 0 NaN NaN

Comparing Prediction Estimates and Confidence Intervals

So, it seems that both the original glmmLDTS approach and the mgcv::bam() approach can provide very similar fits with nearly identical coefficient estimates and standard errors. But, this is to be expected, I think, given that the glmm can be considered a special case of gam and the only smooth term included was for the random effect. The next area to explore is how the models might compare with respect to prediction estimates and the confidence intervals around those predictions.

For this prediction exercise, we’ll want to create a new data frame that includes all of the values we’d like to predict at. For this particular model, it’s a rather complex exercise and, thus, the complex function below (feel free to unfold the code block if interested).

Code
create_newdata <- function(data, age_sex) {
  df_list <- vector(mode = "list", length = length({{age_sex}} ))

  for (a_s in {{age_sex}}) {
    range_yday <- data %>% filter(age_sex == a_s ) %>%
      summarize(start_day = min(yday),
                end_day = max(yday))
    start_day <- range_yday$start_day
    end_day <- range_yday$end_day
    n_days = (end_day - start_day) + 1

    # for wx covariates we'll use a gam to get values by 
    # day/hour since wx is likely to vary w/in day over the season

    gam.baro <-
      gam(pressure ~ s(yday), 
          data = data, 
          method = "REML")
    gam.temp <-
      gam(temp2m ~ s(yday) + s(as.numeric(solar_hour)), 
          data = data, 
          method = "REML")
    gam.wind <-
      gam(wind ~ s(yday) + s(as.numeric(solar_hour)), 
          data = data, 
          method = "REML")
    gam.precip <-
      gam(precip ~ s(yday), 
          data = data, 
          method = "REML")

    wx_new_data <- data.frame(
      solar_hour = rep(0:23, each = n_days),
      yday = rep(start_day:end_day, times = 24)
    )

    temp_pred <- predict(gam.temp, newdata = wx_new_data)
    wind_pred <- predict(gam.wind, newdata = wx_new_data)
    baro_pred <- predict(gam.baro, newdata = wx_new_data)
    precip_pred <- predict(gam.precip, newdata = wx_new_data)

    df <- data.frame(
      age_sex = a_s,
      solar_hour = rep(0:23, each = n_days),
      yday = rep(start_day:end_day, times = 24),
      northing = mean(data$northing),
      temp2m = temp_pred,
      wind = wind_pred,
      pressure = baro_pred,
      precip = precip_pred
    ) %>%
      mutate(
        sin1 = sin(pi * solar_hour / 12),
        cos1 = cos(pi * solar_hour / 12),
        sin2 = sin(pi * solar_hour / 6),
        cos2 = cos(pi * solar_hour / 6),
        sin3 = sin(pi * solar_hour / 4),
        cos3 = sin(pi * solar_hour / 4),
      ) %>%
      mutate(day = (yday - 120) / 10,
             day_2 = day^2,
             day_3 = day^3,
             day_4 = day^4)
    df_list[[a_s]] <- df
  }
  if(length({{age_sex}}) > 1) {
    df_out <- bind_rows(df_list) %>%
      mutate(age_sex = forcats::fct_relevel(
        age_sex,c("ADULT.F","ADULT.M","SUBADULT","YOUNG OF YEAR"))
      )
  } else {
    df_out <- bind_rows(df_list)
  }
}

spotted_newdata <- create_newdata(
  data = spotted_fit_glmm$dataset,
  age_sex = levels(spotted_fit_glmm$dataset$age_sex)
  )

Predicting from the glmmLDTS model

Now, we need a function to predict from the glmmLDTS model. This isn’t an included function within the package so we’ll create our own.

Code
predict.glmmLDTS <- function(glmmLDTS_model, newdata,
                             type = "response") {
# create the model matrix
  spotted_mm <- model.matrix(glmmLDTS_model$fixed.formula[-2],
                            data = newdata)

  #clean up extra intercept and get coef
  fit_coef <- glmmLDTS_model$fixed.effects %>%
    filter(!is.na(std.err)) %>%
    pull(estimate)

  predicts <-  
    tibble(logit_fits = as.vector(spotted_mm %*% fit_coef)) %>% 
    mutate(logit_fits_se = sqrt(diag(
             spotted_mm %*% glmmLDTS_model$covb %*% t(spotted_mm)
             )),
           logit_fits_lo95 = logit_fits - 1.96*logit_fits_se,
           logit_fits_up95 = logit_fits + 1.96*logit_fits_se,
           ho_prob = plogis(spotted_mm %*% fit_coef),
           lower95 = plogis(logit_fits_lo95),
           upper95 = plogis(logit_fits_up95)
    )
  return(predicts)
}

spotted_newdata <- spotted_newdata %>% 
  dplyr::bind_cols(
    predict.glmmLDTS(spotted_fit_glmm, spotted_newdata)
    )

Predicting from the mgcv::bam() model

The mgcv package includes a predict() function for us, so we can proceed directly to prediction using our spotted_newdata and bind the results. The predict.bam() function requires a column for speno even though we are excluding the random effect (s(speno)) from the model prediction. As with the glmmLDTS prediction exercise, we’ll specify type = "link" and calculate our confidence intervals directly.

spenos <- unique(spotted_model_data$speno) %>% 
  as.character()

spotted_newdata <- spotted_newdata %>% 
  mutate(speno = sample(spenos,1))

spotted_newdata <- spotted_newdata %>%
  bind_cols(
    predict(
      m2,
      spotted_newdata,
      type = "link",
      se.fit = TRUE,
      exclude = "s(speno)"
    ) %>%
      as_tibble()
  ) %>%
  rename(fit_bam = fit, se_bam = se.fit) %>%
  mutate(
    logit_bam_lo95 = fit_bam - 1.96 * se_bam,
    logit_bam_up95 = fit_bam + 1.96 * se_bam,
    ho_prob_bam = plogis(fit_bam),
    lower95_bam = plogis(logit_bam_lo95),
    upper95_bam = plogis(logit_bam_up95)
  )

Now, with predictions from both model approaches in hand, we can visualize the comparison of predictions and associated standard errors (Figure 1).

Code
ggplot(data = spotted_newdata) +
  geom_point(aes(x = ho_prob, y = ho_prob_bam),
             alpha = 0.1) +
  geom_abline(slope = 1, intercept = 0) +
  coord_cartesian() +
  xlab("Predicted HO Probability - glmmLDTS") +
  ylab("Predicted HO Probability - mgcv::bam") +
  ggtitle("Comparing Predicted Haul-out Probability",
          subtitle = "values shown are on the response scale") +
  theme_minimal() + theme(legend.position = "none")

Figure 1: XY plot comparing the predicted haul-out probability (response scale) for the same data between the glmmLDTS model fit and the mgcv::bam() model fit

This looks great!

Now, let’s visualize the predictions with a heat map (Figure 2) that depicts haul-out probability changes over the season and hour of day for each of our age and sex classes.

Code
plot_df <- spotted_newdata %>%
  mutate(date = lubridate::as_date(yday, origin = "2015-01-01"),
         month = lubridate::month(date,label=TRUE),
         day = lubridate::day(date)) %>%
  filter(!month %in% c("Jul","Aug"))

p1 <- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob)) +
  geom_tile(color = "white", linewidth = 0) +
  scale_fill_gradientn(
    colors = rev(met.brewer("Hiroshige")),
    name = "haul-out probability",
    aesthetics = "fill",
    limits = c(0, 1),
    breaks = c(0.25, 0.50, 0.75),
    guide = guide_colorbar(
      title.position = "bottom",
      barwidth = 15,
      barheight = 0.5,
      title.hjust = 0.5
    )
  )

p1 <- p1 + facet_grid(age_sex~month, scales = "free_x")
p1 <- p1 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
  scale_y_continuous(breaks = c(4,12,20)) +
  coord_cartesian(expand=FALSE)
p1 <- p1 + theme_minimal() +
  theme(panel.spacing = unit(0.1, "lines")) +
  theme(legend.position = "bottom") +
  theme(strip.background = element_rect(colour="white")) +
  theme(axis.ticks=element_blank()) +
  xlab("day of month") + ylab("local solar hour") +
  ggtitle("Spotted seal haul-out predictions - glmmLDTS")


p2 <- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob_bam)) +
  geom_tile(color = "white", linewidth = 0) +
  scale_fill_gradientn(
    colors = rev(met.brewer("Hiroshige")),
    name = "haul-out probability",
    aesthetics = "fill",
    limits = c(0, 1),
    breaks = c(0.25, 0.50, 0.75),
    guide = guide_colorbar(
      title.position = "bottom",
      barwidth = 15,
      barheight = 0.5,
      title.hjust = 0.5
    )
  )
p2 <- p2 + facet_grid(age_sex~month, scales = "free_x")
p2 <- p2 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
  scale_y_continuous(breaks = c(4,12,20)) +
  coord_cartesian(expand=FALSE)
p2 <- p2 + theme_minimal() +
  theme(panel.spacing = unit(0.1, "lines")) +
  theme(legend.position = "bottom") +
  theme(strip.background = element_rect(colour="white")) +
  theme(axis.ticks=element_blank()) +
  xlab("day of month") + ylab("local solar hour") +
  ggtitle("Spotted seal haul-out predictions - mgcv::bam")

p1/p2

Figure 2: Heat maps showing the predicted haul-out probability (response scale) variability over the season and within day for the same data between the glmmLDTS model fit and the mgcv::bam() model fit

That looks great, too!

So far, we’ve shown that the model estimates, associated standard errors, and prediction values for haul-out probability are very similar between the glmmLDTS and mgcv::bam() approach. The final comparison to evaluate is the prediction standard errors. We’ll start with an XY plot similar to what we did for the prediction estimates.

Code
ggplot(data = spotted_newdata) +
  geom_point(aes(x = logit_fits_se, y = se_bam),
             alpha = 0.1) +
  geom_abline(slope = 1, intercept = 0) +
  coord_cartesian() + ylim(c(0,NA)) + xlim(c(0,NA)) +
  xlab("Prediction Standard Error - glmmLDTS") +
  ylab("Prediction Standard Error - mgcv::bam") +
  ggtitle("Comparing Prediction Standard Errors",
          subtitle = "values shown are on the logit scale, not response") +
  theme_minimal() + theme(legend.position = "none")

XY plot comparing the predicted standard errors (link scale) for the same data between the glmmLDTS model fit and the mgcv::bam() model fit

Just like the other comparisons, this also looks very promising. As a final visualization, let’s fix the solar_hour parameter to noon and compare predictions and standard errors for both models across age and sex class.

p3 <- ggplot(plot_df %>% filter(solar_hour == 12), aes(date,ho_prob)) +
  geom_ribbon(aes(ymin=lower95,ymax=upper95), fill = "grey30", alpha = 0.3) +
  geom_line(aes(date,ho_prob)) +
  geom_ribbon(aes(ymin=lower95_bam,ymax=upper95_bam), fill = "orange", alpha = 0.3) +
  geom_line(aes(date, ho_prob_bam), color = "orange") +
  facet_grid(. ~ age_sex) +
  xlab("day or year") +
  ylab("haul-out probability") +
  ggtitle("Comparing Predictions & Standard Errors",
          subtitle = "glmmLDTS is the darker shaded region and orange is mgcv::bam") +
  theme_minimal() +
  theme(legend.position = "none")

p3

Figure 3: Comparison of predictions and standard errors for the glmmLDTS model (darker shade) and the mgcv::bam() model (orange). Solar hour was fixed at noon

So, it seems we can adopt mgcv::bam() as a quick surrogate for glmmLDTS and gain an astonishing decrease in computation time. But, we should probably temper any enthusiasm to rely on mgcv::bam() as a complete replacement for glmmLDTS in this case before proper comparisons with simulated data and can be done.

Additionally, it is worth considering a possible fairer comparison, in terms of time. This would involve use of the Gaussian process basis functions by including a model formula something like y ~ x1 + ....xn + s(time, bs = "gp", k = floor(n/2), m = c(2, 3, 1)) to get a reduced rank version of the AR1 model for the residuals (also specifying method = 'REML' in the call to mgcv::bam()).

What About a More GAM/BAM-centric Approach

Up to this point, the focus has been on comparing the same model specification (other than the approach for including random effects) between glmmLDTS and mgcv::bam(). However, a more GAM-like approach to mgcv::bam() might rely on a different specification that includes more smooth terms and such.

That might be interesting to explore while we are at it. The mgcv pacakge provides a range of options for specifying smooths within the model. For the environmental covariates, we’ll specify a simple cubic spline with a shrinkage component (bs = "cs"). For the day of year (yday) we’ll specify a cubic spline that varies by age and sex class (bs = "cr", by=age_sex) and for solar hour we’ll specify a circular spline (bs="cc"). Interactions are handled as tensor product smooths. It wasn’t clear whether the te() approach or ti() would be best so both are fit and tested with AIC.

spotted_model_data <- spotted_model_data %>% 
  mutate(solar_hour = as.integer(solar_hour),
         wind_temp2 = wind*temp2m)


m5ti <- mgcv::bam(
  dry ~ s(speno, bs = "re") +
    s(yday, bs="cr",by=age_sex)+ s(solar_hour, bs="cc")+
    ti(yday, solar_hour, bs=c("cr","cc")) + age_sex +
  s(precip, bs = "cs") + s(temp2m, bs = "cs") +
  s(wind, bs = "cs") + ti(wind, temp2m, bs = "cs") + s(pressure, bs = "cs"),
  data = spotted_model_data,
  family = binomial,
  AR.start = ar1_start,
  rho = lag1,
  discrete = TRUE
)

m5te <- mgcv::bam(
  dry ~ s(speno, bs = "re") +
    te(yday, solar_hour,bs=c("cr","cc"), by=age_sex) + age_sex +
  te(wind, temp2m, bs = "cs") + s(pressure, bs = "cs"),
  data = spotted_model_data,
  family = binomial,
  AR.start = ar1_start,
  rho = lag1,
  discrete = TRUE
)
AIC(m5ti, m5te)
           df       AIC
m5ti 155.3137 -27419.44
m5te 156.9675 -26427.64

Ok, looks like the ti() approach is favored. Now, let’s do some predictions!

spotted_newdata <- spotted_newdata %>% 
  mutate(wind_temp2 = wind*temp2m)

m5_predict <- predict(
      m5ti,
      spotted_newdata,
      type = "link",
      se.fit = TRUE,
      exclude = "s(speno)"
    ) %>%
      as_tibble() %>% 
  rename(fit_bam2 = fit, se_bam2 = se.fit) %>% 
  mutate(
    logit_bam2_lo95 = fit_bam2 - 1.96 * se_bam2,
    logit_bam2_up95 = fit_bam2 + 1.96 * se_bam2,
    ho_prob_bam2 = plogis(fit_bam2),
    lower95_bam2 = plogis(logit_bam2_lo95),
    upper95_bam2 = plogis(logit_bam2_up95)
  )

spotted_newdata <- spotted_newdata %>% 
  bind_cols(m5_predict)

And, visualize those predictions similar to our other plots

Code
plot_df <- spotted_newdata %>%
  mutate(date = lubridate::as_date(yday, origin = "2015-01-01"),
         month = lubridate::month(date,label=TRUE),
         day = lubridate::day(date)) %>%
  filter(!month %in% c("Jul","Aug"))

p1 <- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob_bam2)) +
  geom_tile(color = "white", linewidth = 0) +
  scale_fill_gradientn(
    colors = rev(met.brewer("Hiroshige")),
    name = "haul-out probability",
    aesthetics = "fill",
    limits = c(0, 1),
    breaks = c(0.25, 0.50, 0.75),
    guide = guide_colorbar(
      title.position = "bottom",
      barwidth = 15,
      barheight = 0.5,
      title.hjust = 0.5
    )
  )

p1 <- p1 + facet_grid(age_sex~month, scales = "free_x")
p1 <- p1 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
  scale_y_continuous(breaks = c(4,12,20)) +
  coord_cartesian(expand=FALSE)
p1 <- p1 + theme_minimal() +
  theme(panel.spacing = unit(0.1, "lines")) +
  theme(legend.position = "bottom") +
  theme(strip.background = element_rect(colour="white")) +
  theme(axis.ticks=element_blank()) +
  xlab("day of month") + ylab("local solar hour") +
  ggtitle("Spotted seal haul-out predictions - mgcv::bam with smooths")

p1

Figure 4: Heat map showing the predicted haul-out probability (response scale) variability over the season and within day from the mgcv::bam() with smooths model fit

That’s not so bad and, actually, pretty good given the very different approach. I imagine it speaks to the strong signal in the underlying data. An interesting aspect of this result is that we, now, see a period of increased haul-out behavior in adult females that coincides with pupping and nursing. There’s also an increased signal for adult males in that time that would coincide with their courting behavior and hauling out with nursing females. This difference was especially noticeable compared to early model specifications of glmmLDTS that only included up to a 3rd order polynomial for day of year. Inclusion of a 4th order polynomial has brought the two models more inline and it’s worth considering whether the gam-centric approach is over-fitting and needs to implement some additional constraints.

Code
p3 <- ggplot(plot_df %>% filter(solar_hour == 12), aes(date,ho_prob)) +
  geom_ribbon(aes(ymin=lower95,ymax=upper95), fill = "grey30", alpha = 0.3) +
  geom_line(aes(date,ho_prob)) +
  geom_ribbon(aes(ymin=lower95_bam2,ymax=upper95_bam2), fill = "orange", alpha = 0.3) +
  geom_line(aes(date, ho_prob_bam2), color = "orange") +
  facet_grid(. ~ age_sex) +
  xlab("day or year") +
  ylab("haul-out probability") +
  ggtitle("Comparing Predictions & Standard Errors",
          subtitle = "glmmLDTS is the darker shaded region and orange is mgcv::bam with smooths") +
  theme_minimal() +
  theme(legend.position = "none")

p3

Figure 5: Comparison of predictions and standard errors for the glmmLDTS model (darker shade) and the mgcv::bam() with smooths model (orange). Solar hour was fixed at noon

Acknowledgments

Special thanks to Debbie Russell, Matt Carter, Peter Boveng, and Jay Ver Hoef for their thoughts, discussions, and encouragement to explore. And, to Paul Conn for his originating work on this analysis.

References

Hoef, Jay M. Ver, Eryn Blagg, Michael Dumelle, Philip M. Dixon, Dale L. Zimmerman, and Paul Conn. 2023. “Marginal Inference for Hierarchical Generalized Linear Mixed Models with Patterned Covariance Matrices Using the Laplace Approximation.” https://doi.org/10.48550/ARXIV.2305.02978.
Li, Zheyuan, and Simon N. Wood. 2019. “Faster Model Matrix Crossproducts for Large Generalized Linear Models with Discretized Covariates.” Statistics and Computing 30 (1): 19–25. https://doi.org/10.1007/s11222-019-09864-2.
London, Josh M., Jay M. Ver Hoef, Steven J. Jeffries, Monique M. Lance, and Peter L. Boveng. 2012. “Haul-Out Behavior of Harbor Seals (Phoca Vitulina) in Hood Canal, Washington.” PLoS One; San Francisco 7 (6): e38180. https://doi.org/http://dx.doi.org/10.1371/journal.pone.0038180.
Ver Hoef, Jay M., Josh M. London, and Peter L. Boveng. 2009. “Fast Computing of Some Generalized Linear Mixed Pseudo-Models with Temporal Autocorrelation.” Computational Statistics 25 (1): 39–55. https://doi.org/10.1007/s00180-009-0160-1.
Wood, Simon N., Yannig Goude, and Simon Shaw. 2014. “Generalized Additive Models for Large Data Sets.” Journal of the Royal Statistical Society Series C: Applied Statistics 64 (1): 139–55. https://doi.org/10.1111/rssc.12068.
Wood, Simon N., Zheyuan Li, Gavin Shaddick, and Nicole H. Augustin. 2017. “Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data.” Journal of the American Statistical Association 112 (519): 1199–1210. https://doi.org/10.1080/01621459.2016.1195744.