<- function(HO_spotted) {
fit_spotted glmmLDTS(fixed.formula = dry ~ age_sex +
+ cos1 + sin2 + cos2 + sin3 + cos3 +
sin1 poly(day, 4, raw=TRUE) +
*temp2m + pressure + precip +
wind:poly(day, 4, raw=TRUE),
age_sexrandom.formula = dry ~ speno,
data = HO_spotted,
EstMeth = "REML",
timecol = "time_vec",
group.vec = "ar1_id")
}
Comparing mgcv::bam
with glmmLDTS
for the modeling and prediction of spotted seal haul-out behavior
Content, figures, and analysis within this document are evolving based on new things learned and feedback from others. This is an open learning exercise, not something to be cited or applied outside of this specific context.
Inspiration & Objectives
The glmmLDTS
package (Ver Hoef, London, and Boveng 2009) was developed and published over ten years ago and, at the time, provided a fast-computing solution for generalized linear mixed models with massive data sets and many repeated measures on subjects. While the package has general applicability, it was developed with behavioral observations from telemetry deployments on seals in mind (see example data in (Ver Hoef, London, and Boveng 2009) and analysis in (London et al. 2012)). In recent years, the mgcv
package’s bam()
model fitting function has emerged as a common solution for fitting similarly massive (if, even much more massive) data sets (Wood et al. 2017; Wood, Goude, and Shaw 2014; Li and Wood 2019). The bam()
approach differs in that it fits a generalized additive model (GAM) and employs additional numerical methods designed for data sets containing upwards of several tens of thousands of data. The random effects are included as special penalized smooths. There is also support for AR1 structure in the model to account for temporal autocorrelation.
While mgcv::bam()
relies on a low rank approximation it does use a full likelihood estimation. glmmLDTS
, on the other hand, relies on quasi-likelihood but with a full rank approach. Additional simulation studies would be required to fully understand and throrouhgly compare the two (and, other frameworks such as glmmTMB
, gam
, INLA
(first order Laplace approximation), spaMM
, spBayes
, and spmodel
(second order Laplace approximation, see (Hoef et al. 2023)).
Here, the focus, is more on whether the speed of mgcv::bam()
can be used as a tool for efficient model exploration and comparison in the sepecific case of spotted seal haul-out behavior from a large telemetry data set.
Initial testing and exploration of the mgcv::bam()
approach suggests that models taking a few hours to fit with glmmLDTS
can be fit in just a few seconds with mgcv::bam()
. Thus, there’s keen interest in understanding whether the results of the fits are comparable and if mgcv::bam()
might provide an astonishing improvement in computation time without sacrificing important statistical considerations and ecological insight.
Here, we compare speed, model performance, and predictions of glmmLDTS
and mgcv::bam()
for a data set of spotted seal haul-out observations from deployed telemetry devices. The primary aim of this model is to describe haul-out behavior during the critical periods of pupping, breeding, and molting and to also provide estimate of availability during aerial surveys such that counts of seals can provide estimates of population abundance. The data set consists of 114,852 from deployments on 105 individual spotted seals.
Exploring glmmLDTS
and Fit Results
The model specification below was used to specify the glmmLDTS
model
As a general description of the model, dry is a binomial term that indicates whether a seal was out of the water for the majority of a given hour. A collection of fixed effects predictors are included such as age and sex class, weather conditions determined from climate reanalysis (e.g. temperature (at 2m), wind, precipitation), linear, quadratic, cubic, and quatric effects of day-of-year to represent temporal changes in behavior, and time of day as a continuous formulation based on Fourier series (sin1, cos1, sin2, …) that provides a flexible model while preserving the inherent circularity needed for time-of-day effects (i.e., hour 0 should be equal to hour 24). It also represents hour-of-day with 6 parameters, which is a considerable reduction when compared to a 24-parameter variable. The random effect is speno (or seal identifier) which is further blocked into consecutive observations – the ar1_id specifies this grouping.
Exact timing for the fit depends on the compute platform used. For an Apple M2 Silicon laptop (macbook Air) the model fit took 3.114 hours. Estimates of the fixed effects from the glmmLDTS
model are provided below in Table 1.
effect | levels | estimate | std.err | df | t.value | prob.t | |
---|---|---|---|---|---|---|---|
1 | intercept | -0.577619483 | 0.196772228 | 114822 | -2.935473 | 0.00333 | |
2 | age_sex | ADULT.F | 0.000000000 | NA | NA | NA | NA |
3 | age_sex | ADULT.M | 0.379491785 | 0.277958097 | 114822 | 1.365284 | 0.17217 |
4 | age_sex | SUBADULT | -0.272680722 | 0.250518429 | 114822 | -1.088466 | 0.27639 |
5 | age_sex | YOUNG OF YEAR | 1.961605614 | 0.375635757 | 114822 | 5.222095 | 0.00000 |
6..35 | |||||||
36 | age_sex:day_4 | YOUNG OF YEAR, | 0.007596039 | 0.004649195 | 114822 | 1.633840 | 0.10230 |
Exploring mgcv::bam()
Fit & Results
The model specification below was used to specify the mgcv::bam()
model
<- mgcv::bam(
m1 ~ age_sex + s(speno, bs = "re") +
dry + cos1 + sin2 + cos2 + sin3 + cos3 +
sin1 poly(day, 4, raw=TRUE) +
*temp2m + pressure + precip +
wind:poly(day, 4, raw=TRUE),
age_sexdata = spotted_model_data,
family = binomial,
discrete = TRUE)
The initial model specification was meant to match the previous glmmLDTS
model. The s(speno, bs = "re")
term is the smooth term for the random effect. All other predictors are the same. Note, the specification for m1
here does not include any AR1 structure for temporal autocorrelation. To include this, we need to provide a value for \(\rho\) (or rho). We can examine the autocorrelation within the model and use the lag-1 value for \(\rho\) .
<- acf(resid(m1), plot=FALSE)[1][[1]] lag1
The value for lag-1 autocorrelation is 0.8255 which is rather high but not surprising. We can, now, update our model specification with a value for rho as well as the A1.start argument with specifies (TRUE/FALSE) the start point of each speno/block.
<- mgcv::bam(
m2 ~ age_sex + s(speno, bs = "re") +
dry + cos1 + sin2 + cos2 + sin3 + cos3 +
sin1 poly(day, 4, raw=TRUE) +
*temp2m + pressure + precip +
wind:poly(day, 4, raw=TRUE),
age_sexdata = spotted_model_data,
family = binomial,
AR.start = ar1_start,
rho = lag1,
discrete = TRUE)
Timing for the mgcv::bam()
fit is approximately 4 seconds on an Apple M1 Silicon laptop. Estimates of the fixed effects from the mgcv::bam()
model are provided below in Table 2.
term | estimate | std.error | statistic | p.value | |
---|---|---|---|---|---|
1 | (Intercept) | -0.594777463 | 0.196709850 | -3.023628 | 2.497631e-03 |
2 | age_sexADULT.M | 0.385801763 | 0.278617516 | 1.384700 | 1.661441e-01 |
3 | age_sexSUBADULT | -0.281249113 | 0.250613891 | -1.122241 | 2.617601e-01 |
4 | age_sexYOUNG OF YEAR | 1.950551281 | 0.372985192 | 5.229568 | 1.699070e-07 |
5 | sin1 | -0.045638470 | 0.020558838 | -2.219895 | 2.642586e-02 |
6..30 | |||||
31 | age_sexYOUNG OF YEAR:poly(day, 4, raw = TRUE)4 | 0.007798545 | 0.004568144 | 1.707158 | 8.779265e-02 |
The estimates in Table 2 align quite well with those presented in Table 1. The same holds true for the estimates of standard error. This provides good indication that both approaches are interpreting the data in a very similar manner. There are, however, the cos3
term is estimated as 0 in the mgcv::bam()
model but has a non-zero estimates in the glmmLDTS
output ( see Table 3 ). Understanding the nature of this is worth pursuing further at some point.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
cos3 | 0 | 0 | NaN | NaN |
Comparing Prediction Estimates and Confidence Intervals
So, it seems that both the original glmmLDTS
approach and the mgcv::bam()
approach can provide very similar fits with nearly identical coefficient estimates and standard errors. But, this is to be expected, I think, given that the glmm can be considered a special case of gam and the only smooth term included was for the random effect. The next area to explore is how the models might compare with respect to prediction estimates and the confidence intervals around those predictions.
For this prediction exercise, we’ll want to create a new data frame that includes all of the values we’d like to predict at. For this particular model, it’s a rather complex exercise and, thus, the complex function below (feel free to unfold the code block if interested).
Code
<- function(data, age_sex) {
create_newdata <- vector(mode = "list", length = length({{age_sex}} ))
df_list
for (a_s in {{age_sex}}) {
<- data %>% filter(age_sex == a_s ) %>%
range_yday summarize(start_day = min(yday),
end_day = max(yday))
<- range_yday$start_day
start_day <- range_yday$end_day
end_day = (end_day - start_day) + 1
n_days
# for wx covariates we'll use a gam to get values by
# day/hour since wx is likely to vary w/in day over the season
<-
gam.baro gam(pressure ~ s(yday),
data = data,
method = "REML")
<-
gam.temp gam(temp2m ~ s(yday) + s(as.numeric(solar_hour)),
data = data,
method = "REML")
<-
gam.wind gam(wind ~ s(yday) + s(as.numeric(solar_hour)),
data = data,
method = "REML")
<-
gam.precip gam(precip ~ s(yday),
data = data,
method = "REML")
<- data.frame(
wx_new_data solar_hour = rep(0:23, each = n_days),
yday = rep(start_day:end_day, times = 24)
)
<- predict(gam.temp, newdata = wx_new_data)
temp_pred <- predict(gam.wind, newdata = wx_new_data)
wind_pred <- predict(gam.baro, newdata = wx_new_data)
baro_pred <- predict(gam.precip, newdata = wx_new_data)
precip_pred
<- data.frame(
df age_sex = a_s,
solar_hour = rep(0:23, each = n_days),
yday = rep(start_day:end_day, times = 24),
northing = mean(data$northing),
temp2m = temp_pred,
wind = wind_pred,
pressure = baro_pred,
precip = precip_pred
%>%
) mutate(
sin1 = sin(pi * solar_hour / 12),
cos1 = cos(pi * solar_hour / 12),
sin2 = sin(pi * solar_hour / 6),
cos2 = cos(pi * solar_hour / 6),
sin3 = sin(pi * solar_hour / 4),
cos3 = sin(pi * solar_hour / 4),
%>%
) mutate(day = (yday - 120) / 10,
day_2 = day^2,
day_3 = day^3,
day_4 = day^4)
<- df
df_list[[a_s]]
}if(length({{age_sex}}) > 1) {
<- bind_rows(df_list) %>%
df_out mutate(age_sex = forcats::fct_relevel(
c("ADULT.F","ADULT.M","SUBADULT","YOUNG OF YEAR"))
age_sex,
)else {
} <- bind_rows(df_list)
df_out
}
}
<- create_newdata(
spotted_newdata data = spotted_fit_glmm$dataset,
age_sex = levels(spotted_fit_glmm$dataset$age_sex)
)
Predicting from the glmmLDTS
model
Now, we need a function to predict from the glmmLDTS
model. This isn’t an included function within the package so we’ll create our own.
Code
<- function(glmmLDTS_model, newdata,
predict.glmmLDTS type = "response") {
# create the model matrix
<- model.matrix(glmmLDTS_model$fixed.formula[-2],
spotted_mm data = newdata)
#clean up extra intercept and get coef
<- glmmLDTS_model$fixed.effects %>%
fit_coef filter(!is.na(std.err)) %>%
pull(estimate)
<-
predicts tibble(logit_fits = as.vector(spotted_mm %*% fit_coef)) %>%
mutate(logit_fits_se = sqrt(diag(
%*% glmmLDTS_model$covb %*% t(spotted_mm)
spotted_mm
)),logit_fits_lo95 = logit_fits - 1.96*logit_fits_se,
logit_fits_up95 = logit_fits + 1.96*logit_fits_se,
ho_prob = plogis(spotted_mm %*% fit_coef),
lower95 = plogis(logit_fits_lo95),
upper95 = plogis(logit_fits_up95)
)return(predicts)
}
<- spotted_newdata %>%
spotted_newdata ::bind_cols(
dplyrpredict.glmmLDTS(spotted_fit_glmm, spotted_newdata)
)
Predicting from the mgcv::bam()
model
The mgcv
package includes a predict()
function for us, so we can proceed directly to prediction using our spotted_newdata
and bind the results. The predict.bam()
function requires a column for speno
even though we are excluding the random effect (s(speno)
) from the model prediction. As with the glmmLDTS
prediction exercise, we’ll specify type = "link"
and calculate our confidence intervals directly.
<- unique(spotted_model_data$speno) %>%
spenos as.character()
<- spotted_newdata %>%
spotted_newdata mutate(speno = sample(spenos,1))
<- spotted_newdata %>%
spotted_newdata bind_cols(
predict(
m2,
spotted_newdata,type = "link",
se.fit = TRUE,
exclude = "s(speno)"
%>%
) as_tibble()
%>%
) rename(fit_bam = fit, se_bam = se.fit) %>%
mutate(
logit_bam_lo95 = fit_bam - 1.96 * se_bam,
logit_bam_up95 = fit_bam + 1.96 * se_bam,
ho_prob_bam = plogis(fit_bam),
lower95_bam = plogis(logit_bam_lo95),
upper95_bam = plogis(logit_bam_up95)
)
Now, with predictions from both model approaches in hand, we can visualize the comparison of predictions and associated standard errors (Figure 1).
Code
ggplot(data = spotted_newdata) +
geom_point(aes(x = ho_prob, y = ho_prob_bam),
alpha = 0.1) +
geom_abline(slope = 1, intercept = 0) +
coord_cartesian() +
xlab("Predicted HO Probability - glmmLDTS") +
ylab("Predicted HO Probability - mgcv::bam") +
ggtitle("Comparing Predicted Haul-out Probability",
subtitle = "values shown are on the response scale") +
theme_minimal() + theme(legend.position = "none")
This looks great!
Now, let’s visualize the predictions with a heat map (Figure 2) that depicts haul-out probability changes over the season and hour of day for each of our age and sex classes.
Code
<- spotted_newdata %>%
plot_df mutate(date = lubridate::as_date(yday, origin = "2015-01-01"),
month = lubridate::month(date,label=TRUE),
day = lubridate::day(date)) %>%
filter(!month %in% c("Jul","Aug"))
<- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob)) +
p1 geom_tile(color = "white", linewidth = 0) +
scale_fill_gradientn(
colors = rev(met.brewer("Hiroshige")),
name = "haul-out probability",
aesthetics = "fill",
limits = c(0, 1),
breaks = c(0.25, 0.50, 0.75),
guide = guide_colorbar(
title.position = "bottom",
barwidth = 15,
barheight = 0.5,
title.hjust = 0.5
)
)
<- p1 + facet_grid(age_sex~month, scales = "free_x")
p1 <- p1 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
p1 scale_y_continuous(breaks = c(4,12,20)) +
coord_cartesian(expand=FALSE)
<- p1 + theme_minimal() +
p1 theme(panel.spacing = unit(0.1, "lines")) +
theme(legend.position = "bottom") +
theme(strip.background = element_rect(colour="white")) +
theme(axis.ticks=element_blank()) +
xlab("day of month") + ylab("local solar hour") +
ggtitle("Spotted seal haul-out predictions - glmmLDTS")
<- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob_bam)) +
p2 geom_tile(color = "white", linewidth = 0) +
scale_fill_gradientn(
colors = rev(met.brewer("Hiroshige")),
name = "haul-out probability",
aesthetics = "fill",
limits = c(0, 1),
breaks = c(0.25, 0.50, 0.75),
guide = guide_colorbar(
title.position = "bottom",
barwidth = 15,
barheight = 0.5,
title.hjust = 0.5
)
)<- p2 + facet_grid(age_sex~month, scales = "free_x")
p2 <- p2 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
p2 scale_y_continuous(breaks = c(4,12,20)) +
coord_cartesian(expand=FALSE)
<- p2 + theme_minimal() +
p2 theme(panel.spacing = unit(0.1, "lines")) +
theme(legend.position = "bottom") +
theme(strip.background = element_rect(colour="white")) +
theme(axis.ticks=element_blank()) +
xlab("day of month") + ylab("local solar hour") +
ggtitle("Spotted seal haul-out predictions - mgcv::bam")
/p2 p1
That looks great, too!
So far, we’ve shown that the model estimates, associated standard errors, and prediction values for haul-out probability are very similar between the glmmLDTS
and mgcv::bam()
approach. The final comparison to evaluate is the prediction standard errors. We’ll start with an XY plot similar to what we did for the prediction estimates.
Code
ggplot(data = spotted_newdata) +
geom_point(aes(x = logit_fits_se, y = se_bam),
alpha = 0.1) +
geom_abline(slope = 1, intercept = 0) +
coord_cartesian() + ylim(c(0,NA)) + xlim(c(0,NA)) +
xlab("Prediction Standard Error - glmmLDTS") +
ylab("Prediction Standard Error - mgcv::bam") +
ggtitle("Comparing Prediction Standard Errors",
subtitle = "values shown are on the logit scale, not response") +
theme_minimal() + theme(legend.position = "none")
Just like the other comparisons, this also looks very promising. As a final visualization, let’s fix the solar_hour
parameter to noon and compare predictions and standard errors for both models across age and sex class.
<- ggplot(plot_df %>% filter(solar_hour == 12), aes(date,ho_prob)) +
p3 geom_ribbon(aes(ymin=lower95,ymax=upper95), fill = "grey30", alpha = 0.3) +
geom_line(aes(date,ho_prob)) +
geom_ribbon(aes(ymin=lower95_bam,ymax=upper95_bam), fill = "orange", alpha = 0.3) +
geom_line(aes(date, ho_prob_bam), color = "orange") +
facet_grid(. ~ age_sex) +
xlab("day or year") +
ylab("haul-out probability") +
ggtitle("Comparing Predictions & Standard Errors",
subtitle = "glmmLDTS is the darker shaded region and orange is mgcv::bam") +
theme_minimal() +
theme(legend.position = "none")
p3
So, it seems we can adopt mgcv::bam()
as a quick surrogate for glmmLDTS
and gain an astonishing decrease in computation time. But, we should probably temper any enthusiasm to rely on mgcv::bam()
as a complete replacement for glmmLDTS
in this case before proper comparisons with simulated data and can be done.
Additionally, it is worth considering a possible fairer comparison, in terms of time. This would involve use of the Gaussian process basis functions by including a model formula something like y ~ x1 + ....xn + s(time, bs = "gp", k = floor(n/2), m = c(2, 3, 1))
to get a reduced rank version of the AR1 model for the residuals (also specifying method = 'REML'
in the call to mgcv::bam()
).
What About a More GAM/BAM-centric Approach
Up to this point, the focus has been on comparing the same model specification (other than the approach for including random effects) between glmmLDTS
and mgcv::bam()
. However, a more GAM-like approach to mgcv::bam()
might rely on a different specification that includes more smooth terms and such.
That might be interesting to explore while we are at it. The mgcv
pacakge provides a range of options for specifying smooths within the model. For the environmental covariates, we’ll specify a simple cubic spline with a shrinkage component (bs = "cs"
). For the day of year (yday
) we’ll specify a cubic spline that varies by age and sex class (bs = "cr", by=age_sex
) and for solar hour we’ll specify a circular spline (bs="cc"
). Interactions are handled as tensor product smooths. It wasn’t clear whether the te()
approach or ti()
would be best so both are fit and tested with AIC.
<- spotted_model_data %>%
spotted_model_data mutate(solar_hour = as.integer(solar_hour),
wind_temp2 = wind*temp2m)
<- mgcv::bam(
m5ti ~ s(speno, bs = "re") +
dry s(yday, bs="cr",by=age_sex)+ s(solar_hour, bs="cc")+
ti(yday, solar_hour, bs=c("cr","cc")) + age_sex +
s(precip, bs = "cs") + s(temp2m, bs = "cs") +
s(wind, bs = "cs") + ti(wind, temp2m, bs = "cs") + s(pressure, bs = "cs"),
data = spotted_model_data,
family = binomial,
AR.start = ar1_start,
rho = lag1,
discrete = TRUE
)
<- mgcv::bam(
m5te ~ s(speno, bs = "re") +
dry te(yday, solar_hour,bs=c("cr","cc"), by=age_sex) + age_sex +
te(wind, temp2m, bs = "cs") + s(pressure, bs = "cs"),
data = spotted_model_data,
family = binomial,
AR.start = ar1_start,
rho = lag1,
discrete = TRUE
)
AIC(m5ti, m5te)
df AIC
m5ti 155.3137 -27419.44
m5te 156.9675 -26427.64
Ok, looks like the ti()
approach is favored. Now, let’s do some predictions!
<- spotted_newdata %>%
spotted_newdata mutate(wind_temp2 = wind*temp2m)
<- predict(
m5_predict
m5ti,
spotted_newdata,type = "link",
se.fit = TRUE,
exclude = "s(speno)"
%>%
) as_tibble() %>%
rename(fit_bam2 = fit, se_bam2 = se.fit) %>%
mutate(
logit_bam2_lo95 = fit_bam2 - 1.96 * se_bam2,
logit_bam2_up95 = fit_bam2 + 1.96 * se_bam2,
ho_prob_bam2 = plogis(fit_bam2),
lower95_bam2 = plogis(logit_bam2_lo95),
upper95_bam2 = plogis(logit_bam2_up95)
)
<- spotted_newdata %>%
spotted_newdata bind_cols(m5_predict)
And, visualize those predictions similar to our other plots
Code
<- spotted_newdata %>%
plot_df mutate(date = lubridate::as_date(yday, origin = "2015-01-01"),
month = lubridate::month(date,label=TRUE),
day = lubridate::day(date)) %>%
filter(!month %in% c("Jul","Aug"))
<- ggplot(plot_df, aes(day, solar_hour, fill = ho_prob_bam2)) +
p1 geom_tile(color = "white", linewidth = 0) +
scale_fill_gradientn(
colors = rev(met.brewer("Hiroshige")),
name = "haul-out probability",
aesthetics = "fill",
limits = c(0, 1),
breaks = c(0.25, 0.50, 0.75),
guide = guide_colorbar(
title.position = "bottom",
barwidth = 15,
barheight = 0.5,
title.hjust = 0.5
)
)
<- p1 + facet_grid(age_sex~month, scales = "free_x")
p1 <- p1 + scale_x_continuous(breaks = c(5,10,15,20,25)) +
p1 scale_y_continuous(breaks = c(4,12,20)) +
coord_cartesian(expand=FALSE)
<- p1 + theme_minimal() +
p1 theme(panel.spacing = unit(0.1, "lines")) +
theme(legend.position = "bottom") +
theme(strip.background = element_rect(colour="white")) +
theme(axis.ticks=element_blank()) +
xlab("day of month") + ylab("local solar hour") +
ggtitle("Spotted seal haul-out predictions - mgcv::bam with smooths")
p1
That’s not so bad and, actually, pretty good given the very different approach. I imagine it speaks to the strong signal in the underlying data. An interesting aspect of this result is that we, now, see a period of increased haul-out behavior in adult females that coincides with pupping and nursing. There’s also an increased signal for adult males in that time that would coincide with their courting behavior and hauling out with nursing females. This difference was especially noticeable compared to early model specifications of glmmLDTS
that only included up to a 3rd order polynomial for day of year. Inclusion of a 4th order polynomial has brought the two models more inline and it’s worth considering whether the gam-centric approach is over-fitting and needs to implement some additional constraints.
Code
<- ggplot(plot_df %>% filter(solar_hour == 12), aes(date,ho_prob)) +
p3 geom_ribbon(aes(ymin=lower95,ymax=upper95), fill = "grey30", alpha = 0.3) +
geom_line(aes(date,ho_prob)) +
geom_ribbon(aes(ymin=lower95_bam2,ymax=upper95_bam2), fill = "orange", alpha = 0.3) +
geom_line(aes(date, ho_prob_bam2), color = "orange") +
facet_grid(. ~ age_sex) +
xlab("day or year") +
ylab("haul-out probability") +
ggtitle("Comparing Predictions & Standard Errors",
subtitle = "glmmLDTS is the darker shaded region and orange is mgcv::bam with smooths") +
theme_minimal() +
theme(legend.position = "none")
p3
Acknowledgments
Special thanks to Debbie Russell, Matt Carter, Peter Boveng, and Jay Ver Hoef for their thoughts, discussions, and encouragement to explore. And, to Paul Conn for his originating work on this analysis.