-
Notifications
You must be signed in to change notification settings - Fork 19
/
longitfreq.qmd
332 lines (248 loc) · 9.27 KB
/
longitfreq.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: Longitudinal analysis using Frequentist methods
author: "[Julian Faraway](https://julianfaraway.github.io/)"
date: "`r format(Sys.time(), '%d %B %Y')`"
format:
gfm:
toc: true
---
```{r}
#| label: global_options
#| include: false
knitr::opts_chunk$set(comment=NA,
echo = TRUE,
fig.path="figs/",
dev = 'svglite',
fig.ext = ".svg",
warning=TRUE,
message=TRUE)
knitr::opts_knit$set(global.par = TRUE)
```
```{r}
#| label: graphopts
#| include: false
par(mgp=c(1.5,0.5,0), mar=c(3.1,3.1,0.1,0), pch=20)
ggplot2::theme_set(ggplot2::theme_bw())
```
See the [introduction](index.md) for an overview.
See a [mostly Bayesian analysis](longitudinal.md) analysis of the same
data.
This example is discussed in more detail in my book
[Extending the Linear Model with R](https://julianfaraway.github.io/faraway/ELM/)
Required libraries:
```{r}
library(faraway)
library(ggplot2)
```
# Data
The Panel Study of Income Dynamics (PSID), begun in 1968, is a
longitudinal study of a representative sample of U.S. individuals.
The study is conducted at the Survey Research
Center, Institute for Social Research, University of Michigan, and is
still continuing. There are currently 8700 households in the study
and many variables are measured. We chose to analyze a random subset
of this data, consisting of 85 heads of household who were aged
25--39 in 1968 and had complete data for at least 11 of the
years between 1968 and 1990. The variables included were annual
income, gender, years of education and age in 1968:
Load in the data:
```{r}
data(psid, package="faraway")
head(psid)
summary(psid)
psid$cyear <- psid$year-78
```
We have centered the time variable year. Some plots of just 20 of the subjects:
```{r}
#| label: psidplot
psid20 = subset(psid, person <= 20)
ggplot(psid20, aes(x=year, y=income))+geom_line()+facet_wrap(~ person)
ggplot(psid20, aes(x=year, y=income+100, group=person)) +geom_line()+facet_wrap(~ sex)+scale_y_log10()
```
# Mixed Effect Model
Suppose that the income change over time can be partly predicted by
the subject's age, sex and educational level. The variation may be partitioned into two components.
Clearly there are other factors that will affect a subject's income.
These factors may cause the income to be generally higher or lower or
they may cause the income to grow at a faster or slower rate. We can
model this variation with a random intercept and slope, respectively,
for each subject. We also expect that there will be some year-to-year
variation within each subject. For simplicity, let us initially assume
that this error is homogeneous and uncorrelated. We also center the year to aid interpretation.
# LME4
See the discussion for the [single random effect example](pulpfreq.md#LME4)
for some introduction.
```{r}
library(lme4)
mmod = lmer(log(income) ~ cyear*sex +age+educ+(1 | person) + (0 + cyear | person), psid)
summary(mmod, cor=FALSE)
```
This model can be written as:
```{math}
\begin{aligned}
\mathrm{\log(income)}_{ij} &= \mu + \beta_{y} \mathrm{year}_i + \beta_s \mathrm{sex}_j + \beta_{ys} \mathrm{sex}_j \times \mathrm{year}_i + \beta_e \mathrm{educ}_j + \beta_a \mathrm{age}_j \\ &+ \gamma^0_j + \gamma^1_j \mathrm{year}_i + \epsilon_{ij}
\end{aligned}
```
where $i$ indexes the year and $j$ indexes the individual. We have:
```{math}
\left(
\begin{array}{c}
\gamma^0_k \\
\gamma^1_k
\end{array}
\right) \sim
N(0,\sigma^2 D)
```
We have chosen not to have an interaction between the intercept and the slope random effects
and so $D$ is a diagonal matrix. It is possible to have such an interaction in
`lmer()` models (but not in all the Bayesian models). As it happens, if you do fit
a model with such an interaction, you find that it is not significant. Hence dropping
it does not make much difference in this instance.
We can test the interaction term in the fixed effect part of the model:
```{r}
library(pbkrtest)
mmod = lmer(log(income) ~ cyear*sex + age + educ +
(1 | person) + (0 + cyear | person),
psid, REML=FALSE)
mmodr <- lmer(log(income) ~ cyear + sex + age + educ +
(1 | person) + (0 + cyear | person),
psid, REML=FALSE)
KRmodcomp(mmod,mmodr)
```
We find that the interaction is statistically significant. We can also compute
bootstrap confidence intervals:
```{r}
#| label: psidboot
#| cache: TRUE
confint(mmod, method="boot", oldNames=FALSE)
```
We see that all the standard deviations are clearly well above zero. The age
effect does not look significant.
# NLME
See the discussion for the [single random effect example](pulpfreq.md#NLME)
for some introduction.
The syntax for specifying the random effects part of the model is different from `lme4`:
```{r}
library(nlme)
nlmod = lme(log(income) ~ cyear*sex + age + educ,
psid,
~ cyear | person)
summary(nlmod)
```
The results are very similar although not identical with `lme4`. This
is probably due to small differences in the convergence criteria
used by the fitting algorithms.
`nlme` does provide t-tests on the fixed effects which are useful
in this instance. We see that age is not a significant effect.
We can also do the sequential ANOVA:
```{r}
anova(nlmod)
```
Only the last test is the same as the previous test because
of the sequential nature of the testing (i.e. different models
are being compared.)
More useful are the marginal tests provided by:
```{r}
library(car)
Anova(nlmod)
```
# MMRM
See the discussion for the [single random effect example](pulpfreq.md#MMRM)
for some introduction.
```{r}
library(mmrm)
```
It is not currently possible to fit the random effect slope and
intercept model in `mmrm`. We can fit some other models.
We need a factor form of the time variable. Also `mmrm` insists
that the subject variable also be a factor.
```{r}
psid$visit = factor(psid$cyear)
psid$person = factor(psid$person)
```
The most general model uses an unstructured covariance matrix
for the random component of the model. Since we have 23 timepoints,
that means 23*24/2 = 276 parameters in total for each pair of timepoints
plus the diagonal variance.
```{r}
mmmodu = mmrm(log(income) ~ cyear*sex + age+educ+ us(visit | person), psid)
summary(mmmodu)
```
Given the number of parameters, this takes a while to fit (although was
faster than I expected).
We might assume that the correlation between timepoints depends only on their
distance apart with Toeplitz structure on the correlation matrix for
random subject component:
```{r}
mmmodt = mmrm(log(income) ~ cyear*sex + age+educ+ toep(visit | person), psid)
summary(mmmodt)
```
This brings us down to 23 parameters for the variance matrix.
This generates a warning on the fit convergence but the output looks reasonable.
We can reasonably assume the much stronger (autoregressive) AR1 structure
on successive timepoints:
```{r}
mmmoda = mmrm(log(income) ~ cyear*sex + age+educ+ ar1(visit | person), psid)
summary(mmmoda)
```
Now we have only 2 parameters for the variance matrix.
There is a plausible myth that in mixed effect models where you are mainly
interested in the fixed effects, the random structure does not matter so
much provided some flexibility is given to account for the correlation
between observations on the same subject.
The unstructured covariance uses far too many parameters so we might discard
that. Let's compare the Toeplitz fixed effect fit:
```{r}
tidy(mmmodt) |> knitr::kable()
```
with the AR1 fixed effect fit:
```{r}
tidy(mmmoda) |> knitr::kable()
```
We see that there is not much difference in the fixed effect fits and
their standard errors. It does make a difference to whether the interaction
is statistically significant (but p-values themselves are sensitive to
small changes in the data so there really is not much difference between
p=0.06 and p=0.04). These results are also quite similar to the `nlme` and
`lme4` output.
# GLMMTMB
See the discussion for the [single random effect example](pulpfreq.md#GLMMTMB)
for some introduction.
```{r}
library(glmmTMB)
```
The default fit uses ML - let's use REML for the purposes of comparison
with `lme4`:
```{r}
gtmod <- glmmTMB(log(income) ~ cyear*sex +age+educ+(1 | person) + (0 + cyear | person), psid, REML=TRUE)
summary(gtmod)
```
This matches the `lme4` output earlier.
`glmmTMB` has some additional covariance structures for the
random part of the model. In particular, it has the AR1 option
as used for `mmrm` above:
```{r}
gtmoda <- glmmTMB(log(income) ~ cyear*sex +age+educ+ ar1(0 + visit | person),
data=psid,
REML=TRUE)
summary(gtmoda)
```
The results are similar but not the same as the `mmrm` output. The
correlation is estimated as 0.77 whereas `mmrm` gives:
```{r}
cm = mmmoda$cov
cm[1,2]/cm[1,1]
```
- `glmmTMB` can also handle the Toeplitz and unstructured covariance forms.
- It would also be possible to fit an AR1-type model using `gls()` from
`nlme`.
# Discussion
We have contending models from all four packages but only
`glmmTMB` can attempt them all. There is the question of which
random structure is appropriate for this data which is
not easily answered. There is a suggestion that this choice
may not be crucial if the fixed effects are our main interest.
# Package version info
```{r}
sessionInfo()
```