Main

BWAS use non-invasive MRI to identify associations between inter-individual differences in behaviour, cognition, biological or clinical measurements and brain structure or function1,2. A fundamental goal of BWAS is to identify true underlying biological associations that improve our understanding of how brain organization and function are linked to health across the lifespan.

Recent studies have raised concerns about the replicability of BWAS1,2,3. Statistical replicability is typically defined as the probability of obtaining consistent results from hypothesis tests across different studies. Like statistical power, replicability is a function of both the standardized effect size and the sample size5,6,7. Low replicability in BWAS has been attributed to a combination of small sample sizes, small standardized effect sizes and bad research practices (such as p-hacking and publication bias)1,2,8,9,10,11,12. The most obvious solution to increase the replicability in BWAS is to increase study sample sizes. Several recent studies have shown that thousands of study participants are required to obtain replicable findings in BWAS1,2. However, massive sample sizes are often infeasible in practice.

Standardized effect sizes (such as Pearson’s correlation and Cohen’s d) are statistical values that not only depend on the underlying biological association in the population but also on the study design. Two studies of the same biological effect with different study designs will have different standardized effect sizes. For example, contrasting brain function of groups with depression versus those without depression will have a different Cohen’s d effect size if the study design measures more extreme depressed states contemporaneously with measures of brain function, as opposed to less extreme depressed states, even if the underlying biological effect is the same. Although researchers cannot increase the magnitude of the underlying biological association, its standardized effect size — and thus its replicability — can be increased by critical features of study design.

In this study, we focus on identifying modifiable study design features that can be used to improve the replicability of BWAS by increasing standardized effect sizes. Increasing standardized effect sizes through study design before data collection stands in contrast to bad research practices that can artificially inflate reported effect sizes, such as p-hacking and publication bias. There has been very little research regarding how modifications to the study design might improve BWAS replicability. Specifically, we focus on two major design features that directly influence standardized effect sizes: variation in sampling scheme and longitudinal designs1,13,14,15. Of note, these design features can be implemented without inflating the sample estimate of the underlying biological effect when using correctly specified models16. By increasing the replicability of BWAS through study design, we can more efficiently utilize the US$1.8 billion average annual investment in neuroimaging research from the US National Institutes of Health (https://reporter.nih.gov/search/_dNnH1VaiEKU_vZLZ7L2xw/projects/charts).

Here we conducted a comprehensive investigation of cross-sectional and longitudinal BWAS designs by capitalizing on multiple large-scale data resources. Specifically, we begin by analysing and meta-analysing 63 neuroimaging datasets including 77,695 scans from 60,900 cognitively normal participants from the Lifespan Brain Chart Consortium4 (LBCC). We leverage data from the UK Biobank (UKB; up to 29,031 scans), the Alzheimer’s Disease Neuroimaging Initiative (ADNI; 2,232 scans) and the Adolescent Brain Cognitive Development study (ABCD; up to 17,210 scans) to investigate the most commonly measured phenotypes of brain structure and function. To ensure that our results are broadly generalizable, we evaluated associations with diverse covariates of interest, including age, sex, cognition and psychopathology. To facilitate comparison between BWAS designs, we also introduce a new version of the robust effect size index (RESI)17,18,19 that allows us to demonstrate how longitudinal study design directly impacts standardized effect sizes.

Standardized effect sizes depend on study design

To fit each study-level analysis, we regressed each of the global brain measures (total grey matter volume (GMV), total subcortical grey matter volume (sGMV), total white matter volume (WMV) and mean cortical thickness) and regional brain measures (regional GMV and cortical thickness, based on Desikan–Killiany parcellation20) on sex and age in each of the 63 neuroimaging datasets from the LBCC. Age was modelled using a non-linear spline function in linear regression models for the cross-sectional datasets and generalized estimating equations (GEEs) for the longitudinal datasets (Methods). Site effects were removed before the regressions using ComBat21,22 (Methods). Analyses for total GMV, total sGMV and total WMV used all 63 neuroimaging datasets (16 longitudinal; Supplementary Table 1). Analyses of regional brain volumes and cortical thickness used 43 neuroimaging datasets (13 longitudinal; Methods and Supplementary Table 2).

Throughout the present study, we used the RESI17,18,19 as a measure of standardized effect size. The RESI is a recently developed index that is equal to 1/2 Cohen’s d under the same assumptions for Cohen’s d17 (Methods; section 3 in Supplementary Information). We used the RESI as a standardized effect size because it is broadly applicable to many types of models and is robust to model misspecification. To investigate the effects of study design features on the RESI, we performed meta-analyses for the four global brain measures and two regional brain measures in the LBCC to model the association of study-level design features with standardized effect sizes. Study design features are quantified as the sample mean, standard deviation and skewness of the age covariate as non-linear terms, and a binary variable indicating the design type (cross-sectional or longitudinal). After obtaining the estimates of the standardized effect sizes of age and sex in each analysis of the global and regional brain measures, we conducted meta-analyses of the estimated standardized effect sizes using weighted linear regression models with study design features as covariates (Methods).

For total GMV, the partial regression plots of the effect of each study design feature demonstrate a strong cubic-shape relationship between the standardized effect size for total GMV–age association and study population mean age. This cubic shape indicates that the strength of the age effect varies with respect to the age of the population being studied. The largest age effect on total GMV in the human lifespan occurs during early and late adulthood (Fig. 1a and Supplementary Table 3). There is also a strong positive linear effect of the study population standard deviation of age and the standardized effect size for total GMV–age association. For each unit increase in the standard deviation of age (in years), the expected standardized effect size increases by about 0.1 (Fig. 1a). This aligns with the well-known relationship between correlation strength and covariate standard deviation indicated by statistical principles23. Plots for total sGMV, total WMV and mean cortical thickness show U-shaped changes of the age effect with respect to the study population mean age (Fig. 1b–d). A similar but sometimes weaker relationship is shown between expected standardized effect size and study population standard deviation and skewness of the age covariate (Fig. 1b–d and Supplementary Tables 4–6). Finally, the meta-analyses also show a moderate effect of study design on the standardized effect size of age on each of the global brain measures (Fig. 1a–d and Supplementary Tables 3–6). The average standardized effect size for total GMV–age associations in longitudinal studies (RESI = 0.39) is substantially larger than in cross-sectional studies (RESI = 0.08) after controlling for the study design variables, corresponding to a more than 380% increase in the standardized effect size for longitudinal studies. This value quantifies the systematic differences in the standardized effect sizes between the cross-sectional and longitudinal studies among the 63 neuroimaging studies. Of note, longitudinal study design does not improve the standardized effect size for biological sex, because sex does not vary within participants in these studies (Supplementary Tables 7–10 and Supplementary Fig. 2).

Fig. 1: Meta-analyses reveal study design features that are associated with larger standardized effect sizes of age on different brain measures.
figure 1

a–d, Partial regression plots of the meta-analyses of standardized effect sizes (RESI) for the association between age and global brain measures — total GMV (a), total sGMV (b), total WMV (c) and mean cortical thickness (CT) (d) — show that standardized effect sizes vary with respect to the mean and standard deviation (s.d.) of age in each study. Box plots show the median (horizontal line), interquartile range (grey box), min-max values (vertical lines), and outliers (points). ‘| other’ means after fixing the other features at constant levels: design was cross-sectional, mean age of 45 years, sample age s.d. = 7 and/or skewness of age = 0 (symmetric). The blue curves are the expected standardized effect sizes for age from the locally estimated scatterplot smoothing (LOESS) curves. The grey shading areas are the 95% confidence bands from the LOESS curves. e,f, The effects of study design features on the standardized effect sizes for the associations between age and regional brain measures (regional GMV (e) and regional cortical thickness (f)). Regions with Benjamini–Hochberg-adjusted P < 0.05 are shown in colour. FDR, false discovery rate.

For regional GMV and cortical thickness, similar effects of study design features also occur across regions (Fig. 1e,f; 34 regions per hemisphere). In most of the regions, the standardized effect sizes of age on regional GMV and cortical thickness are strongly associated with the study population standard deviation of age. Longitudinal study designs generally tend to have a positive effect on the standardized effect sizes for regional GMV–age associations and a positive but weaker effect on the standardized effect sizes for regional cortical thickness–age associations. To improve the comparability of standardized effect sizes between cross-sectional and longitudinal studies, we propose a new effect size index: the cross-sectional RESI for longitudinal datasets (section 3 in Supplementary Information). The cross-sectional RESI for longitudinal datasets represents the RESI in the same study population, if the longitudinal study had been conducted cross-sectionally. This newly developed effect size index allows us to quantify the benefit of using a longitudinal study design in a single dataset (section 3.3 in Supplementary Information).

The meta-analysis results demonstrate that standardized effect sizes are dependent on study design features, such as mean age, standard deviation of the age of the sample population, and cross-sectional or longitudinal design. Moreover, the results suggest that modifying study design features, such as increasing variability and conducting longitudinal studies, can increase the standardized effect sizes in BWAS.

Improved sampling boosts replicability

To investigate the effect of modifying the variability of the age covariate on increasing standardized effect sizes and replicability, we implemented three sampling schemes that produce different sample standard deviations of the age covariate. We treated the large-scale cross-sectional UKB data as the population and draw samples whose age distributions follow a pre-specified shape (bell shaped, uniform and U shaped; Methods and Extended Data Fig. 1a). In the UKB, the U-shaped sampling scheme on age increases the standardized effect size for the total GMV–age association by 60% compared with bell shaped and by 27% compared with uniform (Fig. 2a), with an associated increase in replicability (Fig. 2b). To achieve 80% replicability for detecting the total GMV–age association (Methods), fewer than 100 participants are sufficient if using the U-shaped sampling scheme, whereas about 200 participants are needed if the bell-shaped sampling scheme is used (Fig. 2b). A similar pattern can be seen for the regional outcomes of GMV and cortical thickness (Fig. 2c–f). The U-shaped sampling scheme typically provides the largest standardized effect sizes of age and the highest replicability, followed by the uniform and bell-shaped schemes. The U-shaped sampling scheme shows greater region-specific improvement in the standardized effect sizes and replicability for regional GMV–age and regional cortical thickness–age associations than the bell-shaped scheme (Extended Data Fig. 1d,e).

Fig. 2: Increased standardized effect sizes and replicability for age associations with different brain measures under three sampling schemes in the UKB study.
figure 2

n = 29,031 for total GMV and n = 29,030 for regional GMV and cortical thickness. The sampling schemes target different age distributions to increase the variability of age: bell shaped < uniform < U shaped (Extended Data Fig. 1a). a,b, Using the sampling schemes, increasing age variability increases the standardized effect sizes (a) and replicability (b; at significance level of 0.05) for total GMV–age association. c–f, The same result holds for regional GMV (c,d) and regional cortical thickness (e,f). The curves represent the average standardized effect size or estimated replicability at a given sample size and sampling scheme. The shaded areas represent the corresponding 95% confidence bands. The bold curves are the average standardized effect size or replicability across all regions with significant uncorrected effects using the full UKB data (c–f).

To investigate the effect of increasing the variability of the age covariate longitudinally, we implemented sampling schemes to adjust the between-subject and within-subject variability of age in the bootstrap samples from the longitudinal ADNI dataset. In the bootstrap samples, each participant had two measurements (baseline and a follow-up). To imitate the true operation of a study, we selected the two measurements of each participant based on baseline age and the follow-up age by targeting specific distributions for the baseline age and the age change at the follow-up time point (Methods; Extended Data Fig. 1b,c). Increasing between-subject and within-subject variability of age increases the average observed standardized effect sizes, with corresponding increases in replicability (Fig. 3). A U-shaped between-subject sampling scheme on age increases the standardized effect size for total GMV–age association by 23.6% compared with bell shaped and by 12.1% compared with uniform, when using the uniform within-subject sampling scheme (Fig. 3a).

Fig. 3: Increased standardized effect sizes and replicability for age associations with structural brain measures under different longitudinal sampling schemes in the ADNI data.
figure 3

Three different sampling schemes (Extended Data Fig. 1b,c) are implemented in bootstrap analyses to modify the between-subject and within-subject variability of age, respectively. a,b, Implementing the sampling schemes results in higher (between-subject and/or within-subject) variability and increases the standardized effect size (a) and replicability (b; at significance level of 0.05) for the total GMV–age association. The curves represent the average standardized effect size or estimated replicability, and the shaded areas are the 95% confidence bands across the 1,000 bootstraps. c,d, Increasing the number of measurements from one to two per participant provides the most benefit on standardized effect size (c) and replicability (d) for the total GMV–age association when using uniform between-subject and within-subject sampling schemes and n = 30. The points represent the mean standardized effect sizes or estimated replicability, and the whiskers are the 95% confidence intervals. e–h, Increased standardized effect sizes (e,g) and replicability (f,h) for the associations of age with regional GMV (e,f) and regional cortical thickness (g,h) across all brain regions under different sampling schemes. The bold curves are the average standardized effect size or estimated replicability across all regions with significant uncorrected effects using the full ADNI data. The shaded areas are the corresponding 95% confidence bands. Increasing the between-subject and within-subject variability of age by implementing different sampling schemes can increase the standardized effect sizes of age, and the associated replicability, on regional GMV and regional cortical thickness.

In addition, we investigated the effect of the number of measurements per participant on the standardized effect size and replicability in longitudinal data using the ADNI dataset. Adding a single additional measurement after the baseline increases the standardized effect size for total GMV–age association by 156% and replicability by 350%. The benefit of additional measurements is minimal (Fig. 3c,d). Finally, we also evaluated the effects of the longitudinal sampling schemes on regional GMV and cortical thickness in the ADNI dataset (Fig. 3e–h). When sampling two measurements per participant, the between-subject and within-subject sampling schemes producing larger age variability increase the standardized effect size and replicability across most regions.

Together, these results suggest that having larger spacing in between-subject and within-subject age measurements increases standardized effect size and replicability. Most of the benefit of the number of within-subject measurements is due to the first additional measurement after baseline.

Sampling benefit varies by brain measure

As standardized effect sizes for brain–age associations are often larger than for brain–behaviour associations, we investigated whether the proposed sampling schemes are effective on various non-brain covariates and their associations with structural and functional brain measures in all participants (with and without neuropsychiatric symptoms) with cross-sectional and longitudinal measurements from the ABCD dataset. The non-brain covariates include the NIH toolbox24, Child Behavior Checklist (CBCL), body mass index (BMI), birth weight and handedness (Methods; Supplementary Tables 11 and 12). Functional connectivity is used as a functional brain measure and is computed for all pairs of regions in the Gordon atlas25 (Methods). We used the bell-shaped and U-shaped target sampling distributions to control the between-subject and within-subject variability of each non-brain covariate (Methods). For each non-brain covariate, we show the results for the four combinations of between-subject and within-subject sampling schemes. Overall, there is a consistent benefit to increasing between-subject variability of the covariate (Fig. 4 and Extended Data Fig. 2). These preferred sampling schemes lead to more than 1.8 factor reduction in sample size for 80% replicability and more than 1.4 factor increase in the standardized effect size for over 50% of associations. Moreover, 72% of covariate-outcome associations had increased standardized effect sizes by increasing the between-subject variability of the covariates (Extended Data Fig. 3).

Fig. 4: Heterogeneous improvement of standardized effect sizes for select cognitive, mental health and demographic associations with structural and functional brain measures in the ABCD study with bootstrapped samples of n = 500.
figure 4

a,b, U-shaped between-subject sampling scheme (blue) that increases between-subject variability of the non-brain covariate produces larger standardized effect sizes (a) and reduces the number of participants scanned to obtain 80% replicability (b) in total GMV. The points and triangles are the average standardized effect sizes across bootstraps, and the whiskers are the 95% confidence intervals. Increasing within-subject sampling (triangles) can reduce standardized effect sizes. c–f, A similar pattern holds in regional GMV (c,d) and regional cortical thickness (e,f) as in panels a,b. The boxplots show the distributions of the standardized effect sizes across regions (or region pairs for functional connectivity). Box plots are as in Fig. 1. g,h, By contrast, regional pairwise functional connectivity standardized effect sizes improve by increasing between-subject (blue) and within-subject (dashed borders) variability (g) with a corresponding reduction in the number of participants scanned for 80% replicability (h). See Extended Data Fig. 2 for the results for all non-brain covariates examined.

Importantly, increasing within-subject variability decreases the standardized effect sizes for many structural associations (Fig. 4a–f and Extended Data Fig. 2a–f), suggesting that conducting longitudinal analyses can result in decreased replicability compared with cross-sectional analyses. For the functional connectivity outcomes, there is a slight positive effect of increasing within-subject variability (Fig. 4g,h and Extended Data Fig. 2g,h). To evaluate the lower replicability of the structural associations with increasing within-subject variability, we compared cross-sectional standardized effect sizes of the non-brain covariates on each brain measure using the baseline measurements to the standardized effect sizes estimated using the full longitudinal data (Fig. 5a–d and Extended Data Fig. 4). Consistent with the reduction in standardized effect size by increasing within-subject variability, for most structural associations (GMV and cortical thickness), conducting cross-sectional analyses using the baseline measurements results in larger standardized effect sizes (and higher replicability) than conducting analyses using the full longitudinal data. This finding holds when fitting a cross-sectional model only using the 2-year follow-up measurement (Extended Data Fig. 4). Identical results are found using linear mixed models with individual-specific random intercepts, which are commonly used in BWAS (Supplementary Fig. 3). Together, these results suggest that the benefit of conducting longitudinal studies and larger within-subject variability is highly dependent on the brain–behaviour association. Counterintuitively, longitudinal designs can reduce the standardized effect sizes and replicability.

Fig. 5: Longitudinal study designs can reduce standardized effect sizes and replicability due to differences in between-subject versus within-subject associations of brain and behavioural measures.
figure 5

Plots show the distribution of the standardized effect sizes. a–c, Cross-sectional analyses (using only the baseline measures; indicated by ‘1st’ on the x axes) can have larger standardized effect sizes than the same longitudinal analyses (using the full longitudinal data; indicated by ‘all’ on the x axes) for total GMV (a), regional GMV (b) and regional cortical thickness (c) in the ABCD dataset. Data in a are estimates (points) with 95% confidence intervals (whiskers). Box plots in b–d are as in Fig. 1. d, Functional connectivity measures do not show such a reduction of standardized effect sizes in longitudinal modelling. See Extended Data Fig. 4 for the results for all non-brain covariates examined. e,f, Most regional GMV associations (e) have larger between-subject parameter estimates (βb; x axis) than within-subject parameter estimates (βw; y axis; see equation (13) in Supplementary Information), whereas functional connectivity associations (f) show less heterogeneous relationships between the two parameters.

Accurate longitudinal models are crucial

To investigate why increasing within-subject variability or using longitudinal designs is not beneficial for some associations, we examined an assumption common to GEEs and linear mixed models in BWAS. These widely used models assume that there is consistent association strength between the brain measure and non-brain covariate across between-subject and within-subject changes in the non-brain covariate. However, the between-subject and within-subject association strengths can differ because non-brain measures can be more variable than structural brain measures for various reasons. For example, crystallized composite scores may vary within a participant longitudinally because of time-of-day effects, lack of sleep or natural noise in the measurement. By contrast, GMV is more precise and it is not vulnerable to other sources of variability that might accompany the crystallized composite score. This combination leads to a low within-subject association between these variables (Supplementary Table 13). Functional connectivity measures are more similar to crystallized composite scores in that they are subject to higher within-subject variability and natural noise, so they have a higher potential for stronger within-subject associations with crystallized composite scores (that is, they are more likely to vary together based on many factors such as time of day and lack of sleep). To demonstrate this, we fitted models that estimated distinct effects for between-subject and within-subject associations in the ABCD dataset (Methods) and found that there are large between-subject parameter estimates and small within-subject parameter estimates in total and regional GMV (Fig. 5e, Supplementary Table 13 and section 5.2 in Supplementary Information), whereas the functional connectivity associations are distributed more evenly across between-subject and within-subject parameters (Fig. 5f). If the between-subject and within-subject associations are different, these widely used longitudinal models average the two associations (equation (13) in section 5 in Supplementary Information). Fitting these associations separately avoids averaging the larger effect with the smaller effect and can inform our understanding of brain–behaviour associations (section 5.2 in Supplementary Information). This approach ameliorates the reduction in standardized effect sizes caused by longitudinal designs for structural brain measures in the ABCD (Extended Data Figs. 5 and 6 and section 5 in Supplementary Information). This longitudinal model has a similar between-subject standardized effect size to the cross-sectional model (see ‘Estimation of the between-subject and within-subject effects’ in the Methods section; Extended Data Fig. 6). In short, longitudinal designs can be detrimental to replicability when the between-subject and within-subject effects differ and the model is incorrectly specified.

Optimal design considerations

With increasing evidence of small standardized effect sizes and low replicability in BWAS, optimizing study design to increase standardized effect sizes and replicability is a critical prerequisite for progress1,26. Our results demonstrate that standardized effect size and replicability can be increased by enriched sampling of participants with small and large values of the covariate of interest. This is well known in linear models in which the standardized effect size is explicitly a function of the standard deviation of the covariate23. We showed that designing a study to have a larger covariate standard deviation increases standardized effect sizes by a median factor of 1.4, even when there is non-linearity in the association, such as with age and GMV (Supplementary Fig. 1). When the association is very non-monotonic — as in the case of a U-shape relationship between covariate and outcome — sampling the tails more heavily could decrease replicability and diminish our ability to detect non-linearities in the centre of the study population. In such a case, sampling to obtain a uniform distribution of the covariate balances power across the range of the covariate and can increase replicability relative to random sampling when the covariate has a normal distribution in the population. Increasing between-subject variability is beneficial in more than 72% of the association pairs that we studied, despite the presence of such non-linearities (Extended Data Fig. 3).

Because standardized effect sizes are dependent on study design, careful design choices can simultaneously increase standardized effect sizes and study replicability. Two-phase, extreme group and outcome-dependent sampling designs can inform which participants should be selected for imaging from a larger sample to increase the efficiency and standardized effect sizes of brain–behaviour associations27,28,29,30,31,32,33. For example, given the high degree of accessibility of cognitive and behavioural testing (for example, to be performed virtually or electronically), individuals scoring at the extremes on a testing scale or battery (‘phase I’) could be prioritized for subsequent brain scanning (‘phase II’). When there are multiple covariates of interest, multivariate two-phase designs can be used to increase standardized effect sizes and replicability34. Multivariate designs are also needed to stratify sampling to avoid confounding by other sociodemographic variables. Together, the use of optimal designs can increase both standardized effect sizes and replicability relative to a design that uses random sampling31. If desired, weighted regression (such as inverse probability weighting) can be combined with optimized designs to estimate a standardized effect size that is consistent with the standardized effect size if the study had been conducted in the full population34,35,36. Choosing highly reliable psychometric measurements or interventions (for example, medications or neuromodulation within a clinical trial)37,38,39 may also be effective for increasing replicability. The decision to pursue an optimized design will depend on other practical factors, such as the cost and complexity of acquiring other (non-imaging) measures of interest and the specific translational goals of the research.

Longitudinal design considerations

In the meta-analysis, longitudinal studies of the total GMV–age associations have, on average, more than 380% larger standardized effect sizes than cross-sectional studies. However, in subsequent analyses, we noticed that the benefit of conducting a longitudinal design is highly dependent on both the between-subject and the within-subject effects. When the between-subject and the within-subject effects are equal and the within-subject brain measurement error is low, longitudinal studies offer larger standardized effect sizes than cross-sectional studies40 (section 5.1 in Supplementary Information). This combination of equal between-subject and within-subject effects and low within-subject measurement error is the reason that there is a benefit of longitudinal design in the ADNI for the total GMV–age association (Supplementary Fig. 4). Comparing efficiency per measurement supports the approach of collecting two measurements per participant in this scenario (section 5.1 in Supplementary Information).

Longitudinal models offer the unique ability to separately estimate between-subject and within-subject effects. When the between-subject and within-subject effects differ but we still fit them with a single effect, we mistakenly assume they are equal, and the interpretation of that coefficient becomes complicated: the effect becomes a weighted average of the between-subject and within-subject effects whose weights are determined by the study design features (section 5 in Supplementary Information). The apparent lack of benefit of longitudinal designs in the ABCD on the study of GMV associations is because within-subject changes in the non-brain measures are not associated with within-subject changes in the GMV (Fig. 5e and Supplementary Table 13). The smaller standardized effect sizes that we found in longitudinal analyses are due to the contribution from the smaller within-subject effect to the weighted average of the between-subject and within-subject effects (equation (14) in section 5 in Supplementary Information). Fitting the between-subject and within-subject effects separately prevents averaging the two effects (section 5.2 in Supplementary Information). These two effects are often not directly comparable with the effect obtained from a cross-sectional model because they have different interpretations41,42,43,44,45 (section 5.2 in Supplementary Information). Using sampling strategies to increase between-subject and within-subject variability of the covariate will increase the standardized effect sizes for between-subject and within-subject associations, respectively (Extended Data Fig. 5).

Design and analysis recommendations

Although it is difficult to provide universal recommendations for study design and analysis, the present study provides general guidelines for designing and analysing BWAS for optimal standardized effect sizes and replicability based on both empirical and theoretical results (Extended Data Figs. 7 and 8). Although the decision for a particular design or analysis strategy may depend on unknown features of the brain and non-brain measures and their associations, these characteristics can be evaluated in pilot data or the analysis dataset (Supplementary Fig. 4 and section 5.2 in Supplementary Information). One general principle that increases standardized effect sizes for most associations is to increase the covariate standard deviation (for example, through two-phase, extreme group and outcome-dependent sampling), which is practically applicable to a wide range of BWAS contexts. Longitudinal designs can be helpful and optimal even when the between-subject and within-subject effects differ, if modelled correctly. Moreover, longitudinal BWAS enable us to study between-subject and within-subject effects separately, and they should be used when the two effects are hypothesized to be different. Although striving for large sample sizes remains important when designing a study, our findings emphasize the importance of considering other design features to improve standardized effect sizes and replicability of BWAS.

Methods

LBCC dataset and processing

The original LBCC dataset included 123,984 MRI scans from 101,457 human participants across more than 100 studies (which include multiple publicly available datasets46,47,48,49,50,51,52,53,54,55,56) and was described in previous work4 (see Supplementary Information and supplementary table S1 from ref. 4). We filtered to the subset of cognitively normal participants whose data were processed using FreeSurfer (v6.1). Studies were curated for the analysis by excluding duplicated observations and studies with fewer than 4 unique age points, sample size less than 20 and/or only participants of one sex. If there were fewer than three participants having longitudinal observations, only the baseline observations were included and the study was considered cross-sectional. If a participant had changing demographic information during the longitudinal follow-up (for example, changing biological sex), only the most recent observation was included. We updated the LBCC dataset with the ABCD release 5, resulting in a final dataset that includes 77,695 MRI scans from 60,900 cognitively normal participants with available total GMV, sGMV and GMV measures across 63 studies (Supplementary Table 1). In this dataset, 74,148 MRI scans from 57,538 participants across 43 studies have complete-case regional brain measures (regional GMV, regional surface area and regional cortical thickness, based on Desikan–Killiany parcellation20; Supplementary Table 2). The global brain measure mean cortical thickness was derived using the regional brain measures (see below).

Structural brain measures

Details of data processing have been described in our previous work4. In brief, total GMV, sGMV and WMV were estimated from T1-weighted and T2-weighted (when available) MRIs using the ‘aseg’ output from FreeSurfer (v6.0.1). All three cerebrum tissue volumes were extracted from the aseg.stats files output by the recon-all process: ‘Total cortical gray matter volume’ for GMV; ‘Total cerebral white matter volume’ for WMV; and ‘Subcortical gray matter volume’ for sGMV (inclusive of the thalamus, caudate nucleus, putamen, pallidum, hippocampus, amygdala and nucleus accumbens area; https://freesurfer.net/fswiki/SubcorticalSegmentation). Regional GMV and cortical thickness across 68 regions (34 per hemisphere, based on Desikan–Killiany parcellation20) were obtained from the aparc.stats files output by the recon-all process. Mean cortical thickness across the whole brain is the weighted average of the regional cortical thickness weighted by the corresponding regional surface areas.

Preprocessing specific to ABCD

Functional connectivity measures

Longitudinal functional connectivity measures were obtained from the ABCD-BIDS community collection, which houses a community-shared and continually updated ABCD neuroimaging dataset available under Brain Imaging Data Structure (BIDS) standards. The data used in these analyses were processed using the abcd-hcp-pipeline (v0.1.3), an updated version of The Human Connectome Project MRI pipeline57. In brief, resting-state functional MRI time series were demeaned and detrended, and a generalized linear model was used to regress out mean white matter, cerebrospinal fluid and global signal, as well as motion variables and then band-pass filtered. High-motion frames (filtered frame displacement > 0.2 mm) were censored during the demeaning and detrending. After preprocessing, the time series were parcellated using the 352 regions of the Gordon atlas (including 19 subcortical structures) and pairwise Pearson correlations were computed among the regions. Functional connectivity measures were estimated from resting-state fMRI time series using a minimum of 5 min of data. After Fisher’s z-transformation, the connectivities were averaged across the 24 canonical functional networks25, forming 276 inter-network connectivities and 24 intra-network connectivities.

Cognitive and other covariates

The ABCD dataset is a large-scale repository aiming to track the brain and psychological development of over 10,000 children 9–16 years of age by measuring hundreds of variables, including demographic, physical, cognitive and mental health variables58. We used release 5 of the ABCD study to examine the effect of the sampling schemes on other types of covariates including cognition (fully corrected T-scores of the individual subscales and total composite scores of the NIH Toolbox24), mental health (total problem CBCL syndrome scale) and other common demographic variables (BMI, birth weight and handedness). For each of the covariates, we evaluated the effect of the sampling schemes on their associations with the global and regional structural brain measures and functional connectivity after controlling for non-linear age and sex (and for functional connectivity outcomes only, mean frame displacement).

For the analyses of structural brain measures, there were three non-brain covariates with fewer than 5% non-missing follow-ups at both 2-year and 4-year follow-ups (that is, the Dimensional Change Card Sort Test, Cognition Fluid Composite and Cognition Total Composite Score; Supplementary Table 11), and only their baseline cognitive measurements were included in the analyses. For the remaining 11 variables (that is, the Picture Vocabulary Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Pattern Comparison Processing Speed Test, Picture Sequence Memory Test, Oral Reading Recognition Test, Crystallized Composite, CBCL, birth weight, BMI and handedness), all of the available baseline, 2-year and 4-year follow-up observations were used. For the analyses of the functional connectivity, only the baseline observations for the List Sorting Working Memory Test were used due to missingness (Supplementary Table 12).

The records with BMI lying outside the lower and upper 1% quantiles (that is, BMI < 13.5 or BMI > 36.9) were considered misinput and replaced with missing values. The variable handedness was imputed using the last observation carried forwards.

Statistical analysis

Removal of site effects

For multisite or multistudy neuroimaging studies, it is necessary to control for potential heterogeneity between sites to obtain unconfounded and generalizable results. Before estimating the main effects of age and sex on the global and regional brain measures (total GMV, total WMV, total sGMV, mean cortical thickness, regional GMV and regional cortical thickness), we applied ComBat21 and LongComBat22 in cross-sectional datasets and longitudinal datasets, respectively, to remove the potential site effects. The ComBat algorithm involves several steps including data standardization, site-effect estimation, empirical Bayesian adjustment, removing estimated site effects and data rescaling.

In the analysis of cross-sectional datasets, the models for ComBat were specified as a linear regression model illustrated below using total GMV:

$${\rm{GMV\; \sim \; ns(age,\; d.f.\; =\; 2)\; +\; sex\; +\; site}},$$

where ns denotes natural cubic splines on 2 d.f., which means that there were two boundary knots and one interval knot placed at the median of the covariate age. Splines were used to accommodate non-linearity in the age effect. For the longitudinal datasets, the model for LongComBat used a linear mixed effects model with participant-specific random intercepts:

$${\rm{GMV\; \sim \; (1| participant)\; +\; ns(age,\; d.f.\; =\; 2)\; +\; sex\; +\; site}}.$$

When estimating the effects of other non-brain covariates in the ABCD dataset, ComBat was used to control the site effects, respectively, for each of the cross-sectional covariates. The ComBat models were specified as illustrated below using GMV:

$${\rm{GMV\; \sim \; ns(age,\; d.f.\; =\; 2)\; +\; sex}}+x+{\rm{site,}}$$

where x denotes the non-brain covariate. LongComBat was used for each of the longitudinal covariates with a linear mixed effects model with participant-specific random intercepts only:

$${\rm{GMV\; \sim \; (1| participant)\; +\; ns(age,\; d.f.\; =\; 2)\; +\; sex}}+x+{\rm{site.}}$$

When estimating the effects of other covariates on the functional connectivity (FC) in the ABCD data, we additionally controlled for the mean frame displacement (FD) of the frames remaining after scrubbing. The longComBat models were specified as:

$$\begin{array}{l}\text{FC}\, \sim \,(1| \text{participant})+\text{ns}(\text{age, d.f.}=2)\,+\,\text{sex}\\ \,+\,\text{ns}(\text{mean}\_\text{FD, d. f.}=3+x+\text{site}.\end{array}$$

The Combat and LongComBat were implemented using the neuroCombat59 and longCombat60 R packages. Site effects were removed before all subsequent analyses including the bootstrap analyses described below.

RESI for association strength

The RESI is a recently developed standardized effect size index that has consistent interpretation across many model types, encompassing all types of test statistics in most regression models17,18. In brief, the RESI is a standardized effect size parameter describing the deviation of the true parameter value (or values) \(\beta \) from the reference value (or values) \({\beta }_{0}\) from the statistical null hypothesis \({H}_{0}:\,\beta ={\beta }_{0}\),

$$S=\sqrt{{(\beta -{\beta }_{0})}^{T}{{\varSigma }_{\beta }}^{-1}(\beta -{\beta }_{0})},$$

where S denotes the parameter RESI, \(\beta \) and \({\beta }_{0}\) can be vectors, T denotes the transpose of a matrix, \({\varSigma }_{\beta }\) is the covariance matrix for \(\sqrt{N}\hat{\beta }\) (where \(\hat{\beta }\) is the estimator for \(\beta \), N is the number of participants; section 3 in Supplementary Information).

In previous work, we defined a consistent estimator for RESI17,

$$\hat{S}={\left(\max \left\{0,\frac{{T}^{2}-m}{N}\right\}\right)}^{1/2},$$

where \({{T}}^{2}\) is the chi-squared test statistics \({T}^{2}=N{(\beta -{\beta }_{0})}^{T}{{\varSigma }_{\beta }}^{-1}(\beta -{\beta }_{0})\) for testing the null hypothesis \({H}_{0}:\,\beta ={\beta }_{0}\), \(m\) is the number of parameters being tested (that is, the length of \(\beta \)) and \(N\) is the number of participants.

As RESI is generally applicable across different models and data types, it is also applicable to the situation where Cohen’s d was defined. In this scenario, the RESI is equal to ½ Cohen’s d17, so Cohen’s suggested thresholds for effect size can be adopted for RESI: small (RESI = 0.1), medium (RESI = 0.25) and large (RESI = 0.4). Because RESI is robust, when the assumptions of Cohen’s d are not satisfied, such as when the variances between the groups are not equal, RESI is still a consistent estimator, but Cohen’s d is not. The confidence intervals for RESI in our analyses were constructed using 1,000 non-parametric bootstraps18.

The systematic difference in the standardized effect sizes between cross-sectional and longitudinal studies puts extra challenges on the comparison and aggregation of standardized effect size estimates across studies with different designs. To improve the comparability of standardized effect sizes between cross-sectional and longitudinal studies, we proposed a new effect size index: the cross-sectional RESI (CS-RESI) for longitudinal datasets. The CS-RESI for longitudinal datasets represents the RESI in the same study population if the longitudinal study had been conducted cross-sectionally. Detailed definition, point estimator and confidence interval construction procedure for CS-RESI can be found in section 3 in Supplementary Information. Comprehensive statistical simulation studies were also performed to demonstrate the valid performance of the proposed estimator and confidence interval for CS-RESI (section 3.2 in Supplementary Information). With CS-RESI, we can quantify the benefit of using a longitudinal study design in a single dataset (section 3.3 in Supplementary Information).

Study-level models

After removing the site effects using ComBat or LongComBat in the multisite data, we estimated the effects of age and sex on each of the global or regional brain measures using GEEs and linear regression models in the longitudinal datasets and cross-sectional datasets, respectively. The mean model was specified as below after ComBat or LongComBat:

$${y}_{ij} \sim {\rm{ns}}({{\rm{age}}}_{ij},{\rm{d.f.}}=2)+{{\rm{sex}}}_{i},$$

where yij was taken to be a global brain measure (that is, total GMV, WMV, sGMV or mean cortical thickness) or regional brain measure (that is, regional GMV or cortical thickness) at the j-th visit from the participant i and j = 1 for cross-sectional datasets. The age effect was estimated with natural cubic splines with 2 d.f., which means that there were two boundary knots and one interval knot placed at the median of the covariate age. For the GEEs, we used an exchangeable correlation structure as the working structure and identity linkage function. The model assumes the mean was correctly specified, but made no assumption about the error distribution. The GEEs were fitted with the ‘geepack’ package61 in R. We used the RESI as a standardized effect size measure. RESIs and confidence intervals were computed using the ‘RESI’ R package (v1.2.0)19.

Meta-analysis of the age and sex effects

The weighted linear regression model for the meta-analysis of age effects across the studies was specified as:

$${\widehat{S}}_{{\rm{age}},k}={{\rm{design}}}_{k}+{\rm{ns}}[{\rm{mean}}{({\rm{age}})}_{k},3]+{\rm{ns}}[{\rm{s.d.}}{({\rm{age}})}_{k},3]+{\rm{ns}}[{\rm{skew}}{({\rm{age}})}_{k},3]+{{\epsilon }}_{k},$$

where \({\widehat{S}}_{{\rm{age}},k}\) denotes the estimated RESI for study k, and the weights were the inverse of the standard error of each RESI estimate. The sample mean, standard deviation (s.d.) and skewness of the age were included as non-linear terms estimated using natural splines with 3 d.f. (that is, two boundary knots plus two interval knots at the 33rd and 66th percentiles of the covariates), and a binary variable indicating the design type (cross-sectional or longitudinal) was also included.

The weighted linear regression model for the meta-analysis of sex effects across the studies was specified as

$${\widehat{S}}_{{\rm{sex}},k}={{\rm{design}}}_{k}+{\rm{ns}}[{\rm{mean}}{({\rm{age}})}_{k},3]+{\rm{ns}}[{\rm{s.d.}}{({\rm{age}})}_{k},3]+{\rm{ns}}[{\rm{pr}}{({\rm{male}})}_{k},3]+{{\epsilon }}_{k},$$

where \({\widehat{S}}_{{\rm{sex}},k}\) denotes the estimated RESI of sex for study k, and the weights were the inverse of the standard error of each RESI estimate. The sample mean, standard deviation of the age covariate and the proportion of males in each study were included as non-linear terms estimated using natural splines with 3 d.f., and a binary variable indicating the design type (cross-sectional or longitudinal) was also included.

These meta-analyses were performed for each of the global and regional brain measures. Inferences were performed using robust standard errors62. In the partial regression plots, the expected standardized effect sizes for the age effects were estimated from the meta-analysis model after fixing mean age at 45 years, standard deviation of age at 7 years and/or skewness at 0; the expected standardized effect sizes for the sex effects were estimated from the meta-analysis model after fixing mean age at 45 years, standard deviation of age at 7 years and/or proportion of males at 0.5.

Sampling schemes for age in the UKB and ADNI

We used bootstrapping to evaluate the effect of different sampling schemes with different target sample covariate distributions on the standardized effect sizes and replicability in the cross-sectional UKB and longitudinal ADNI datasets. For a given sample size and sampling schemes, 1,000 bootstrap replicates were conducted. The standardized effect size was estimated as the mean standardized effect size (that is, RESI) across the bootstrap replicates. The 95% confidence interval for the standardized effect size was estimated using the lower and upper 2.5% quantiles across the 1,000 estimates of the standardized effect size in the bootstrap replicates. Power was calculated as the proportion of bootstrap replicates producing P values less than or equal to 5% for those associations that were significant at 0.05 in the full sample. In the UKB, only one region was not significant for age in each of GMV and cortical thickness, and in the ADNI, only one and four regions were not significant for age in GMV and cortical thickness, respectively. Replicability in previous work has been defined as having a significant P value and the same sign for the regression coefficient. Because we were fitting non-linear effects, we defined replicability as the probability that two independent studies have significant P values; this is equivalent to the definition of power squared. The 95% confidence intervals for replicability were derived using Wilson’s method63.

In the UKB dataset, to modify the (between-subject) variability of the age variable, we used the following three target sampling distributions (Extended Data Fig. 1a): bell shaped, where the target distribution had most of the participants distributed in the middle age range; uniform, where the target distribution had participants equally distributed across the age range; and U shaped, where the target distribution had most of the participants distributed closer to the range limits of the age in the study. The samples with U-shaped age distribution had the largest sample variance of age, followed by the samples with uniform age distribution and the samples with bell-shaped age distribution. The bell-shaped and U-shaped functions were proportional to a quadratic function. To sample according to these distributions, each record was first inversely weighted by the frequency of the records with age falling in the range of ±0.5 years of the age for that record to achieve the uniform sampling distribution. Each record was then rescaled to derive the weights for bell-shaped and U-shaped sampling distributions. The records with age < 50 or age > 78 years were winsorized at 50 or 78 years when assigning weights, respectively, to limit the effects of outliers on the weight assignment, but the actual age values were used when analysing each bootstrapped data.

In each bootstrap from the ADNI dataset, each participant was sampled to have two records. We modified the between-subject and within-subject variability of age, respectively, by making the ‘baseline age’ follow one of the three target sampling distributions used for the UKB dataset and the ‘age change’ independently follow one of three new distributions: decreasing, uniform and increasing (Extended Data Fig. 1b,c). The increasing and decreasing functions were proportional to an exponential function. The samples with increasing distribution of age change had the largest within-subject variability of age, followed by the samples with the uniform distribution of age change and the samples with decreasing distribution of age change.

To modify the baseline age and the age change from baseline independently, we first created all combinations of the baseline record and one follow-up from each participant, and derived the baseline age and age change for each combination. The ‘bivariate’ frequency of each combination was obtained as the number of combinations with values of baseline age and age change falling in the range of ±0.5 years of the values of baseline age and age change for this combination. Then, each combination was inversely weighted by its bivariate frequency to target a uniform bivariate distribution of baseline age and age change. The weight for each combination was then rescaled to make the baseline age and age change follow different sampling distributions independently. The combinations with baseline age < 65 or age > 85 years were winsorized at 65 or 85 years, and the combinations with age change greater than 5 years were winsorized at 5 years when assigning weights to limit the effects of outliers on the weight assignment, but the actual ages were used when analysing each bootstrapped data.

The sampling methods could be easily extended to the scenario in which each participant had three records (and more than three) in the bootstrap data by making combinations of the baseline and two follow-ups. Each combination was inversely weighted to achieve uniform baseline age and age change distributions, respectively, by the ‘trivariate’ frequency of the combinations with baseline age and the two age changes from baseline for the two follow-ups falling into the range of ±0.5 years of the corresponding values for this combination. As we only investigated the effect of modifying the number of measurements per participant under uniform between-subject and within-subject sampling schemes (Fig. 3c,d), we did not need to consider rescaling the weights here to achieve other sampling distributions but they could be done similarly. For the scenario in which each participant only had one measurement (Fig. 3c,d), the standardized effect sizes and replicability were estimated only using the baseline measurements.

All site effects were removed using ComBat or LongComBat before performing the bootstrap analysis.

Sampling schemes for other non-brain covariates in the ABCD

We used bootstrapping to study how different sampling strategies affect the RESI in the ABCD dataset. Each participant in the bootstrap data had two measurements. We applied the same weight assignment method described above for the ADNI dataset to modify the between-subject and within-subject variability of a covariate. We made the baseline covariate and the change in covariate follow bell-shaped and/or U-shaped distributions, to let the sample have larger or smaller between-subject and/or within-subject variability of the covariate, respectively. The baseline covariate and change of covariate were winsorized at the upper and lower 5% quantiles to limit the effect of outliers on sampling. For each cognitive variable, only the participants with non-missing baseline measurements and at least one non-missing follow-up were included.

Generalized linear models and GEEs were fitted to estimate the effect of each non-brain covariate on the structural brain measures after controlling for age and sex,

$${\rm{GMV}} \sim {\rm{ns}}({\rm{age,\; d.f.}}=2)+{\rm{sex}}+x,$$

where x denotes one of the non-brain covariates. For the GEEs, we used an exchangeable correlation structure as the working structure and identity linkage function.

Only the between-subject sampling schemes were applied for the non-brain covariates that were stable over time (for example, birth weight and handedness). In other words, the participants were sampled based on their baseline covariate values, and then a follow-up was selected randomly for each participant. The sampling schemes to increase the between-subject variability in the covariate handedness, which was a binary variable (right-handed or not), was specified differently. The expected proportion of right-handed participants in the bootstrap samples was 50% under the sampling scheme with larger between-subject variability and 10% under the sampling scheme with smaller between-subject variability.

For given between-subject and/or within-subject sampling schemes, we obtained 1,000 bootstrap replicates. The standardized effect size was estimated as the mean standardized effect size across the bootstrap replicates. The 95% confidence intervals for standardized effect size were estimated using the lower and upper 2.5% quantiles across the 1,000 estimates of the standardized effect size in the bootstrap replicates. The sample sizes needed for 80% replicability were estimated based on the (mean) estimated standardized effect size and F-distribution (see below).

Analysis of functional connectivity in the ABCD

In a subset of the ABCD in which we have preprocessed longitudinal functional connectivity data at two time points (baseline and 2-year follow-up), we only restricted our analysis to the participants with non-missing measurements at both of the two time points. In the GEEs used to estimate the effects of non-brain covariates on functional connectivity, the mean model was specified as below after LongComBat:

$${y}_{ij} \sim {\rm{ns}}({{\rm{age}}}_{ij},{\rm{d.f.}}=2)+{{\rm{sex}}}_{i}+{\rm{ns}}({\rm{mean}}\_{\rm{FD}},\mathrm{d.f.}=3)+x,$$

where yij was taken to be a functional connectivity outcome, and x denotes a non-brain covariate. The mean frame-wise displacement (mean_FD) was also included as a covariate with natural cubic splines with 3 d.f. We used an exchangeable correlation structure as the working structure and identity linkage function in the GEEs. The frame count of each scan was used as the weights.

When evaluating the effect of different sampling schemes on the standardized effect sizes, we obtained 1,000 bootstrap replicates for given between-subject and/or within-subject sampling schemes. The standardized effect size was estimated as the mean standardized effect size across the bootstrap replicates. Confidence intervals were computed as described above. The sample sizes needed for 80% replicability were estimated based on the (mean) estimated standardized effect sizes and F-distribution (see below).

Sample size calculation for a target power or replicability with a given standardized effect size

After estimating the standardized effect size for an association, the calculation of the corresponding sample size N needed for detecting this association with γ × 100% power at significance level of α was based on an F-distribution. d.f. denotes the total degree of freedom of the analysis model, \(F(z;\lambda )\) denotes the cumulative density function for a random variable z, which follows the (non-central) F-distribution with degrees of freedom being 1 and N − d.f. and non-centrality parameter \(\lambda \). The corresponding sample size N is:

$$N=\{N:F({F}^{-1}(1-\alpha )\,;\lambda =N{\hat{S}}^{2})=\gamma \},$$

where \(\hat{S}\) is the estimated RESI for the standardized effect size. Power curves for the RESI are given in figure 3 of Vandekar et al.17. Replicability was defined as the probability that two independent studies have significant P values, which is equivalent to power squared.

Estimation of the between-subject and within-subject effects

For the non-brain covariates that were analysed longitudinally in the ABCD dataset, GEEs with exchangeable correlation structures were fitted to estimate their cross-sectional and longitudinal effects on structural and functional brain measures after controlling for age and sex, respectively. The mean model was specified as illustrated with GMV:

$${\rm{GMV\; \sim \; ns(age,\; d.f.\; =\; 2)\; +\; sex}}+X\_{\rm{bl}}+X\_{\rm{change,}}$$

where X_bl denotes the participant-specific baseline covariate values, and the X_change denotes the difference of the covariate value at each visit to the participant-specific baseline covariate value (see section 5.2 in Supplementary Information). The participants without baseline measures were not included in the modelling. The model coefficients for the terms X_bl and X_change represent the between-subject and within-subject effects of this non-brain covariate on total GMV, respectively. For the functional connectivity data, the same covariates and weighting were used as described above. Using the first time point as the between-subject term was a special case that ensured that comparing the parameter using the baseline cross-sectional model was equal to the parameter for the between-subject effect in the longitudinal model. In this model, the between-subject variance was defined as the variance of the baseline measurement, and the within-subject variance was the mean square of X_change. This model specification ensured that the sampling schemes independently affected the between-subject and within-subject variances separately (equation (16) in section 5.2 in Supplementary Information).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.