U.S. flag

An official website of the United States government

Skip Header


Experimentation, Prediction, & Modeling

Motivation:

Experiments at the Census Bureau are used to answer many research questions, especially those related to testing, evaluating, and advancing survey sampling methods. A properly designed experiment provides a valid, cost-effective framework that ensures the right type of data are collected as well as sufficient sample sizes and power are attained to address the questions of interest. The use of valid statistical models is vital to both the analysis of results from designed experiments and in characterizing relationships between variables in the vast data sources available to the Census Bureau. Statistical modeling is an essential component for wisely integrating data from previous sources (e.g., censuses, sample surveys, and administrative records) in order to maximize the information that they can provide. In particular, linear mixed effects models are ubiquitous at the Census Bureau through applications of small area estimation. Models can also identify errors in data, e.g., by computing valid tolerance bounds and flagging data outside the bounds for further review.

 

Research Problems:

  • Investigate established methods and novel extensions to support design (e.g., factorial designs), analysis, and sample size determination for Census Bureau experiments.
  • Investigate methodology for experimental designs embedded in sample surveys, including large-scale field experiments embedded in ongoing surveys. This includes design-based and model-based analysis and variance estimation incorporating the sampling design and the experimental design (van den Brakel, Survey Methodology, 2005); factorial designs embedded in sample surveys (van den Brakel, Survey Methodology, 2013), and the estimation of interactions; and testing non-response using embedded experiments.
  • Identify and develop statistical models (e.g., loglinear models, mixture models, and mixed-effects models), associated methodologies, and computational tools for problems relevant to the Census Bureau.
  • Assess the applicability of post hoc methods (e.g., multiple comparisons and tolerance intervals) with future designed experiments and when reviewing previous data analyses.
  • Construct rectangular nonparametric tolerance regions for multivariate data. Tolerance regions for multivariate data are usually elliptical in shape, but such regions cannot provide information on individual components of the measurement vector. However, such information can be obtained through rectangular tolerance regions.
  • Develop a technique for mis-reporting via the COM-Poisson distribution in order to estimate true counts.
  • Develop a disclosure policy motivated by the COM-Poisson and related distributions that allows one to protect individual information reported in two-way and multi-way tables.

 

Current Subprojects:

  • Developing Flexible Distributions and Statistical Modeling for Count Data Containing Dispersion (Sellers, Morris, Raim).
  • Design and Analysis Methods for Experiments (Raim, Mathew, Sellers)

 

Potential Applications:

  • Modeling can help to characterize relationships between variables measured in censuses, sample surveys, and administrative records and quantify their uncertainty.
  • Modeling approaches with administrative records can help enhance the information obtained from various sample surveys.
  • Experimental design can help guide and validate testing procedures proposed for censuses and surveys. Sample sizes can be determined to achieve desired power using planned designs and statistical procedures.
  • Embedded experiments can be used to evaluate the effectiveness of alternative contact strategies.
  • The collection of experimental design procedures currently utilized with the American Community Survey can be expanded.
  • Fiducial predictors of random effects can be applied to mixed effects models such as those used in small area estimation.
  • Rectangular tolerance regions can be applied to multivariate economic data and aid in the editing process by identifying observations that are outlying in one or more attributes and which subsequently should undergo further review. The importance of ratio edits and multivariate/multiple edits is noted in the work of Thompson and Sigman (Journal of Official Statistics, 1999) and de Waal, Pannekoek and Scholtus (Handbook of Statistical Data Editing and Imputation, 2011).
  • Principled measures of statistical variability can be provided for constructs like the POP Division's Population Estimates.
  • Mis-reporting techniques could be used to assess the amount of mis-reporting in historical Census datasets to aid in model development to estimate true survey count outcomes.
  • Statistical disclosure limitation constructs would allow the Census Bureau to release statistical measures associated with a general distributional form while protecting individual privacy. These measures would allow one to estimate the form of multi-way tables of interest while masking the true outcomes.

 

Accomplishments (October 2018-September 2020):

  • Completed paper on spatio-temporal change of support modeling in R and released stcos R package.
  • Addressed issues with COM-Poisson normalizing constant in the COMPoissonReg R package.
  • Completed paper on Conway-Maxwell (COM) multinomial distribution and its use in analyzing clustered multinomial datasets that exhibit over- or under-dispersion.
  • Developed and released COMMultReg R package to support COM-multinomial paper.
  • Completed paper on continuation-ratio logit modeling for sample size determination and analysis of experiments involving sequences of success/failure trials. Such models support the study of nonresponse probabilities under multiple enumeration attempts to each household.
  • Completed paper on comparing pairs of discrete distributions via multinomial outcomes to determine if one is closer to a discrete uniform distribution. This was applied to Census Bureau call volume data to determine if a staggered mailing strategy leads to significantly more uniform call distributions than a simpler strategy where mail is sent to all recipients at once.
  • Completed development of a one-step autoregressive model for count data motivated by the COM-Poisson distribution.

 

Short-Term Activities (FY 2021 – FY 2023):

  • Explore COM-multinomial as a model for missing observations in clustered data under a Bayesian setting.
  • Extend work on sample size determination with continuation-ratio logit model to a mixed effects setting.
  • Develop a multivariate COM-Poisson distribution model.

 

Longer-Term Activities (beyond FY 2023):

  • Develop generalized/flexible spatial and time series models motivated by the COM-Poisson distribution.
  • Significant progress has been made recently on randomization-based causal inference for complex experiments; Ding (Statistical Science, 2017), Dasgupta, Pillai and Rubin (Journal of the Royal Statistical Society, Series B, 2015), Ding and Dasgupta (Journal of the American Statistical Association, 2016), Mukerjee, Dasgupta and Rubin (Journal of the American Statistical Association, 2018), Branson and Dasgupta (International Statistical Review, 2020).  It is proposed to adopt these methodologies for analyzing complex embedded experiments, by taking into account the features of embedded experiments (for example, random interviewer effects and different sampling designs).
  • Generalize the Kadane et al. (2006) COM-Poisson motivated data disclosure limitation procedure for one-way tables to handle two-way and multi-way tables. Determine the associated sufficient statistics of the bivariate (or multivariate) COM-Poisson distribution and use them to describe the space of feasible tables that can be used to substitute the true contingency table.
  • Consider generalizations of the frequentist and Bayesian approaches to address under-reporting described in Winkelmann (1996), Fader and Hardie (2000), Neubauer and Djuras (2009), and Neubauer et al. (2009) to allow for data dispersion via the COM-Poisson distribution.

 

Selected Publications:

Raim, A.M., Nichols, E., and Mathew, T. (2023). “A Statistical Comparison of Call Volume Uniformity Due to Mailing Strategy,” Journal of Official Statistics, 39, 103-121.

Raim, A.M., Mathew, T., Sellers, K. F., Ellis, R., and Meyers, M. (2023). “Design and Sample Size Determination for Experiments on Nonresponse Follow-up using a Sequential Regression Model,” Journal of Official Statistics, 39(2), 173-202.

Raim, A.M. (2023). “Direct Sampling with a Step Function,” Statistics and Computing, 33(22). https://doi.org/10.1007/s11222-022-10188.

Lucagbo, M., Mathew, T., and Young, D. (2023). “Rectangular Multivariate Normal Prediction Regions for Setting Reference Regions in Laboratory Medicine,” Journal of Biopharmaceutical Statistics, 33(2), 191-209.

Lucagbo, M. and Mathew, T. (2023). “Rectangular Tolerance Regions and Multivariate Normal Reference Regions in Laboratory Medicine,” Biometrical Journal, 65(3).

Arsham, A., Bebu, I., and Mathew, T. (2023). “Cost-Effectiveness Analysis Under Multiple Effectiveness Outcomes: A Probabilistic Approach,” Statistics in Medicine, 42, 3936-3955.

Arsham, A., Bebu, I., and Mathew, T. (2022). “A Bivariate Regression-Based Cost-Effectiveness Analysis,” Journal of Statistical Theory and Practice, 16, Article No. 27.

Janicki, R., Raim, A.M., Holan, S.H., and Maples, J. (2022). “Bayesian Nonparametric Multivariate Spatial Mixture Mixed Effects Models with Application to American Community Survey Special Tabulations,” The Annals of Applied Statistics, Volume 16, Issue 1, 144-168.

Lucagbo, M. and Mathew, T. (2022). “Rectangular Confidence Regions and Prediction Regions in Multivariate Calibration,” Journal of the Indian Society for Probability and Statistics, 23, 155–171.

Morris, D.S. and Sellers, K.F. (2022). “A Flexible Mixed Model for Clustered Count Data,” Stats: Special Issue on Statistics, Data Analytics, and Inferences for Discrete Data, 5(1): 52–69. https://doi.org/10.3390/stats5010004.

Rivas, A., Antoun, C., Feuer, S., Mathew, T., Nichols, E., Olmsted-Hawala, E. and Wang, L (2022), “Comparison of Three Navigation Button Designs in Mobile Survey for Older Adults,” Survey Practice, 15(1).

Weems, K.S., Sellers, K.F., and Li, T. (2021). “A Flexible Bivariate Distribution for Count Data Expressing Data Dispersion,” Communications in Statistics - Theory and Methods, https://doi.org/10.1080/03610926.2021.1999474.

Feng, X., Mathew, T, and Adragni, K. (2021). “Interval Estimation of the Intra-class Correlation in General Linear Mixed Effects Models,” Journal of Statistical Theory and Practice, 15, Article 65.

Sellers, K.F., Arab, A., Melville, S., and Cui, F. (2021). “A Flexible Univariate Moving Average Time-Series Model for Dispersed Count Data,” Journal of Statistical Distributions and Applications 8 (1). https://doi.org/10.1186/s40488-021-00115-2

Sellers, K.F., Li, T., Wu, Y., and Balakrishnan, N. (2021). “A Flexible Multivariate Distribution for Correlated Count Data,” Stats, 4(2), 308-326, https://doi.org/10.3390/stats4020021.

Zhao, J., Mathew, T., and Bebu, I. (2021). “Accurate Confidence Intervals for Inter-Laboratory Calibration and Common Mean Estimation,” Chemometrics and Intelligent Laboratory Systems, 208. DOI: 10.1016/j.chemolab.2020.104218.

Zimmer, Z., Park, D., and Mathew, T. (2021). “Tolerance Limits under Zero-Inflated Lognormal and Gamma Distributions,” Computational and Mathematical Methods, Special Issue on Statistics, 3. DOI: 10.1002/cmm4.1113.

Morris, D.S., Raim, A.M., and Sellers, K.F. (In Press). “A Conway-Maxwell-multinomial Distribution for Flexible Modeling of Clustered Categorical Data,” Journal of Multivariate Analysis. DOI: https://doi.org/10.1016/j.jmva.2020.104651.

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2020). stcos: “Space-Time Change of Support, version 0.3.0,” https://cran.r-project.org/package=stcos.

Sellers K.F., Peng, S.J., and Arab, A. (2020). “A Flexible Univariate Autoregressive Time-series Model for Dispersed Count Data,” Journal of Time Series Analysis, 41(3): 436-453.

Zhu, L., Sellers, K., Morris, D., Shmueli, G., and Davenport, D. (2020). cmpprocess: “Flexible Modeling of Count Processes,” version 1.1, https://cran.r-project.org/package=cmpprocess

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2019). “Spatio-Temporal Change of Support Modeling for the American Community Survey with R,” URL: https://arxiv.org/abs/1904.12092.

Sellers, K., Lotze, T., and Raim, A. (2019). COMPoissonReg: “Conway-Maxwell-Poisson Regression, version 0.7.0,” https://cran.r-project.org/package=COMPoissonReg

Sellers, K.F. and Young, D. (2019). “Zero-inflated Sum of Conway-Maxwell-Poissons (ZISCMP) Regression with Application to Shark Distributions,” Journal of Statistical Computation and Simulation, 89 (9): 1649-1673.

Sellers, K., Morris, D., Balakrishnan, N., and Davenport, D. (2018). multicmp: “Flexible Modeling of Multivariate Count Data via the Multivariate Conway-Maxwell-Poisson Distribution,” version 1.1, https://cran.r-project.org/package=multicmp

Morris, D.S., Sellers, K.F., and Menger, A. (2017). “Fitting a Flexible Model for Longitudinal Count Data Using the NLMIXED Procedure,” SAS Global Forum Proceedings Paper 202-2017, SAS Institute: Cary, NC.

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2017). “A Model Selection Study for Spatio-Temporal Change of Support,” in Proceedings, Government Statistics Section of the American Statistical Association, Alexandria, VA: American Statistical Association.

Sellers, K.F., and Morris, D. (2017). “Under-dispersion Models: Models That Are ‘Under The Radar’,” Communications in Statistics – Theory and Methods, 46 (24): 12075-12086.

Sellers K.F., Morris D.S., Shmueli, G., and Zhu, L. (2017). “Reply: Models for Count Data (A Response to a Letter to the Editor),” The American Statistician.

Young, D.S., Raim, A.M., and Johnson, N.R. (2017). “Zero-inflated Modelling for Characterizing Coverage Errors of Extracts from the U.S. Census Bureau's Master Address File,” Journal of the Royal Statistical Society: Series A. 180(1):73-97.

Zhu, L., Sellers, K.F., Morris, D.S., and Shmueli, G. (2017). “Bridging the Gap: A Generalized Stochastic Process for Count Data,” The American Statistician, 71 (1): 71-80.

Heim, K. and Raim, A.M. (2016). “Predicting Coverage Error on the Master Address File Using Spatial Modeling Methods at the Block Level,” In JSM Proceedings, Survey Research Methods Section. Alexandria, VA: American Statistical Association.

Mathew, T., Menon, S., Perevozskaya, I., and Weerahandi, S. (2016). “Improved Prediction Intervals in Heteroscedastic Mixed-Effects Models,” Statistics & Probability Letters, 114, 48-53.

Raim, A.M. (2016). “Informing Maintenance to the U.S. Census Bureau's Master Address File with Statistical Decision Theory,” In JSM Proceedings, Government Statistics Section. Alexandria, VA: American Statistical Association.

Sellers, K.F., Morris, D.S., and Balakrishnan, N. (2016). “Bivariate Conway-Maxwell-Poisson Distribution: Formulation, Properties, and Inference,” Journal of Multivariate Analysis, 150:152-168.

Sellers, K.F. and Raim, A.M. (2016). “A Flexible Zero-inflated Model to Address Data Dispersion,” Computational Statistics and Data Analysis, 99: 68-80.

Raim, A.M. and Gargano, M.N. (2015). “Selection of Predictors to Model Coverage Errors in the Master Address File,” Research Report Series: Statistics #2015-04, Center for Statistical Research and Methodology, U.S. Census Bureau.

Young, D. and Mathew, T. (2015). “Ratio Edits Based on Statistical Tolerance Intervals,” Journal of Official Statistics 31, 77-100.

Klein, M., Mathew, T., and Sinha, B. K. (2014). “Likelihood Based Inference under Noise Multiplication,” Thailand Statistician. 12(1), pp.1-23. URL: http://www.tci-thaijo.org/index.php/thaistat/article/view/34199/28686

Young, D.S. (2014). “A Procedure for Approximate Negative Binomial Tolerance Intervals,” Journal of Statistical Computation and Simulation, 84(2), pp.438-450. URL: http://dx.doi.org/10.1080/00949655.2012.715649

Gamage, G., Mathew, T., and Weerahandi, S. (2013). “Generalized Prediction Intervals for BLUPs in Mixed Models,” Journal of Multivariate Analysis, 120, 226-233.

Mathew, T. and Young, D. S. (2013). “Fiducial-Based Tolerance Intervals for Some Discrete Distributions,” Computational Statistics and Data Analysis, 61, 38-49.

Young, D.S. (2013). “Regression Tolerance Intervals,” Communications in Statistics – Simulation and Computation, 42(9), 2040-2055.

 

Contact:

Andrew Raim, Thomas Mathew, Kimberly Sellers, Darcy Morris

 

Funding Sources for FY 2021-2025:         

0331 – Working Capital Fund / General Research Project

Various Decennial and Demographic Projects

Related Information


Page Last Revised - January 3, 2024
Is this page helpful?
Thumbs Up Image Yes Thumbs Down Image No
NO THANKS
255 characters maximum 255 characters maximum reached
Thank you for your feedback.
Comments or suggestions?

Top

Back to Header