Supplementing Missing Self-Reported Race Data with a Probability Distribution in Logistic Regression Models

Authors

  • Stanley Xu The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA
  • Komal Narwaney The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA
  • Sophia Newcomer The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA
  • Jason Glanz The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA

DOI:

https://doi.org/10.6000/1929-6029.2015.04.03.2

Keywords:

Race and ethnicity, Bayesian Improved Surname Geocoding, up-to-date immunization, direct substitution approach, partial information maximum likelihood estimator

Abstract

Race is often included as an independent variable in health services research, especially in recent studies of racial and ethnic disparities in health care. Although self-reported information on race exists in large electronic health records (EHR) data, these data are sometimes missing. Recently Bayesian Improved Surname Geocoding method (BISG) is used to estimate the probability distribution of race categories for those with missing information on race. The BISG estimated probability distribution has been used in reporting health care measures but not in statistical modellings with dichotomous events as outcomes. We propose two approaches to accommodate available distribution probability of an independent categorical variable (e.g., race) in logistic regression models: 1) a direct substitution approach and 2) a partial information maximum likelihood estimator (PIMLE). In examining the association between race and up-to-date immunization status of children by three years old from an integrated health care organization, 11.3% of 14,903 children have missing self-reported race information but have BISG estimated probability distribution for the six race/ethnicity categories. We employed the direct substitution approach and PIMLE approach to analyze the under vaccination data. Both approaches included all observations and thus yielded smaller standard errors of estimated coefficients compared to the complete data analyses. Our simulation study showed that the direct substitution approach and PIMLE yielded nearly unbiased coefficient estimates and preserved efficiency when the missing rate of the independent categorical variable was up to 30%.

Author Biographies

Stanley Xu, The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA

Head of Biostatistics, IHR, Kaiser Permanente Colorado, Colorado, USA

Adjunt Assciate Professor, University of Colorado Denver, Colorado, USA

Komal Narwaney, The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA

Biostatistician

Sophia Newcomer, The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA

Biostatistician

Jason Glanz, The Institute for Health Research, Kaiser Permanente Colorado, Denver, CO, USA

Epidemiologist

References

Boehmer U, Kressin NR, Berlowitz DR, Christiansen CL, Kazis LE, Jones JA. Self-reported vs administrative race/ethnicity data and study results. Am J Public Health 2002; 92: 1471-2. http://dx.doi.org/10.2105/AJPH.92.9.1471 DOI: https://doi.org/10.2105/AJPH.92.9.1471

Bilheimer LT, Sisk JE. Continue collecting adequate data on racial and ethnic disparities in health: the challenges. Health Affairs 2008; 27: 383-91. http://dx.doi.org/10.1377/hlthaff.27.2.383 DOI: https://doi.org/10.1377/hlthaff.27.2.383

Institute of Medicine 2009. Race, ethnicity, and language data: standardization for health care quality improvement.Washington, DC: The National Academies Press.

Elliott MN, Fremont A, Morrison PA, Pantoja P, Lurie N. A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity. Health Serv Res 2008; 43: 1722-36. http://dx.doi.org/10.1111/j.1475-6773.2008.00854.x DOI: https://doi.org/10.1111/j.1475-6773.2008.00854.x

Elliott MN, Morrison P, Fremont A, McCaffrey D, Pantoja P, Lurie N. Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology 2009; 9: 69-83. http://dx.doi.org/10.1007/s10742-009-0047-1 DOI: https://doi.org/10.1007/s10742-009-0047-1

Adjaye-Gbewonyo D, Bednarczyk RA, Davis RL, Omer SB. Using the Bayesian Improved Surname Geocoding Method (BISG) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv Res 2014; 49: 268-83. http://dx.doi.org/10.1111/1475-6773.12089 DOI: https://doi.org/10.1111/1475-6773.12089

van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 2006; 59: 1102-9. http://dx.doi.org/10.1016/j.jclinepi.2006.01.015 DOI: https://doi.org/10.1016/j.jclinepi.2006.01.015

Janssen KJ, Donders AR, Harrell FE Jr, Vergouwe Y, Chen Q, Grobbee DE, Moons KG. Missing covariate data in medical research: to impute is better than to ignore. J Clin Epidemiol 2010; 63: 721-7. http://dx.doi.org/10.1016/j.jclinepi.2009.12.008 DOI: https://doi.org/10.1016/j.jclinepi.2009.12.008

Raebel MA, Xu S, Goodrich GK, Schroeder EB, Schmittdiel JA, Segal JB, O’Connor PJ, Nichols GA, Lawrence JM, Kirchner HL, Elston Lafata J, Butler M, Newton KM, Steiner JF. Initial antihyperglycemic drug therapy among 241 327 adults with newly identified diabetes from 2005 through 2010: a surveillance, prevention, and management of diabetes mellitus (SUPREME-DM) study. Ann Pharmacother 2013; 47: 1280-91. http://dx.doi.org/10.1177/1060028013503624 DOI: https://doi.org/10.1177/1060028013503624

Horton NJ, Kleinman KP. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. The American Statistician 2007; 61: 79-90. http://dx.doi.org/10.1198/000313007X172556 DOI: https://doi.org/10.1198/000313007X172556

McCaffrey DF, Elliott MN. Power of tests for a dichotomous independent variable measured with error. Health Serv Res 2008; 43: 1085-101. http://dx.doi.org/10.1111/j.1475-6773.2007.00810.x DOI: https://doi.org/10.1111/j.1475-6773.2007.00810.x

SAS Institute Inc 2011. Base SAS® 9.3 Procedures Guide. Cary, NC: SAS Institute Inc.

Glanz JM, Newcomer SR, Narwaney KJ, Hambidge SJ, Daley MF, Wagner NM, McClure DL, Xu S, Rowhani-Rahbar A, Lee GM, Nelson JC, Donahue JG, Naleway AL, Nordin JD, Lugg MM, Weintraub ES. A population-based cohort study of under vaccination in eight managed care organizations across the United States. Archives of Pediatrics & Adolescent Medicine. JAMA Pediatrics 2013; 167: 274-281. http://dx.doi.org/10.1001/jamapediatrics.2013.502 DOI: https://doi.org/10.1001/jamapediatrics.2013.502

Sugerman DE, Barskey AE, Delea MG, Ortega-Sanchez IR, Bi D, Ralston KJ, Rota PA, Waters-Montijo K, Lebaron CW. Measles outbreak in a highly vaccinated population, San Diego, 2008: role of the intentionally undervaccinated. Pediatrics 2010; 125: 747-55. http://dx.doi.org/10.1542/peds.2009-1653 DOI: https://doi.org/10.1542/peds.2009-1653

Omer SB, Enger KS, Moulton LH, Halsey NA, Stokley S, Salmon DA. Geographic clustering of nonmedical exemptions to school immunization requirements and associations with geographic clustering of pertussis. Am. J. Epidemiol 2008; 168: 1389-96. http://dx.doi.org/10.1093/aje/kwn263 DOI: https://doi.org/10.1093/aje/kwn263

Luman ET, Ching PL, Jumaan AO, Seward JF. Uptake of varicella vaccination among young children in the United States: a success story in eliminating racial and ethnic disparities. Pediatrics 2006; 117: 999-1008. http://dx.doi.org/10.1542/peds.2005-1201 DOI: https://doi.org/10.1542/peds.2005-1201

Centers for Disease Control and Prevention. National, state, and local area vaccination coverage among children aged 19-35 months — United States, 2011. Morbidity and Mortality Weekly Report (MMWR) 2012; 61: 689-696. Available from: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6135a1.htm

Xu S, Schroeder EB, Shetterly S, Goodrich GK, O'Connor PJ, Steiner JF, Schmittdiel JA, Desai J, Pathak RD, Neugebauer R, Butler MG, Kirchner L, Raebel MA. Accuracy of hemoglobin A1c imputation using fasting plasma glucose in diabetes research using electronic health records data. Statistics, Optimization & Information Computing 2014; 2: 93-104. DOI: https://doi.org/10.19139/soic.v2i2.68

Downloads

Published

2015-08-18

How to Cite

Xu, S., Narwaney, K., Newcomer, S., & Glanz, J. (2015). Supplementing Missing Self-Reported Race Data with a Probability Distribution in Logistic Regression Models. International Journal of Statistics in Medical Research, 4(3), 252–259. https://doi.org/10.6000/1929-6029.2015.04.03.2

Issue

Section

General Articles