Supplementing Missing Self-Reported Race Data with a Probability Distribution in Logistic Regression Models
Race is often included as an independent variable in health services research, especially in recent studies of racial and ethnic disparities in health care. Although self-reported information on race exists in large electronic health records (EHR) data, these data are sometimes missing. Recently Bayesian Improved Surname Geocoding method (BISG) is used to estimate the probability distribution of race categories for those with missing information on race. The BISG estimated probability distribution has been used in reporting health care measures but not in statistical modellings with dichotomous events as outcomes. We propose two approaches to accommodate available distribution probability of an independent categorical variable (e.g., race) in logistic regression models: 1) a direct substitution approach and 2) a partial information maximum likelihood estimator (PIMLE). In examining the association between race and up-to-date immunization status of children by three years old from an integrated health care organization, 11.3% of 14,903 children have missing self-reported race information but have BISG estimated probability distribution for the six race/ethnicity categories. We employed the direct substitution approach and PIMLE approach to analyze the under vaccination data. Both approaches included all observations and thus yielded smaller standard errors of estimated coefficients compared to the complete data analyses. Our simulation study showed that the direct substitution approach and PIMLE yielded nearly unbiased coefficient estimates and preserved efficiency when the missing rate of the independent categorical variable was up to 30%.
Race and ethnicity, Bayesian Improved Surname Geocoding, up-to-date immunization, direct substitution approach, partial information maximum likelihood estimator
Full Text:Subscribers Only
- There are currently no refbacks.