Project 3: Combining Contextual and Individual Data from Multiple Sources  

Scientific understanding of the relationship among health factors at the individual level and those at the level of social and spatial aggregates has been severely hampered by the lack of analytic tools.  With the widespread availability of those tools, though, it is clear that an important source of data about health has not produced the kind of contextual information needed to understand the interplay of individual and groups.

One source of contextual information that has not been fully exploited is aggregation of survey data themselves to higher level sampling units from which the survey subjects were selected.  Many health surveys on different health topics are conducted in the same primary sampling units, yet there is little opportunity to link the sampling unit across sample designs.  For example, the National Health and Nutrition Examination Survey (NHANES) is conducted in a sample of the National Health Interview Survey’s (NHIS) primary sampling units.  NHIS data could be used to estimate neighborhood structural characteristics, such as ease of access to a health care facility at the PSU level, which then can be related to individual health risk factors such as cholesterol level. What is lacking are methods to exploit the existence of coordinated survey design for the measurement of ecological influences.

The objective of this project is to use multi-level or hierarchical models to develop statistical methods and associated software to bring community and neighborhood foundations of health and development into analysis of individual health characteristics.

Recent developments in Bayesian computation (e.g., Markov chain Monte Carlo methods) have made it possible to apply hierarchical models to both continuous and categorical outcome data (Geman and Geman, 1984; Gelfand and Smith, 1991).  Further, Bayesian methods have the desirable property that they can use more of the information available more efficiently than traditional frequentist procedures.  The proposed project will use the Bayesian computational framework to combine information from neighborhood and individual characteristics in a sample survey to examine individual health outcomes through a set of random effect hierarchical models.  The random effects estimated from one level of the model are used as predictors in the next level of the model.

Let  denote an indicator variable taking the value 1 if the person i in neighborhood j is below the federally defined poverty level, based on a detailed assessment in a large survey such as NHIS.  A random effect logistic regression model may be used to specify the relationship between  and a set of predictors such as region or, urban residence, denoted as , as follows:

Here the  are random regression coefficients, the adjusted community-level log odds of being below the poverty level, and are assumed to be normally distributed with mean 0 and variance  .   is a vector of fixed effect regression coefficients.  Suppose that  is an individual health outcome of interest such as blood glucose for subject k from the same neighborhood, but  is measured in another survey using the same neighborhoods, or primary sampling units.  A second-stage model regresses this health outcome on the unobserved random effect ( ) and individual-level variables ,

Here  are assumed to be normally distributed with mean 0 and variance .  The object of the inference is , the adjusted effect of the neighborhood characteristic.

  This second-stage model is suitable for normally distributed health outcomes. A logistic regression model could be used for binary outcome variables such as presence or absence of a condition; a polytomous or multinomial regression model may be used for nominally or ordinal-scaled outcome variables such self-rated health or type of health insurance coverage.  The method could readily be adapted to handle count variables such as the number of visits to a doctor’s office through a Poisson model.  

Gibbs sampling and other Markov chain Monte Carlo algorithms will be used to construct posterior distribution of the parameter of interest,  and the other parameters in the model.  As a first step, algorithms will be developed for drawing values from first stage of the model, conditional on the parameters in the second stage of the model and on the data.  Next, procedures will be developed for drawing values from the posterior distribution of the parameters of the second stage of the model, conditional on the first stage parameters and on the data.  These two sampling algorithms will be combined into a single general-purpose software system to implement the procedure.  The software procedure will be implemented in SAS using facilities such as the macro language, PROC IML (and interactive matrix language), and the SAS ASSIST features to present screens that allow users unfamiliar with the complexities of Bayesian methods, Gibbs sampling, and Markov chain Monte Carlo methods to specify substantively suitable models.

The project will also explore whether “neighborhood” characteristics estimated using NHIS data may become part of the public use NHANES (or NSFG) files with suitable random recodes of the primary sampling unit characteristics to assure confidentiality.  Thus, analysts outside of NCHS will have access to the random effects coefficients representing neighborhood or primary sampling unit characteristics that would ordinarily be inaccessible.

Publications:  none at this time.