This site contains information about
the text "Applied Survey Data Analysis" including author biographies,
links to public release data sets and related sites, code and output
for analysis examples replicated in current software packages, and information
about new publications of interest to survey data analysts. Other features include a FAQ log and
links to other software and statistical sites. We plan to intermittently update this site with news about
statistical and software advances in the field of analysis of survey data.
Special Note from Authors
The most recent printing of Applied
Survey Data Analysis, as of March 7, 2013, has a font issue where some
symbols appear to be missing in the text. This problem is being
corrected for all future printings. Please accept our apologies and
this will be fixed as soon as possible.
Applied Survey Data
Analysis is the product born of many years of teaching applied
survey data analysis classes and practical experience analyzing survey data.
We have taught various versions of this course in the ISR/SRC Summer Institute
Program, as part of University of Michigan/CSCAR, and within the Survey Methodology Program at
University of Michigan and University of Maryland. Our goal has been to
integrate teaching materials and practical analysis knowledge into a textbook
geared to a level accessible for graduate students and working analysts who
may have varying levels of statistical and analytic expertise. We
intend to update the materials on this website as statistical and
software improvements emerge with the goal of assisting analyst and researchers
performing survey data analysis.
Information About Authors
Patricia A. Berglund is a Senior Research Associate in
the Survey Methodology Program at the Institute for Social Research.
She has extensive
experience in the use of computing systems for data management and complex
sample survey data analysis. She works on research projects in youth substance
abuse, adult mental health, and survey methodology using data from Monitoring
the Future, the National Comorbidity Surveys, World Mental Health Surveys, Collaborative Psychiatric
Epidemiology Surveys, and various other national and international surveys.
In addition, she is involved in development, implementation, and teaching of
analysis courses and computer training programs at the Survey Research
Center-Institute for Social Research. She also lectures in the SAS® Institute-Business
Knowledge Series. mailto:firstname.lastname@example.org
Steven G. Heeringa is a Research Scientist in the Survey
Methodology Program, the Director of the Statistical and Research Design Group
in the Survey Research Center, and the Director of the Summer Institute in
Survey Research Techniques at the Institute for Social Research. He has over 25
years of statistical sampling experience directing the development of the SRC
National Sample design, as well as sample designs for SRC's major longitudinal
and cross-sectional survey programs. During this period he has been actively
involved in research and publication on sample design methods and procedures
such as weighting, variance estimation, and the imputation of missing data that
are required in the analysis of sample survey data. He has been a teacher of
survey sampling methods to U.S. and international students and has served as a
sample design consultant to a wide variety of international research programs
based in countries such as Russia, the Ukraine, Uzbekistan, Kazakhstan, India,
Nepal, China, Egypt, Iran, and Chile. mailto:email@example.com
Brady T. West is
an Assistant Research Professor in the Survey Methodology Program at the
University of Michigan and an Assistant Research Scientist at the Center for
Statistical Consultation and Research (CSCAR) on the University of Michigan
campus. He earned a PhD in Survey Methodology from the Michigan Program in
Survey Methodology, and also received an MA in Applied Statistics from the
University of Michigan Statistics Department. His primary research
interests revolve around regression models for clustered and longitudinal
data, and he has authored a book, "Linear Mixed Models: A Practical Guide
Using Statistical Software" (www.umich.edu/~bwest/almmussp.html)
comparing different statistical software packages in terms of their mixed
modeling procedures (Chapman Hall/CRC Press, 2007). He specializes in
applications of statistical software and analysis of survey data, and
through CSCAR teaches several yearly short courses on statistical
methodology and software.
Professional Reviews of ASDA
1. Review/Summary of
ASDA from the Stata Bookstore:
Stata Review of ASDA
2. Review posted
5.0 out of 5 stars Simply a Great Book,
December 25, 2010 By Dennis Hanseman (Cincinnati, OH United States)
"Applied Survey Data Analysis (ASDA) is a
crystal-clear survey of modern techniques for analyzing complex survey
data. Note the word "analyzing".
This is not a text on sampling methods per
se. Rather, it is a guide to using existing data sets that result from a
complex survey design that employs weighting, clustering, and
stratification. The authors demonstrate how a correct analysis should be
undertaken. In doing so, they review descriptive statistics, categorical
methods, regression analysis (linear and logistic), survival analysis,
and multiple imputation. Most examples use Stata, but some are in SAS.
The level of mathematical sophistication is
not high, although "theory boxes" are interspersed to add additional
detail. Anyone who is challenged by the mathematical level of this book
probably should not be working with survey data in the first place.
In sum, this is an important -- and very
well written -- contribution to the literature on survey data analysis."
3. Review from International Statistical Review (2010), 78, 3,
445–482. (Page 463 extracted here). 2010 The Authors.
International Statistical Review 2010 International Statistical Institute.
To read this review click here:
Review of ASDA.
4. Review from
Quantitative Methods Network" Newsletter in the UK:
5. Review from Amazon.com:
5.0 out of 5 stars: "A must-have for anyone analyzing survey data",
(Kristen Olson, Lincoln NE): May 2, 2011.
This review is from:
Applied Survey Data Analysis (Chapman & Hall/CRC Statistics in the
Social and Behavioral Sciences) (Hardcover)
"This book is unique in the extensive
market of books on analysis of survey data. Most data collected on
finite populations are selected with unequal probabilities of selection,
strata and clusters, but most regression textbooks assume a simple
random sample. This is the first full-length textbook that deals with
subclass analyses, categorical data analysis, and various generalized
linear models (from linear regression through hierarchical models) and
complex survey designs at a statistical level accessible to most
graduate students or data analysts. Weight creation and multiple
imputation are also covered.
Readers will not be scared off by 'too many' formulas in this book.
Although formulas are used throughout the book, there is not a great
deal of detailed statistical theory presented; additionally, the 'Theory
Boxes' provide enough information for more statistically inclined
readers to know where to turn for more information. The book is best
used for a more advanced statistical models class, after students have
taken their basic regression/correlation class (and possibly after a
categorical data class). The homework assignments at the end of each
chapter are a useful addition to the text, with the required data sets
available on the book's website. I find that the homework assignments
are best supplemented with additional examples with other data sets,
especially for classes taught repeatedly, but they are a great starting
I strongly recommend this book to anyone wanting a 'how to' book for
conducting and interpreting analyses on complex survey data,
supplemented with extensive documentation on model fitting and
diagnostics under a complex survey design (where they are available). It
is immediately useful, with Stata code for all of the analyses provided
in the text and SAS, Stata, MPlus, R, SUDAAN, WesVar, and SPSS code on
6. Review posted on Amazon.com:
5.0 out of 5 stars: from Sophia
this book in preparation for analyzing NHANES data for the first time.
To give an idea of my background: My formal biostatistics training is
quite limited to what I got through experience on different research
projects and through medical school. I use Stata 12 in a very basic way,
and this was the first project using a large dataset that I did my own
analysis for from start to finish. I've worked with weighted survey data
before, but another statistician was doing all the actual coding.
I thought this book was pretty balanced between theory and practical
issues. A lot of the theory was over my head, but certain parts were
extremely illuminating and useful to read through. They do specify at
least two semester of graduate level statistics or something like that
as a prerequisite, but obviously I don't have that and I still found the
book useful. The real value of the book lies in its numerous and
detailed examples. The authors actually use NHANES data (and other
national datasets) to work through their examples. They walk you through
many different types of analyses, include multiple linear regression,
multiple logistic regression, etc. Many of these examples are very
detailed, and they build the whole model step-by-step and explain the
rationale behind each step and decision. The book is extremely well
organized, so that by flipping through the table of contents you can
immediately find the relevant section for what you want to do with the
data, read through the example, and apply the Stata code directly to
your own analysis.
NHANES does have tutorials about how to work through their data;
although I also found those to be useful and essential, I think that
this book is superior because it does give more background about why you
need to run certain types of analyses and tests rather than others.
I can't believe my university library didn't have this book--I think
it's a totally worthwhile purchase for anyone preparing to work with
large national datasets."
7. Link to Chapman Hall Bestsellers List: (see ASDA on the list!):
Link to BestSellers
8. Link to ASDA review from The American Statistician:
book is clearly written and easy to follow, and well equipped with real
data examples and a book web site. The program codes used in the example
are also available, mostly written in Stata. I like the
presentations with real survey examples and, in particular, the unified
four-step approach to the regression analysis in different models.
Anyone working on survey data analysis would find the book very helpful
and instructive. The book website seems to be a good complement, with
additional resources on this book." (Partial review).
Link to review from Journal of Statistics, 2011:
data analysts have a good general understanding of the theory and
application of statistical analysis to basic behavioral science data.
However, many analysts do not receive specialized training in the
specific aspects of complex survey design and its
the statistical analysis of survey data. Applied Survey Data Analysis is
a great remedy to fill this gap."
Links to Data
National Comorbidity Survey-Replication (Collaborative Psychiatric Epidemiology Surveys)
(for online documentation tools and data download)
(for NCS-R specific information)
National Health and Nutrition Examination Survey
(National Center for Health Statistics)
Health and Retirement Survey (Institute for Social Research-University of Michigan)
United States Census Bureau
Chapter Exercises Data Sets
data sets are subsets of the original data and are designed for use with the
chapter exercises in ASDA.
Chapter Exercises Data Sets (Stata and SAS Format) Chapter
Exercises Data Sets (R Format)
Analysis Example Data Sets
data sets are subsets of the original data and are designed for use with the
analysis examples in ASDA. We have included the raw variables used in
the variable recodes and constructed variables used in the analysis
Analysis Examples Data Sets (Stata and SAS Format)
Frequently Asked Questions
document contains frequently asked questions and brief answers. Click
This working paper addresses Accounting for Multi-stage Sample Designs in
Complex Sample Variance Estimation by Brady West. Click
here to download:
Multi-Stage Sample Designs
Links to Additional Sites
University of Michigan (ICPSR)
Data Archive http://www.icpsr.umich.edu
Software for Survey Data Analysis
SDA from ICPSR http://www.icpsr.umich.edu
(online analysis system with survey correction capabilities)
Stata - V12 is current
as of August 2011
SPSS/PASW - V19 is
current as of Summer 2011
SAS - v9.3 is current as of Summer 2011
This section provides key updates to
software for analysis of survey data.
1. Stata v11-Example of new "factor" coding for categorical variables:
Example of Factor Coding in Stata
2. Stata v11.1-Some key updates:
1. Survey estimation commands now
support survey bootstrap SEs, with user-supplied bootstrap replicate
Example of svy bootstrap
estimation commands now support successive difference replicate (SDR)
weights, common in data sets supplied by the U.S. Census Bureau:
Example of sdr method
goodness-of-fit (GOF) tests are now available after svy: probit and svy:
Example of gof test
estimates of the coefficient of variation (CV) can now be computed using
Example of cv option
3. SAS v9.2-
Example of how to use replicate weights using NHANES data:
Replicate Weights Example
v11.0-Example of how to use replicate weights using NHANES data:
Stata Replicate Weights Example
5. Stata v10.1-Code to produce Table 8.4 and Figure 8.3:
Non-Linear Comparisons of Logits
6. SAS v9.2 (TS2M3)-Example of PROC SURVEYPHREG (Cox Model):
PROC SURVEYPHREG Example
7. Stata v11.1-Example of Mediation analysis with survey data and
Stata sgmediation example
8. R-Example of Quantile Regression with Bootstrap Method:
R Quantile Regression Example
9. Stata 11.1-Example of use of mi suite of commands:
Stata 11.1 MI Example
10. SAS v9.22-Example of use of NOMCAR option with PROC SURVEYMEANS:
SAS NOMCAR Example
11. Stata 11.1-Example of use of svy: logistic with estat gof
Stata estat gof Example
12. Example of How to Create a Delimited Text File in SAS and Read Text File
Text File SAS to R Example
13. An Example of Fuller’s (1984) Method for
Testing the Bias of Unweighted Estimates of Regression
Parameters in a Linear Regression Model:
14. SAS code to implement
rank sum test for complex sample survey data:
15. SAS Macro for Difference Between
Means (addition to PROC SURVEYMEANS):
16. SAS Paper with Examples of ODS Graphics and SG Procedures with Examples
of Weighted Frequency Plots:
SAS Paper with
ODS Graphics and SG Procedures Examples
17. Note on How SPSS handles Strata with A Single or "Lonely" PSU:
18. Link to Stata command for calculation of Population Attributable Risk
proportions (user written "punaf" command):
19. Link to information about use of Stata 12.1 with the postestimation
command estat gof after svy: logistic with
20. SAS PROC MI - FCS imputation method with analysis of complex sample
PROC MI FCS Example. Right click here to save SAS data set:
Data set for FCS example
21. SAS v9.3 PROC SURVEYMEANS with RATIO and
DOMAIN statements for Example 5.9:
SAS Example 5.9
Resources for Analysis of Survey Data
University of Michigan
Institute for Social Research-Summer Institute
IVEware (Imputation and Variance Estimation software)
ICPSR summer institute
Center for Statistical Consulting and Research
University of California-Los Angeles
Survey Data Analysis
University of North Carolina-Chapel Hill
American Statistical Association
Survey Data Analysis Publications
This section is designed to
provide information about key updates in publications regarding Survey Data
analysis. We will add to the list as new publications emerge.
1. Carle, A.C.,
Fitting multilevel models in complex survey data with design weights:
Recommendations, BMC Medical Research Methodology, 1471-2288-9-49, 2009. http://www.biomedcentral.com/1471-2288/9/49
Multilevel models (MLM) offer complex survey
data analysts a unique approach to understanding individual and contextual
determinants of public health. However, little summarized guidance exists
with regard to fitting MLM in complex survey data with design weights.
Simulation work suggests that analysts should scale design weights using two
methods and fit the MLM using unweighted and scaled-weighted data. This
article examines the performance of scaled-weighted and unweighted analyses
across a variety of MLM and software programs.
2. Lumley, T.S., Complex Surveys: a guide to
analysis using R,
John Wiley & Sons,
New York, 2010.
A complete guide to carrying out complex
survey analysis using R. As survey analysis continues to serve as a
core component of sociological research, researchers are increasingly
relying upon data gathered from complex surveys to carry out traditional
analyses. Complex Surveys is a practical guide to the analysis of this kind
of data using R, the freely available and downloadable statistical
3. Liao, Dan., Collinearity Diagnostics for
Complex Survey Data. Dissertation submitted to the Faculty of the
Graduate School of the University of Maryland, College Park, Maryland,
4. Asparouhov, T. & Muthen, B. (2006).
Multilevel modeling of complex survey data. Proceedings of the Joint
Statistical Meeting in Seattle, August 2006. ASA section on Survey Research
Methods, 2718-2726. Paper can be downloaded from
5. Berglund, Patricia, (2010).
An Introduction to Multiple
Imputation of Complex Sample Data Using SAS v9.2, SAS Global Forum 2010,
Paper 265-2010. Paper can be downloaded from
6. Kolenikov, S.,
Resampling Variance Estimation for Complex Survey Data,
Stata Journal, sj10-2:
7. Valliant, R., The Effect of Multiple
Weighting Steps on Variance Estimation, Journal of Official Statistics, Vol.
20, No. 1, 2004, pp. 1–18.
Multiple weight adjustments are common
in surveys to account for ineligible units on a frame, nonresponse by
some units, and the use of auxiliary data in estimation. A practical
question is whether all of these steps need to be accounted for when
estimating variances. Linearization variance estimators and related
estimators in commercial software packages that use squared residuals
usually account only for the last step in estimation, which is the
incorporation of auxiliary data through poststratification, regression
estimation, or similar methods. Replication variance estimators can
explicitly account for all of the steps in estimation by repeating each
adjustment separately for each replicate subsample. Through simulation,
this article studies the difference in these methods for some specific
sample designs, estimators of totals, and rates of ineligibility and
nonresponse. In the simulations reported here, the linearization
variance estimators are negatively biased and produce confidence
intervals for a population total that cover at less than the nominal
rate, especially at smaller sample sizes. The jackknife replication
estimator generally yields confidence intervals that cover at or above
the nominal rate but do so at the expense of considerably overestimating
empirical mean squared errors. A leverage-adjusted variance estimator,
which is related to the jackknife estimator, has small positive bias and
nearly nominal coverage. The leverage-adjusted estimator is less
computationally burdensome than the jackknife but works well in the
situations studied here where multiple weighting steps are used.
Valliant, R. and Rust, K.F., Degrees of Freedom Approximations and
Rules-of-Thumb, Journal of Official Statistics, Vol. 26, No. 4, 2010, pp.
samples, t-distributions are used when performing hypothesis tests and
confidence intervals. Rules-of-thumb are typically used to approximate
of freedom for
the t-distributions. The standard rule is to set the degrees of freedom
the number of
primary sampling units minus the number of strata. We illustrate some
where these rules can be poor. A simple estimate of degrees of freedom
presented that leads to
improved confidence interval coverage.
9. Brumback, B. and He, Z., The
Mantel–Haenszel estimator adapted for complex survey designs is not dually
Statistics & Probability Letters
Volume 81, Issue 9, September 2011, Pages 1465-1470.
10. Brumback, B. and He, Z., Adjusting
for confounding by neighborhood using complex survey data, Statistics
, Issue 9,
pages 965–972, 30
11. Liao, D. (2011). Variance Inflation
Factors in the Analysis of Complex Survey Data. Paper presented at the 2011
Joint Statistical Meetings, Miami Beach, FL. Currently under review for
publication in Survey Methodology.
12. Li, J. and Valliant, R..
Influence Diagnostics for Unclustered Survey Data,
Diagnostics for linear regression models have largely been
developed to handle nonsurvey data. The models and the sampling plans
used for finite populations often entail stratification, clustering, and
survey weights. In this article we adapt some influence diagnostics that
have been formulated for ordinary or weighted least squares for use with
unclustered survey data. The statistics considered here include DFBETAS,
DFFITS, and Cook’s D. The differences in the performance of ordinary
least squares and survey-weighted diagnostics are compared in an
empirical study where the values of weights, response variables, and
covariates vary substantially.
Complex sample, Cook’s D, DFBETAS, DFFITS, influence, outlier,
13. Wagstaff, D.A. and Harel, O., A Closer
Examination of Three Small-Sample Approximations to the Multiple-Imputation
Degrees of Freedom.
The Stata Journal (2011) 11,
Number 3, pp. 403–419.
14. Binder, D.A., ESTIMATING MODEL PARAMETERS
FROM A COMPLEX SURVEY UNDER A MODEL-DESIGN RANDOMIZATION FRAMEWORK, Pak. J.
Statist., 2011 Vol. 27(4), 371-390.
Link to Paper
When an analyst faces the problem of estimating model parameters to data
from a complex survey, one of the first questions he often asks is
whether or not to use the survey weights. The appropriate question to
ask, however, is whether the survey design information itself is
relevant, and if so, how should it be incorporated in the analysis. The
debate between the design-based and the model-based schools for making
inferences on model parameters can be explained and clarified using a
model-design randomization framework to describe how the observations
for the sampled units have been obtained.
Complex survey data; Design-based inference; Model-design-based
framework; Informativeness; Ignorability.
15. Li, J.
DETECTING GROUPS OF INFLUENTIAL
OBSERVATIONS IN LINEAR REGRESSION USING SURVEY DATA—ADAPTING THE
FORWARD SEARCH METHOD, Pak. J. Statist. 2011 Vol. 27(4), 507-528.
Link to Paper
search is an effective and efficient approach when analyzing non-survey
data to detect a group of influential observations which affect
regression estimates greatly if they were removed from the model
fitting. It has the advantages of avoiding masked
the outliers, as well as automatically identifying influential points.
Compared to multiple-case deletion diagnostic statistics, this method
reduces computational burden, especially when the dataset is very large.
In this research we adapted the forward search to linear regression
diagnostics for some types of complex survey data. While keeping the
existing advantages of this method, we incorporate sample weights and
the effects of stratification. A case study is conducted to illustrate
the advantages of the adapted method.
diagnostics for survey data, influence, linear regression, outliers,
authors, Journal of Statistical Software, Vol. 45, Issue 1-7, Dec 2011.
Various articles on multiple imputation are included in this volume.
issue of Journal of Statistical Software has several articles devoted to
multiple imputation, including implementations in R, SAS, and Stata.
There is also an article devoted to imputation in multilevel structures.
Click here for more information and links to articles:
17. Mplus Notes area with many articles about
survey data analysis:
Kott, P. and Liao, D. Providing double protection for unit nonresponse with
a nonlinear calibration-weighting routine, Survey Research Methods (2012)
No.2, pp. 105-111. Link to paper:
Sundar Natarajan, Stuart R. Lipsitz, Garrett M. Fitzmaurice, Debajyoti Sinha,
Joseph G. Ibrahim, Jennifer Haas, Walid Gellad, An Extension of the Wilcoxon
rank sum test for complex sample survey data. Journal of the Royal Statistical
Society: Series C
61, Issue 4, pages 653-664, August 2012.
Czaplewski, Raymond L.
2010. Complex sample survey estimation in static state-space. Gen. Tech. Rep.
RMRS-GTR-239. Fort Collins, CO: U.S. Department of
Agriculture, Forest Service, Rocky Mountain Research Station. 124 p.
Czaplewski, Raymond L. 2010. Recursive restriction estimation: an alternative
to post-stratification in surveys of land and forest cover. Res. Pap.
RMRS-RP-81. Fort Collins, CO: U.S. Department of Agriculture, Forest Service,
Rocky Mountain Research Station. 32 p.
Owen, A., and Eckles, D. Bootstrapping data arrays
of arbitrary order. Annals of Applied Statistics, Volume 6, Number 3 (2012),
895-927. Available from
A. Veiga, P. W. F. Smith and J. J. Brown,
The use of sample weights in multivariate multilevel models with an application
to income data collected by using a rotating panel survey. Forthcoming in
the Journal of the Royal Statistical Society, 2013. Link to paper:
Veiga et al
Summary. Longitudinal data from labour force
surveys permit the investigation of income dynamics
at the individual level. However, the data often originate from surveys with
multistage sampling scheme. In addition, the hierarchical structure of the
data that is imposed
by the different stages of the sampling scheme often represents the natural
grouping in the
population. Motivated by how income dynamics differ between the formal and
of the Brazilian economy and the data structure of the Brazilian Labour
Force Survey, we extend
the probability-weighted iterative generalized least squares estimation
method. Our method is
used to fit multivariate multilevel models to the Brazilian Labour Force
Survey data where the
covariance structure between occasions at the individual level is
modelled.We conclude that
there are significant income differentials and that incorporating the
weights in the parameter
estimation has some effect on the estimated coefficients and standard
Keywords: Design weights; Labour force surveys; Longitudinal data;
Non-response weights; Probability-weighted iterative generalized least
Newson R. Confidence intervals for rank statistics: Somers' D and extensions.
The Stata Journal 2006; 6(3): 309-334. Prepublication draft at:
Somers’ D is an asymmetric measure of association between two
which plays a
central role as a parameter behind rank or “non–parametric”
Given a predictor variable X and an outcome variable Y , we
as a measure of the effect of X on Y , or we may estimate
as a performance indicator of X as a
predictor of Y. The somersd package
estimation of Somers’ D and Kendall’s
confidence limits as
well as P-values.
The Stata 9 version of somersd can estimate extended versions
of Somers’ D
not previously available, including the Gini index, the parameter
tested by the sign
test, and extensions to left– or right–censored data. It can also
versions of Somers’ D, restricted to pairs in the same stratum.
Therefore, it is
possible to define strata by grouping values of a confounder, or
of a propensity
score based on multiple confounders, and to estimate versions of
which measure the association between the outcome and the predictor,
adjusted for the
confounders. The Stata 9 version of somersd uses the Mata
improved computational efficiency with large datasets.
st0001, Somers’ D, Kendall’s tau, Harrell’s c, ROC area, Gini
attributable risk, rank correlation, rank–sum test, Wilcoxon test, sign
test, confidence intervals,
non–parametric methods, propensity score.
25. Presentation on AIC and BIC
for Survey Data by Thomas Lumley and Alastair Scott:
Link to Presentation
26. T. Lumley and A.J. Scott (2013).
Partial likelihood-ratio tests for the Cox model under complex sampling.
Statistics in Medicine, 32, 110-123.
27. T. Lumley and A.J. Scott (2012).
Fitting GLMs with survey data. Proceedings of the Survey Research Methods
Section, Amer. Statist. Assoc, 5174-5181.
28. T. Lumley and A.J. Scott (2013).
Two-sample rank tests under complex sampling. Biometrika, 100, to appear
Please check this link for corrections to