Compiling and Managing NHANES Datasets

Last Updated on October 13, 2020 by Ayla Myrick

The National Center for Health Statistics (NCHS) has carried out multiple national studies with measures on a variety of health aspects, and includes data on children and adults. These studies have included the National Health and Nutrition Examination Surveys (NHANES) , which have been ongoing since the 1970s.

The data collected in these surveys are massive, in that many of the measures collected have been assessed at multiple waves of data collection, and the measures have been assessed in multiple ways. When a researcher has found a health variable of interest (for example, smoking), the researcher has choices to make- how will I study smoking? Will I count the number of packs a person has smoked at this wave, or will I think about life time smoking habits? Am I concerned with smoking cigarettes only, or am I also interested in whether a person smokes cigars? Will I measure smoking in terms of the number of cigarettes smoked daily, weekly, yearly, or use pack-years? The answer to these sorts of questions lie in the interests of the researcher as well as the information/supporting documentation from the current literature. However, when it comes to the analyses, the researcher must also consider how this data will be managed, combined, cleaned, and coded for final use.

The NHANES dataset is particularly clear about the information in the files. To access the NHANES files, the user can go to and examine the questionnaires and codebooks to determine the variables they want to download and examine.

After the user gets the data from the NHANES site, the user must

There are 3 main steps to compiling and managing data from NHANES after the researcher has downloaded it:

  1. Determine the measures of interest the user wants to download (and download the data files containing the measures).
  2. Evaluate whether the user wants to combine data from multiple years to increase statistical power. (Note- questions may be worded differently between years, so data from each year must be carefully examined, particularly with attention to the missing data codings.)
  3. Determine what weights are appropriate for the data. Because NHANES collects nationally representative data, weights that account for sampling characteristics (such as disproportionate sampling and nonresponse) have been created by the NHANES researchers. They MUST be used in analyses to make proper inferences from the data!
  4. Select the variables of interest from the respective downloaded datasets, and create a smaller unified datasets with all measures of interest to the researcher. There are many files available from NHANES, and having many variables with many observations may make it more time consuming to perform analyses. Selecting the specific variables to analyze will condense this process.
  5. Clean the data for all measures to be used (which includes recoding missing data to have the same code for multi-year variables; ensuring the population sizes are as expected).
  6. Now your data is ready to be set up to examine the relationships between the measures you have extracted from NHANES!
  7. Remember to clean and recode your data whenever necessary–having uncleaned data can result in data misrepresentation, which makes your results wrong, and perhaps your analyses inappropriate.

Notes of caution: Just because a researcher finds it of interest to study a particular health outcome does not mean the health outcome has been MEASURED in a way that is APPROPRIATE for the research question. Each user has to take special care to ensure that the measures and research questions are compatible in order to have useful and informative analyses.

Ayla Myrick