General Data Notes
CARDIA Public Release, Version A.9
This release contains data from Years 0-30 of the CARDIA Study. In their original form, these data were distributed as Version 1.4, 2.2, 3.1, 4.1, 5.1, 6.1, 7.3, 8.3 and 9.1. Links to the general notes concerning each of the original data releases are available at the bottom of this page. Datasets are organized by data collection form number. Form numbers are assigned based on content and are used throughout the study, but the format of the form and the items contained therein may change over time. For example, Form 2 is the Blood Pressure Form but the format of the form changed from Year 0 to Year 2.
Naming Conventions
All dataset names consist of 5-8 characters. The first character indicates the exam during which the data were collected (A=1, B=2, C=3, D=4, E=5, F=6, G=7, H =8, I =9). The second character indicates the version number. All public release versions are indicated by a letter; for this version, that is 'A'. For data collected via a form designed by the study, the remaining characters indicate the form number (Fxx, where xx=form number). For data from a laboratory or reading center, the remaining characters of the dataset name are descriptive of the contents.
Variable names consist of 5-8 characters between Years 0-15 and slightly longer for Year 20, especially for genetic data variables. As with datasets, the first character indicates the exam during which the data were collected. Characters 2-3 indicate form number or laboratory from which the data were obtained. The remaining characters are descriptive of the contents. Variables which were collected in more than one exam are named identically or similarly except for the first character.
Documentation
For each dataset, 3 types of documentation are available. First, for data collected via forms, the form is available as a PDF file. For each item on a form, the corresponding SAS variable name in the distributed data sets is listed. These variable names did not appear on the form during data collection.
Second, new documentation regarding the publicly released version is available as a PDF file. Notes concerning revisions to the dataset made for the public release are contained therein, as well as a PROC CONTENTS listing and the SAS program used to generate the dataset.
Finally, the original documentation for the internal study version is available as a PDF file. Notes concerning computed variables, problems in the dataset, and information for longitudinal analyses are contained therein, as well as information about data edits, original contents, original SAS program, and the original data dictionary.
Medical history and medication follow-up forms - For several medical conditions, detailed information is desired. An affirmative answer to items on the medical history questionnaire (Form 8) prompts completion of these follow-up forms. The follow-up forms are all designated as Form 9, with a descriptive subtitle. Please refer to the forms for the specific skip patterns followed.
Revisions for Public Release
For public release, some modifications were made to protect the anonymity of the participants. To maintain consistency, rules were developed for variable transformations based on variable type (i.e., dichotomous, continuous). Rules were applied within the 4 race-gender cells. Table 1 provides the numbers of participants in each of these cells across all Field Center populations.
Table 1. Frequency and percent of participants by race and gender at each exam, CARDIA, September 2017
Exam Year |
Strata |
Total |
||||||||||
BF |
BM |
WF |
WM |
GC |
||||||||
N |
% |
N |
% |
N |
% |
N |
% |
N |
% |
N |
% |
|
0 |
1480 |
28.9 |
1157 |
22.6 |
1307 |
25.6 |
1171 |
22.9 |
0 |
0.0 |
5115 |
100.0 |
2 |
1298 |
28.1 |
988 |
21.4 |
1236 |
26.7 |
1102 |
23.8 |
0 |
0.0 |
4624 |
100.0 |
5 |
1214 |
27.9 |
905 |
20.8 |
1179 |
27.1 |
1054 |
24.2 |
0 |
0.0 |
4352 |
100.0 |
7 |
1143 |
28.0 |
831 |
20.3 |
1106 |
27.0 |
1006 |
24.6 |
0 |
0.0 |
4086 |
100.0 |
10 |
1120 |
28.4 |
806 |
20.4 |
1072 |
27.1 |
950 |
24.1 |
2 |
0.1 |
3950 |
100.0 |
15 |
1021 |
27.8 |
709 |
19.3 |
1030 |
28.1 |
911 |
24.8 |
1 |
0.0 |
3672 |
100.0 |
20 |
1005 |
28.3 |
646 |
18.2 |
1008 |
28.4 |
889 |
25.1 |
1 |
0.0 |
3549 |
100.0 |
25 |
986 |
28.2 |
654 |
18.7 |
994 |
28.4 |
863 |
24.7 |
1 |
0.0 |
3498 |
100.0 |
30 |
957 |
28.5 |
648 |
19.3 |
956 |
28.5 |
796 |
23.7 |
1 |
0.0 |
3358 |
100.0 |
Note. Strata are noted with a two-character abbreviation. The first character represents Race (Black or White) and the second character represents Gender (Male or Female). GC: Gender change (female to male).
Transformations to all datasets. Some revisions were made to all datasets in the current release. The original CARDIA ID, which had information about Field Center embedded in it, was replaced with a randomly generated ID for each individual. The new variable, PID, can be used to merge datasets such that individual particpants' data are correctly matched. In addition to this change, the variable CENTER was deleted. Finally, 2 participants who had sex change operations during the course of the study were deleted from all datasets as the reliability of some data, particularly chemistries, may not be reliable due to hormonal changes.
Transformation Rules. The following rules were applied to individual variables:
a. Variables with Inherent Ability to Identify Individuals
Those variables that are judged to inherently identify individuals are not included in the data set. Examples are variables containing information regarding name and birth date.
b. Variables with Inherent Ability to Identify Field Centerxxx
Variables which inherently contain information about field center are either be deleted or recoded. Examples are variables such as technician ID or machine ID.
c. Date Variables
All dates (except birthdate, which will be deleted) are recoded to number of months relative to the Year 0 (Baseline) Examination date. This rule is intended to retain the chronological nature of events while obscuring the actual calendar time. An illustration is provided below:
Variable |
Date |
Months Since Baseline Exam |
Baseline exam |
September 21, 1985 |
0 |
Pregnancy delivery |
October 31, 1983 |
-23 |
Hospitalization |
July 5, 1996 |
106 |
d. Dichotomous and Polychotomous Variables
For dichotomous and polychotomous variables identified as needing modification, we include without alteration variables for which there are 20 or more participants represented in each response category within a given race-gender strata. If one or more categories have fewer than 20 responses, we either combined categories so that none are left with fewer than 20 responses or, failing that, set all responses in that design cell to missing.
e. Continuous Variables
For those continuous variables needing modification, we identified participants with the 20 highest values and those with the 20 lowest values within each of the 4 race-gender cells. The values for these participants were changed to the threshold value used to identify the group. Thus, the 20 with the highest values all have their data changed so that the value is set at the value for the 21st value from the top. Similar transformation was done for the low values. For cells with less than 40 non-missing values, all values were set to missing.
f. Character Variables
Some variables have been deleted as they contain confidential information such as initials, place of birth, and reason for hospitalization.
g. Time Variables
No transformations were made as times are recorded only as values on the 24-hour clock and contain no information about date.
h. Special Coding
Some variables received special coding according to specific rules that do not fall into any of the previous categories. One such example is age. At baseline, values less than 18 were recoded to 18 and values greater than 30 were recoded to 30. Another example is number of children or number of siblings. For these types of variables, transformations were done for non-0 values. Information for either pregnancies or children of more than four was recoded to four. Details of these special coding transformations are contained in the new documentation section for each dataset.
Datasets Judged Too Sensitive to Distribute
Some datasets from the original CARDIA release were not included in this release. The following details these datasets and the reason for withholding distribution.
- Illicit drug use (all years) - judged too sensitive to distribute
- Year 2 medical history follow-up questions for hysterectomy, liver disease, and vasectomy - too few records
- Year 2 lipids and lipoproteins - judged to be unreliable
- Year 7 GXT - judged unreliable at one center
- Year 10 return blood draw and medical history follow-up questions for angiograms and MRI scans - too few records
- Year 10 EBCT pilot - reliability still undetermined
General Notes from Original Documentation