General Data Notes

 

CARDIA Public Release, Version A.9

This release contains data from Years 0-30 of the CARDIA Study. In their original form, these data were distributed as Version 1.4, 2.2, 3.1, 4.1, 5.1, 6.1, 7.3, 8.3 and 9.1. Links to the general notes concerning each of the original data releases are available at the bottom of this page. Datasets are organized by data collection form number. Form numbers are assigned based on content and are used throughout the study, but the format of the form and the items contained therein may change over time. For example, Form 2 is the Blood Pressure Form but the format of the form changed from Year 0 to Year 2.

Naming Conventions

All dataset names consist of 5-8 characters. The first character indicates the exam during which the data were collected (A=1, B=2, C=3, D=4, E=5, F=6, G=7, H =8, I =9). The second character indicates the version number. All public release versions are indicated by a letter; for this version, that is 'A'. For data collected via a form designed by the study, the remaining characters indicate the form number (Fxx, where xx=form number). For data from a laboratory or reading center, the remaining characters of the dataset name are descriptive of the contents.

Variable names consist of 5-8 characters between Years 0-15 and slightly longer for Year 20, especially for genetic data variables. As with datasets, the first character indicates the exam during which the data were collected. Characters 2-3 indicate form number or laboratory from which the data were obtained. The remaining characters are descriptive of the contents. Variables which were collected in more than one exam are named identically or similarly except for the first character.

Documentation

For each dataset, 3 types of documentation are available. First, for data collected via forms, the form is available as a PDF file. For each item on a form, the corresponding SAS variable name in the distributed data sets is listed. These variable names did not appear on the form during data collection.

Second, new documentation regarding the publicly released version is available as a PDF file. Notes concerning revisions to the dataset made for the public release are contained therein, as well as a PROC CONTENTS listing and the SAS program used to generate the dataset.

Finally, the original documentation for the internal study version is available as a PDF file. Notes concerning computed variables, problems in the dataset, and information for longitudinal analyses are contained therein, as well as information about data edits, original contents, original SAS program, and the original data dictionary.

Medical history and medication follow-up forms - For several medical conditions, detailed information is desired. An affirmative answer to items on the medical history questionnaire (Form 8) prompts completion of these follow-up forms. The follow-up forms are all designated as Form 9, with a descriptive subtitle. Please refer to the forms for the specific skip patterns followed.

Revisions for Public Release

For public release, some modifications were made to protect the anonymity of the participants. To maintain consistency, rules were developed for variable transformations based on variable type (i.e., dichotomous, continuous). Rules were applied within the 4 race-gender cells. Table 1 provides the numbers of participants in each of these cells across all Field Center populations.

Table 1. Frequency and percent of participants by race and gender at each exam, CARDIA, September 2017

Exam Year

Strata

Total

BF

BM

WF

WM

GC

 

N

%

N

%

N

%

N

%

N

%

N

%

0

1480

28.9

1157

22.6

1307

25.6

1171

22.9

0

0.0

5115

100.0

2

1298

28.1

988

21.4

1236

26.7

1102

23.8

0

0.0

4624

100.0

5

1214

27.9

905

20.8

1179

27.1

1054

24.2

0

0.0

4352

100.0

7

1143

28.0

831

20.3

1106

27.0

1006

24.6

0

0.0

4086

100.0

10

1120

28.4

806

20.4

1072

27.1

950

24.1

2

0.1

3950

100.0

15

1021

27.8

709

19.3

1030

28.1

911

24.8

1

0.0

3672

100.0

20

1005

28.3

646

18.2

1008

28.4

889

25.1

1

0.0

3549

100.0

25

986

28.2

654

18.7

994

28.4

863

24.7

1

0.0

3498

100.0

30

957

28.5

648

19.3

956

28.5

796

23.7

1

0.0

3358

100.0

Note. Strata are noted with a two-character abbreviation. The first character represents Race (Black or White) and the second character represents Gender (Male or Female). GC: Gender change (female to male).

Transformations to all datasets. Some revisions were made to all datasets in the current release. The original CARDIA ID, which had information about Field Center embedded in it, was replaced with a randomly generated ID for each individual. The new variable, PID, can be used to merge datasets such that individual particpants' data are correctly matched. In addition to this change, the variable CENTER was deleted. Finally, 2 participants who had sex change operations during the course of the study were deleted from all datasets as the reliability of some data, particularly chemistries, may not be reliable due to hormonal changes.

Transformation Rules. The following rules were applied to individual variables:

a. Variables with Inherent Ability to Identify Individuals

Those variables that are judged to inherently identify individuals are not included in the data set. Examples are variables containing information regarding name and birth date.

b. Variables with Inherent Ability to Identify Field Centerxxx

Variables which inherently contain information about field center are either be deleted or recoded. Examples are variables such as technician ID or machine ID.

c. Date Variables

All dates (except birthdate, which will be deleted) are recoded to number of months relative to the Year 0 (Baseline) Examination date. This rule is intended to retain the chronological nature of events while obscuring the actual calendar time. An illustration is provided below:

Variable

Date

Months Since Baseline Exam

Baseline exam

September 21, 1985

0

Pregnancy delivery

October 31, 1983

-23

Hospitalization

July 5, 1996

106

d. Dichotomous and Polychotomous Variables

For dichotomous and polychotomous variables identified as needing modification, we include without alteration variables for which there are 20 or more participants represented in each response category within a given race-gender strata. If one or more categories have fewer than 20 responses, we either combined categories so that none are left with fewer than 20 responses or, failing that, set all responses in that design cell to missing.

e. Continuous Variables

For those continuous variables needing modification, we identified participants with the 20 highest values and those with the 20 lowest values within each of the 4 race-gender cells. The values for these participants were changed to the threshold value used to identify the group. Thus, the 20 with the highest values all have their data changed so that the value is set at the value for the 21st value from the top. Similar transformation was done for the low values. For cells with less than 40 non-missing values, all values were set to missing.

f. Character Variables

Some variables have been deleted as they contain confidential information such as initials, place of birth, and reason for hospitalization.

g. Time Variables

No transformations were made as times are recorded only as values on the 24-hour clock and contain no information about date.

h. Special Coding

Some variables received special coding according to specific rules that do not fall into any of the previous categories. One such example is age. At baseline, values less than 18 were recoded to 18 and values greater than 30 were recoded to 30. Another example is number of children or number of siblings. For these types of variables, transformations were done for non-0 values. Information for either pregnancies or children of more than four was recoded to four. Details of these special coding transformations are contained in the new documentation section for each dataset.

Datasets Judged Too Sensitive to Distribute

Some datasets from the original CARDIA release were not included in this release. The following details these datasets and the reason for withholding distribution.

  • Illicit drug use (all years) - judged too sensitive to distribute
  • Year 2 medical history follow-up questions for hysterectomy, liver disease, and vasectomy - too few records
  • Year 2 lipids and lipoproteins - judged to be unreliable
  • Year 7 GXT - judged unreliable at one center
  • Year 10 return blood draw and medical history follow-up questions for angiograms and MRI scans - too few records
  • Year 10 EBCT pilot - reliability still undetermined

General Notes from Original Documentation

Exam1GeneralNotes.PDF

Exam2GeneralNotes.PDF

Exam3GeneralNotes.PDF

Exam4GeneralNotes.PDF

Exam5GeneralNotes.PDF

Exam6GeneralNotes.PDF

Exam7GeneralNotes.PDF

Exam8GeneralNotes.PDF

Exam9GeneralNotes.PDF