NIDDK Central Repository Data and Documentation Submission Guidelines
The following provides information on preparation of datasets and associated documentation for submission to the NIDDK Central Repository. The overall goal of this effort is to produce research datasets and associated documentation to allow qualified outside researchers to perform new research while providing protection for the privacy of the research participants. This is accomplished by providing complete and accurate study data (and related biospecimens) according to best practices to replace, remove, or otherwise protect any directly identifiable participant level data.
The data submission process involves the following steps:
- Assembly of study data and associated documents
- Preparation of study data to replace or remove certain personal information at the participant level. When appropriate a data redaction plan for the creation of public-use datasets, and the application of that plan to the study data, should be developed and shared with the NIDDK for approval.
- Submission of data and associated documentation (including a description of any redactions that were applied)
Pre-redacted (private) final master files from which the redacted (public-use) data files will be derived are required in the following circumstances, as they are important to the responsible management of data/biospecimen resources over time:
- Studies which are also submitting specimens to the NIDDK Biorepository
- Studies funded under NIDDK contract mechanisms
- Submission of a linking file, when appropriate, to provide linking information for data and biospecimens, when both pre-redacted (private) and redacted (public-use) data files are provided to the NIDDK Repository, or when study data/samples are provided to the NIDDK Repository and some other NIDDK approved Repository.
Step 1: Assembly of study data and documents
The documentation should be comprehensive and sufficiently clear to enable investigators who are not familiar with the study data to use it. The following types of documents will need to be assembled for submission to NIDDK. Documents should be in their original electronic format.
To facilitate the transfer of study materials, please review the Study Data Checklist. This checklist should be used to specify the items included in the submission and may be uploaded along with the data package. The repository will provide instructions for upload after initial contact regarding the submission.
The materials needed for submission are:
- Study Data
- data collected on forms
- data resulting from lab tests, genotyping, etc.
- analysis data (derived data used for publication)
- Study Documentation
- Study Forms
- Sample and Image Linkage Files as applicable
Each of these categories is described in more detail below.
- Data provided to the Central Repository must have certain personally identifying information removed in accordance with guidance provided for Limited Data Sets and Data Use Agreements section of the NIH HIPAA Privacy Rule summary. This is described in further detail in the Step 2: Preparation of Study Data section below.
- Provide SAS data sets or contact the Central Repository regarding an alternative format.
- Data set variables should have variable and value labels.
- Name datasets in a manner that facilitates the matching of data sets to study forms. For example, Form02_v2.pdf and Form02_v2.sas7bdat could represent the study form and study data set for "Form 02: Medical History, Version 2".
- All data associated with study data collection forms should be provided.
- Other relevant data, such as Central Biospecimen Laboratory (CBL) data, should be provided.
- Provide analysis data sets when possible. These supplemental files, constructed by study investigators, are typically associated with a peer-reviewed publication and may include data records that have been merged across different data sets. Documentation showing how analysis variables are derived from the forms (raw data) variables should be included as well; see "Study Documentation".
- A review will be done by Central Repository staff to ensure that personally identifying information is removed. Additional redaction will be undertaken as necessary to produce a public-use data package to be shared with requestors through the NIDDK Central Repository. (see Step 3: Submission of Data section below)
- In order to ensure the transferred data are complete and valid, a Central Repository statistician will replicate selected tables from published results. Data should be submitted in a timeframe that allows study staff to answer questions that may arise during this process.
Study Data Collection Forms
- Provide copies of data collection forms in PDF format.
- Form instructions not included in Manual of Operations/Procedures should be provided as a PDF.
- Identify copyrighted forms/scales. Copyrighted forms are not provided to requestors of Central Repository data but should be provided to the Repository for reference.
- Annotate forms with variable names and value labels.
Provide study documentation in PDF format, except where noted below. Documentation should include:
- Study protocol
- Manual of Operations/Standard Operating Procedures (MOP/SOP)
- Study System User Guide, if separate from MOP/SOP, for electronic data capture
- Bibliography of publications (Word format preferred)
- A contents file for each data set, showing the number of observations, variables, variable labels, and value labels
- Data dictionary or Codebook showing:
- Descriptive statistics (means or frequencies) for each variable.
- Explanations of how analysis variables are derived from the forms (raw) variables. A log file for the program used to create the dataset may be provided in lieu of a description.
- Value labels (i.e., SAS user-defined formats.)
- Identification of the publication(s) associated with each analysis data set, if applicable
- URL for the public study website, if applicable
- Any additional information about the study or data sets that will facilitate the use of the data
- Selected documentation will be posted on the NIDDK Central Repository Portal to aid researchers in selecting datasets appropriate for their research.
It should be noted that selected study documentation, not including documentation of pre-redacted (private) study datasets but including documentation of datasets to be shared, will be posted and used to describe the study on the NIDDK Central Repository website. Examples include Forms, Data Dictionaries, Descriptive Statistics, and the Study Protocol.
To see the contents and appearance of a typical study listing, click on any listed study at the Study Search Page.
Step 2: Preparation of study data to remove participant level personal information
Certain personal information of study participants or of relatives, employers, or household members of the individuals must be removed from all items, including data, images, and documentation before submitting to the Central Repository. Below is a list of information that fall into this category and should be removed prior to submitting materials to the Repository.
- Postal address information, other than town or city, state, and ZIP Code.
- Telephone numbers.
- Fax numbers.
- Electronic mail addresses.
- Social security numbers.
- Medical record numbers.
- Health plan beneficiary numbers.
- Account numbers.
- Certificate/license numbers.
- Vehicle identifiers and serial numbers, including license plate numbers.
- Device identifiers and serial numbers.
- Web universal resource locators (URLs).
- Internet protocol (IP) address numbers.
- Biometric identifiers, including fingerprints and voiceprints.
- Full-face photographic images and any comparable images
Submitted datasets should include all raw data and analysis level files from all study visits, laboratory measurements, study procedures, and outcome elements along with other final supplemental files (for example, required calculated variables) so that users may approximate published results and conduct new analyses. Submitted datasets must be redacted to remove the personal information specified above and data collected solely for administrative purposes and must conform to individual informed consent restrictions. Public-use datasets may contain recodes of selected low-frequency data values necessary to protect participant privacy and minimize re-identification risks. The redaction process may impact the exact replication of published results using the public-use data.
If a study wishes to prepare the public-use datasets, modified study dataset documentation which reflects changes made to the included variables and recodes should be prepared. This documentation will be provided along with the public-use datasets to approved requestors. A summary document which describes the changes and deletions which were applied during redaction should also be included. In addition, a summary documentation file, usually called a README file, should be submitted. This document should provide a complete overview of the data and a description of their use, appropriate for investigators who are not familiar with the dataset. It should include a description of significant events which may not be documented in the protocol or other documents that would be useful to understand the submitted data; examples might include addenda describing significant changes in study procedures, cautionary information regarding the interpretation of data elements or which explain apparent inconsistencies in the data or frequently missing data; the abandonment of selected data collections from one or more sites; modifications to questionnaires over time if not documented elsewhere, etc.
The README should also contain a brief description of the study (including a general orientation to the study, its components, and its examination and follow-up timeline), a listing of all files being provided, a description of system requirements, a generation program code for installing a SAS file from the SAS export data file (if appropriate), and a frequency distribution for selected key variables.
Step 3: Submission of data and associated documentation
Upon completion of the preparation of datasets and documentation, these files will be ready for transfer.
Once transferred, Central Repository staff will review the submission to verify the transferred records and included study data variables, re-generate frequencies for comparison to those generated by the study staff, and review datasets for additional items that may need to be redacted or recoded. Studies that have multiple datasets will be assessed on their ability to be linked to one another. NIDDK will also examine variables contained in multiple datasets, such as Participant ID and visit, to ensure that they have been formatted consistently across all datasets.
Pre-redacted (private) data will be held securely by the NIDDK Central Repository and used to facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and in the long-term management of the repository resources.
The NIDDK Central Repository shares data/samples (repository resources) with qualified researchers under a data use agreement restricted to the specified research proposed by the researcher. Each request for resources undergoes a review for merit and recipients of resources agree not to attempt to identify or contact participants, link to other resources not specified in their proposal, nor share these data with others. The Data Use Agreement template can be found on the Repository website.
Step 4: Submission of a linking file
If a study wishes to prepare the public-use data sets, all redacted data, such as full dates, center values, and original values for grouped data, should be provided in a separate file along with a link between the original Participant IDs and the newly assigned (randomized) IDs. When biospecimens are also submitted, the Sample Linkage File (described below) should include any redacted sample collection dates and their associated study time points for each participant. When different study-related data types (e.g. genomic, omics) are provided to other NIDDK approved repositories, a linkage file should be provided to map the data provided to the NIDDK Repository with the data provided to the other repository(ies). These files will be stored securely for archival purposes and may be used to facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and in the long-term management of the repository resources.
Sample Linkage File
(if biospecimens are submitted for storage at the NIDDK Biorepository)
- The linkage file should be delivered at the time of study data delivery for completed studies, or at the initial communication between the Central Repository and the DCC for ongoing studies.
- Provide a sample linkage file that uniquely maps each sample ID to the corresponding Participant ID. For studies where samples are collected longitudinally over several timepoints, the linkage file should uniquely map each sample ID to a corresponding Participant ID and timepoint. From this point on, we'll refer to timepoint as a "study visit".
- Study visits should be represented by unique visit codes. A visit code is a label that identifies the study visit during which the sample was collected. Sample collection date is not a substitute for visit code.
- Example: Suppose Study "A" collected longitudinal serums from participants at baseline, week 6, and month 6. Then, a portion of the sample linkage file might look like:
Sample ID Participant ID Visit Code Collection Date A-0001 010-001 BV 01/01/2001 A-0027 010-001 FV01 02/10/2001 A-0078 010-001 FV02 06/01/2001 A-0002 010-002 BV 01/01/2002 B-0001 020-001 BV 01/01/2003 B-0023 020-001 FV01 02/10/2003 B-0002 020-002 BV 01/01/2004 B-0025 020-002 FV01 02/10/2004 B-0041 020-002 FV02 06/01/2004 I-0001 090-001 BV 01/01/2003 I-0007 090-001 BV 01/01/2002
Note that sample collection dates are a required component of the Sample Manifest that accompanies each shipment of biospecimens to the biorepository, and described here. Alternatively, the full collection date may be provided as part of the Sample Linkage File.
The sample linkage file should include a simple table showing the description for each visit code. The visit code description table for the previous example might look like:
Visit Code Description BV Baseline Visit (enrollment) FV01 Follow-up Visit #1 (week 6) FV02 Follow-up Visit #2 (month 6)
Image Linkage File
(if image data were collected for storage in the Data Repository)
- Provide an image linkage file that maps each Image ID to the corresponding Participant ID and visit code. The file provided should conform to the structure discussed above but refers to Image ID rather than Sample ID.
Study-related data provided to other NIDDK approved repositories
If study-related data have been or will be provided to other NIDDK approved repositories such as dbGaP, a link must be provided between the IDs submitted to the NIDDK Central Repository and the IDs used at the other repository. This link will be used to facilitate combining data from the two sources for approved analyses.
- Provide a linkage file(s) that maps each ID in other repositories to the corresponding Participant ID and visit code as submitted to the NIDDK Central Repository. The file provided should conform to the structure discussed above.