NIDDK Central Repository - Archive Requirements

NIDDK Central Repository Data and Documentation Submission Guidelines

The following provides information on preparation of datasets and associated documentation for submission to NIDDK Central Repository (NIDDK-CR). The overall goal of this effort is to produce research datasets and associated documentation to allow qualified outside researchers to perform new research while providing protection for the privacy of the research participants. This is accomplished by providing complete and accurate study data (and related biospecimens) according to best practices to replace, remove, or otherwise protect any directly identifiable participant level data.

The data submission process involves the following steps:

Assembly of study data and associated documents
Preparation of study data to replace or remove certain personal information at the participant level. When appropriate a data redaction plan for the creation of public-use datasets, and the application of that plan to the study data, should be developed and shared with NIDDK for approval.
Submission of data and associated documentation (including a description of any redactions that were applied)
Pre-redacted (private) final master files from which the redacted (public-use) data files will be derived are required in the following circumstances, as they are important to the responsible management of data/biospecimen resources over time:
- Studies which are also submitting specimens to NIDDK Biorepository
- Studies funded under NIDDK contract mechanisms
The submission of pre-redacted final files is preferred for data-only studies funded by grants or cooperative agreements, as they are useful for NIDDK Quality Assurance (QA) of the redaction process.
Submission of a linking file, when appropriate, to provide linking information for data and biospecimens, when both pre-redacted (private) and redacted (public-use) data files are provided to NIDDK-CR, or when study data/biospecimens are provided to NIDDK-CR and some other NIDDK-approved Repository.

Step 1: Assembly of study data and documents

The documentation should be comprehensive and sufficiently clear to enable investigators who are not familiar with the study data to use it. The following types of documents will need to be assembled for submission to NIDDK. Documents should be in their original electronic format.

To facilitate the transfer of study materials, please review the Study Data Checklist. This checklist should be used to specify the items included in the submission and may be uploaded along with the data package. NIDDK-CR will provide instructions for upload after initial contact regarding the submission.

The materials needed for submission are:

Study Data
- data collected on forms
- data resulting from lab tests, genotyping, etc.
- analysis data (derived data used for publication)
Study Documentation
Study Forms
Biospecimen and Image Linkage Files as applicable

Each of these categories is described in more detail below.

Study Data

Data provided to NIDDK-CR must have certain personally identifying information removed in accordance with guidance provided for Limited Data Sets and Data Use Agreements section of NIH HIPAA Privacy Rule summary. This is described in further detail in the Step 2: Preparation of Study Data section below.
Provide SAS data sets or contact NIDDK-CR regarding an alternative format.
Data set variables should have variable and value labels.
Name datasets in a manner that facilitates the matching of data sets to study forms. For example, Form02_v2.pdf and Form02_v2.sas7bdat could represent the study form and study data set for "Form 02: Medical History, Version 2".
All data associated with study data collection forms should be provided.
Other relevant data, such as Central Biospecimen Laboratory (CBL) data, should be provided.
Provide analysis data sets when possible. These supplemental files, constructed by study investigators, are typically associated with a peer-reviewed publication and may include data records that have been merged across different data sets. Documentation showing how analysis variables are derived from the forms (raw data) variables should be included as well; see "Study Documentation".
A review will be done by NIDDK-CR staff to ensure that personally identifying information is removed. Additional redaction will be undertaken as necessary to produce a public-use data package to be shared with requestors through NIDDK-CR. (see Step 3: Submission of Data section below)
In order to ensure the transferred data are complete and valid, NIDDK-CR support staff will replicate selected tables from published results. Data should be submitted in a timeframe that allows study staff to answer questions that may arise during this process.

Study Data Collection Forms

Provide copies of data collection forms in PDF format.
Form instructions not included in Manual of Operations/Procedures should be provided as a PDF.
Identify copyrighted forms/scales. Copyrighted forms are not provided to requestors of NIDDK-CR data but should be provided to NIDDK-CR for reference.
Annotate forms with variable names and value labels.

Study Documentation

Provide study documentation in PDF format, except where noted below. Documentation should include:

Study protocol
Manual of Operations/Standard Operating Procedures (MOP/SOP)
Study System User Guide, if separate from MOP/SOP, for electronic data capture
Bibliography of publications (Word format preferred)
A contents file for each data set, showing the number of observations, variables, variable labels, and value labels
Data dictionary or Codebook showing:
- Descriptive statistics (means or frequencies) for each variable.
- Explanations of how analysis variables are derived from the forms (raw) variables. A log file for the program used to create the dataset may be provided in lieu of a description.
Value labels (i.e., SAS user-defined formats.)
Identification of the publication(s) associated with each analysis data set, if applicable
URL for the public study website, if applicable
Any additional information about the study or data sets that will facilitate the use of the data
Selected documentation will be posted on NIDDK-CR website to aid researchers in selecting datasets appropriate for their research.

It should be noted that selected study documentation, not including documentation of pre-redacted (private) study datasets but including documentation of datasets to be shared, will be posted and used to describe the study on NIDDK-CR website. Examples include Forms, Data Dictionaries, Descriptive Statistics, and the Study Protocol.

To see the contents and appearance of a typical study listing, click on any listed study at the Study Search Page.

Step 2: Preparation of study data to remove participant level personal information

Certain personal information of study participants or of relatives, employers, or household members of the individuals must be removed from all items, including data, images, and documentation before submitting to NIDDK-CR. Below is a list of information that fall into this category and should be removed prior to submitting materials to NIDDK-CR.

Names.
Postal address information, other than town or city, state, and ZIP Code.
Telephone numbers.
Fax numbers.
Electronic mail addresses.
Social security numbers.
Medical record numbers.
Health plan beneficiary numbers.
Account numbers.
Certificate/license numbers.
Vehicle identifiers and serial numbers, including license plate numbers.
Device identifiers and serial numbers.
Web universal resource locators (URLs).
Internet protocol (IP) address numbers.
Biometric identifiers, including fingerprints and voiceprints.
Full-face photographic images and any comparable images

Submitted datasets should include all raw data and analysis level files from all study visits, laboratory measurements, study procedures, and outcome elements along with other final supplemental files (for example, required calculated variables) so that users may approximate published results and conduct new analyses. Submitted datasets must be redacted to remove the personal information specified above and data collected solely for administrative purposes and must conform to individual informed consent restrictions. Public-use datasets may contain recodes of selected low-frequency data values necessary to protect participant privacy and minimize re-identification risks. The redaction process may impact the exact replication of published results using the public-use data.

If a study wishes to prepare the public-use datasets, modified study dataset documentation which reflects changes made to the included variables and recodes should be prepared. This documentation will be provided along with the public-use datasets to approved requestors. A summary document which describes the changes and deletions which were applied during redaction should also be included. In addition, a summary documentation file, usually called a README file, should be submitted. This document should provide a complete overview of the data and a description of their use, appropriate for investigators who are not familiar with the dataset. It should include a description of significant events which may not be documented in the protocol or other documents that would be useful to understand the submitted data; examples might include addenda describing significant changes in study procedures, cautionary information regarding the interpretation of data elements or which explain apparent inconsistencies in the data or frequently missing data; the abandonment of selected data collections from one or more sites; modifications to questionnaires over time if not documented elsewhere, etc.

The README should also contain a brief description of the study (including a general orientation to the study, its components, and its examination and follow-up timeline), a listing of all files being provided, a description of system requirements, a generation program code for installing a SAS file from the SAS export data file (if appropriate), and a frequency distribution for selected key variables.

Step 3: Submission of data and associated documentation

Upon completion of the preparation of datasets and documentation, these files will be ready for transfer.

Once transferred, NIDDK-CR support staff will review the submission to verify the transferred records and included study data variables, re-generate frequencies for comparison to those generated by the study staff, and review datasets for additional items that may need to be redacted or recoded. Studies that have multiple datasets will be assessed on their ability to be linked to one another. NIDDK will also examine variables contained in multiple datasets, such as Participant ID and visit, to ensure that they have been formatted consistently across all datasets.

Pre-redacted (private) data will be held securely by NIDDK-CR and used to facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and in the long-term management of the repository resources.

NIDDK-CR shares data/biospecimens (repository resources) with qualified researchers under a data use agreement restricted to the specified research proposed by the researcher. Each request for resources undergoes a review for merit and recipients of resources agree not to attempt to identify or contact participants, link to other resources not specified in their proposal, nor share these data with others. The Data and Resources Use Agreement example templates can be found on the Information for Requestors using NIDDK-CR Resources for Research (R4R) page under the Helpful Information tab.

Step 4: Submission of a linking file

If a study wishes to prepare the public-use data sets, all redacted data, such as full dates, center values, and original values for grouped data, should be provided in a separate file along with a link between the original Participant IDs and the newly assigned (randomized) IDs. When biospecimens are also submitted, the Biospecimen Linkage File (described below) should include any redacted biospecimen collection dates and their associated study time points for each participant. When different study-related data types (e.g. genomic, omics) are provided to other NIDDK approved repositories, a linkage file should be provided to map the data provided to NIDDK-CR with the data provided to the other repository(ies). These files will be stored securely for archival purposes and may be used to facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and in the long-term management of the repository resources.

Biospecimen Linkage File

(if biospecimens are submitted for storage at NIDDK Biorepository)

The linkage file should be delivered at the time of study data delivery for completed studies, or at the initial communication between NIDDK-CR and the DCC for ongoing studies.
Provide a biospecimen linkage file that uniquely maps each biospecimen ID to the corresponding Participant ID. For studies where biospecimens are collected longitudinally over several timepoints, the linkage file should uniquely map each biospecimen ID to a corresponding Participant ID and timepoint. From this point on, we'll refer to timepoint as a "study visit".

Study visits should be represented by unique visit codes. A visit code is a label that identifies the study visit during which the biospecimen was collected. Biospecimen collection date is not a substitute for visit code.

Example: Suppose Study "A" collected longitudinal serums from participants at baseline, week 6, and month 6. Then, a portion of the biospecimen linkage file might look like:

Biospecimen ID	Participant ID	Visit Code	Collection Date
A-0001	010-001	BV	01/01/2001
A-0027	010-001	FV01	02/10/2001
A-0078	010-001	FV02	06/01/2001
A-0002	010-002	BV	01/01/2002
B-0001	020-001	BV	01/01/2003
B-0023	020-001	FV01	02/10/2003
B-0002	020-002	BV	01/01/2004
B-0025	020-002	FV01	02/10/2004
B-0041	020-002	FV02	06/01/2004
I-0001	090-001	BV	01/01/2003
I-0007	090-001	BV	01/01/2002

Note that biospecimen collection dates are a required component of the Biospecimen Manifest that accompanies each shipment of biospecimens to NIDDK Biorepository as described in the Biospecimen Submission Label, Manifest, and Shipping Guidance. Alternatively, the full collection date may be provided as part of the Biospecimen Linkage File.

The biospecimen linkage file should include a simple table showing the description for each visit code. The visit code description table for the previous example might look like:

Visit	Code Description
BV	Baseline Visit (enrollment)
FV01	Follow-up Visit #1 (week 6)
FV02	Follow-up Visit #2 (month 6)

Image Linkage File

(if image data were collected for storage in NIDDK-CR)

Provide an image linkage file that maps each Image ID to the corresponding Participant ID and visit code. The file provided should conform to the structure discussed above but refers to Image ID rather than Biospecimen ID.

Study-related data provided to other NIDDK approved repositories

If study-related data have been or will be provided to other NIDDK-approved repositories such as dbGaP, a link must be provided between the IDs submitted to NIDDK-CR and the IDs used at the other NIDDK-approved repository. This link will be used to facilitate combining data from the two sources for approved analyses.

Provide a linkage file(s) that maps each ID in other repositories to the corresponding Participant ID and visit code as submitted to NIDDK-CR. The file provided should conform to the structure discussed above.

(Rev 02/25/2020)

Home chevron_right For Submitters chevron_right Archive Requirements