NIDDK Central Repository Data and Documentation Submission Guidance
The following provides information on preparation of datasets and associated documentation for submission of data or specimens to NIDDK Central Repository (NIDDK-CR). The overall goal of this effort is to produce research datasets and associated documentation which includes study metadata, analytes, documentation, code, analytic tools, methods, algorithms, workflows, results, summaries, and analyses to allow qualified secondary researchers to perform new research while providing protection for the privacy of the research participants. This is accomplished by providing complete and accurate study data (and related specimens) according to best practices to replace, remove, or otherwise protect any directly identifiable participant level data. Note that not all data (or data types) generated by a clinical research project may be appropriate for NIDDK-CR. NIDDK-CR along with other NIDDK approved data-type specific or generalist repositories may be considered. Refer to NIDDK Data Management and Sharing (DMS) Guidance, specifically the Selecting a Data Repository resources and tool.
The data submission process involves the following steps:
- Assembly of study data and associated documents.
- Preparation of study data to replace or remove certain personal information at the participant level. A Resource Archival and Sharing Request to onboard a study must be developed and submitted to NIDDK for approval in advance of participant enrollment. The Request to onboard a study must be aligned with the approved DMS plan.
- Submission of data and associated documentation (including a description of any redactions that were applied and the program or code used to create the redacted datasets). If redactions beyond the 16 identifiers listed in Step 2 are performed, the redacted and pre-redacted data will need to be submitted for NIDDK-CR Quality Assurance (QA) of the redaction process, facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and allow for responsible long-term management of the repository resources.
- Submission of a linking file, if needed, to provide linking information for data to specimens or images, or when study resources are provided to NIDDK-CR and some other NIDDK-approved repository.
Step 1: Assembly of Study Data and Documents
The data and documentation should be comprehensive and sufficiently clear to enable investigators who are not familiar with the study data to use it appropriately. The following types of files will need to be assembled for submission to NIDDK-CR. Documents should be in their original electronic format. NIDDK-CR will provide instructions for upload after initial contact regarding the submission.
The files needed for submission are:
- Study Data
- Data collected on forms, at study visits, etc. that are at the individual participant level
- Data resulting from lab tests, genotyping, etc.
- Summary or analysis data (derived data used for analyses or publications)
- Program(s) or Code (e.g., SAS program used to create analysis dataset for a publication, SAS program used to create derived variables, etc.)
- Study Documentation (Protocol(s), MOPs/SOPs, instructions, walkthroughs, codebooks/data dictionaries in Excel format, list of publications, etc.)
- Study Forms (e.g., data collection instruments, case report forms)
- Specimen and Image Linkage Files as applicable
Each of these categories is described in more detail below.
Study Data
- Data provided to NIDDK-CR must have certain personally identifying information removed in accordance with guidance provided in the Limited Data Sets and Data Use Agreements section of NIH HIPAA Privacy Rule summary. This is described in further detail in Step 2: Preparation of Study Data.
- Provide SAS and CSV data sets or contact NIDDK-CR regarding an alternative format.
- Data set variables should have variable and value labels. Include any format files and/or programs or code used to create variables and labels.
- Name datasets in a manner that facilitates the matching of data sets to study forms. For example, Form02_v2.pdf and Form02_v2.sas7bdat could represent the study form and study data set for "Form 02: Medical History, Version 2".
- All data associated with study data collection forms should be provided.
- Other relevant data, such as Central Biospecimen Laboratory (CBL) data, should be provided.
- Provide analysis data sets when possible. These supplemental files, constructed by study investigators, are typically associated with a peer-reviewed publication and may include data records that have been merged across different data sets. Documentation showing how summary or analysis variables are derived from the forms (raw data) variables should be included as well; see "Study Documentation". Included with these datasets should be the code or program used to create the analysis dataset itself.
- A review will be done by NIDDK-CR staff to ensure that personally identifying information is removed. Additional redaction will be undertaken as necessary to produce a data package to be shared with approved requestors through NIDDK-CR (see Step 3: Submission of Data and Associated Documentation).
- To ensure the submitted data are complete and valid, NIDDK-CR support staff will replicate selected tables from published results. Data should be submitted in a timeframe that allows study staff to answer questions that may arise during this process.
Study Documentation
Provide study documentation in PDF format, except where noted below. Documentation must include:
- Study protocol
- Manual of Operations/Standard Operating Procedures (MOP/SOP)
- Data dictionary or Codebook in Excel format containing contents for each data set, variable names, variable labels and descriptions, variable types, variable units (as applicable), and value code lists and labels
- Descriptive statistics (means or frequencies) for each variable
- Explanations of how summary or analysis variables are derived from the forms (raw) variables; a log file for the program, or the program used to create the dataset may be provided in lieu of a description
- Value labels (i.e., SAS formats catalog)
- Bibliography of publications (Word format preferred) with a primary outcome publication identified
- SAS programs used for the analyses in publications (these will be kept internal to NIDDK-CR and used for reference)
- Identification of the publication(s) associated with each analysis data set, if applicable
- URL for the public study website, if applicable
- Any additional information about the study or data sets that will facilitate meaningful secondary use of the data
Selected documentation (e.g., protocols, MOPs, data dictionaries/codebooks, case report forms) will be posted on NIDDK-CR website to aid researchers in selecting datasets appropriate for their research. To see the contents and appearance of a typical study listing, click on any listed study at the Study Search Page.
Study Data Collection Forms
- Provide copies of data collection instruments/case report forms in PDF format.
- Form instructions not included in MOP/SOP should be provided as a PDF.
- Identify proprietary instruments or forms (i.e., have an associated cost and/or license for use). Proprietary instruments or forms are not provided to requestors of NIDDK-CR data but may be provided to NIDDK-CR for internal reference.
- Annotate forms with variable names and value labels.
Step 2: Preparation of Study Data
Certain personal information of study participants or of relatives, employers, or household members of the participants must be removed from all items, including data, images, and documentation before submitting to NIDDK-CR. Below is a list of information that fall into this category and should be removed prior to submitting files to NIDDK-CR.
- Names
- Postal address information other than town or city, state, and ZIP Code
- Telephone numbers
- Fax numbers
- Electronic mail addresses
- Social security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web universal resource locators (URLs)
- Internet protocol (IP) address numbers
- Biometric identifiers, including fingerprints and voiceprints
- Full-face photographic images and any comparable images
Submitted datasets should include all raw data and summary/analysis level files from all study visits, laboratory measurements, study procedures, and outcome elements along with other final supplemental files (for example, derived or calculated variables) so that users may approximate published results and conduct new analyses. Submitted datasets must be redacted to remove the personal information specified above, and data collected solely for administrative purposes, and must conform to individual informed consent restrictions.
If redactions beyond the 16 identifiers listed above are performed, the study must provide both the redacted and pre-redacted data to NIDDK-CR with documentation that describes the changes to the data (e.g., README file, program/code). The documentation should provide a complete overview of the data and a description of their use, appropriate for investigators who are not familiar with the dataset. It should include a description of significant events which may not be documented in the protocol or other documents that would be useful to understand the submitted data. Examples might include addenda describing significant changes in study procedures; cautionary information regarding the interpretation of data elements or which explain apparent inconsistencies in the data or frequently missing data; the abandonment of selected data collections from one or more sites; modifications to questionnaires over time if not documented elsewhere, etc.
Step 3: Submission of Data and Associated Documentation
Upon completion of the preparation of datasets and documentation, these files will be ready for transfer.
Once transferred, NIDDK-CR support staff will review the submission to verify the transferred records and included study data variables, re-generate frequencies for comparison to those generated by the study staff, and review datasets for additional items that may need to be redacted or recoded. Studies that have multiple datasets will be assessed on their ability to be linked to one another. NIDDK-CR support staff will also examine variables contained in multiple datasets, such as Participant ID and Visit, to ensure that they have been formatted consistently across all datasets.
NIDDK-CR shares repository resources with qualified requestors under a Data and Resources Use Agreement (DUA) agreeing to abide by the terms and conditions and must adhere to the specifications of the DUA. Each request for resources undergoes a review for scientific merit, feasibility, and appropriateness, and recipients of resources agree not to attempt to identify or contact participants, link to other resources not specified in their proposal, nor share these data with others. The DUA example template can be found on the Information for Requestors using NIDDK-CR Resources for Research (R4R) page under the Helpful Information tab.
Step 4: Submission of a Linking File
Ideally, the same de-identified Participant ID for study participants should be used consistently across study datasets, specimens, and image collections. If a different ID was used for specimens or images, a linkage file is necessary to link the Participant ID to another assigned (randomized) ID. When different study-related data types (e.g., genomic, omics) are provided to other NIDDK approved repositories, a linkage file must be provided to NIDDK-CR to map the data provided to NIDDK-CR with the data provided to the other repository(ies). These files will be stored securely and may be used to facilitate the comparison of submitted data to publications, assist in making appropriate resources available to requestors, and in the long-term management of the repository resources.
Specimen Linkage File
If specimens are submitted for storage at NIDDK Biorepository:
- The linkage file should be delivered at the time of study data delivery for completed studies, or at the initial communication between NIDDK-CR and the coordinating unit for ongoing studies.
- Provide a specimen linkage file that uniquely maps each specimen ID to the corresponding Participant ID. For studies where specimens are collected longitudinally over several timepoints, the linkage file should uniquely map each specimen ID to a corresponding Participant ID and timepoint (i.e., study visit).
- Study visits should be represented by unique visit codes. A visit code is a label that identifies the study visit during which the specimen was collected. Specimen collection date is not a substitute for visit code.
- Example: Suppose Study "A" collected longitudinal serum specimens from participants at baseline, week 6, and month 6. A portion of the specimen linkage file might look like:
Specimen ID |
Participant ID |
Visit Code |
Collection Date |
A-0001 |
010-001 |
BV |
01/01/2001 |
A-0027 |
010-001 |
FV01 |
02/10/2001 |
A-0078 |
010-001 |
FV02 |
06/01/2001 |
A-0002 |
010-002 |
BV |
01/01/2002 |
B-0001 |
020-001 |
BV |
01/01/2003 |
B-0023 |
020-001 |
FV01 |
02/10/2003 |
B-0002 |
020-002 |
BV |
01/01/2004 |
B-0025 |
020-002 |
FV01 |
02/10/2004 |
B-0041 |
020-002 |
FV02 |
06/01/2004 |
I-0001 |
090-001 |
BV |
01/01/2003 |
I-0007 |
090-001 |
BV |
01/01/2002 |
- The collection dates are a required component on the manifest that accompanies each shipment of specimens to NIDDK Biorepository as described in the Specimen Submission Label, Manifest, and Shipping Guidance. Additionally, the full collection date may be provided as part of the Specimen Linkage File.
- The specimen linkage file should include a simple table showing the description for each visit code. The visit code description table for the previous example might look like:
Visit |
Code Description |
BV |
Baseline Visit (Enrollment) |
FV01 |
Follow-up Visit #1 (Week 6) |
FV02 |
Follow-up Visit #2 (Month 6) |
Image Linkage File
If images are submitted for storage in NIDDK-CR:
- The linkage file should be delivered at the time of submission of images to NIDDK-CR.
- Provide an image linkage file that maps each Image ID to the corresponding Participant ID, visit, and collection date. Example image linking file below:
Participant ID |
Visit Name |
Visit Date |
Image Series ID/ Accession # |
Comments |
P11111 |
Screening |
02/07/2010 |
10001 |
|
P11111 |
Year 2 |
02/15/2012 |
20001 |
|
P11111 |
Year 4 |
02/25/2014 |
30001 |
|
P11111 |
Year 6 |
02/20/2016 |
40001 |
|
P11111 |
Year 8 |
02/27/2018 |
50001 |
No exam - missing images |
P22222 |
Screening |
10/06/2011 |
10111 |
|
P22222 |
Year 2 |
10/12/2013 |
20222 |
|
Linkage File to Study-related Data Provided to Other NIDDK Approved Repositories
If study-related data (e.g., genomics, sequencing) have been or will be provided to other NIDDK approved repositories such as dbGaP, a linking file must be provided between the IDs submitted to NIDDK-CR and the IDs used at the other NIDDK approved repository (if different). This linking file will be used to facilitate combining data from the two sources for approved analyses. Example linking file below:
NIDDK-CR Participant ID |
dbGaP Participant ID |
100001 |
210000 |
100002 |
210001 |
100003 |
210002 |
100004 |
210003 |
100005 |
210004 |
100006 |
210005 |
100007 |
210006 |
100008 |
210007 |
100009 |
210008 |
100010 |
210009 |
100011 |
210010 |