For the NIDDK-CR Data Centric Challenge, NIDDK sought innovative approaches to enhance the utility of NIDDK datasets for AI applications. Towards this, the goals of the competition were to 1) generate an “AI-ready” dataset that can be used for future data challenges, and 2) to produce methods that can be used to enhance the AI-readiness of NIDDK data. Participants enhanced de-identified data from the following longitudinal studies focused on Type 1 Diabetes (T1D) that are available through NIDDK-CR:
Participation in this challenge was tiered based on the challenge applicants’ self-described experience with data science and analytics (i.e., beginner or intermediate/advanced). Participants were instructed to 1) prepare a single dataset by aggregating all data files associated with one or more longitudinal studies on T1D listed above, and 2) augment the single dataset to ensure AI-readiness. One winner from each group below was selected.
Beginner-Level Challenge: The goal for challenge participants was to aggregate the 48+ datasets from the TEDDY study into a single unified and machine-readable dataset harmonized by participant ID (MaskID). Since NIDDK cannot know what other study designs may arise in the future, or what discoveries could be pursued when combining the TEDDY dataset with other datasets, AI-readiness required aggregation of all 48 dataset files into a single tabular (i.e., spreadsheet or rectangular) .csv file type; data enhancement steps that do not meaningfully alter the original data; and preparation of dataset documentation that is both human- and machine-readable.
Intermediate/Advanced-Level Challenge: The goal for challenge participants was to harmonize the four studies listed above within the TrialNet set of studies. Since NIDDK cannot predict the varied ways AI researchers could construct epidemiologic studies using this or any other harmonized TrialNet dataset, these four studies helped to understand the feasibility of data harmonization within TrialNet studies. For data challenge participants, it was important to harmonize study participants by MaskID and to retain those study participants who appear in TN16, TN19, or TN20 but did not also appear in TN01. AI-readiness for this task required data aggregation for all dataset files within each study into a single tabular (i.e., spreadsheet or rectangular) .csv file type; data enhancement steps that do not meaningfully alter the original datasets; and harmonization of study participants across TrialNet studies, which may additionally require de-duplication of records. Challenge participants also needed to prepare dataset documentation that is both human- and machine-readable.
The Data Centric Challenge was split into two phases, a registration phase, and a competition phase. During Phase 1, interested applicants registered for the Challenge via Challenge.gov. Approved participants then proceeded to Phase 2, where they received access to the data within a secure, NIDDK-provided analytics workbench environment with tools to make the data AI-ready, including SageMaker Jupyter Notebook, Python, TensorFlow, and R Programming Language.
Register for a NIDDK-CR account to stay up-to-date on future challenges hosted by the Repository. The NIDDK-CR support staff are also available to answer questions at NIDDK-CRsupport@niddk.nih.gov.
Want to learn more about this Challenge? To learn more about this completed challenge, including submission requirements, judging criteria, and winners, visit NIDDK Central Repository Data-Centric Challenge page on Challenge.gov for details.
Want to access the winning solutions? Scripts developed by the winning teams to generate the AI-ready data are available in GitHub. Click on the winning submission titles under the leaderboard section to access the solution scripts in GitHub. Note that access to the TEDDY or TrialNet datasets requires a data request to be submitted through NIDDK-CR.
Want to learn more about AI-readiness? Each week during the Data Centric Challenge, NIDDK-CR hosted office hours on AI-readiness to provide an educational opportunity for challengers and the broader research community to learn about tools, models, and approaches, for AI-driven research. Recordings and materials from these office hours are available on the NIDDK-CR website here.
Click on the submission title to access the winning solution.
Rank | Submission | Score |
---|---|---|
1 | Team (6 members) |
87.37 |
2 | Clustering-ready TEDDY Team (6 members) |
79.75 |
3 | PREPAIRED - Python and R for Easily
Preparing AL-Ready Enhanced Datasets Individual |
79.05 |
4 | EMI Advisors NIDDK Submission Team (2 members) |
0.00 |
4 | Predictive Laboratory Analytics Team (4 members) |
0.00 |
Click on the submission title to access the winning solution.
Rank | Submission | Score |
---|---|---|
1 | Team (6 members) |
94.49 |
2 | Longhorn NIDDK Data Roundup Team (2 members) |
89.01 |
3 | Aggregation and Harmonization of Data Harmonization of Data Across Four Type 1 Diabetes TrialNet Studies (TN01, TN16, TN19, and TN20) Team (10 members) |
87.50 |
4 | Type 1 Diabetes Studies AI Preparing - A Data Fusion Challenge Team (2 members) |
0.00 |
4 | Enhancing NIDDK Datasets for Future Artificial Intelligence Applications Individual |
0.00 |