Data Sets and Data Access Units
Table of Contents
Key Concepts
Data Access Unit (DAU)
A St. Jude Cloud Data Access Unit (DAU) is a grouping of data that typically corresponds to a project, study, or Data Set generated at the same time at the same institution. Each DAU has its own governing body of researchers, the Data Access Committee, who preside over the data and who may grant or deny access. Each Data Access Committee is responsible for only one DAU and has its own protocols for approving access to their DAU. Please contact us if you have questions about committee approval protocols. We currently have 7 DAUs: Pediatric Cancer Genome Project (PCGP), St. Jude Lifetime Cohort Study (SJLIFE), Clinical Genomics, Sickle Cell Genome Project (SGP), Childhood Cancer Survivor Study (CCSS), Pan-Acute Lymphoblastic Leukemia (PanALL), and Clinical Research in ALS and Related Disorders for Therapeutic Development Consortium (CReATe). See below for a brief description of each DAU. For a more detailed description please see the respective Schedule 1(s).
See the list of Data Access Units.
Data Set
A St. Jude Cloud Data Set is a grouping of data which has been curated by St. Jude and can correspond to a study, project, or specific disease. They are available for free to researchers and access to a Data Set can be requested from the data browser. However, access is not granted at the Data Set level, but rather the Data Access Unit. A single Data Set may belong to only one DAU, or it could belong in multiple if it contains data that came from different groups. An approved Data Access Request grants access to a particular Data Access Unit which includes specific Data Sets that can be selected from the Data Browser. An approved DAR would give access not only to the data selected at the time, but also any additional data that is included in the DAUs which were approved. Only the data initially selected will be vended to a project folder upon approval but returning to the data browser and selecting additional data which falls under the approved DAUs will not require another Data Access Request.
Data Access Committee (DAC)
A St. Jude Cloud Data Access Committee (DAC) is a group of St. Jude researchers who oversee access to a particular Data Access Unit (DAU) and evaluate incoming data requests.
The first time you request access to files in a DAU, it is required that you fill out a Data Access Agreement (DAA). Access is granted at the DAU level based on the decision of each DAC upon reviewing the DAA.
For example, if you make a request asking for all of St. Jude's Acute Lymphoblastic Leukemia sequencing data, you might be asking for data from multiple different projects (DAUs) here at St. Jude. For the sake of the example, let's say the data you want is spread across three different Data Sets and two DAUs. Once you place a request, your application will be routed to the corresponding two data access committees for approval. Since each DAC is made up of different individuals using different criteria for evaluation, you may or may not be approved for access to all of the files.
Embargo Date
The Embargo Date specifies the date that a publishing embargo on the file in question has been lifted.
Publishing using any of the files before the embargo date has passed is strictly prohibited as outlined in section 1.15 of the Data Access Agreement (DAA).
Some Data, including Data funded by the NIH, are not subject to embargo.
Applicable Embargo Dates can be found in Genomics Platform Metadata in the SJ_Embargo_Date column.
List of DAUs
We currently have seven Data Access Units (DAU) listed below. Basic clinical data is available for relevant subjects in each DAU. Click on the name below to navigate directly to that DAU's Study page for more detailed information. The Data Sets included in each DAU are listed below; note that some Data Sets are a part of multiple DAUs.
| DAU | Focus | Data Type |
|---|---|---|
| CCSS | Long-term outcomes in childhood cancer survivors | Germline WGS |
| Clinical Genomics | Variants influencing childhood tumor development | Paired tumor-normal WGS, WES, RNA-Seq |
| CReATe | ALS and related disorders | WGS |
| PanALL | ALL subtypes across the age continuum | Tumor-only RNA-Seq |
| PCGP | Genetic origins of pediatric cancer | Paired tumor-normal WGS, WES, RNA-Seq |
| SGP | Genetic modifiers in Sickle Cell Disease | Germline WGS |
| SJLIFE | Long-term adverse outcomes of cancer therapy | Germline WGS, WES |
Childhood Cancer Survivor Study (CCSS)
CCSS consists of germline-only whole genome sequencing samples of childhood cancer survivors. The following data set(s) are included within CCSS:
Clinical Genomics
Clinical Genomics contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on identifying variants that influence the development and behavior of childhood tumors. The following data set(s) are included within Clinical Genomics:
- CICERO Benchmark
- Childhood Solid Tumor Network (CSTN)
- Clinical Pilot
- Genome 4 Kids (G4K)
- Genomics and Transcriptomics of Relapsed Pediatric AML (RPAML)
- Pediatric Brain Tumor Program (PBTP)
- Pediatric therapy-related Myeloid Neoplasms (tMN)
- Real-Time Clinical Genomics (RTCG)
Clinical Research in ALS and Related Disorders for Therapeutic Development Consortium (CReATe)
CReATe contains whole genome sequencing data with paired phenotypic data, focused on studying patients with amyotrophic lateral sclerosis (ALS) or a related disorder. The following data set(s) are included in CReATe:
Pan-Acute Lymphoblastic Leukemia (PanALL)
PanALL contains tumor-only RNA-Seq data focused on the spectrum of ALL subtypes from a variety of contributing sources. The following data set(s) are included within PanALL:
Pediatric Cancer Genome Project (PCGP)
PCGP contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on discovering the genetic origins of pediatric cancer. The following data set(s) are included within PCGP:
- CICERO Benchmark
- Childhood Solid Tumor Network (CSTN)
- Clinical Pilot
- DMG-H3K27a Clonal Evolution (H3K27A_EVOLUTION)
- Genome 4 Kids (G4K)
- Genomics and Transcriptomics of Relapsed Pediatric AML (RPAML)
- Landscape of Pediatric Acute Myeloid Leukemia (PanpAML)
- Medulloblastoma Preclinical Ribociclib and Gemcitabine (MBPRG)
- Medulloblastoma Preclinical Ribociclib and Paxalisib (MBPRP)
- Pan-Acute Lymphoblastic Leukemia (PanALL)
- Pediatric Acute Myeloid Leukemia (PedAML)
- Pediatric Brain Tumor Program (PBTP)
- Pediatric Cancer Genome Project (PCGP)
- Pediatric therapy-related Myeloid Neoplasms (tMN)
- Atypical Teratoid/Rhabdoid Tumor-derived Tumoroid Models (ATRT_TM)
- Real-Time Clinical Genomics (RTCG)
Sickle Cell Genome Project (SGP)
SGP contains germline-only whole genome sequencing data of Sickle Cell Disease patients from birth to young adulthood. The following data set(s) are included within SGP:
St. Jude Life (SJLIFE)
SJLIFE contains germline-only whole genome and whole exome sequencing data focused on studying the long-term adverse outcomes associated with cancer and cancer-related therapy. The following data set(s) are included within SJLIFE:
List of Data Sets
We currently have 21 Data Sets listed below. Additional information can also be seen including which Data Access Units (DAU) the Data Set belongs to, tissue type, sequencing type, number of samples, additional links, and a brief description.
| Data Set | DAU(s) | Tissue Type | Sequencing | Samples |
|---|---|---|---|---|
| ATRT_TM | PCGP | — | WES, WGS, RNA-Seq | 8 |
| CCSS | CCSS | Germline Only | WGS | 2,912 |
| CICERO Benchmark | PCGP, Clinical Genomics | Paired Tumor-Normal | RNA-Seq | 124 |
| Clinical Pilot | PCGP, Clinical Genomics | Paired Tumor-Normal | WGS, WES, RNA-Seq | 155 |
| CReATe | CReATe | PBMC Germline DNA | WGS | 705 |
| CSTN | PCGP, Clinical Genomics | Paired Tumor-Normal | WGS, WES, RNA-Seq | 143 |
| G4K | PCPG, Clinical Genomics | Paired Tumor-Normal | WGS, WES, RNA-Seq | 571 |
| H3K27A_EVOLUTION | PCGP | — | WGS, WES | 70 |
| MBPRG | PCGP | — | RNA-Seq | 70 |
| MBPRP | PCGP | — | RNA-Seq | 39 |
| PanALL | PCPG, PanALL | Paired Tumor-Normal | RNA-Seq | 735 |
| PanpAML | PCGP | — | WES, WGS, RNA-Seq | 272 |
| PBTP | PCPG, Clinical Genomics | — | WES, WGS, RNA-Seq | 97 |
| PCGP | PCGP | Paired Tumor-Normal | WGS, WES, RNA-Seq | 3,031 |
| PedAML | PCGP | — | WES, WGS, RNA-Seq | 275 |
| RPAML | PCPG, Clinical Genomics | — | WGS, RNA-Seq | 265 |
| RTCG | PCPG, Clinical Genomics | Paired Tumor-Normal | WGS, WES, RNA-Seq | 2,371 |
| SGP | SGP | Germline Only | WGS | 807 |
| SJLIFE | SJLIFE | Germline Only | WGS, WES | 4,838 |
| SJLIFE_ClonalHematopoiesis | SJLIFE | — | SingleCell-WGS, Targeted | 3,192 |
| tMN | PCGP | Paired Tumor-Normal | WGS, WES, RNA-Seq | 206 |
Atypical Teratoid/Rhabdoid Tumor-derived Tumoroid Models
DAU: PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 8
The ATRT-TM dataset comprises atypical teratoid/rhabdoid tumors (ATRT) of the Sonic hedgehog (ATRT-SHH) and Myc (ATRT-MYC) subgroups. ATRT-SHH and ATRT-MYC patient-derived orthotopic xenografts (PDOX) were used to generate pre-clinical in vitro tumoroid models. The key objective of this dataset is to validate the tumoroid models by comparing them to their parental PDOX at the molecular level, including gene alterations (whole genome/whole exome sequencing), gene expression (RNA-seq), and DNA methylation (Illumina EPIC array). The findings of the project were published in Oncogene.
Childhood Cancer Survivor Study
DAU: CCSS | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 2,912 | Additional Information About CCSS
Childhood Cancer Survivor Study (CCSS) is a germline-only Data Set consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy.
Samples for the Childhood Cancer Survivorship Study were collected by sending out Buccal swab kits to enrolled participants and having them complete the kits at home. This mechanism of collecting saliva and buccal cells for sequencing is highly desirable because of its non-invasive nature and ease of execution. However, collection of samples in this manner also has higher probability of contamination from external sources (as compared to, say, samples collected using blood). We have observed some samples in this cohort which suffer from bacterial contamination. To address this issue, we have taken the following steps:
- We have estimated the bacterial contamination rate and annotated each of the samples in the CCSS cohort.
For each sample, you will find the estimated contamination rate in the
Descriptionfield of theSAMPLE_INFO.txtfile that is vended with your data (and as a property on the DNAnexus file). For information on this field, see the Metadata specification. - Using this estimated contamination rate, we have removed 82 samples which exhibited large rates of bacterial contamination.
- For the remaining samples, we have provided the
BAMfile as aligned withbwa memwith default parameters. We have observed that there are instances of reads originating from bacterial contamination that are erroneously mapped to the human genome and display a very low mapping quality. Please be advised that we have kept these reads as they were aligned and have not yet made any attempt to unmap these reads. Any analysis you perform on these samples will need to take this into account! - Last, we will be working over the coming months to unmap the reads originating from bacterial contamination and release updated
BAMfiles along with the associatedgVCFfiles from Microsoft Genomics Service.
Childhood Solid Tumor Network
DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 143 | Additional Information About CSTN
The Childhood Solid Tumor Network (CSTN) is a St. Jude Children's Research Hospital initiative to disseminate its childhood solid tumor resources and data. The raw Data Sets from this initiative are made available via St. Jude Cloud.
CICERO Benchmark
DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 124
The CICERO Data Set contains the samples which were selected for use in the CICERO Paper.
Clinical Pilot
DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 155 | Additional Information About Clinical Genomics
The Clinical Pilot project was a retrospective study that evaluated the accuracy and demonstrated the feasibility of three-platform sequencing in a CAP/CLIA setting. The findings of this project were published in Nature Communications.
Clinical Research in ALS and Related Disorders for Therapeutic Development Consortium
DAU: CReATe | Tissue Type: PBMC Germline DNA | Sequencing Type: WGS | Samples: 705
The Phenotype-Genotype-Biomarker (PGB, or PGB1) study (NCT02327845) of the Clinical Research in ALS and Related Disorders for Therapeutic Development (CReATe) Consortium was a natural history and biomarker study of patients with amyotrophic lateral sclerosis (ALS) or a related disorder, including but not limited to ALS-frontotemporal dementia (ALS-FTD), progressive muscular atrophy (PMA), primary lateral sclerosis (PLS), hereditary spastic paraplegia (HSP), and multisystem proteinopathy (MSP). In addition to patients enrolled in the PGB1 Cohort (primary participants), the study also enrolled family members for limited data collection (secondary participants). This dataset includes WGS data from N=705 in PGB1, including N=472 ALS/ALS-FTD, N=20 PMA, N=47 PLS, N=162 HSP, and N=4 with other related disorders. The findings of the project were published in Translational Neurodegeneration.
DMG-H3K27a Clonal Evolution
DAU: PCGP | Tissue Type: — | Sequencing Type: WGS, WES | Samples: 70
The primary purpose of the DMG-H3K27a Clonal Evolution (H3K27A_EVOLUTION) project is to understand how clonal evolution contributes to tumor invasive spread. The study performed exome sequencing and SNP array profiling on 49 multi-region autopsy samples from 11 patients with pontine DMG-H3 K27-a enrolled in a phase I clinical trial of PDGFR inhibitor crenolanib. Additional objectives include deconvoluting subclonal composition and prevalence at each tumor region to study convergent evolution and invasion patterns. For more information see: http://permalinks.stjude.cloud/permalinks/h3k27a_evolution
Genome 4 Kids
DAU: Clinical Genomics, PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 571 | Additional Information About Clinical Genomics
The goal of the Genomes 4 Kids (G4K) prospective study was to determine whether the three-platform sequencing protocol laid out in the Clinical Pilot project could generate results on a clinical timeline in practice and to evaluate the prevalence of actionable findings. The study concluded with just over 300 patients, and the publication is currently in review.
Genomics and Transcriptomics of Relapsed Pediatric AML (RPAML)
DAU: Clinical Genomics, PCGP | Tissue Type: — | Sequencing Type: RNA-seq, WGS | Samples: 265
The primary purpose of the Relapsed Pediatric AML Dataset (RPAML) is to identify the tumor-acquired (somatic) genome sequence and structural variants in pediatric AML at the time of disease relapse. Additional objectives include the acquisition and analysis of additional genomic data, including gene expression data, mutational signatures, and germline variants that may predispose to AML or other bone marrow disorders. The findings of the project were published in Blood Cancer Discovery.
Landscape of Pediatric Acute Myeloid Leukemia (PanpAML)
DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-seq, WGS, WES | Samples: 272
Recent studies on pediatric acute myeloid leukemia (pAML) have revealed pediatric-specific driver alterations, many of which are underrepresented in the current classification schemas. The PanpAML study systematically categorized 887 pAML cases into 23 mutually distinct molecular categories, including new major entities such as UBTF or BCL11B, covering 91.4% of the cohort. These molecular categories were associated with unique expression profiles and mutational patterns, and were strongly associated with clinical outcomes, leading to the establishment of a new prognostic framework for pAML based on updated molecular categories and minimal residual disease. The findings of the project were published in Nature Genetics.
Medulloblastoma Preclinical Ribociclib and Gemcitabine (MBPRG)
DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-Seq | Samples: 70
The MBPRG dataset comprises medulloblastoma group 3 (G3 MB) patient-derived orthotopic xenografts (PDOX) and mouse G3 MB tumor models. Both human (PDOX) and mouse tumor models were treated with either ribociclib (CDK4/6 inhibitor), gemcitabine (metabolic inhibitor of DNA synthesis), or the combination of these two drugs in comparison to control (vehicle). The key objective of this dataset is to evaluate the impact of this treatment and identify perturbation of gene expression/pathways at the transcriptional level in G3 MB.
Medulloblastoma Preclinical Ribociclib and Paxalisib (MBPRP)
DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-Seq | Samples: 39
The MBPRP dataset comprises medulloblastoma group 3 (G3 MB) and medulloblastoma Sonic hedgehog (SHH MB) patient-derived orthotopic xenografts (PDOX). These human tumor models were treated with either ribociclib (CDK4/6 inhibitor), paxalisib (PI3K/mTOR inhibitor), or the combination of these two drugs in comparison to control (vehicle). The key objective of this dataset is to validate the synergistic effect of the combination treatment observed in vitro, and evaluate the impact of these treatments on gene expression/pathways at the transcriptional level in MB.
Pan-Acute Lymphoblastic Leukemia
DAU: PanALL | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 735
Pan-Acute Lymphoblastic Leukemia (PanALL) comprises cases of B-progenitor and T-lineage ALL encompassing the spectrum of ALL subtypes across the age continuum. Samples sequenced were obtained from multiple sites, centers and cooperative groups including St. Jude Children's Research Hospital, The Children's Oncology Group, The Alliance – Cancer and Leukemia Group B, the Eastern Cooperative Oncology Group, The Southwestern Oncology group, MD Anderson Cancer Center, City of Hope National Medical Center, Princess Margaret Cancer Center, Northern Italy Leukemia Group, and UKALL.
Pediatric Acute Myeloid Leukemia (PedAML)
DAU: PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 275
The primary purpose of the Pediatric AML (PedAML) Data Set is to identify the genome sequence and structural variants that define the different molecular subtypes of pediatric AML (pAML). Additional objectives include, but are not limited to, the acquisition and analysis of additional genomic data, including gene expression data and patterns of mutational cooperativity.
Pediatric Brain Tumor Program
DAU: Clinical Genomics, PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 97
The Pediatric Brain Tumor Portal (PBTP) is organized by the St. Jude Children's Research Hospital Neurobiology and Brain Tumor Program. Investigators have access to specialized resources, such as an integrated support structure for preclinical modeling, including patient-derived xenograft samples. The program consists of clinicians, radiation oncologists, neurobiologists, medicinal chemists, and other research faculty and staff. PBTP features molecular characterization for patient-derived orthotopic xenograft (PDOX) models of pediatric CNS tumors and reflects close to 10 years of effort to generate and extensively characterize in vivo models that faithfully recapitulate pediatric brain cancer diseases. The portal offers visualization tools that allow users to interrogate curated datasets and access models from our library of PDOX for functional studies of tumorigenesis or preclinical testing. The findings of the project were published in Acta Neuropathol.
Pediatric Cancer Genome Project
DAU: PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 3,031 | Additional Information About PCGP
The Pediatric Cancer Genome Project (PCGP) is a collaboration between St. Jude Children's Research Hospital and the McDonnell Genome Institute at Washington University School of Medicine that sequenced the genomes of over 600 pediatric cancer patients.
Pediatric therapy-related Myeloid Neoplasms (tMN)
DAU: Clinical Genonics, PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 206 | Additional Information About tMN
The primary purpose of the Pediatric therapy-related Myeloid Neoplasms (tMN) study is to define the genomic alterations in therapy-related myeloid neoplasms in children. The objective of the study was to define the somatic and germline alterations using WGS, WES and/or RNA-seq that drive tMN in children. The dataset is a mixture of paired tumor-normal samples or normal-only samples.
Real-time Clinical Genomics
DAU: Clinical Genomics, PCPG | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 2,371 | Additional Information About Clinical Genomics
Real-time Clinical Genomics (RTCG) is a first of its kind initiative, whereby St. Jude began releasing data from the clinical NGS service consented for research use to St. Jude Cloud in monthly batches to give researchers access to valuable data as quickly as possible.
Sickle Cell Genome Project
DAU: SGP | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 807 | Additional Information About SGP
SGP is a germline-only Data Set of Sickle Cell Disease (SCD) patients from birth to young adulthood. The Sickle Cell Genome Project (SGP) is a collaboration between St. Jude Children's Research Hospital and Baylor College of Medicine focused on identifying genetic modifiers that contribute to various health complications in SCD patients. Additional objectives include, but are not limited to, developing accurate methods to characterize germline structural variants in highly homologous globin locus and blood typing.
St. Jude Life
DAU: SJLIFE | Tissue Type: Germline Only | Sequencing Type: WGS, WES | Samples: 4,838 | Additional Information About SJLIFE
St. Jude Lifetime (SJLIFE) is a longevity study from St. Jude Children's Research Hospital that aims to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. This cohort contains unpaired germline samples and does not contain tumor samples.
St. Jude Life Clonal Hematopoiesis
DAU: PCGP | Tissue Type: — | Sequencing Type: SingleCell-WGS, Targeted | Samples: 3,192
The primary purpose of the St. Jude Lifetime Cohort Study (SJLIFE) Clonal Hematopoiesis dataset is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. Additional objectives include, but are not limited to, the acquisition and analysis of additional genomic data, including epigenetic and gene expression data, data integration, and the development and validation of informatic and analytical solutions appropriate to the scale and nature of the project, as well as use of the data generated to answer important methodological and biological questions as specifically related to childhood malignancies.