Data Sets and Data Access Units

Table of Contents


Key Concepts

Data Access Unit (DAU)

A St. Jude Cloud Data Access Unit (DAU) is a grouping of data that typically corresponds to a project, study, or Data Set generated at the same time at the same institution. Each DAU has its own governing body of researchers, the Data Access Committee, who preside over the data and who may grant or deny access. Each Data Access Committee is responsible for only one DAU and has its own protocols for approving access to their DAU. Please contact us if you have questions about committee approval protocols. We currently have 7 DAUs: Pediatric Cancer Genome Project (PCGP), St. Jude Lifetime Cohort Study (SJLIFE), Clinical Genomics, Sickle Cell Genome Project (SGP), Childhood Cancer Survivor Study (CCSS), Pan-Acute Lymphoblastic Leukemia (PanALL), and Clinical Research in ALS and Related Disorders for Therapeutic Development Consortium (CReATe). See below for a brief description of each DAU. For a more detailed description please see the respective Schedule 1(s).

See the list of Data Access Units.

Data Set

A St. Jude Cloud Data Set is a grouping of data which has been curated by St. Jude and can correspond to a study, project, or specific disease. They are available for free to researchers and access to a Data Set can be requested from the data browser. However, access is not granted at the Data Set level, but rather the Data Access Unit. A single Data Set may belong to only one DAU, or it could belong in multiple if it contains data that came from different groups. An approved Data Access Request grants access to a particular Data Access Unit which includes specific Data Sets that can be selected from the Data Browser. An approved DAR would give access not only to the data selected at the time, but also any additional data that is included in the DAUs which were approved. Only the data initially selected will be vended to a project folder upon approval but returning to the data browser and selecting additional data which falls under the approved DAUs will not require another Data Access Request.

See the list of Data Sets.

Data Access Committee (DAC)

A St. Jude Cloud Data Access Committee (DAC) is a group of St. Jude researchers who oversee access to a particular Data Access Unit (DAU) and evaluate incoming data requests.

The first time you request access to files in a DAU, it is required that you fill out a Data Access Agreement (DAA). Access is granted at the DAU level based on the decision of each DAC upon reviewing the DAA.

Example
For example, if you make a request asking for all of St. Jude's Acute Lymphoblastic Leukemia sequencing data, you might be asking for data from multiple different projects (DAUs) here at St. Jude. For the sake of the example, let's say the data you want is spread across three different Data Sets and two DAUs. Once you place a request, your application will be routed to the corresponding two data access committees for approval. Since each DAC is made up of different individuals using different criteria for evaluation, you may or may not be approved for access to all of the files.

Embargo Date

The Embargo Date specifies the date that a publishing embargo on the file in question has been lifted. Publishing using any of the files before the embargo date has passed is strictly prohibited as outlined in section 1.15 of the Data Access Agreement (DAA). Some Data, including Data funded by the NIH, are not subject to embargo. Applicable Embargo Dates can be found in Genomics Platform Metadata in the SJ_Embargo_Date column.


List of DAUs

We currently have seven Data Access Units (DAU) listed below. Basic clinical data is available for relevant subjects in each DAU. Click on the name below to navigate directly to that DAU's Study page for more detailed information. The Data Sets included in each DAU are listed below; note that some Data Sets are a part of multiple DAUs.

DAUFocusData Type
CCSSLong-term outcomes in childhood cancer survivorsGermline WGS
Clinical GenomicsVariants influencing childhood tumor developmentPaired tumor-normal WGS, WES, RNA-Seq
CReATeALS and related disordersWGS
PanALLALL subtypes across the age continuumTumor-only RNA-Seq
PCGPGenetic origins of pediatric cancerPaired tumor-normal WGS, WES, RNA-Seq
SGPGenetic modifiers in Sickle Cell DiseaseGermline WGS
SJLIFELong-term adverse outcomes of cancer therapyGermline WGS, WES

Childhood Cancer Survivor Study (CCSS)

CCSS consists of germline-only whole genome sequencing samples of childhood cancer survivors. The following data set(s) are included within CCSS:

Clinical Genomics

Clinical Genomics contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on identifying variants that influence the development and behavior of childhood tumors. The following data set(s) are included within Clinical Genomics:

CReATe contains whole genome sequencing data with paired phenotypic data, focused on studying patients with amyotrophic lateral sclerosis (ALS) or a related disorder. The following data set(s) are included in CReATe:

Pan-Acute Lymphoblastic Leukemia (PanALL)

PanALL contains tumor-only RNA-Seq data focused on the spectrum of ALL subtypes from a variety of contributing sources. The following data set(s) are included within PanALL:

Pediatric Cancer Genome Project (PCGP)

PCGP contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on discovering the genetic origins of pediatric cancer. The following data set(s) are included within PCGP:

Sickle Cell Genome Project (SGP)

SGP contains germline-only whole genome sequencing data of Sickle Cell Disease patients from birth to young adulthood. The following data set(s) are included within SGP:

St. Jude Life (SJLIFE)

SJLIFE contains germline-only whole genome and whole exome sequencing data focused on studying the long-term adverse outcomes associated with cancer and cancer-related therapy. The following data set(s) are included within SJLIFE:


List of Data Sets

We currently have 21 Data Sets listed below. Additional information can also be seen including which Data Access Units (DAU) the Data Set belongs to, tissue type, sequencing type, number of samples, additional links, and a brief description.

Data SetDAU(s)Tissue TypeSequencingSamples
ATRT_TMPCGPWES, WGS, RNA-Seq8
CCSSCCSSGermline OnlyWGS2,912
CICERO BenchmarkPCGP, Clinical GenomicsPaired Tumor-NormalRNA-Seq124
Clinical PilotPCGP, Clinical GenomicsPaired Tumor-NormalWGS, WES, RNA-Seq155
CReATeCReATePBMC Germline DNAWGS705
CSTNPCGP, Clinical GenomicsPaired Tumor-NormalWGS, WES, RNA-Seq143
G4KPCPG, Clinical GenomicsPaired Tumor-NormalWGS, WES, RNA-Seq571
H3K27A_EVOLUTIONPCGPWGS, WES70
MBPRGPCGPRNA-Seq70
MBPRPPCGPRNA-Seq39
PanALLPCPG, PanALLPaired Tumor-NormalRNA-Seq735
PanpAMLPCGPWES, WGS, RNA-Seq272
PBTPPCPG, Clinical GenomicsWES, WGS, RNA-Seq97
PCGPPCGPPaired Tumor-NormalWGS, WES, RNA-Seq3,031
PedAMLPCGPWES, WGS, RNA-Seq275
RPAMLPCPG, Clinical GenomicsWGS, RNA-Seq265
RTCGPCPG, Clinical GenomicsPaired Tumor-NormalWGS, WES, RNA-Seq2,371
SGPSGPGermline OnlyWGS807
SJLIFESJLIFEGermline OnlyWGS, WES4,838
SJLIFE_ClonalHematopoiesisSJLIFESingleCell-WGS, Targeted3,192
tMNPCGPPaired Tumor-NormalWGS, WES, RNA-Seq206

Atypical Teratoid/Rhabdoid Tumor-derived Tumoroid Models

DAU: PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 8

The ATRT-TM dataset comprises atypical teratoid/rhabdoid tumors (ATRT) of the Sonic hedgehog (ATRT-SHH) and Myc (ATRT-MYC) subgroups. ATRT-SHH and ATRT-MYC patient-derived orthotopic xenografts (PDOX) were used to generate pre-clinical in vitro tumoroid models. The key objective of this dataset is to validate the tumoroid models by comparing them to their parental PDOX at the molecular level, including gene alterations (whole genome/whole exome sequencing), gene expression (RNA-seq), and DNA methylation (Illumina EPIC array). The findings of the project were published in Oncogene.

Childhood Cancer Survivor Study

DAU: CCSS | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 2,912 | Additional Information About CCSS

Childhood Cancer Survivor Study (CCSS) is a germline-only Data Set consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy.

CCSS: Potential Bacterial Contamination
Samples for the Childhood Cancer Survivorship Study were collected by sending out Buccal swab kits to enrolled participants and having them complete the kits at home. This mechanism of collecting saliva and buccal cells for sequencing is highly desirable because of its non-invasive nature and ease of execution. However, collection of samples in this manner also has higher probability of contamination from external sources (as compared to, say, samples collected using blood). We have observed some samples in this cohort which suffer from bacterial contamination. To address this issue, we have taken the following steps:
  1. We have estimated the bacterial contamination rate and annotated each of the samples in the CCSS cohort. For each sample, you will find the estimated contamination rate in the Description field of the SAMPLE_INFO.txt file that is vended with your data (and as a property on the DNAnexus file). For information on this field, see the Metadata specification.
  2. Using this estimated contamination rate, we have removed 82 samples which exhibited large rates of bacterial contamination.
  3. For the remaining samples, we have provided the BAM file as aligned with bwa mem with default parameters. We have observed that there are instances of reads originating from bacterial contamination that are erroneously mapped to the human genome and display a very low mapping quality. Please be advised that we have kept these reads as they were aligned and have not yet made any attempt to unmap these reads. Any analysis you perform on these samples will need to take this into account!
  4. Last, we will be working over the coming months to unmap the reads originating from bacterial contamination and release updated BAM files along with the associated gVCF files from Microsoft Genomics Service.
With any questions on the nature or implications of this warning, please contact us at support@stjude.cloud.

Childhood Solid Tumor Network

DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 143 | Additional Information About CSTN

The Childhood Solid Tumor Network (CSTN) is a St. Jude Children's Research Hospital initiative to disseminate its childhood solid tumor resources and data. The raw Data Sets from this initiative are made available via St. Jude Cloud.

CICERO Benchmark

DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 124

The CICERO Data Set contains the samples which were selected for use in the CICERO Paper.

Clinical Pilot

DAU: PCGP, Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 155 | Additional Information About Clinical Genomics

The Clinical Pilot project was a retrospective study that evaluated the accuracy and demonstrated the feasibility of three-platform sequencing in a CAP/CLIA setting. The findings of this project were published in Nature Communications.

DAU: CReATe | Tissue Type: PBMC Germline DNA | Sequencing Type: WGS | Samples: 705

The Phenotype-Genotype-Biomarker (PGB, or PGB1) study (NCT02327845) of the Clinical Research in ALS and Related Disorders for Therapeutic Development (CReATe) Consortium was a natural history and biomarker study of patients with amyotrophic lateral sclerosis (ALS) or a related disorder, including but not limited to ALS-frontotemporal dementia (ALS-FTD), progressive muscular atrophy (PMA), primary lateral sclerosis (PLS), hereditary spastic paraplegia (HSP), and multisystem proteinopathy (MSP). In addition to patients enrolled in the PGB1 Cohort (primary participants), the study also enrolled family members for limited data collection (secondary participants). This dataset includes WGS data from N=705 in PGB1, including N=472 ALS/ALS-FTD, N=20 PMA, N=47 PLS, N=162 HSP, and N=4 with other related disorders. The findings of the project were published in Translational Neurodegeneration.

DMG-H3K27a Clonal Evolution

DAU: PCGP | Tissue Type: — | Sequencing Type: WGS, WES | Samples: 70

The primary purpose of the DMG-H3K27a Clonal Evolution (H3K27A_EVOLUTION) project is to understand how clonal evolution contributes to tumor invasive spread. The study performed exome sequencing and SNP array profiling on 49 multi-region autopsy samples from 11 patients with pontine DMG-H3 K27-a enrolled in a phase I clinical trial of PDGFR inhibitor crenolanib. Additional objectives include deconvoluting subclonal composition and prevalence at each tumor region to study convergent evolution and invasion patterns. For more information see: http://permalinks.stjude.cloud/permalinks/h3k27a_evolution

Genome 4 Kids

DAU: Clinical Genomics, PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 571 | Additional Information About Clinical Genomics

The goal of the Genomes 4 Kids (G4K) prospective study was to determine whether the three-platform sequencing protocol laid out in the Clinical Pilot project could generate results on a clinical timeline in practice and to evaluate the prevalence of actionable findings. The study concluded with just over 300 patients, and the publication is currently in review.

Genomics and Transcriptomics of Relapsed Pediatric AML (RPAML)

DAU: Clinical Genomics, PCGP | Tissue Type: — | Sequencing Type: RNA-seq, WGS | Samples: 265

The primary purpose of the Relapsed Pediatric AML Dataset (RPAML) is to identify the tumor-acquired (somatic) genome sequence and structural variants in pediatric AML at the time of disease relapse. Additional objectives include the acquisition and analysis of additional genomic data, including gene expression data, mutational signatures, and germline variants that may predispose to AML or other bone marrow disorders. The findings of the project were published in Blood Cancer Discovery.

Landscape of Pediatric Acute Myeloid Leukemia (PanpAML)

DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-seq, WGS, WES | Samples: 272

Recent studies on pediatric acute myeloid leukemia (pAML) have revealed pediatric-specific driver alterations, many of which are underrepresented in the current classification schemas. The PanpAML study systematically categorized 887 pAML cases into 23 mutually distinct molecular categories, including new major entities such as UBTF or BCL11B, covering 91.4% of the cohort. These molecular categories were associated with unique expression profiles and mutational patterns, and were strongly associated with clinical outcomes, leading to the establishment of a new prognostic framework for pAML based on updated molecular categories and minimal residual disease. The findings of the project were published in Nature Genetics.

Medulloblastoma Preclinical Ribociclib and Gemcitabine (MBPRG)

DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-Seq | Samples: 70

The MBPRG dataset comprises medulloblastoma group 3 (G3 MB) patient-derived orthotopic xenografts (PDOX) and mouse G3 MB tumor models. Both human (PDOX) and mouse tumor models were treated with either ribociclib (CDK4/6 inhibitor), gemcitabine (metabolic inhibitor of DNA synthesis), or the combination of these two drugs in comparison to control (vehicle). The key objective of this dataset is to evaluate the impact of this treatment and identify perturbation of gene expression/pathways at the transcriptional level in G3 MB.

Medulloblastoma Preclinical Ribociclib and Paxalisib (MBPRP)

DAU: PCGP | Tissue Type: — | Sequencing Type: RNA-Seq | Samples: 39

The MBPRP dataset comprises medulloblastoma group 3 (G3 MB) and medulloblastoma Sonic hedgehog (SHH MB) patient-derived orthotopic xenografts (PDOX). These human tumor models were treated with either ribociclib (CDK4/6 inhibitor), paxalisib (PI3K/mTOR inhibitor), or the combination of these two drugs in comparison to control (vehicle). The key objective of this dataset is to validate the synergistic effect of the combination treatment observed in vitro, and evaluate the impact of these treatments on gene expression/pathways at the transcriptional level in MB.

Pan-Acute Lymphoblastic Leukemia

DAU: PanALL | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 735

Pan-Acute Lymphoblastic Leukemia (PanALL) comprises cases of B-progenitor and T-lineage ALL encompassing the spectrum of ALL subtypes across the age continuum. Samples sequenced were obtained from multiple sites, centers and cooperative groups including St. Jude Children's Research Hospital, The Children's Oncology Group, The Alliance – Cancer and Leukemia Group B, the Eastern Cooperative Oncology Group, The Southwestern Oncology group, MD Anderson Cancer Center, City of Hope National Medical Center, Princess Margaret Cancer Center, Northern Italy Leukemia Group, and UKALL.

Pediatric Acute Myeloid Leukemia (PedAML)

DAU: PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 275

The primary purpose of the Pediatric AML (PedAML) Data Set is to identify the genome sequence and structural variants that define the different molecular subtypes of pediatric AML (pAML). Additional objectives include, but are not limited to, the acquisition and analysis of additional genomic data, including gene expression data and patterns of mutational cooperativity.

Pediatric Brain Tumor Program

DAU: Clinical Genomics, PCGP | Tissue Type: — | Sequencing Type: WES, WGS, RNA-Seq | Samples: 97

The Pediatric Brain Tumor Portal (PBTP) is organized by the St. Jude Children's Research Hospital Neurobiology and Brain Tumor Program. Investigators have access to specialized resources, such as an integrated support structure for preclinical modeling, including patient-derived xenograft samples. The program consists of clinicians, radiation oncologists, neurobiologists, medicinal chemists, and other research faculty and staff. PBTP features molecular characterization for patient-derived orthotopic xenograft (PDOX) models of pediatric CNS tumors and reflects close to 10 years of effort to generate and extensively characterize in vivo models that faithfully recapitulate pediatric brain cancer diseases. The portal offers visualization tools that allow users to interrogate curated datasets and access models from our library of PDOX for functional studies of tumorigenesis or preclinical testing. The findings of the project were published in Acta Neuropathol.

Pediatric Cancer Genome Project

DAU: PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 3,031 | Additional Information About PCGP

The Pediatric Cancer Genome Project (PCGP) is a collaboration between St. Jude Children's Research Hospital and the McDonnell Genome Institute at Washington University School of Medicine that sequenced the genomes of over 600 pediatric cancer patients.

DAU: Clinical Genonics, PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 206 | Additional Information About tMN

The primary purpose of the Pediatric therapy-related Myeloid Neoplasms (tMN) study is to define the genomic alterations in therapy-related myeloid neoplasms in children. The objective of the study was to define the somatic and germline alterations using WGS, WES and/or RNA-seq that drive tMN in children. The dataset is a mixture of paired tumor-normal samples or normal-only samples.

Real-time Clinical Genomics

DAU: Clinical Genomics, PCPG | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 2,371 | Additional Information About Clinical Genomics

Real-time Clinical Genomics (RTCG) is a first of its kind initiative, whereby St. Jude began releasing data from the clinical NGS service consented for research use to St. Jude Cloud in monthly batches to give researchers access to valuable data as quickly as possible.

Sickle Cell Genome Project

DAU: SGP | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 807 | Additional Information About SGP

SGP is a germline-only Data Set of Sickle Cell Disease (SCD) patients from birth to young adulthood. The Sickle Cell Genome Project (SGP) is a collaboration between St. Jude Children's Research Hospital and Baylor College of Medicine focused on identifying genetic modifiers that contribute to various health complications in SCD patients. Additional objectives include, but are not limited to, developing accurate methods to characterize germline structural variants in highly homologous globin locus and blood typing.

St. Jude Life

DAU: SJLIFE | Tissue Type: Germline Only | Sequencing Type: WGS, WES | Samples: 4,838 | Additional Information About SJLIFE

St. Jude Lifetime (SJLIFE) is a longevity study from St. Jude Children's Research Hospital that aims to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. This cohort contains unpaired germline samples and does not contain tumor samples.

St. Jude Life Clonal Hematopoiesis

DAU: PCGP | Tissue Type: — | Sequencing Type: SingleCell-WGS, Targeted | Samples: 3,192

The primary purpose of the St. Jude Lifetime Cohort Study (SJLIFE) Clonal Hematopoiesis dataset is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. Additional objectives include, but are not limited to, the acquisition and analysis of additional genomic data, including epigenetic and gene expression data, data integration, and the development and validation of informatic and analytical solutions appropriate to the scale and nature of the project, as well as use of the data generated to answer important methodological and biological questions as specifically related to childhood malignancies.