AuthorsYu Liu1,2,*,#, Chunliang Li3,*, Shuhong Shen1,4,*, Xiaolong Chen2, Karol Szlachta2, Michael N. Edmonson2, Ying Shao2, Xiaotu Ma2, Judith Hyle3, Shaela Wright3, Bensheng Ju2, Michael C. Rusch2, Yanling Liu2, Benshang Li1,4, Michael Macias2, Liqing Tian2, John Easton2, Maoxiang Qian5, Jun J. Yang5,6,7, Shaoyan Hu8, A. Thomas Look9,10 and Jinghui Zhang2,#
Publication"Discovery of regulatory non-coding variants in individual cancer genomes using cis-X" (in submission)
Technical SupportContact Us

Overview

Activating regular variants usually cause the cis-activation of target genes. To find cis-activated genes, allelic specific/imbalance expressions (ASE) and outlier high expression (OHE) signals are used. Variants in the same topologically associated domains with the candidates can then be searched, including structural variants (SV), copy number aberrations (CNA), and single nucleotide variations (SNV) and insertion/deletions (indel).

A transcription factor binding analysis is also done, using motifs from HOCOMOCO v10 models.

cis-X currently only works with hg19 (GRCh37).

Inputs

NameTypeDescriptionExample
Sample IDStringThe ID of the input sampleSJALL018373_D1
Disease subtypeStringThe disease name under analysis. Must be either NBL or TALL.TALL
Single nucleotide variantsFileTab-delimited file containing raw sequence variants*.txt
CNV/LOH regionsFileTab-delimited file containing any aneuploidy region existing in the tumor genome under analysis*.txt
RNA-Seq BAMFileBAM file aligned to hg19 (GRCh37)*.bam
RNA-Seq BAM indexFileBAM index for the given BAM*.bam.bai
Gene expression tableFileTab-delimited file containing gene level expressions for the tumor under analysis in FPKM*.txt
Somatic SNV/indelsFileTab-delimited file containing somatic SNV/indels in the tumor genome*.txt
Somatic SVsFileTab-delimited file containing somatic acquired structural variants in the tumor genome*.txt
Somatic CNVsFileTab-delimited file containing copy number aberrations in the tumor genome*.txt
CNV/LOH actionStringThe behavior when handling markers in CNV/LOH regions. Can be either keep or drop. default: keepdrop
Minimum coverage for WGSIntegerThe minimum coverage in WGS to be included in the analysis default: 1010
Minimum coverage for RNA-SeqIntegerThe minimum coverage in RNA-Seq to be included in the analysis default: 105
Candidate FPKM thresholdFloatThe FPKM threshold for the nomination of a cis-activated candidate default: 5.00.1
User annotationsFileUser applied annotations optional*.bed
chr PrefixStringWhether the names in the reference sequence dictionary are prefixed with "chr". Must be either TRUE or FALSE. default: TRUETRUE
TAD annotationsFileTAD annotations optional*.bed

Input file configuration

cis-X requires six tab-delimited input files to be prepared in advance. These files can be uploaded via the command line.

Note
Even though CNV/LOH regions, somatic SNV/indels, somatic SVs, and somatic CNVs can be "empty", using such inputs will produce results with a much higher false positive rate.

Single Nucleotide Variants

A list of single nucleotide markers is a tab-delimited file with the following columns:

  • Chr: chromosome name for the marker
  • Pos: genomic start location for the marker
  • Chr_Allele: reference allele
  • Alternative_Allele: alternative allele
  • reference_tumor_count: reference allele count in the tumor genome
  • alternative_tumor_count: alternative allele count in the tumor genome
  • reference_normal_count: reference allele count in the matched normal genome
  • alternative_normal_count: alternative count in the matched normal genome

This file can be generated with Bambino.

Example
ChrPosChr_AlleleAlternative_Allelereference_tumor_countalternative_tumor_countreference_normal_countalternative_normal_count
chr1161396TT03010
chr1172981T1323

CNV/LOH regions

The CNV/LOH regions are all the genomic regions carrying copy number variations (CNV) or loss of heterozygosity (LOH), which will be filtered out during analysis.

This is a tab-delimited file in the bed format. It must have at least the following three columns:

  • chrom: chromosome name
  • loc.start: genomic start location
  • loc.end: genomic end location

If no CNV/LOH are in the genome under analysis, a file with no rows (but including headers) can be provided.

This file can be generated with CONSERTING.

Example
chromloc.startloc.endSampleseg.meanLogRatiosource
chr91071237855747SJALL018373_D10.471181417LOH
chr92027690120703900SJALL018373_D1-0.978-5.696CNV

Gene expression table

The gene expression table is a tab-delimited file containing gene level expressions for the tumor under analysis. The expressions are in FPKM (fragments per kilobase of transcript per million mapped reads).

  • GeneID: gene Ensembl ID
  • GeneName: gene symbol
  • Type: transcript type
  • Status: transcript status (must be KNOWN, NOVEL, or PUTATIVE)
  • Chr: chromosome name
  • Start genomic start location
  • End: genomic end location
  • [SampleID...]: FPKM for the given sample

This file can be generated with the output of HTseq-count preprocessed through mergeData_geneName.pl (available with the distribution of cis-X). The data must be able to match values in the given gene specific reference expression matrices generated from a larger cohort.

Example
GeneIDGeneNameTypeStatusChrStartEndSJALL018373_D1
ENSG00000261122.25S_rRNAlincRNANOVELchr1634977639349908860.0000
ENSG00000249352.37SKlincRNANOVELchr568266266683259924.5937

Somatic SNV/indels

This is a tab-delimited file containing somatic sequence mutations present in the genome under analysis. It includes both single nucleotide variants (SNV) and small insertion/deletions (indel). The file must have the following columns:

  • chr: chromosome name
  • pos: genomic start location
  • ref: reference nucleotide
  • mutant: mutant nucleotide
  • type: mutation type (must be either snv or indel)

Note that the coordinate used for an indel is after the inserted sequence.

If no SNV/indels are in the sample under analysis, a file with no rows (but including headers) can be provided.

This file can can be created with Bambino and then preprocessed using the steps taken in "The genetic basis of early T-cell precursor acute lymphoblastic leukemia".

Example
chrposrefmuttype
chr124782720GAsnv
chr1182896176TCsnv

Somatic SVs

This is a tab-delimited file containing somatic-acquired structural variants (SV) in the cancer genome. The file must have the following columns:

  • chrA: chromosome name of the left breakpoint
  • posA: genomic location of the left breakpoint
  • ortA: strand orientation of the left breakpoint
  • chrB: chromosome name of the right breakpoint
  • posB: genomic location of the right breakpoint
  • ortB: strand orientation of the right breakpoint

Strand orientations are denoted with a + for a sense or coding strand and - for a antisense or non-coding strand.

If no somatic SVs are in the sample under analysis, a file with no rows (but including headers) can be provided.

This file can be generated by CREST.

Example
chrAposAortAchrBposBortBtype
chr1133913169+chr7142494049-CTX
chr1164219334+chr2205042527-CTX

Somatic CNVs

This is a tab-delimited file containing the genomic regions with somatic-acquired copy number aberrations (CNA) in the cancer genome.

  • chr: chromosome name
  • start: genomic start location
  • end: genomic end location
  • logR: log2 ratio

If no somatic CNVs are in the sample under analysis, a file with no rows (but including headers) can be provided.

This file can be generating by CONSERTING.

Example
chrstartendlogR
chr92027690120703900-5.696

Outputs

NameDescription
cis-activated candidatescis-activated candidates in the tumor genome under analysis
SV candidatesStructural variant (SV) candidates predicted as the causal for the cis-activated genes in the regulatory territory
CNA candidatesCopy number aberrations (CNA) predicted as the causal for the cis-activated genes in the regulatory territory
SNV/indel candidatesSNV/indel candidates predicted as functional and predicted transcription factors
OHE resultsRaw outlier high expression (OHE) results
Gene level ASE resultsRaw gene level allelic specific expression (ASE) results
Single marker ASE resultsRaw single marker allelic specific expression (ASE) results

Creating a workspace

Before you can run one of our workflows, you must first create a workspace in DNAnexus for the run. Refer to the general workflow guide to learn how to create a DNAnexus workspace for each workflow run.

You can navigate to the Cis-X workflow page here.

Uploading Input Files

cis-X requires a total of eight files to be uploaded, as input.

Refer to the general workflow guide to learn how to upload input files to the workspace you just created.

Running the Workflow

Refer to the general workflow guide to learn how to launch the workflow, hook up input files, adjust parameters, start a run, and monitor run progress.

Analysis of Results

Each tool in St. Jude Cloud produces a visualization that makes understanding results more accessible than working with excel spreadsheet or tab delimited files. This is the primary way we recommend you work with your results.

Refer to the general workflow guide to learn how to access these visualizations.

We also include the raw output files for you to dig into if the visualization is not sufficient to answer your research question.

Refer to the general workflow guide to learn how to access raw results files.

Interpreting results

cis-activated candidates

The main result file contains the cis-activated candidates in the tumor genome under analysis.

  • gene: gene accession number (RefSeq ID)
  • gsym: gene symbol
  • chrom: chromosome name
  • strand: strand orientation
  • start: genomic start location
  • end: genomic end location
  • cdsStartStat: coding sequence (CDS) start status
  • cdsEndStat: coding sequence (CDS) end status
  • markers: number of heterozygous markers in this gene
  • ase_markers: number of heterozygous markers showing allelic specific expressions (ASE)
  • average_ai_all: average B-allele frequency (BAF) difference between RNA and DNA for all heterozygous markers
  • average_ai_ase: average BAF difference between RNA and DNA for ASE markers
  • pval_all_markers: p-value for each marker in the ASE test
  • pval_ase_markers: p-value for ASE markers in the ASE test
  • ai_all_markers: BAF difference between RNA and DNA for all heterozygous markers
  • ai_ase_markers: BAF difference between RNA and DNA for ASE markers
  • comb.pval: combined p-value for the ASE test
  • mean.delta: average BAF difference between RNA and DNA for all markers
  • rawp: raw p-value for the ASE test
  • Bonferroni: adjusted p-value for the ASE test (single-step Bonferroni)
  • ABH: adjusted p-value for the ASE test (Benjamini-Hochberg)
  • FPKM: FPKM value
  • loo.source: which reference expression matrix was used in the outlier high expression (OHE) test
  • loo.cohort.size: number of cases in the reference expression matrix for this gene
  • loo.pval: p-value of the OHE test
  • loo.rank: rank of the case under analysis among the reference cases
  • imprinting.status: imprinting status of the gene
  • candidate.group: status of the gene, combining both ASE and outlier tests

Strand orientations are denoted with a + for a sense or coding strand and - for a antisense or non-coding strand.

Coding sequence status is typically one of "none" (not specified), "unk" (unknown), "incmpl" (incomplete), or "cmpl" (complete).

Example
genegsymchromstrandstartendcdsStartStatcdsEndStatmarkersase_markersaverage_ai_allaverage_ai_asepval_all_markerspval_ase_markersai_all_markersai_ase_markerscomb.pvalmean.deltarawpBonferroniABHFPKMloo.sourceloo.cohort.sizeloo.pvalloo.rankimprinting.statuscandidate.group
NM_145804ABTB2chr11-3417253334379555cmplcmpl550.50.5000.001953125,0.001953125,0.001953125,6.10351562500001e-05,0.0002441406250.001953125,0.001953125,0.001953125,6.10351562500001e-05,0.0002441406250.5,0.5,0.5,0.5,0.50.5,0.5,0.5,0.5,0.50.0006442909720570770.50.0006442909720570770.6320494435879930.01108666729275577.6776bi_cohort400.03672410865052761ase_outlier
NM_003189TAL1chr1-4768196147698007cmplcmpl220.4820.4826.66361745922277e-28,3.30872245021211e-246.66361745922277e-28,3.30872245021211e-240.464912280701754,0.50.464912280701754,0.54.69553625126628e-260.4824561403508774.69553625126628e-264.60632106249222e-236.11761294450693e-248.8168white_list1670.01393857719870891ase_outlier

SV candidates

Structural variant (SV) candidates include candidates predicted as the causal for the cis-activated genes in the regulatory territory.

  • left.candidate.inTAD: cis-activated candidate near the left breakpoint
  • right.candidate.inTAD: cis-activated candidate near the right breakpoint
  • chrA: chromosome name of the left breakpoint
  • posA: genomic location of the left breakpoint
  • ortA: strand orientation of the left breakpoint
  • chrB: chromosome name of the right breakpoint
  • posB: genomic location of the right breakpoint
  • ortB: strand orientation of the right breakpoint
  • type: type of translocation
Example
left.candidate.inTADright.candidate.inTADchrAposAortAchrBposBortBtype
LMO2chr1133913169+chr7142494049-CTX

CNA candidates

Copy number aberration (CNA) candidates include candidates predicted as the causal for the cis-activated genes in the regulatory territory.

  • candidate.inTAD: cis-activated candidate by the CNA
  • chr: chromosome name
  • start: genomic start position
  • end: genomic end location
  • logR: log ratio of the CNA

SNV/indel candidates

SNV/indel candidates include predicted candidates as functional and predicted transcription factors. The mutations are also annotated for known regulatory elements reported by the NIH Roadmap Epigenomics Project by collecting 111 cell lines.

  • chrom: chromosome name
  • pos: genomic start position
  • ref: reference allele genotype
  • mut: mutant allele genotype
  • type: mutation type (either snv or indel)
  • target: cis-activated candidate
  • dist: distance between the mutation and transcription start sites of the target gene
  • tf: transcription factors predicted to have the binding motif introduced by the mutation
  • EpiRoadmap_enhancer: enhancer regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)
  • EpiRoadmap_promoter: promoter regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)
  • EpiRoadmap_dyadic: dyadic regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)
Example
chromposrefmuttypetargetdisttfEpiRoadmap_enhancerEpiRoadmap_promoterEpiRoadmap_dyadic
chr147696311CTsnvTAL11696BCL11A,CEBPG,PBX2,YY1,ZBTB4Brain,Digestive,ES-deriv,ESC,HSC & B-cell,Heart,Muscle,Other,Sm. Muscle,iPSC

OHE results

OHE results are the raw results for the outlier expression test.

  • Gene: gene symbol
  • fpkm.raw: FPKM value
  • size.bi: number of cases in the bi-allelic reference cohort
  • p.bi: p-value in the outlier test using the bi-allelic reference cohort
  • rank.bi: rank of the expression level in the case under analysis compared to the bi-allelic reference cohort
  • size.cohort: number of cases in the entire reference cohort
  • p.cohort: p-value in the outlier test using the entire reference cohort
  • rank.cohort: rank of the expression level in the case under analysis compared to the entire reference cohort
  • size.white: number of cases in the whitelist reference cohort
  • p.white: p-value in the outlier test using the whitelist reference cohort
  • rank.white: rank of the expression level in the case under analysis compared to the whitelist reference cohort
Example
Genefpkm.rawsize.bip.birank.bisize.cohortp.cohortrank.cohortsize.whitep.whiterank.white
7SK4.5937nanana2640.716284011918374162nanana
A1BG0.2312240.900132642257996212640.84055666600945222nanana

Gene level ASE results

Gene level ASE results are the raw results from the gene level ASE test.

  • gene: gene accession number (RefSeq ID)
  • gsym: gene symbol
  • chrom: chromosome name
  • strand: strand orientation
  • start: genomic start location
  • end: genomic end location
  • cdsStartStat: coding sequence (CDS) start status
  • cdsEndStat: coding sequence (CDS) end status
  • markers: number of heterozygous markers in this gene
  • ase_markers: number of heterozygous markers showing allelic specific expressions (ASE)
  • average_ai_all: average B-allele frequency (BAF) difference between RNA and DNA for all heterozygous markers
  • average_ai_ase: average BAF difference between RNA and DNA for ASE markers
  • pval_all_markers: p-value for each marker in the ASE test
  • pval_ase_markers: p-value for ASE markers in the ASE test
  • ai_all_markers: BAF difference between RNA and DNA for all heterozygous markers
  • ai_ase_markers: BAF difference between RNA and DNA for ASE markers
  • comb.pval: combined p-value for the ASE test
  • mean.delta: average BAF difference between RNA and DNA for all markers
  • rawp: raw p-value for the ASE test
  • Bonferroni: adjusted p-value for the ASE test (single-step Bonferroni)
  • ABH: adjusted p-value for the ASE test (Benjamini-Hochberg)
Example
genegsymchromstrandstartendcdsStartStatcdsEndStatmarkersase_markersaverage_ai_allaverage_ai_asepval_all_markerspval_ase_markersai_all_markersai_ase_markerscomb.pvalmean.deltarawpBonferroniABH
NM_024684AAMDCchr11+7753220777583398cmplcmpl200.079na0.924775093657227,0.0331439677875056na0.00892857142857145,0.149122807017544na0.1750734586248370.07902568922305770.17507345862483710.480780882445856
NM_015423AASDHPPTchr11+105948291105969419cmplcmpl200.023na0.749258624760841,1na0.0384615384615384,0.00769230769230766na0.865597264760490.0230769230769230.8655972647604910.873257417545981

Single marker ASE results

Single marker ASE results are the raw results from the single marker ASE test.

  • chrom: chromosome name
  • pos: genomic start position
  • ref: reference allele genotype
  • mut: non-reference allele genotype
  • cvg_wgs: coverage of the marker from the whole genome sequence (WGS)
  • mut_freq_wgs: non-reference allele fraction in the WGS
  • cvg_rna: coverage of the marker from the RNA-Seq
  • mut_freq_rna: non-reference allele fraction in the RNA-Seq
  • ref.1: read count of the reference allele in the RNA-Seq
  • var: read count of the non-reference allele in the RNA-Seq
  • pvalue: p-value from the binomial test
  • delta.abs: absolute difference of the non-reference allele fraction between the WGS and RNA-Seq

Example

chromposrefmutcvg_wgsmut_freq_wgscvg_rnamut_freq_rnaref.1varpvaluedelta.abs
chr11204147GA360.472850.55338470.3856694201192780.0529411764705883
chr11205198CA230.522830.31357260.0008775517800028630.186746987951807

Frequently asked questions

None yet! If you have any questions not covered here, feel free to email us at support@stjude.cloud.

Similar Topics

Running our Workflows
Working with our Data Overview
Upload/Download Data (Local)

Footnotes

1 Pediatric Translational Medicine Institute, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China

2 Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

3 Department of Tumor Cell Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

4 Key Laboratory of Pediatric Hematology & Oncology Ministry of Health, Department of Hematology & Oncology, Shanghai Children's Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China

5 Department of Pharmaceutical Sciences, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

6 Hematological Malignancies Program, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

7 Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

8 Children's Hospital of Soochow University, Suzhou, Jiangsu, China

9 Department of Pediatric Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02215, USA

10 Division of Pediatric Hematology-Oncology, Boston Children's Hospital, MA 02115, USA

* Contributed equally to this work.

# Correspondence should be addressed to Y.L. (liuyu@scmc.com.cn) or J.Z. (jinghui.zhang@stjude.org).