1. Introduction to OncoSplicing


2. Splicing types and quantification


3. Database construction pipeline


4. Function in OncoSplicing


5. Description of columns


6. Data summary


7. Abbreviation


8. Reference



1. Introduction to OncoSplicing

Percent splice in (PSI) ranged from 0 to 1 was a commonly used ratio to indicate different uses of alternative exon. We downloaded PSI value of 122,423 alternative splicing (AS) events across 10,699 samples in 33 cancer types from the TCGA SpliceSeq database [1]. The SplAdder [2] software is different with the SpliceSeq [3] software in detection and quantification of alternative splicing, which may contribute to the discovery of different value of alternative splicing in cancers. We collected raw counts data of 4,491,482 AS events in 9437 TCGA and 3323 GTEx samples from the SplAdder project [4] and then re-calculated PSI value of 238,558 confirmed and filtered AS events based on a modified pipeline. After integrating with sample types, survival data and different levels of clinical indicators, differential, survival and cancer specific analyses based on these data were preformed to identify potential clinically relevant AS events in each cancer type (See below). In order to facilitate users to quickly determine the transcripts information related to alternative splicing, 101,877 (83.2%) AS events in the SpliceSeq and 169,426 (71%) in the SplAdder project were annotated with at least one transcript. In addition, explanatory diagrams of detected AS events and annotated structure of transcripts were presented in the UCSC genome browser by adding custom tracks. In OncoSplicing, users can easily browse and search alternative splicing in cancers and perform visualization of clinical relevance in a single-cancer or pan-cancer view.


In the latest updated OncoSplicing, we integrated 282,125 AS events by the whole splice region from 57,172 SpliceSeq events with only one alternative exon and all 238,558 SplAdder events. We mapped these events with published RNA binding motifs of 130 RBPs [5,6] and eCLIP peaks of 150 RBPs and splice events regulated by 358 RBPs in the ENCODE project [7]. All these events were performed correlation analysis with these collected RBPs as well as other 521 MSigDB annotated splicing RBPs [8,9]. Finally, 58,993,935 (26% of all) splice events-RBP pairs were found with at least one of these relationships.


2. Splicing types and quantification

(1) Splice types detected in the SplAdder project including alternative 3’ site (A3), alternative 5’ site (A5), exon skip (ES), mutually exclusive exons (ME) and intron retention (IR) were respectively corresponding to five splice types detected in the SpliceSeq project, including alternative acceptor sites (AA), alternative donor sites (AD), exonskipping (ES), mutually exclusive exons (ME) and retained intron (RI). The SpliceSeq project also included splice types alternate promoter (AP) and alternate terminator (AT).


(2) The software SpliceSeq and SplAdder are different in AS detection and PSI calculation. The SpliceSeq software takes all read counts covered in the splicing exon to calculate PSI, while the SplAdder takes into consideration only the read counts covered in the splice junctions. In this database, we modified the PSI calculation in the SplAdder project by normalizing the read quantification by the number of splice junctions, which might to some extent affect quantification of read counts and PSI for ME and ES splice types.


(3) In the latest version, we analyzed AS events in the ENCODE project by using rMATs software, which also include five similar splice types: alternative 3’ splice site (A3SS), alternative 5’ splice site (A5SS), skipped exon (ES), mutually exclusive exons (MXE) and retained intron (RI).



3. Database construction pipeline

Primary pipelines for data processing were obtained from the papers of the SpliceSeq [1] and SplAdder [4] project separately. In the SplAdder project, the modified quantification of read counts and PSI was described in the section 2 and the pipeline to confirm novel AS events was modified by setting the minimum read counts for splice junctions as three for splice types ES and ME and performing sample size normalization with a percentage cut-off 0.5% in a cancer type population.

(1) Survival analysis. Survival analyses were performed based on overall survival (OS), progression free interval (PFI), disease free interval (DFI) and disease specific survival (DSS) data. Cox'PH regression analysis was used to evaluate relative Hazard ratio between two survival groups by dichotomizing PSI values and Log-rank test was used to value the significance. For each survival data, survival analysis for AS event implemented only if it with effective sample size > 30, survival event > 5 and minima of group size > 10. AS events with Log-Rank p-value less than 0.05 were considered significant survival-associated alternative splicing events (SASEs). All significant SASEs can be found on the "ClinicalAS" page and all results of survival analyses are stored in the info table on the "Download" page.Survival analyses on the "SpliceSeq" and "SplAdder" page were performed based on overall survival data for most cancer types or based on progression-free survival data due to a lack of overall survival events for PCPG, PRAD, TGCT, THCA and THYM.


(2) Differential analysis. Differential AS analyses were performed for cancer types (see data summary) with at least 30 TCGA tumour and 10 adjacent normal samples. The Wilcoxon rank-sum test was used to evaluate the significance of differences. AS events with absolute delta PSI more than 0.1 and Benjamini–Hochberg (BH) adjust p-value less than 0.05 were considered significant differential alternative splicing events (DASEs).


(3) Identification of clinical indicator-relevant AS events. Basic patient information, including age, sex and race, and nonredundant and variant clinical indicators, was manually collected and separated into two groups for each cancer type. Clinical indicators in a cancer type were reserved for further analysis only if there were more than 20 records per group. Differential AS analysis was performed between two groups for each indicator, and only AS events with a delta PSI greater than 0.1 and a BH adjusted p-value less than 0.05 were considered significant clinically relevant AS events. The explanation of each clinical indicator can be found here.


(4) Identification of cancer specific AS. AS events considered as cancer-specific AS only if they met one of following criteria: 1) PSI > 0.99 in more than 90% GTEx samples and < 0.95 in more than 10% tumour samples for at least one TCGA cancer type; or 2) PSI < 0.01 in more than 90% GTEx samples and > 0.05 in more than 10% tumour samples for at least one TCGA cancer type.


(5) Annotation of AS associated transcripts. An AS-associated transcript means that the transcript contained splice junctions in the AS event, associated with either exon splice in or splice out. Exons were organized from 5’ to 3’ for all transcripts annotated in the genome annotation file (GRCh37 or GENCODE19). Chromosome locations of splice junctions of each AS event were obtained and mapped to transcripts of the splice gene to confirm AS associated transcripts.

(6) RNA binding motifs mapping to AS events. From published data, we collected 1,149 RBP-motifs and 753 unique RNA binding motifs of 130 RBPs. From the splice sites within AS events, 250nt on the intron side and 50nt on the exon side were extracted as motif-searching regions that may influence the AS events. After get fasta sequences, motifs were mapped to all these integrated AS events one by one.


(7) ENCODE eCLIP-seq peaks mapping to AS events. eCLIP-seq peaks peaks of 103 HepG2 experiments and 120 K562 experiments in the ENCODE project were downloaded from the ENCODE data portal. Peaks were merged by each target RBP for IDR peaks and replicate peaks respectively using deeptools. These merged peaks were further mapped to all the integrated AS events.


(8) Splicing analysis of Encode RNA-seq data. 338 HepG2 and 378 K562 experiments of ENCODE RNA-seq data were downloaded from the ENCODE data portal. After pre-process of raw fastq data with fastp, clean data were mapped to hg19 genome with STAR-2-pass workflow. rMATS was used to perform splicing analysis. AS events with delta PSI more than 0.1 and FDR less than 0.05 were considered as significant regulated events. Among those 282,125 integrated AS events, 42,486 were also found in at least one of these ENCODE RNA-seq data.


(9) Correlation analysis. Gene expression data in TPM format of both TCGA and GTEx datasets were downloaded from the UCSC Xena database. Expression data were mapped with splicing data either of SpliceSeq or SplAdder projects by sample ID, and 9,997 samples in TCGA and 2,442 samples in GTEx were remained for correlation analysis. In a cancer/tissue type (more than 40 samples), AS events with PSI value presented in more than 75% samples and exhibited 25% unique values (excluding solely 0 or 1) were subjected to Pearson correlation analysis with the mRNA expression of RBPs. AS events and RBPs correlations with absolute coefficients more than 0.4 and FDR less than 0.05 were considered as significant.



4. Function in OncoSplicing

(1) After choosing a cancer type and searching a gene symbol, users can browse an integrative information of the queried splice events, including gene information ①, events information ② and statistical results ③. By clicking on a gene symbol embeded with the hyperlink, users are directed to the Ensembl database. In the button region ④, by clicking on the "UCSC" button users are directed to the UCSC genome browser, which is characterized by customized tracks with annotated structure of transcripts ⑦, and explanatory diagrams ⑤ ⑥ of detected AS events in the SplAdder and SpliceSeq projects. By clicking on a plot button, corresponding plot of querying will be presented in few seconds. ① and ② are consistently presented in different pages such as "SpliceSeq", "SplAdder", "PanCancer" and "ClinAS", while ③ and ④ might be different among them. For the SplAdder project, event region in ② was organized orderly by three or four exons (0 base start to end) based on chromosome locations from 5' to 3' and linked by "-", and alternate region was characterized as 0 base start to end of longer intron (A3 and A5), longer exon (IR) or alternate exon (ES and ME).


(2) Three customized tracks are provided in the UCSC genome browser, presenting explanatory diagrams and annotated structure of transcripts in the SplAdder and SpliceSeq projects. In the explanatory diagram track for SplAdder ⑤ and SpliceSeq ⑥, AS events in Blue or Green are indicated as annotated known AS events while AS events in Red or Orange are indicated as novel AS events. In the structural transcripts track for GENECODE v19 ⑦, transcripts in DarkRed are indicated as protein coding while transcripts in DarkBlue are indicated as non-protein coding. Thick blocks are indicated as alternate exon/exons of an AS event in the tracks with explanatory diagrams of AS events and are indicated as CDS region of a transcript in the tracks with annotated structure of transcripts.


(3) OncoSplicing provides six different plotting functions to visualize each AS event.
KM-plot produces two plots for most AS events based on the median PSI cut-off (left) and the predicted optimal PSI cut-off (right, if applicable). The optimal cut-off was predicted using survival data by the “surv_cutpoint” function in the R package “survminer”.

TN-plot provides boxplot to show the distribution of PSI in tumor samples and the comparison with adjacent normal samples (if applicable) and/or GTEx samples (if applicable). The Wilcoxon rank-sum test was used to evaluate the significance of differences.

PanDiff plot provides a pan-cancer view of PSI differences of the queried AS event (detected in at least 3 cancers) between tumor samples and adjacent normal samples (left, if applicable) and/or GTEx normal samples (right, if applicable). The red dashed line indicate 0.05 in the Y-axis. The red labels in axises indicate transformed breaks. The in-circle colors represent different cancer types.

PanCox (PanOS or PanPFI) plot provides a pan-cancer view of Hazard ratio of the queried AS event (detected in at least 3 cancers) based on the median PSI cut-off (left) and the predicted optimal PSI cut-off (right, if applicable). Survival data (OS or PFI) used in the plot was labeled in the X-axis title.

PanPlot produces boxplot to show PSI distribution of the queried AS event across different cancer types and GTEx tissues (SplAdder) and distributions of read counts supporting exon splice in or splice out (SplAdder). In the Figure (3)C, colored labels and black labels in the X-axis represent TCGA cancers and GTEx tissues respectively. The upper and middle parts labeled with "Reads-In" and "Reads-Out" in Y-axis represent read count value surport exon splice in or splice out respectively.

CIplot provides the visualization of significant PSI or survival differences between two subgroups of a selected clinical indicator, which are similar with KMplot and TNplot respectively.

If no data of the queried AS event could be displayed for a plot function, an empty plot with warning message will be presented.


In the latest OncoSplicing database, users can explore potential splicing regulators of any AS events across the TCGA and GTEx cohorts.


(4) Browser and search By entering a splice gene, AS event or RBP on the MapAS page, users can view essential information of AS events, along with integrative data regarding hundreds of RNA-binding proteins (RBPs). The column "RBP_Motifs" indicates the number of RBP motifs identified in the queried AS event. "Peak_Type" and "Peaks" display the type and number of RBP peaks mapped to the event. The "shRNAseq_Cell" and "shRNAseq_PSI" columns represent the cell type and the corresponding PSI changes of the queried AS event, respectively. "Corr_Cancers" and "Pos_Correlations" list the number of cancers or tissues showing significant correlations and positive correlations, respectively. If two numbers are separated by a comma in the "Corr_Cancers" and "Pos_Correlations" columns, they represent data from the SpliceSeq and SplAdder projects, respectively. A "0" or "." in these fields denotes that the AS event has been explored, but no significant correlations were found, while an empty value ("") signifies a lack of supporting data.


(5) The latest OncoSplicing database introduces new features to explore the relationships between AS events and RBPs.


MapAS-Plot generates a structural representation of the queried AS event, highlighting the motifs and peaks of the selected RBP.


MapAS-Event generate heatmaps that display the location of RNA-binding motifs or eCLIP-seq peaks for all potential RBPs within the structured AS event (e.g., exon_skip_497057). To analyze the distribution of these motifs or peaks, 250 nucleotides on the intronic side and 50 nucleotides on the exonic side of the splice sites are included. The "PanRBP" row provides a summary of all RBPs that have motifs or eCLIP-seq peaks associated with the structured AS event. The grey color indicates sites with motifs or peaks. In the PanRBP row, darker shades represent a higher relative frequency of motifs for all RBPs.


MapAS-RBP generate heatmaps showing the relative distribution of RNA-binding motifs or eCLIP-seq peaks for a specific RBP across all structured AS events of various splice types. The darker the color, the higher the relative frequency of the RBP motif.


MapAS-Seq provides detailed sequences of the queried AS event, which can be used to design AS-specific primers and minigene constructs.


Encode-Plot generates a sashimi plot for the queried AS event based on bigwig score data and eCLIP-seq peak data for the selected RBP from the ENCODE project. (Note: this data may slightly differ from the PSI values calculated by rMATS.)


Encode-Event produces a volcano plot to identify RBPs that significantly regulate the queried AS event.


Encode-RBP generates several types of plots, including scatter plots, bar plots, and volcano plots, to illustrate the number, distribution, and overall landscape of AS events regulated by the selected RBP.


CoExp-Plot generates scatter plots to display the correlation between the queried AS event and the RBP within a specific cancer or tissue type. It also produces boxplots and time series plots that show the dynamics of both omics data across different time points and disease stages.


CoExp-PanPlot generates a scatter plot to visualize the correlation between the queried AS event and the RBP across user-filtered and selected cancer or tissue types.




5. Description of columns



6. Data summary



7. Abbreviation

Cancer TypeFull Name Cancer TypeFull Name Cancer TypeFull Name
ACCAdrenocortical Carcinoma KIRCKidney Renal Clear Cell Carcinoma PRADProstate Adenocarcinoma
BLCABladder Urothelial Carcinoma KIRPKidney Renal Papillary Cell Carcinoma READRectum Adenocarcinoma
BRCABreast Invasive Carcinoma LAMLAcute Myeloid Leukemia SARCSarcoma
CESCCervical Squamous Cell Carcinoma LGGLower Grade Glioma SKCMSkin Cutaneous Melanoma
CHOLCholangiocarcinoma LIHCLiver Hepatocellular Carcinoma STADStomach Adenocarcinoma
COADColon Adenocarcinoma LUADLung Adenocarcinoma TGCTTesticular Germ Cell Tumors
DLBCDiffuse Large B-cell Lymphoma LUSCLung Squamous Cell Carcinoma THCAThyroid Carcinoma
ESCAEsophageal Carcinoma MESOMesothelioma THYMThymoma
GBMGlioblastoma Multiforme OVOvarian Serous Cystadenocarcinoma UCECUterine Corpus Endometrial Carcinoma
HNSCHead and Neck Squamous Cell Carcinoma PAADPancreatic Adenocarcinoma UCSUterine Carcinosarcoma
KICHKidney Chromophobe PCPGPheochromocytoma and Paraganglioma UVMUveal Melanoma


8. Reference

1. Ryan M, Wong WC, Brown R, Akbani R, Su X, Broom B, Melott J, Weinstein J. TCGASpliceSeq a compendium of alternative mRNA splicing in cancer. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1018-22. PMID: 26602693.


2. Kahles A, Ong CS, Zhong Y, Rätsch G. SplAdder: identification, quantification and testing of alternative splicing events from RNA-Seq data. Bioinformatics. 2016 Jun 15;32(12):1840-7. PMID: 26873928.


3. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: a resource for analysis and visualization of RNA-Seq data on alternative splicing and its functional impacts. Bioinformatics. 2012 Sep 15; 28(18):2385-7. PMID: 22820202.


4. Kahles A, Lehmann KV, Toussaint NC, Hüser M, Stark SG, Sachsenberg T, Stegle O, Kohlbacher O, Sander C; Cancer Genome Atlas Research Network, Rätsch G. Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. Cancer Cell. 2018 Aug 13; 34(2):211-224.e6. PMID: 30078747.


5. Hwang JY, Jung S, Kook TL, Rouchka EC, Bok J, Park JW. rMAPS2: an update of the RNA map analysis and plotting server for alternative splicing regulation. Nucleic Acids Res. 2020 Jul 2;48(W1):W300-W306. PMID: 32286627.


6. Van Nostrand EL, Freese P, Pratt GA, Wang X, Wei X, Xiao R, Blue SM, Chen JY, Cody NAL, Dominguez D, Olson S, Sundararaman B, Zhan L, Bazile C, Bouvrette LPB, Bergalet J, Duff MO, Garcia KE, Gelboin-Burkhart C, Hochman M, Lambert NJ, Li H, McGurk MP, Nguyen TB, Palden T, Rabano I, Sathe S, Stanton R, Su A, Wang R, Yee BA, Zhou B, Louie AL, Aigner S, Fu XD, Lécuyer E, Burge CB, Graveley BR, Yeo GW. A large-scale binding and functional map of human RNA-binding proteins. Nature. 2020 Jul;583(7818):711-719. PMID: 32728246.


7. Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B, Myers Z, Sud P, Jou J, Lin K, Baymuradov UK, Graham K, Litton C, Miyasato SR, Strattan JS, Jolanki O, Lee JW, Tanaka FY, Adenekan P, O'Neill E, Cherry JM. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020 Jan 8;48(D1):D882-D889. PMID: 31713622.


8. Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011 Jun 15;27(12):1739-40. PMID: 21546393.


9. Zhang Y, Wu X, Li J, Sun K, Li H, Yan L, Duan C, Liu H, Chen K, Ye Z, Liu M, Xu H. Comprehensive characterization of alternative splicing in renal cell carcinoma. Brief Bioinform. 2021 Sep 2;22(5):bbab084. PMID: 33822848.