A downstream visualization and analysis tool for gene set enrichment results with interactive web visualizer.
Author: Yan Tan (Boston University, Broad Institute), Felix Wu (Broad Institute)
Algorithm Version: 1.1
ConstellationMap helps leverage the full power of a gene set enrichment analysis by identifying commonalities between high-scoring gene sets and mapping their relationships. ConstellationMap visualizes in an interactive web application the enrichment profile similarity and gene member overlap of gene sets, which are positively or negatively enriched in relation to a phenotype in a two-class or continuous-class comparison. It uses provided enrichment scores (e.g., those generated by ssGSEAProjection) to estimate normalized mutual information (NMI) scores and project top scoring sets onto a circular plot with the following features:
(1) Gene sets are represented as nodes with radial distance to the center reflecting positive or negative enrichment in the phenotype, whichever direction is specified. Higher associations for the correlation direction are plotted closer to the center.
(2) Member gene overlaps between sets are represented as edges connecting nodes with thickness proportional to degree of overlap.
(3) The angular distance between nodes is relatively proportional to the similarity of their enrichment profiles, i.e. more similar enrichment patterns have a smaller angular distance.
See the video tutorial: Using the Constellation Map Visualizer
See the F1000Research paper (Tan Y, Wu F, Tamayo P et al., 2015): doi: 10.12688/f1000research.6644.1
Web interactive features allow export of selected overlapping gene symbols for annotation in DAVID, MSigDB, and GeneMania. Click on the output file Visualizer.html to open the interactive ConstellationMap plot from the Jobs Tab.
In addition to the web application, ConstellationMap outputs two static plots: (1) a positive red to negative blue heat map of per sample enrichment of the gene sets ranked by NMI scores as well as Area Under the Curve (AUC) and t-test metrics along with corresponding p-values, and (2) a constellation map marking the phenotype of interest in the center in red, concentric contour arcs marking association to the phenotype (as measured by NMI), nodes as numbered open circles, overlap as green lines, and a key of the numbered gene sets.
Users may input gene set enrichment data produced by any algorithm as long as it is in GCT format, as well as any gene set collection in GMX or GMT format. For convenience, GenePattern modules including ConstellationMap provide MSigDB gene set collection files from a drop-down menu. The two module workflow examples presented here both use outputs from GenePattern’s ssGSEAProjection module as inputs to ConstellationMap. Workflow (A) uses ssGSEAProjection only on user-specified enriched sets, e.g. those previously identified from another analysis, while Workflow (B) uses enrichment rankings computed by ssGSEAProjection on the expression dataset in ConstellationMap. Users may choose Workflow (A) if they have a preferred method (e.g., GSEA) for identifying gene sets highly associated with a phenotypic class. If users do not have a preferred method, then they may use Workflow (B), which uses NMI to identify highly associated gene sets. See the following figure for a visual representation of the two workflows.
Whole Transcriptome Expression Data → GSEA (or other gene set analysis method) → ssGSEAProjection → ConstellationMap
Whole Transcriptome Expression Data → ssGSEAProjection → ConstellationMap
For gene set collections that do not utilize _UP and _DN suffixes at the ends of set names, the combine mode parameter option is irrelevant as all the modes give the same output.
As of MSigDB v5.0, released March 2015, four collections contain gene sets with _UP and _DN suffixes. These are C2.all, C2.CGP, C6.all, and C7.all. Some gene sets, including many in MSigDB, are derived from previous gene expression studies with up-regulated and down-regulated genes divided into separate gene sets denoted by an _UP or _DN suffix appended to otherwise identical set names. The ssGSEAProjection module can recombine these paired sets and calculate a combined enrichment score for each pair. See ssGSEAProjection module documentation for more information.
For gene set collections that utilize _UP and _DN suffixes, set ssGSEAProjection parameter combine mode to combine.off. Do not use the other two options, combine.add or combine.replace, as they create new combined gene sets that lack either suffix and which are not represented in the input gene set collection unless you will be providing ConstellationMap with a gene set collection that includes the new combined gene sets. ConstellationMap requires that the gene sets specified in the input GCT file match a subset of the gene sets provided.
Constellation Map operates on the provided gene set enrichment scores, estimating the probability density functions of gene set and phenotypic class variables using kernel density estimation. These density functions are subsequently used to calculate normalized mutual information (NMI) scores for each gene set, which capture the association between each gene set’s enrichment scores and phenotypic classes (Eq. 3). Since NMI scores do not inherently contain information on direction of association, we apply the sign of the Pearson correlation (Eq. 4).
Next, the top N NMI-scoring gene sets are selected for projection onto the circular space. NMI scores are calculated pairwise across the N gene sets quantify the similarity between their enrichment profiles. These pairwise NMI scores are converted into dissimilarity scores, d = 1 − NMI, which has been mathematically proven to be a true distance metric. An N-by-N distance matrix D is constructed containing these pairwise distances d.
The circular Constellation Map plot is built using the multidimensional scaling projection R package SMACOF version 1.5–0. An angular distance matrix Δ is calculated by minimizing the objective function (Eq. 5), where δij is the angular distance and dij is the original distance (stored in D) between gene sets i and j. The gene sets are plotted as points distributed about the origin. Angular distance between two gene sets is determined from Δ and is proportional to the similarity of the gene sets’ enrichment profiles. Radial distance (i.e., distance to the origin) indicates the gene set’s association with respect to the phenotype (1 – NMI).
To provide context of gene overlap between sets, pairwise Jaccard indices are calculated across the gene sets. (The Jaccard index is equal to the number of genes shared by two sets divided by the number of genes in their union.) For pairs with Jaccard indices greater than a given threshold, edges are drawn connecting the respective nodes; the thickness of each edge is proportional to the Jaccard index.
|input gct file *||
This is a tab-delimited text file in GCT format containing gene set enrichment data outputted from an ssGSEAProjection (single sample gene set enrichment analysis projection) module job.ConstellationMap assumes that the input GCT file is a gene set enrichment file outputted by an ssGSEAProjection module job. Rows should correspond to gene sets while columns should correspond to sample names. The gene sets specified in the GCT file must be a subset of the gene sets listed in the accompanying gene sets database or gene sets file (see below).
|input cls file *||A space-delimited text file in CLS format containing two-phenotype categorical labels (e.g., tumor vs. normal) or continuous phenotype labels (e.g., time series). For each sample in the corresponding expression dataset the CLS file assigns a label or numerical value for the phenotype.|
|gene sets database||
This drop-down menu allows you to select gene sets from the Molecular Signatures Database (MSigDB) on the GSEA website. This menu provides access to collections from MSigDB version 5.0.
If you want to use files from an earlier version of MSigDB, you will need to download that file from the archived releases on the website and specify it in the gene sets file parameter.If you do not select an option here, you must upload a file in the gene sets file parameter.
|gene sets file||A gene set file in GMT or GMX format. Provide your gene set file here if it is unavailable from the drop-down menu under gene sets database.|
|top n *||This is a positive integer indicating the number of top NMI-scoring gene sets to display in the final plot. This parameter must be greater than 2 but less than or equal to the total number of enriched gene sets.|
This drop-down menu allows you to select the direction of correlation, “positive” or “negative”, in which ConstellationMap will investigate association to the target class.
If “positive” is chosen then gene sets that are more positively associated with the target class will be placed closer to the center in the resulting radial plot. If “negative” is chosen then gene sets that are more negatively associated with the target class will be placed closer to the center in the resulting radial plot.Default: “positive”
|image format *||
This drop-down menu allows you to select an image format for the outputted static plot and heat map, either PNG (raster graphics) or PDF (vector graphics).Default: “PNG”
|jaccard threshold *||
This is a number between 0 and 1 indicating the Jaccard Index threshold above which connecting edges between gene sets will be drawn in the resulting plot.
ConstellationMap measures the overlap between all gene set pairs using a Jaccard Index metric. The Jaccard Index is equal to the number of genes in the intersect of the two sets divided by the number of genes in their union.
Edges are drawn between gene set pairs if their Jaccard Index is greater than the threshold parameter given here. This affects both the static and interactive plots.Default: 0.1
This is a phenotype label, indicating the phenotypic category against which ConstellationMap will measure the association of various gene sets based on their enrichment. In the outputted visualization, target class will be plotted in the center of the radial plot.This parameter must match one of the two phenotype labels specified in the second line of the input cls file. If this parameter is left blank, ConstellationMap will default to the first listed phenotype label in the input cls file.
* - required
When saving a ConstellationMap job to the GenePattern file system, users should ensure that Visualizer.html resides in the same folder as the two accompanying ODF files in order to continue using the visualizer.This is the preferred method for visualizing ConstellationMap results.
|Task Type:||Statistical Methods|
|Language:||R (v3.0), Java|
|1.1||2015-04-08||Initial production release.|