oPOSSUM anchored Combination TFBS Cluster Analysis (aCTCA)

Identification of pairs of over-represented transcription factor binding site clusters in sets of genes or sequences

Analysis Input

Entering Input Data Sets
Entering Background Data Sets
TFBS Cluster Selection Based on TF Families
Select an anchoring TF
Select TFBS clusters to find in combination with the anchoring TF
TFBS Cluster Search Parameters
Conservation level
Matrix match threshold
Upstream / downstream sequence
Number of results to return
Sorting results

Analysis Results

Selected Parameters
Number of submitted genes/sequences
Selected TFBS clusters
Search parameters
GC composition
Results returned
Analysis Results Summary Table
TFBS Hits Table



Analysis Input


Entering Target Data Sets

Specify your list of target genes or sequences by either pasting them into the text box or using the file upload option.

For gene-based analysis, the gene list should be separated by spaces, commas, semi-colons, colons, newlines or some combination thereof. For testing purposes, pressing the Use sample genes button will paste a pre-defined set of test genes into the text area. Use the Clear button to clear the text area. You need to choose the type of gene ID from the drop down menu. The gene ID types available are:

Depending on the species chosen, other species-specific database IDs are available: Please note that while oPOSSUM attempts to use many different ways of identifying genes, the difficulty in mapping between different gene identification schemes means that not every gene in oPOSSUM is represented in every scheme. Every gene in oPOSSUM has an Ensembl gene ID, so this is the recommended gene ID type.

For sequence-based analysis, fasta-formatted sequences can be pasted into the text area or uploaded as a file. If you need only have genomic coordinates for your sequences, please use the Galaxy service from the UCSC.



Entering Background Data Sets

For both gene- and sequence-based systems, the same restrictions apply as given in the Entering Target Data Sets section.

For gene-based oPOSSUM systems, the full background data sets were compiled by selecting genes from Ensembl which were considered to be well known, i.e. the "KNOWN" attribute was true and the genes had a gene symbol assigned (HGNC symbol for human, MGI for mouse etc.). Transcription start sites (TSSs) were assigned for each gene based on the core Ensembl gene TSS annotations. You can paste in the list of gene ID's to be used as the background data set, or upload a file containing the ID's. Alternatively, you can specify the number of random genes to be picked as the background data set, where the maximum number allowed is the number of genes annotated in the oPOSSUM system.

The default setting for gene-based analyses is to use all genes in the oPOSSUM database as your background. Alternatively, you can specify the number of genes to be picked randomly to be used as the background (default=5000) or provide your own custom background gene list. Choosing randomly selected background genes can reduce the required computation time compared to choosing the entire gene set if the number chosen is sufficiently larger than your foreground gene set. If you are performing multiple analyses, and you wish to compare the results between analyses, we recommend you use "all" or supply your own background so that the background is stable across analyses.

For sequence-based systems, you can either choose a pre-defined background data set or supply your own. When choosing a background data set, it is important to ascertain that its GC composition closely matches that of the target data set. The pre-defined data sets include:

For sequence-based analyses, the default setting for the background selection is to provide your own set of control sequences. Alternatively, one can choose to use one of the provided backgrounds; however, these are currently limited to mouse sequences, and as each background set has a specific nucleotide composition (indicated in the background name), the available background sets may not necessarily be good matches for your foreground sequences. It is preferable to supply your own control sequences as you can control for the nucleotide composition of your sequence sets. There is an automatic background sequence generator that is currently undergoing beta testing (provided below the custom background box), which you can use to extract a background with a nucleotide composition comparable to your foreground sequences. When choosing the background, you should ensure that you have the same number (or greater) of sequences in your background set as in the foreground set.



TFBS Cluster Selection Based on TF Families

TFBS profiles from the JASPAR CORE, PBM and PENDING collections were grouped according to their structural families and clustered within each family according to their profilie similarities. 170 TFBS clusters were formed in total. You can specify whether to use the entire set or a selected subset of the clusters by specifying which TF families to include in your analysis. No custom TFBS input is allowed, as this analysis relies on pre-computed TFBS clusters stored within the oPOSSUM system.

Select an anchoring TF

Select your anchoring TF from the pull down list of JASPAR CORE profiles. Species-specific restrictions may apply.

Select TF families to find in combination with the anchoring TF.

When specific TF families are chosen, TFBS clusters within those families will be included in the analysis. To select multiple families, use CTRL-click (Windows) or Command-click (OS X).



TFBS Cluster Search Parameters

Conservation level

For gene-based systems, conservation was determined by using phastCons scores from the UCSC Genome Browser scoring above some minimum threshold and merging these into conserved regions of minimum length 20 bp.

The conserved regions that fell within 10,000 nucleotides upstream and downstream of the predicted transcription start site (TSS) were then scanned for binding sites using position weight matrice (PWM) models of transcription factor (TF) binding specificity from the JASPAR database. These TF sites were stored in the oPOSSUM database and comprise the background set for the default analysis. The background set was pre-computed with three levels of conservation filter.

The default conservation level is 0.4 (level 1). If you want to be more restrictive in your analysis, you can raise the conservation level to 0.6 (level 2).

No conservation filters are used for yeast oPOSSUM and sequence-based systems.



Matrix score threshold

TF sites are scanned by sliding the corresponding postion weight matrix (PWM) along the sequence and scoring it at each position. The threshold is the minimum relative (percent) score used to report the position as a putative binding site. For gene-based oPOSSUM the background sets were scanned using a minimum threshold of 75%. The background TFBS counts were pre-computed using three threshold levels.

The relative score is computed from the raw matrix score as:
rel_score = (site_score - min_matrix_score) / (max_matrix_score - min_matrix_score)

The default threshold is 80%, which is a commonly used threshold for TFBS analyses using PWMs. If you have prior knowledge of which TFs are of interest for your analyses and what their properties are, you may change this threshold based on that knowledge (between 75% to 100%). For instance, if the matrix for a TF of interest has a low IC, then you will want to use a higher threshold, whereas for a TF with a high IC, you might try using a lower threshold. A threshold of 80% or 85% will generally provide you with satisfactory results, but if you are uncertain, we recommend multiple analyses with various thresholds.



Maximum inter-binding distance

TFs that are located within the specified maximum inter-binding distance from a predicted anchoring TFBS are counted for over-representation analysis. The distance can range up to 250 bp. The default distance is 100 bp. These values were chosen as a reasonable boundary for including TFs that may act together for a common regulatory function. If you are uncertain of what to expect in terms of distance, perform three analyses with three different distances e.g. 50, 150, and 250 bp.



Upstream / Downstream sequence

For gene-based oPOSSUM analyses, this refers to the size of the region around the transcription start sites (TSS) which was analysed for TF binding sites. The maximum amount of upstream / downstream region used to pre-compute the background is species dependent, as follows.

SpeciesMax. Upstream (bp)Max. Downstream (bp)
Human, Mouse10,00010,000
Fly3,0003,000
Worm1,5001,500
Yeast1,000Annotated 3' end of the gene

The TFBS counts within the search regions were precomputed for various levels of upstream / downstream sequence. The levels for the metazoans are:

VertebratesInsectsNematodes
Search Region Level Upstream (bp)Downstream (bp) Upstream (bp)Downstream (bp) Upstream (bp)Downstream (bp)
1 10,00010,0003,0003,0001,5001,500
2 10,0005,0003,0002,0001,0001,000
3 5,0005,0002,0002,0001,000500
4 5,0002,0002,0001,000500500
5 5,0002,0002,0001,000500500
6 5,0002,0002,0001,000500500

Using one of the pre-defined search region levels will result in much faster computation than using custom search region lengths, as the oPOSSUM system can take advantage of pre-computed values stored in the database. The default levels were reasonable in terms of regulatory region lengths for the individual organisms. Alternatively, you can choose other pre-computed regions from the drop down list or specify your own custom region. As the regions in the drop-down list are pre-computed, the analysis will complete relatively quickly. Custom regions are not pre-computed and thus take longer to analyse; if you don't remember to adhere to the max upstream and max downstream constraints, the system will limit the region for you.

We recommend that you perform several analyses, varying the selected search region level for each analyses. Using mouse as an example, you might start with the default of 5,000/5,000 bp, and then examine how the over-representaiton analyses change between shorter and longer search regions by comparing the results from 2,000/2,000 bp and 10,000/10,000 bp for another.

Notes:



Number of results to display:

You can specify the number of results to be returned. The default is to return all results, but you can specify to return only the top 5, 10 or 20 results. You can also specify to return all results which score above a given Z-score and Fisher score threshold. In cases where anything other than the "All" option is specified, the top scoring results are based on which scoring method was chosen in the "Sort results by" section. Also, in those cases, those results that are not returned, are lost, even those results that scored highly in the other scoring measure.

We recommend using the default of "all results", as selecting less than "all" does not increase the speed of the analysis. Once you have the results file, you can manually filter the scores yourself.

Sort results by:

Results can be sorted by either Z-score or Fisher score. The sort order lists the top scoring over-represented TFBS at the top of the list. The sort order is not permanent, and can be changed once the analysis is complete and the results page is displayed.



Enter your email address:

As aCTCA process takes a considerable amount of time to complete, users are notified by email of the results when the analysis is finished.



Analysis Results

The results are returned in a table format, preceded by a summary of the input parameters and the GC composition of the target and background sequences used in the analysis. The results can be downloaded in a tab delimited file format (link is provided at the bottom of the table). Those results that do not have any hits in the background genes/sequences are flagged with a warning.

Selected Parameters

Sequence information (sequence-based analysis)

For sequence-based analysis, the following information on the analyzed sequences are displayed:

Number of submitted genes (gene-based analysis)

oPOSSUM systems are built based on Ensembl IDs. If external IDs are used for gene selection, oPOSSUM may not be able to map all IDs in an one-to-one manner. All calculations are based on the unique oPOSSUM genes found and their associated non-coding regions.

TFBS cluster summary

TFBS profile matrix source: JASPAR CORE profiles / User-supplied(sequence-based analysis only)

Search parameters

For gene-based analysis, the conservation level and upstream/downstream sequence lengths chosen are summarized. For both the gene-based and sequence-based analyses, the matrix score threshold is shown.

GC composition

For both the target and background sets, the GC compositions of the sequences that were searched for motifs are shown. GC content less than 0.45 or greater than 0.55 is flagged red. For sequence-based analysis, it is important to make sure that both the target and background sequence sets to have similar levels of GC content, as sequence composition can affect TFBS search results. Please refer to the manuscript for more details.

Results returned

How the results are sorted in the summary table.



Genes Included in Analysis (gene-based analysis)

This section lists the set of target and background genes that were entered by the user. The background gene list is shown only if you pasted in or uploaded a gene set to be used as the background (i.e. not randomly chosen). It is broken down into the sections Analyzed and Excluded. Please see description under Target Genes of what is meant by 'Analyzed' and 'Excluded' and how this may impact the results of the analysis.

Analyzed

This is the subset of genes entered by the user which were found within the oPOSSUM database, and therefore included in the analysis.

Excluded

This is the subset of genes entered by the user which were not found within the oPOSSUM database, and therefore had to be excluded from the analysis.



Analysis Results Summary Table

For a general explanation of what the oPOSSUM analysis results mean, please refer to the main help page.

This table contains the results of both the Z-score and Fisher analyses. The results are ordered by Z-score from most to least significant (higher to lower z-score). Those columns that the table can be sorted by has double arrow icons in the header. The columns of the table are as follows:

TFBS Cluster ID

The TFBS Cluster ID. A link is provided to a summary page for the given cluster listing the individual member TFs.

Class

The class of TFs to which this cluster belongs.

Family

The family of TFs to which this cluster belongs.

Target gene/sequence hits

The number of genes/sequences in the included target set for which this TFBS cluster was predicted within the searched regions. A link is provided to the TFBS Hits summary table that lists the actual genomic locations of the hits.

Target gene/sequence non-hits

The number of genes/sequences in the included target set for which this TFBS cluster was NOT predicted within the searched regions.

Background gene/sequence hits

The number of genes/sequences in the background set for which this TFBS cluster was predicted within the searched regions. Results with 0 background hits are flagged with a warning.

Background gene/sequence non-hits

The number of genes/sequences in the background set for which this TFBS cluster was NOT predicted within the searched regions.

Background cluster hits

The number of times this TFBS was detected within the searched regions of the background set of genes/sequences. Results with 0 background hits are flagged with a warning.

Target cluster hits

The number of times this TFBS cluster was detected within the searched regions of the target set of genes/sequences. A link is provided to the TFBS Cluster Hits summary table that lists the actual genomic locations of the hits.

Background cluster rate

The rate of occurrence of this TFBS cluster within the searched regions of the background set of genes/sequences. The rate is equal to the number of times the site was predicted (background hits) multiplied by the width of the TFBS cluster, divided by the total number of nucleotides in the searched regions of the background set.

Target cluster rate

The rate of occurrence of this TFBS cluster within the searched regions of the included target set of genes/sequences. The rate is equal to the number of times the site was predicted (target hits) multiplied by the width of the TFBS cluster, divided by the total number of nucleotides in the searched regions of the included target set.

Z-score

The likelihood that the number of TFBS cluster nucleotides detected for the included target genes/sequences is significant as compared with the number of TFBS clsuter nucleotides detected for the background set. Z-score is expressed in units of magnitude of the standard deviation.

Fisher score

The probability that the number of hits vs. non-hits for the included target genes/sequences could have occured by random chance based on the hits vs. non-hits for the background set. Negative natural logarithm of the probabilities are returned as the scores.



TFBS Cluster Hits Summary Table

This table contains the gene and promoter information where the TFBS cluster prediction is found, along with individual TFBS clsuter hit locations.

gene-based analysis

sequence-based analysis