oPOSSUM anchored Combination Site Analysis (aCSA)

Identification of pairs of over-represented transcription factor binding sites in sets of genes or sequences

Analysis Input

Entering Input Data Sets
Entering Background Data Sets
Selecting transcription factor binding site profiles
Select an anchoring TF
Select TFs to find in combination with the anchoring TF
TFBS Search Parameters
Conservation level
Matrix match threshold
Upstream / downstream sequence
Number of results to return
Sorting results

Analysis Results

Selected Parameters
Number of submitted genes/sequences
Selected TFBS
Search parameters
GC composition
Results returned
Analysis Results Summary Table
TFBS Hits Table



Analysis Input


Entering Target Data Sets

Specify your list of target genes or sequences by either pasting them into the text box or using the file upload option.

For gene-based analysis, the gene list should preferably be formatted with one ID per line but may also be separated by spaces, commas, semi-colons, colons, newlines or some combination thereof. For testing purposes, pressing the Use sample genes button will paste a pre-defined set of test genes into the text area. Use the Clear button to clear the text area. You need to choose the type of gene ID from the drop down menu. The gene ID types available are:

Depending on the species chosen, other species-specific database IDs are available: Please note that while oPOSSUM attempts to use many different ways of identifying genes, the difficulty in mapping between different gene identification schemes means that not every gene in oPOSSUM is represented in every scheme. Every gene in oPOSSUM has an Ensembl gene ID, so this is the recommended gene ID type.

For sequence-based analysis, fasta-formatted sequences can be pasted into the text area or uploaded as a file. If you need only have genomic coordinates for your sequences, please use the Galaxy service from the UCSC.



Entering Background Data Sets

For both gene- and sequence-based systems, the same restrictions apply as given in the Entering Target Data Sets section.

For gene-based oPOSSUM systems, the full background data sets were compiled by selecting genes from Ensembl which were considered to be well known, i.e. the "KNOWN" attribute was true and the genes had a gene symbol assigned (HGNC symbol for human, MGI for mouse etc.). Transcription start sites (TSSs) were assigned for each gene based on the core Ensembl gene TSS annotations. You can paste in the list of gene ID's to be used as the background data set, or upload a file containing the ID's. Alternatively, you can specify the number of random genes to be picked as the background data set, where the maximum number allowed is the number of genes annotated in the oPOSSUM system.

The default setting for gene-based analyses is to use all genes in the oPOSSUM database as your background. Alternatively, you can specify the number of genes to be picked randomly to be used as the background (default=5000) or provide your own custom background gene list. Choosing randomly selected background genes can reduce the required computation time compared to choosing the entire gene set if the number chosen is sufficiently larger than your foreground gene set. If you are performing multiple analyses, and you wish to compare the results between analyses, we recommend you use "all" or supply your own background so that the background is stable across analyses.

For sequence-based systems, you can either choose a pre-defined background data set or supply your own. When choosing a background data set, it is important to ascertain that its GC composition closely matches that of the target data set. Ideally, the average lengths of the sequences used in the background set should also closely match the lengths used in the target set. The pre-defined data sets include:

For sequence-based analyses, the default setting for the background selection is to provide your own set of control sequences. Alternatively, one can choose to use one of the provided backgrounds; however, these are currently limited to mouse sequences, and as each background set has a specific nucleotide composition (indicated in the background name), the available background sets may not necessarily be good matches for your foreground sequences. It is preferable to supply your own control sequences as you can control for the nucleotide composition of your sequence sets. There is an automatic background sequence generator that is currently undergoing beta testing (provided below the custom background box), which you can use to extract a background with a nucleotide composition comparable to your foreground sequences. When choosing the background, you should ensure that you have the same number (or greater) of sequences in your background set as in the foreground set.



Select transcription factor binding site profiles

The matrices included in oPOSSUM-3 were obtained from the 2010 release of JASPAR database. For gene-based analysis, you are currently restricted to JASPAR CORE profiles for both the anchoring TFs and the TFs to be searched in combination. The actual profiles that can be selected vary according to the species selected. For sequence-based analysis, you have the option of pasting in or uploading your custom profiles.

The default setting is to choose "all profiles" belonging to the selected taxonomic group in the JASPAR CORE collection with a minimum specificity of 8 bits. The specificity as bits refers to the Information Content (IC) of the TF matrices. The IC, loosely speaking, combines the length and the complexity of the motif into a value that describes what we know collectively about the sequences a TF recognizes. The minimum specificity of 8 bits was chosen because TF matrices with IC of less than 8 bits are relatively uninformative for scoring DNA sequence with. If you choose to threshold matrices with a higher specificity, you are limiting your analyses to matrices with stronger, more selective patterns. In general, it is not necessary to use a threshold higher than 8 bits unless you have specific requirements, as you may later filter your final results by specificity (the IC column of the results file).

Select an anchoring TF

Select your anchoring TF from the pull down list of JASPAR CORE profiles. Species-specific restrictions may apply.

Select TFs to find in combination with the anchoring TF

To select multiple TFs, use Shift-click and/or CTRL-click (Command-click for Mac).



TFBS Search Parameters

Conservation level

For gene-based systems, conservation was determined by using phastCons scores from the UCSC Genome Browser scoring above some minimum threshold and merging these into conserved regions of minimum length 20 bp.

The conserved regions that fell within 10,000 nucleotides (species dependent) upstream and downstream of the predicted transcription start site (TSS) were then scanned for binding sites using position weight matrice (PWM) models of transcription factor (TF) binding specificity from the JASPAR database. These TF sites were stored in the oPOSSUM database and comprise the background set for the default analysis. The background set was pre-computed with three levels of conservation filter.

The default conservation level is 0.4 (level 1). If you want to be more restrictive in your analysis, you can raise the conservation level to 0.6 (level 2).

No conservation filters are used for yeast oPOSSUM and sequence-based systems.



Matrix score threshold

TF sites are scanned by sliding the corresponding postion weight matrix (PWM) along the sequence and scoring it at each position. The threshold is the minimum relative score used to report the position as a putative binding site. The background set was computed using a minimum threshold of 75%.

The relative score is computed from the raw matrix score as:
rel_score = (site_score - min_matrix_score) / (max_matrix_score - min_matrix_score)

The default threshold is 80%, which is a commonly used threshold for TFBS analyses using PWMs. If you have prior knowledge of which TFs are of interest for your analyses and what their properties are, you may change this threshold based on that knowledge (between 75% to 100%). For instance, if the matrix for a TF of interest has a low IC, then you will want to use a higher threshold, whereas for a TF with a high IC, you might try using a lower threshold. A threshold of 80% or 85% will generally provide you with satisfactory results, but if you are uncertain, we recommend multiple analyses with various thresholds.



Maximum inter-binding distance

TFs that are located within the specified maximum inter-binding distance from a predicted anchoring TFBS are counted for over-representation analysis. The distance can range up to 250 bp. The default distance is 100 bp. These values were chosen as a reasonable boundary for including TFs that may act together for a common regulatory function. If you are uncertain of what to expect in terms of distance, perform three analyses with three different distances e.g. 50, 150, and 250 bp.



Upstream / Downstream sequence

For gene-based oPOSSUM analyses, this refers to the size of the region around the transcription start sites (TSS) which was analysed for TF binding sites. The maximum amount of upstream / downstream region used to pre-compute the background is species dependent, as follows.

SpeciesMax. Upstream (bp)Max. Downstream (bp)
Human, Mouse10,00010,000
Fly3,0003,000
Worm1,5001,500
Yeast1,000Annotated 3' end of the gene

The TFBS counts within the search regions were precomputed for various levels of upstream / downstream sequence. The levels for the metazoans are:

VertebratesInsectsNematodes
Search Region Level Upstream (bp)Downstream (bp) Upstream (bp)Downstream (bp) Upstream (bp)Downstream (bp)
1 10,00010,0003,0003,0001,5001,500
2 10,0005,0003,0002,0001,0001,000
3 5,0005,0002,0002,0001,000500
4 5,0002,0002,0001,000500500
5 5,0002,0002,0001,000500500
6 5,0002,0002,0001,000500500

Using one of the pre-defined search region levels will result in much faster computation than using custom search region lengths, as the oPOSSUM system can take advantage of pre-computed values stored in the database. The default levels were reasonable in terms of regulatory region lengths for the individual organisms. Alternatively, you can choose other pre-computed regions from the drop down list or specify your own custom region. As the regions in the drop-down list are pre-computed, the analysis will complete relatively quickly. Custom regions are not pre-computed and thus take longer to analyse; if you don't remember to adhere to the max upstream and max downstream constraints, the system will limit the region for you.

We recommend that you perform several analyses, varying the selected search region level for each analyses. Using mouse as an example, you might start with the default of 5,000/5,000 bp, and then examine how the over-representaiton analyses change between shorter and longer search regions by comparing the results from 2,000/2,000 bp and 10,000/10,000 bp for another.

Notes:

Number of results to display:

You can specify the number of results to be returned. The default is to return all results, but you can specify to return only the top 5, 10 or 20 results. You can also specify to return all results which score above a given Z-score and Fisher score threshold. In cases where anything other than the "All" option is specified, the top scoring results are based on which scoring method was chosen in the "Sort results by" section. Also, in those cases, those results that are not returned, are lost, even those results that scored highly in the other scoring measure.

We recommend using the default of "all results", as selecting less than "all" does not increase the speed of the analysis. Once you have the results file, you can manually filter the scores yourself.

Sort results by:

Results can be sorted by either Z-score or Fisher score. The sort order lists the top scoring over-represented TFBS at the top of the list. The sort order is not permanent, and can be changed once the analysis is complete and the results page is displayed.

Enter your email address:

Depending on parameters selected, the aCSA process make take a considerable amount of time (several hours) to complete. Users are notified by email of the results when the analysis is finished. Please be assured that your e-mail address is used solely for this purpose.



Analysis Results

The results are returned in a table format, preceded by a summary of the input parameters and the GC composition of the target and background sequences used in the analysis. The results can be downloaded in a tab delimited file format (link is provided at the bottom of the table). Those results that do not have any hits in the background genes/sequences are flagged with a warning.

Selected Parameters

Sequence information (sequence-based analysis)

For sequence-based analysis, the following information on the analyzed sequences are displayed:

Number of submitted genes (gene-based analysis)

oPOSSUM systems are built based on Ensembl IDs. If external IDs are used for gene selection, oPOSSUM may not be able to map all IDs in an one-to-one manner. All calculations are based on the unique oPOSSUM genes found and their associated non-coding regions.

TFBS profile summary

TFBS profile matrix source: JASPAR CORE profiles / User-supplied(sequence-based analysis only)

Search parameters

For gene-based analysis, the conservation level and upstream/downstream sequence lengths chosen are summarized. For both the gene-based and sequence-based analyses, the matrix score threshold is shown.

GC composition

For both the target and background sets, the GC compositions of the sequences that were searched for motifs are shown. GC content less than 0.45 or greater than 0.55 is flagged red. For sequence-based analysis, it is important to make sure that both the target and background sequence sets to have similar levels of GC content, as sequence composition can affect TFBS search results. Please refer to the manuscript for more details.

Results returned

How the results are sorted in the summary table.



Genes Included in Analysis (gene-based analysis)

This section lists the set of target and background genes that were entered by the user. The background gene list is shown only if you pasted in or uploaded a gene set to be used as the background (i.e. not randomly chosen). It is broken down into the sections Analyzed and Excluded. Please see description under Target Genes of what is meant by 'Analyzed' and 'Excluded' and how this may impact the results of the analysis.

Analyzed

This is the sub-set of genes entered by the user which were found within the oPOSSUM database, and therefore included in the analysis.

Excluded

This is the sub-set of genes entered by the user which were not found within the oPOSSUM database, and therefore had to be excluded from the analysis.



Analysis Results Summary Table

For a general explanation of what the oPOSSUM analysis results mean, please refer to the main help page.

This table contains the results of both the Z-score and Fisher analyses. The results are ordered by Z-score from most to least significant (higher to lower z-score). Those columns that the table can be sorted by has double arrow icons in the header. The columns of the table are as follows:

TF

The name of the transcription factor.

JASPAR ID

The JASPAR ID of the transcription factor. Link is provided to the JASPAR summary page for the TF.

Class

The class of TFs to which this TF belongs.

Family

The family of TFs to which this TF belongs.

Tax group

The taxonomic supergroup to which this TF belongs.

IC

The specificity or information content (IC) of this TFBS profile's position weight matrix. 'Extreme' IC values are flagged as this may affect the results for these profiles. A very low IC matrix will be found more frequently by random chance, i.e. more false positives. A very high IC matrix will be found infrequently such that it's possible that just a few extra or fewer hits detected in the foreground or background will more greatly affect the Fisher and Z-scores. Currently those IC values less than 9 or greater than 19 are flagged.

GC Content

The GC content of the this TFBS profile. Values less than 0.33 or greater than 0.66 are flagged red.

Target gene/sequence hits

The number of genes/sequences in the included target set for which this TFBS was predicted within the searched regions. A link is provided to the TFBS Hits summary table that lists the actual genomic locations of the hits.

Target gene/sequence non-hits

The number of genes/sequences in the included target set for which this TFBS was NOT predicted within the searched regions.

Background gene/sequence hits

The number of genes/sequences in the background set for which this TFBS was predicted within the searched regions. Results with 0 background hits are flagged with a warning.

Background gene/sequence non-hits

The number of genes/sequences in the background set for which this TFBS was NOT predicted within the searched regions.

Background TFBS hits

The number of times this TFBS was detected within the searched regions of the background set of genes/sequences. Results with 0 background hits are flagged with a warning.

Target TFBS hits

The number of times this TFBS was detected within the searched regions of the target set of genes/sequences. A link is provided to the TFBS Hits summary table that lists the actual genomic locations of the hits.

Background TFBS rate

The rate of occurrence of this TFBS within the searched regions of the background set of genes/sequences. The rate is equal to the number of times the site was predicted (background hits) multiplied by the width of the TFBS profile, divided by the total number of nucleotides in the searched regions of the background set.

Target TFBS rate

The rate of occurrence of this TFBS within the searched regions of the included target set of genes/sequences. The rate is equal to the number of times the site was predicted (target hits) multiplied by the width of the TFBS profile, divided by the total number of nucleotides in the searched regions of the included target set.

Z-score

The likelihood that the number of TFBS nucleotides detected for the included target genes/sequences is significant as compared with the number of TFBS nucleotides detected for the background set. Z-score is expressed in units of magnitude of the standard deviation.

Fisher score

The probability that the number of hits vs. non-hits for the included target genes/sequences could have occured by random chance based on the hits vs. non-hits for the background set. Negative natural logarithm of the probabilities are returned as the scores.



TFBS Hits Summary Table

This table contains the gene and promoter information where the TFBS prediction is found, along with individual TFBS hit locations.

gene-based analysis

sequence-based analysis