Specify your list of target genes or sequences by either pasting them into the text box or using the file upload option.
For gene-based analysis, the gene list should be separated by spaces, commas,
semi-colons, colons, newlines or some combination thereof.
For testing purposes, pressing the Use sample genes button will paste
a pre-defined set of test genes into the text area.
Use the Clear button to clear the text area.
You need to choose the type of gene ID from the drop down menu. The gene ID
types available are:
For sequence-based analysis, fasta-formatted sequences can be pasted into the text area or uploaded as a file. If you only have genomic coordinates for your sequences, please use the Galaxy service from the UCSC. Optionally, you can provide the maximum confidence positions (MCP) for each sequence, which will be used in calculating the distribution of DistMCP values (refer to the main help page on DistMCP and KS scores). The positions should be in chromosomal coordinates, and in the same order as the input sequences. If this option is to be used, the input sequences must have chromosomal coordinates as their sequence IDs, in the format of:
chrXX:start_position-end_position (e.g. chr1:10000-10500)The positions can be either pasted or uploaded as a file. If no MCPs are specified, the mid-point of each sequence will be treated as the MCP.
For both gene- and sequence-based systems, the same restrictions apply as given in the Entering Target Data Sets section.
For gene-based oPOSSUM systems, the full background data sets were compiled by selecting genes from Ensembl which were considered to be well known, i.e. the "KNOWN" attribute was true and the genes had a gene symbol assigned (HGNC symbol for human, MGI for mouse etc.). Transcription start sites (TSSs) were assigned for each gene based on the core Ensembl gene TSS annotations. You can paste in the list of gene ID's to be used as the background data set, or upload a file containing the ID's. Alternatively, you can specify the number of random genes to be picked as the background data set, where the maximum number allowed is the number of genes annotated in the oPOSSUM system.
The default setting for gene-based analyses is to use all genes in the oPOSSUM database as your background. Alternatively, you can specify the number of genes to be picked randomly to be used as the background (default=5000) or provide your own custom background gene list. Choosing randomly selected background genes can reduce the required computation time compared to choosing the entire gene set if the number chosen is sufficiently larger than your foreground gene set. If you are performing multiple analyses, and you wish to compare the results between analyses, we recommend you use "all" or supply your own background so that the background is stable across analyses.
For sequence-based systems, you can either choose a pre-defined background data set or supply your own. When choosing a background data set, it is important to ascertain that its GC composition closely matches that of the target data set. The pre-defined data sets include:
For sequence-based analyses, the default setting for the background selection is to provide your own set of control sequences. Alternatively, one can choose to use one of the provided backgrounds; however, these are currently limited to mouse sequences, and as each background set has a specific nucleotide composition (indicated in the background name), the available background sets may not necessarily be good matches for your foreground sequences. It is preferable to supply your own control sequences as you can control for the nucleotide composition of your sequence sets. There is an automatic background sequence generator that is currently undergoing beta testing (provided below the custom background box), which you can use to extract a background with a nucleotide composition comparable to your foreground sequences. When choosing the background, you should ensure that you have the same number (or greater) of sequences in your background set as in the foreground set. If you are supplying your own background sequences, you can again optionally provide the MCP for each sequence to be used in calculating the background PeakDist distribution. As before, the positions should be in chromosomal coordinates, and in the same order as the background sequences provided.
TFBS profiles from the JASPAR CORE, PBM and PENDING collections were grouped according to their structural families and clustered within each family according to their profilie similarities. 170 TFBS clusters were formed in total. You can specify whether to use the entire set or a selected subset of the clusters by specifying which TF families to include in your analysis. No custom TFBS input is allowed, as this analysis relies on pre-computed TFBS clusters stored within the oPOSSUM system.
For gene-based systems, conservation was determined by using phastCons scores from the UCSC Genome Browser scoring above some minimum threshold and merging these into conserved regions of minimum length 20 bp.
The conserved regions that fell within 10,000 nucleotides upstream and downstream of the predicted transcription start site (TSS) were then scanned for binding sites using position weight matrice (PWM) models of transcription factor (TF) binding specificity from the JASPAR database. These TF sites were stored in the oPOSSUM database and comprise the background set for the default analysis. The background set was pre-computed with three levels of conservation filter.
No conservation filters are used for yeast oPOSSUM and sequence-based systems.
TF sites are scanned by sliding the corresponding postion weight matrix (PWM) along the sequence and scoring it at each position. The threshold is the minimum relative (percent) score used to report the position as a putative binding site. For gene-based oPOSSUM the background sets were scanned using a minimum threshold of 75%. The background TFBS counts were pre-computed using three threshold levels.
The relative score is computed from the raw matrix score as:
rel_score = (site_score - min_matrix_score) / (max_matrix_score - min_matrix_score)
The default threshold is 80%, which is a commonly used threshold for TFBS analyses using PWMs. If you have prior knowledge of which TFs are of interest for your analyses and what their properties are, you may change this threshold based on that knowledge (between 75% to 100%). For instance, if the matrix for a TF of interest has a low IC, then you will want to use a higher threshold, whereas for a TF with a high IC, you might try using a lower threshold. A threshold of 80% or 85% will generally provide you with satisfactory results, but if you are uncertain, we recommend multiple analyses with various thresholds.
For gene-based oPOSSUM analyses, this refers to the size of the region around the transcription start sites (TSS) which was analysed for TF binding sites. The maximum amount of upstream / downstream region used to pre-compute the background is species dependent, as follows.
|Species||Max. Upstream (bp)||Max. Downstream (bp)|
|Yeast||1,000||Annotated 3' end of the gene|
The TFBS counts within the search regions were precomputed for various levels of upstream / downstream sequence. The levels for the metazoans are:
|Search Region Level||Upstream (bp)||Downstream (bp)||Upstream (bp)||Downstream (bp)||Upstream (bp)||Downstream (bp)|
Using one of the pre-defined search region levels will result in much faster computation than using custom search region lengths, as the oPOSSUM system can take advantage of pre-computed values stored in the database. The default levels were reasonable in terms of regulatory region lengths for the individual organisms. Alternatively, you can choose other pre-computed regions from the drop down list or specify your own custom region. As the regions in the drop-down list are pre-computed, the analysis will complete relatively quickly. Custom regions are not pre-computed and thus take longer to analyse; if you don't remember to adhere to the max upstream and max downstream constraints, the system will limit the region for you.
We recommend that you perform several analyses, varying the selected search region level for each analyses. Using mouse as an example, you might start with the default of 5,000/5,000 bp, and then examine how the over-representaiton analyses change between shorter and longer search regions by comparing the results from 2,000/2,000 bp and 10,000/10,000 bp for another.
You can specify the number of results to be returned. The default is to return all results, but you can specify to return only the top 5, 10 or 20 results. You can also specify to return all results which score above a given Z-score and Fisher score threshold. In cases where anything other than the "All" option is specified, the top scoring results are based on which scoring method was chosen in the "Sort results by" section. Also, in those cases, those results that are not returned, are lost, even those results that scored highly in the other scoring measure.
We recommend using the default of "all results", as selecting less than "all" does not increase the speed of the analysis. Once you have the results file, you can manually filter the scores yourself.
Results can be sorted by either Z-score or Fisher score. The sort order lists the top scoring over-represented TFBS at the top of the list. The sort order is not permanent, and can be changed once the analysis is complete and the results page is displayed.
The results are returned in a table format, preceded by a summary of the input parameters and the GC composition of the target and background sequences used in the analysis. The results can be downloaded in a tab delimited file format (link is provided at the bottom of the table). Those results that do not have any hits in the background genes/sequences are flagged with a warning.
For sequence-based analysis, the following information on the analyzed sequences are displayed:
This section lists the set of target and background genes that were entered by the user. The background gene list is shown only if you pasted in or uploaded a gene set to be used as the background (i.e. not randomly chosen). It is broken down into the sections Analyzed and Excluded. Please see description under Target Genes of what is meant by 'Analyzed' and 'Excluded' and how this may impact the results of the analysis.
This is the sub-set of genes entered by the user which were found within the oPOSSUM database, and therefore included in the analysis.
This is the sub-set of genes entered by the user which were not found within the oPOSSUM database, and therefore had to be excluded from the analysis.
For a general explanation of what the oPOSSUM analysis results mean, please refer to the main help page.
This table contains the results of both the Z-score and Fisher analyses. The results are ordered by Z-score from most to least significant (higher to lower z-score). Those columns that the table can be sorted by has double arrow icons in the header. The columns of the table are as follows:
The TFBS Cluster ID. A link is provided to a summary page for the given cluster listing the individual member TFs.
The class of TFs to which this cluster belongs.
The family of TFs to which this cluster belongs.
The number of genes/sequences in the included target set for which this TFBS cluster was predicted within the searched regions. A link is provided to the TFBS Hits summary table that lists the actual genomic locations of the hits.
The number of genes/sequences in the included target set for which this TFBS cluster was NOT predicted within the searched regions.
The number of genes/sequences in the background set for which this TFBS cluster was predicted within the searched regions. Results with 0 background hits are flagged with a warning.
The number of genes/sequences in the background set for which this TFBS cluster was NOT predicted within the searched regions.
The number of times this TFBS cluster was detected within the searched regions of the background set of genes/sequences. Results with 0 background hits are flagged with a warning.
The number of times this TFBS cluster was detected within the searched regions of the target set of genes/sequences. A link is provided to the TFBS Hits summary table that lists the actual genomic locations of the hits.
The rate of occurrence of this TFBS cluster within the searched regions of the background set of genes/sequences. The rate is equal to the number of times the site was predicted (background hits) multiplied by the width of the TFBS cluster, divided by the total number of nucleotides in the searched regions of the background set.
The rate of occurrence of this TFBS cluster within the searched regions of the included target set of genes/sequences. The rate is equal to the number of times the site was predicted (target hits) multiplied by the width of the TFBS cluster, divided by the total number of nucleotides in the searched regions of the included target set.
The likelihood that the number of TFBS cluster nucleotides detected for the included target genes/sequences is significant as compared with the number of TFBS cluster nucleotides detected for the background set. Z-score is expressed in units of magnitude of the standard deviation.
The probability that the number of hits vs. non-hits for the included target genes/sequences could have occured by random chance based on the hits vs. non-hits for the background set. Negative natural logarithm of the probabilities are returned as the scores.
For sequence-based only. The probability that the distributions of TFBSs for a given TF from the maximum confidence positions are equal between the target and background sequence sets. TFs of interest should have low p-values.
This table contains the gene and promoter information where the TFBS cluster prediction is found, along with individual TFBS cluster hit locations.