Statistical Analysis

Gene-Based vs. Sequence-Based

Species

TFBS Clusters

Analysis Methods

Single Site Analysis (SSA)

Anchored Combination Site Analysis (aCSA)

TFBS Cluster Analysis (TCA)

Anchored Combination TFBS Cluster Analysis (aCTCA)

Download API

FAQ

oPOSSUM is a system for determing the over-representation of transcription factor binding sites (TFBS) and TFBS families within a set of (co-expressed) genes or sequences generated from high-throughput methods, as compared with a pre-compiled background set. The input is a set of gene identifiers (or fasta formatted sequences for sequence-based oPOSSUM). Over-representation of either individual TFBSs or TFBS clusters with similar binding profiles can be calculated. Two types of analysis methods are available: single site methods that look at each (cluster of) binding sites individually (SSA and TCA), and anchored combination site methods that looks for over-represented pairs of (clusters of) binding sites (aCSA and aCTCA). The system compares the number of hits for each selected TFBS (cluster) in the target set against the background set. Two different scoring measures of significance are applied to determine which sites are over-represented in the target set, Z-score and Fisher score. The results of the analysis are displayed in tabular form showing the relative rankings of the TFBSs according to the two scoring measures.

Two statistical measures of over-representation are currently employed, a Z-score and a one-tailed Fisher exact probability.

The Z-score uses the normal approximation to the binomial distribution to compare the rate of occurrence of a TFBS in the target set of genes to the expected rate estimated from the pre-computed background set.

For a given TFBS, let the random variable *X* denote the number of
predicted binding site nucleotides in the conserved non-coding
regions of the target gene set. Let *B* be the number of predicted binding site
nucleotides in the conserved non-coding regions of the background
gene set.

Using a binomial model with *n* events, where *n* is the total number of nucleotides
examined (i.e. the total number of nucleotides in the conserved non-coding regions) from the
co-expressed genes, and *N* is the total number of nucleotides examined from the background
genes, then the expected value of *X* is *u* = *B* * *C*, where
*C* = *n* / *N* (i.e. *C* is the ratio of sample sizes).
Then taking *p* = *B* / *N* as the probability of success, the standard deviation
is given by *s* = sqrt(*n* * *p* * (1 - *p*)).

Now, let *x* be the observed number of binding site nucleotides
in the conserved non-coding regions of the co-expressed genes. By applying the Central Limit
Theorem and using the normal approximation to the binomial distribution with a continuity
correction, the z-score is calculated as
*z* = (*x* - *u* - 0.5) / *s*.

The Fisher score is based on one-tailed Fisher exact probability. In contrast to the z-score, the one-tailed Fisher exact probability compares the proportion of co-expressed genes containing a particular TFBS to the proportion of the background set that contains the site to determine the probability of a non-random association between the co-expressed gene set and the TFBS of interest. It is calculated using the hypergeometric probability distribution that describes sampling without replacement from a finite population consisting of two types of elements. Therefore, the number of times a TFBS occurs in the promoter of an individual gene is disregarded, and instead, the TFBS is considered as either present or absent. Fisher exact probabilities were calculated using the R Statistics package. Negative natural logarithms of the probabilities are used as the Fisher scores.

For sequence-based SSA and TSA, an additional scoring measure that compares the empirical distributions of the TFBS locations between target and background sets is also available. In ChIP-Seq experiments, it is expected that within peak sequence regions targeted by a specific TF, its binding sites would be located close to the maximum confidence position (MCP) of each sequence In contrast, within background sequences, we expect the binding sites to be distributed randomly. For convenience, we will refer to the distance between a binding site and the MCP as the DistMCP. To compare the empirical distributions of DistMCP in the target and background sequence sets, we employ the Kolmogorov-Smirnoff test. The resulting p-value represents the probability of the PeakDist distributions being the same in both the target and background sequence sets, such that TFs of interest would have low p-values. As in Fisher analysis, negative natural logarithms of the probabilities are used as the KS scores.

High-throughput sequence profiling methods such as ChIP-Seq have resulted in a rapid increase in the number of potential regulatory sequences that need to to be analyzed. With data generated from these methods, researchers focus on sets of sequences likely to be bound by a TF, rather than a list co-expressed genes. As such data is optimal for TFBS over-representation analysis, the sequence-based oPOSSUM systems were implemented to take as input a foreground sequence set and a background sequence set. Over-representation scores are calculated based on the TFBS motifs found in the supplied input sequences. In contrast to the gene-based oPOSSUM systems, no conservation filtering is applied to the input sequences; the entire set of submitted sequences is screened for TFBS motifs.

- Search region lengths differ for each species.
Species Max. Upstream (bp) Max. Downstream (bp) Human, Mouse 10,000 10,000 Fly 3,000 3,000 Worm 1,500 1,500 Yeast 1,000 Annotated 3'end of the gene

- Worm oPOSSUM accounts for operon structures by restricting the search space to the regions flanking the annotated start position of the first gene in the operon. TFBS predictions made in the search region of the first gene are universally applied to all other genes in the operon.
- Yeast oPOSSUM does not have conservation filtering.

In general, the scores are used to *rank* the over-representation of a TFs putative binding sites from the most strongly over-represented to the least, to aid you in selecting potential TFs or structural classes of interest. The scores are dependent upon a number of factors, one of which is the number of sequences analyzed, thus comparing scores between different analyses should not be done unless the number of sequences used in the analyses are similar. Another factor is your selection of background. If your background does not have a similar nucleotide composition as the target sequences, you risk biasing your analysis to a subset of TF motifs. To detect whether a bias has occurred a simple visualization is to plot the Z-score (y-axis) against the TF profiles GC content (x-axis), both of which are columns in the final results table. If you see a tendency for TF matrices with a high (or low) GC content to all have the highest ranking scores, then you may need to go back and check the nucleotide composition of your background relative to your target sequences.

There is no specific threshold that can be recommended for any one data set; however within the results of an analysis you may be able to select a group of interesting TFs by the relative value of the scores. If your experiments were designed to provide well-defined data sets highlighting genes regulated by a TF, then it is common to see one or several TFs with scores clearly higher than the majority of the scores. In general a good overview of your results can be obtained by graphical plot. For instance you might plot the TFs by rank on the x-axis and a corresponding score on the y-axis, or you might plot both scores against each other, or you can use the plot mentionned in the prior paragraph to display one of the scores vs. the GC content of the TF profiles. In our experience, a clear segregation of scores is the most reliable indication of functional relevance.

Often, TFs of the same structural families share similar DNA binding profiles. If a given TF family has numerous member TFs with almost identical consensus sequences, oPOSSUM SSA and aCSA results can be dominated by the profiles for these TFs. To avoid this situation, we can condense the redundant profiles by clustering them together based on their matrix similarities. The degree of similarities among consensus sequences of the TFs can vary greatly depending on the TF family in question. Thus, profiles within each TF family (as annotated in JASPAR 2010) have been clustered according to their matrix similarity.

For a detailed summary of the clustering process, please refer to the manuscript. As of April 2011, 250 profiles from the CORE collection and 184 from the PBM collection were included in the clustering process, along with the 4 profiles from the custom PENDING collection, resulting in 170 distinct clusters.

#### Single Site Analysis (SSA)

#### Anchored Combination Site Analysis (aCSA)

#### TFBS Cluster Analysis (TCA)

#### Anchored Combination TFBS Cluster Analysis (aCTCA)

The oPOSSUM-3.0 source code is available at the download page.