### Measuring Similarity of Protein Motif Content

This sequence analysis tool allows you to find gene products that are similar to a probe.

Gene products are represented by accessions for full-length transcripts. Accessions are either RefSeq accessions or MGC accessions. In some instances, a given gene may have multiple RefSeq accessions and multiple MGC accessions. Sometimes this multiplicity represents entirely duplicate transcripts; sometimes it represents transcripts having identical coding regions but variant untranslated regions; sometimes it represents transcripts whose coding region differs in only one amino acid; and sometimes it represents alternative splice forms of the gene product. In the table of results, accessions having identical coding regions are grouped together.

A protein domain is a conserved protein region. The
Pfam database is a database
of multiple alignments of protein domains and Hidden Markov Models (or motifs) built from these
alignments. One can use the HMMER
program to determine whether a given amino acid
sequence (the probe) contains a region that is similar to any Pfam motif. The degree
of similarity is measured by the score of the match: the higher the *score*, the closer
the match. The statistical significance of the match is measured by the *e-value*: the lower
the e-value, the less likely the match is due to chance.

We have extended this notion of similarity to take account of the fact that a given protein
may contain matches to multiple motifs. Suppose there are three gene products NM_999, NM_888,
and NM_777.
Further, suppose that NM_999 contains good matches to motifs M1 and M2,
that NM_888 also contains
good matches to M1 and M2, and that NM_777 contains a good match to only M1.
In this case, we will conclude
that NM_999 is more similar to NM_888 than it is to NM_777.
The measure of this kind of similarity is
the *p-value*. Given a probe and a set of proteins that have at least one motif in
common with the probe, we compute a separate p-value for each protein in this set, using the
Bayesian rule for conditional probablity.
The sum of all these p-values will be equal to 1, and the greater the p-value, the greater
the degree of similarity. Note that, since the sum of p-values for a given candidate
set is constrained to equal 1, the absolute magnitude of a p-value has little meaning.
Rather, one must evaluate the p-value of a given match against the p-values of other
matches in the set.

For example, suppose a probe NM_999 is determined to be similar to NM_888 with a p-value of 0.50 and to be similar to NM_777 with a p_value of 0.25. Then we can be twice as confident in believing that NM_999 is closely related to NM_888 than in believing that it is closely related to NM_777.

If a gene has RefSeq or MGC accessions that have been analyzed for protein motif content, then the Gene Info page for that gene will hyperlink these accessions to the search for similar proteins (under the header "Protein Similarities Based on Shared Motif Content"). This search uses the following parameters:

- e-value: 1.0e-3
- score: 0
- p-value: 0