Skip Navigation
NCI banner National Cancer Institute U.S. National Institutes of Health National Cancer Institute
  • CGAP Info
  • CGAP Data
Cancer Genome Characterization Initiative

Visit the database of genomic characterization data for multiple tumor types.


Measuring Similarity of Protein Motif Content

This sequence analysis tool allows you to find gene products that are similar to a probe.

Gene products are represented by accessions for full-length transcripts. Accessions are either RefSeq accessions or MGC accessions. In some instances, a given gene may have multiple RefSeq accessions and multiple MGC accessions. Sometimes this multiplicity represents entirely duplicate transcripts; sometimes it represents transcripts having identical coding regions but variant untranslated regions; sometimes it represents transcripts whose coding region differs in only one amino acid; and sometimes it represents alternative splice forms of the gene product. In the table of results, accessions having identical coding regions are grouped together.

A protein domain is a conserved protein region. The Pfam database is a database of multiple alignments of protein domains and Hidden Markov Models (or motifs) built from these alignments. One can use the HMMER program to determine whether a given amino acid sequence (the probe) contains a region that is similar to any Pfam motif. The degree of similarity is measured by the score of the match: the higher the score, the closer the match. The statistical significance of the match is measured by the e-value: the lower the e-value, the less likely the match is due to chance.

We have extended this notion of similarity to take account of the fact that a given protein may contain matches to multiple motifs. Suppose there are three gene products NM_999, NM_888, and NM_777. Further, suppose that NM_999 contains good matches to motifs M1 and M2, that NM_888 also contains good matches to M1 and M2, and that NM_777 contains a good match to only M1. In this case, we will conclude that NM_999 is more similar to NM_888 than it is to NM_777. The measure of this kind of similarity is the p-value. Given a probe and a set of proteins that have at least one motif in common with the probe, we compute a separate p-value for each protein in this set, using the Bayesian rule for conditional probablity. The sum of all these p-values will be equal to 1, and the greater the p-value, the greater the degree of similarity. Note that, since the sum of p-values for a given candidate set is constrained to equal 1, the absolute magnitude of a p-value has little meaning. Rather, one must evaluate the p-value of a given match against the p-values of other matches in the set.

For example, suppose a probe NM_999 is determined to be similar to NM_888 with a p-value of 0.50 and to be similar to NM_777 with a p_value of 0.25. Then we can be twice as confident in believing that NM_999 is closely related to NM_888 than in believing that it is closely related to NM_777.

If a gene has RefSeq or MGC accessions that have been analyzed for protein motif content, then the Gene Info page for that gene will hyperlink these accessions to the search for similar proteins (under the header "Protein Similarities Based on Shared Motif Content"). This search uses the following parameters:

  • e-value: 1.0e-3
  • score: 0
  • p-value: 0
To make these parameters more or less stringent, try the Protein Motif Query page