# Gene Ontology similarity measurement

Some notes about measuring similarity of Gene Ontology terms and thus genes (and perhaps even proteins) on this basis.

The starting point for this investigation is the paper Schlicker et al., "A new measure for functional similarity of gene products based on Gene Ontology" which in turn leads to the following papers:

- Resnik (1995), "Using Information Content to Evaluate Semantic Similarity in a Taxonomy"
- Resnik (1999), "Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language"
- Lord et al., "Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation."
- Sheehan et al., "A relation based measure of semantic similarity for Gene Ontology annotations"

Resnik defines the information of each concept (or term) in a taxonomy (or ontology) using the notion of information content by stating that...

- Each concept has a probability associated with it (defining the probability of "encountering an instance" of that concept).
- Where a concept
*c*is subsumed by_{specific}*c*(as in_{general}*c*) then the probability of encountering an instance of_{specific}is-a c_{general}*c*is less than that of encountering an instance of_{specific}*c*._{general} - Where a single root concept exists, since it subsumes all possible concepts, the probability of encountering an instance of it is 1.
- Since information content is defined as
*-log p(c)*for a concept*c*, less probable concepts have higher information content.

The probability of each concept was defined by the cumulative frequency of all nouns subsumed by that concept divided by the total noun frequency of the corpus.

## Applying Information Content to Communications

Information content is often used to analyse or illustrate properties of communications representations as described in these notes about information theory and data compression. When deriving the information context, one first divides *p(c)* into 1 which appears to define the "granularity of the state space" or the number of distinct states required to represent the communication of an occurrence of *c*. Taking the logarithm of this result (*log 1/p(c)* and thus *-log p(c)*) then defines the number of digits or bits (if a base-2 logarithm is used) required to encode such an outcome.

Thus, if *c* is highly probable, occurring with *p(c) = 0.5* then *-log p(c) = -(-1) = 1*, indicating that a single bit is enough to signal the presence of *c* in a signal - with a value of 1, say - whereas all other values would be encoded with an initial bit distinguishing them from *c* - therefore, with a value of 0 - and additional bits employed if necessary. This can be visualised using a tree:

*c*(*p(c) = 0.5*)- (not
*c*)*d*(*p(d) = 0.25*)- (not
*d*)*e*(*p(e) = 0.15*)*f*(*p(f) = 0.1*)

Clearly, in a communications context, the aim is to minimise the size of the message by favouring the most frequent values.

## Returning to Concept Similarity

An initial attempt to translate the notion of information content to concept similarity is to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) *n _{subtree}* and then dividing by the total number of terms

*n*to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least, and one which upholds the concept hierarchy within the information content framework.

_{total}To measure specificity more accurately, one might introduce frequency observations to the ontology terms, maintaining the general property that more general terms (such as *c _{general}*) subsume more specific terms (such as

*c*,

_{1}*c*, ...) such that each term's resultant frequency

_{2}*r*is defined as...

*r(c _{general}) = r(c_{1}) + r(c_{2}) + ... + r(c_{n}) + f(c_{general})*

...where *f* is the observed frequency of the term itself. Since the resultant frequency of any given term includes contributions from the entire subtree of the ontology of which it is the root node, the hierarchical information encoded in the more naive approach is preserved.

The remaining difficulty lies in defining what the "observed frequency" of a term is.

## Concept Comparison

When comparing two concepts, *c1* and *c2*, Resnik refers to the set of concepts subsuming *c1* and *c2* which in a hierarchy will be the common ancestors of *c1* and *c2*. Given a measure for each concept which assigns higher values for concepts further from the root of the hierarchy (more specific terms in an ontology consisting of *is-a* relationships directed towards the root), the common ancestor of *c1* and *c2* furthest from the root (the most specific common ancestor, or "lowest common ancestor (LCA)" according to Schlicker et al.) is likely to provide the highest scoring concept subsuming *c1* and *c2*.