Gene Ontology similarity measurement
Some notes about measuring similarity of Gene Ontology terms and thus genes (and perhaps even proteins) on this basis.
The starting point for this investigation is the paper Schlicker et al., "A new measure for functional similarity of gene products based on Gene Ontology" which in turn leads to the following papers:
- Resnik (1995), "Using Information Content to Evaluate Semantic Similarity in a Taxonomy"
- Resnik (1999), "Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language"
- Lord et al., "Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation."
- Sheehan et al., "A relation based measure of semantic similarity for Gene Ontology annotations"
Resnik defines the information of each concept (or term) in a taxonomy (or ontology) using the notion of information content by stating that...
- Each concept has a probability associated with it (defining the probability of "encountering an instance" of that concept).
- Where a concept c1 is subsumed by c2 (as in c1 is-a c2) then the probability of encountering an instance of c1 is less than that of encountering an instance of c2.
- Where a single root concept exists, since it subsumes all possible concepts, the probability of encountering an instance of it is 1.
- Since information content is defined as -log p(c) for a concept c, less probable concepts have higher information content.
The probability of each concept was defined by the cumulative frequency of all nouns subsumed by that concept divided by the total noun frequency of the corpus.
I find the application of information content to be less than helpful in this context since it is often used to analyse or illustrate properties of communications representations as described in these notes about information theory and data compression. When deriving the information context, one first divides p(c) into 1 which appears to define the "granularity of the state space" or the number of distinct states required to represent the communication of an occurrence of c. Taking the logarithm of this result (log 1/p(c) and thus -log p(c)) then defines the number of digits or bits (if a base-2 logarithm is used) required to encode such an outcome.
My first instinct was to consider the specificity of ontology terms by first counting those subsumed by a particular term (including itself) nsubtree and then dividing by the total number of terms ntotal to give the "coverage" of a particular term, subtracting this from 1 to give the specificity of a term. Obviously, this only considers features of the ontology itself and not external information such as the word frequencies used by Resnik, but "how specific a term is" is a familiar concept, at least.
When comparing two concepts, c1 and c2, Resnik refers to the set of concepts subsuming c1 and c2 which in a hierarchy will be the common ancestors of c1 and c2. Given a measure for each concept which assigns higher values for concepts further from the root of the hierarchy (more specific terms in an ontology consisting of is-a relationships directed towards the root), the common ancestor of c1 and c2 furthest from the root (the most specific common ancestor, or "lowest common ancestor (LCA)" according to Schlicker et al.) is likely to provide the highest scoring concept subsuming c1 and c2.