DiG: Disease groups

From irefindex
Revision as of 11:55, 17 October 2011 by Ian.donaldson (talk | contribs) (→‎Introduction)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Vitruvian man by Leonardo da Vinci


We investigate properties of human disease genes within the protein-protein interaction network. Two essential datasets are required for this research:

  1. the human protein-protein interaction network and
  2. a list of genes associated with genetic disorders

The human interaction network is taken from the iRefIndex resource and the creation of the disease gene list is described on this page.

Disease groups can be searched on using the iRefScape plugin for Cytoscape.

Method for sorting entries in Morbid Map into disease groups (DiG)

OMIM Morbid map

The largest and the most reliable disease association list is available from the Online Mendelian Inheritance in Man (OMIM) database. The Morbid Map, maintained within this resource, lists gene and their disease associations and is exported as a simple table with about 5000 entries. Each line in this table corresponds to one gene – disorder association and looks as follows:

Disease title, OMIM id referring to a disease and evidence code| gene symbol and synonyms|  OMIM id referring to gene entry| locus 

The evidence code is a number in parentheses and indicates whether the evidence for association (mutation) was positioned by mapping the wild type gene (1), by mapping the disease phenotype itself (2), or by both approaches (3). The last "3", includes mapping of the wild type gene combined with demonstration of a mutation in that gene in association with the disorder. To ensure the best possible data for our research, we used only entries denoted with (3).

Example of Morbid Map entry:

Parkinson disease, 168600 (3)  |TBP, SCA17|600075|6q27

Pre-processing the Morbid Map

For the map to be readily cross-referenced with our system, gene symbols were translated into their corresponding Entrez Gene Identifiers. Great care was taken translating gene names into Gene identifiers because the names are not unique. The taxonomy identifier 9606 (human) and chromosome limits (locus info is present in Morbid Map) were used to limit search results for the gene symbol name first in the gene info table (released by the EntrezGene database). In those cases where no hit was found, the search was broadened to include synonyms and locus names for the gene name listed in Morbid Map. Despite this effort, there were about 140 unresolved cases (no hit or too many hits) in the Morbid Map. The evidence code and disease OMIM id were separated from the disease title column into separate database fields. There are nearly 800 entries with no disease omim id specified. These are mostly distinct titles.

Matching Morbid map entries

In many cases, one disease appears in Morbid Map in association with more than one gene. These “multigenic” diseases are the most valuable to us since we can study the properties of the sub-networks that these genes form within the larger human network. Groups of disease-gene association were constructed using disease titles (first column in the Morbid Map). In some cases, multiple entries have identical titles; these were assigned the same disease group identifier. However, titles were not always exactly the same and varied in the detail given. These variations often describe disease sub-types or are a result of a simple inconsistency.

Example of a disease with sub-types:

Parkinson disease 11, 607688 (3)
Parkinson disease 13, 610297 (3)
Parkinson disease 4, autosomal dominant Lewy body, 605543 (3)
Parkinson disease 6, early onset, 605909 (3)
Parkinson disease 7, autosomal recessive early-onset, 606324 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, familial, 168600 (3)
Parkinson disease, familial, 168601 (3)
Parkinson disease, juvenile, type 2, 600116 (3)
Parkinson disease, resistance to, 168600 (3)
Parkinson disease-8, 607060 (3)
{Parkinson disease, protection against}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease}, 168600 (3)

Example of an inconsistency:

Li Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 609265 (3)

Grouping of titles in morbid map

Using simple text parsing rules, titles in morbid map were grouped using regular expression methods. Highly similar or identical titles are pooled into one group of disorders. To initiate the text search, the titles are stripped of everything behind the first comma (if there is no comma, disease tag, omim id are taken off) or after keywords “due to” or “with” that tend to specify subtypes of a main phenotype. Before the search for matching titles, white spaces, stand-alone digits and roman numerals are also removed. The resulting search strings are anchored at the beginning to select for partial title hits. Two kinds of opening brackets (“{”,”[”) are allowed to occur at the begininng, these brackets are used by OMIM as a special symbol to indicate susceptibility to certain conditions. A sql regular expression search was used to form an initial group of hits. Every partial hit was examined. To be accepted, the match had to either end at the word boundary or with punctuation (“,”,”-”,”/”,”(”) or to continue with allowed suffixes (presently “s”, “tous”, “tosis” to accommodate matches of, for example, adenomatous and adenomatosis). Testing for word boundaries is necessary to exclude false matches that result from partial word matching (e.g. AICA versus Aicardi, POR versus Porencephaly). All approved matches were assigned the same integer disease identifier. The Morbid Map released in August 2008 contained 3700 entries with disease tag “(3)” and these were assigned to 1500 disease groups. This mapping of disease genes into groups is referred to as DiG.

Testing disease groups

The publication from Goh et al. [1] presented a similar effort in processing the Morbid Map. The Morbid Map release from December 2005 was processed by string matching to fuse sub-types of diseases. The original 2929 entries with disease tag “(3)” were grouped into 1284 groups and manually classified to 22 classes pointing to a physiological system affected in the human body. This publication was used to assess the accuracy of our disease grouping. The disease grouping of Goh et al. [1] was transferred onto the August 2008 Morbid Map. This left about 400 newer entries unmatched. The groupings for the rest of the entries were compared (i.e. the genes that were mapped by both the Goh study and in our DiG list). In about 100 cases, DiG divisions were too conservative and groups needed to be fused. Sometimes the titles were not consistent between different entries in the Morbid Map. In other cases, the word order did not match or the most relevant word did not occur at the beginning of the title. Also, cancer names varied a lot. Most likely, Goh et al. manually intervened to sort out such groups. To deal with these inconsistent titles, new titles were created that could be processed correctly by the title matching parser and these to were added to the table prior to title matching. These new titles were often just variations on the original one, with a different word order or an extra comma. The changes to the titles were generated by a sql script and can be used on a new version of the Morbid Map to re-create the groups. The first version of this script contains about 150 sql statements. The total numbers of groups is about 1500. In the end, about 25 groups were deliberately kept more conservative than Goh et al.; i.e. one group by Goh et al. corresponds to two or more disease groups in DiG. In 25 cases, groups were fused that are separated in Goh et al. Overall, thorough manual checks were done ONLY on groups that were different between Goh et al. and DiG.

Description of DiG contents

Multigenic groups of diseases

There are 474 disease groups with more than one associated gene. The number of associated genes ranges between two and 50. In total, there are about 2300 genes (some more than once) involved in these disease groups. The average number of genes per multigenic group is 5.

Genes involved in multiple diseases

Currently, morbid map includes about 2200 distinct genes. 1400 of these are involved only in one disorder. The remaining 800 are implicated in two to 11 diseases. From this group, 58 genes are found connected to more than 5 disorders. The highest connected fraction (8 to 11 disorders, 14 instances) is enriched with tumor supressors (such as p53, PTEN, APC, KRAS), collagens and FGF receptors.


DiG was constructed by Katerina Michalickova in the Donaldson group at the Biotechnology Centre of Oslo.


DiG has not been published yet but can be made available to interested collaborators. Contact ian.donaldson@biotek.uio.no for more information.

The DiG file format is described at http://donaldson.uio.no/wiki/README_DiG_1.0

Sources used to build this data set are described at http://donaldson.uio.no/wiki/Sources_DiG_1.0


1. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL, The human disease network. Proc Natl Acad Sci USA. 2007 May 22;104(21):8685-90 PMID 17502601