DiG: Disease groups

From irefindex

Sorting entries in Morbid Map into disease groups

Introduction

We investigate properties of human disease genes within the protein-protein interaction network. Two essential datasets are required for this research:

  1. the human protein-protein interaction network and
  2. a list of genes associated with genetic disorders

The human interaction network is taken from the iRefIndex resource and the creation of the disease gene list is described on this page.

OMIM Morbid map

The largest and the most reliable disease association list is available from the Online Mendelian Inheritance in Man (OMIM) database omim. The Morbid Map maintained within this resource lists gene and disease associations and is exported as simple table with about 5000 entries. Each line in this table corresponds to one gene – disorder association and looks as follows:

Disease title, OMIM id referring to a disease and evidence code| gene symbol and synonyms|  OMIM id reffering to gene entry| locus 


The evidence code is a number in parentheses and indicates whether the evidence for association (mutation) was positioned by mapping the wild type gene (1), by mapping the disease phenotype itself (2), or by both approaches (3). The last "3", includes mapping of the wild type gene combined with demonstration of a mutation in that gene in association with the disorder. To ensure the best possible data for our research, I used only entries denoted with (3).

Example of Morbid Map entry:

Parkinson disease, 168600 (3)  |TBP, SCA17|600075|6q27

Pre-processing the Morbid Map

For the map to be readily cross-referenced with our system, I translated the gene symbols into the Entrez Gene Identifiers. Great care was taken translating gene names into Gene identifiers because the names are not unique. I used taxonomy (human) and chromosome limits (locus info is present in Morbid Map) and searched gene symbol name first in the geneinfo table released by the Gene database, in case of no hit, I broadened the search and looked at synonyms and locus names. Despite this effort, there were about 140 unresolved cases (no hit or too many hits) in the Morbid Map. I also parsed out evidence code and disease OMIM id from the disease title into separate database fields. There are nearly 800 entries with no disease omim id specified. These are mostly distinct titles.


Matching Morbid map entries

In many cases, one disease appears in Morbid Map in association with more than one gene. These “multigenic” diseases are the most valuable to me since I can study the properties of the sub-networks that these genes form within the large human network. It is, therefore, necessary to group the disease titles together. However, the titles are not always exactly the same and vary in detail. These variations often describe disease sub-types or are a result of a simple inconsistency.

Example of a disease with sub-types:

Parkinson disease 11, 607688 (3)
Parkinson disease 13, 610297 (3)
Parkinson disease 4, autosomal dominant Lewy body, 605543 (3)
Parkinson disease 6, early onset, 605909 (3)
Parkinson disease 7, autosomal recessive early-onset, 606324 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, familial, 168600 (3)
Parkinson disease, familial, 168601 (3)
Parkinson disease, juvenile, type 2, 600116 (3)
Parkinson disease, resistance to, 168600 (3)
Parkinson disease-8, 607060 (3)
{Parkinson disease, protection against}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease}, 168600 (3)

Example of an inconsistency:

Li Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 609265 (3)

Grouping of titles in morbid map

Using simple text parsing rules, titles in morbid map are grouped using string matching technique. Highly similar or identical titles are pooled into one group of disorders. To initiate the text search, the titles are stripped of everything behind the first comma (if there is no comma, disease tag, omim id are taken off) or after keywords “due to” or “with” that tend to specify subtypes of a main phenotype. Before the search for matching titles, white spaces, stand-alone digits and roman numerals are also removed. The resulting search strings are anchored at the beginning to select for partial title hits. Two kinds of opening brackets (“{”,”[”) are allowed to occur at the begininng, these brackets are used by OMIM as a special symbol to indicate susceptibility to certain conditions. The sql regular expression search is used to form an initial group of hits. Every partial hit is examined. To be accepted, the match has to either end at the word boundary or with punctuation (“,”,”-”,”/”,”(”) or to continue with allowed suffixes (presently “s”, “tous”, “tosis” to accommodate match of, for example, adenomatous and adenomatosis). Testing for word boundaries is necessary to exclude false matches that result from partial word matching (e.g. AICA versus Aicardi, POR versus Porencephaly). All approved matches are assigned the same integer disease identifier. The Morbid Map released in August 2008 contained 3700 entries with disease tag “(3)” and that were assigned to 1500 disease groups.

Testing disease groups

Publication from Goh et al. presents a similar effort in processing the Morbid Map. The release from December 2005 was processed by string matching to fuse sub-types of diseases. The original 2929 entries with disease tag “(3)” were grouped into 1284 groups and manually classified to 22 classes pointing to a physiological system affected in human body. I used this publication to assess the accuracy of my own disease grouping. I transferred the grouping of Goh et al. into August 2008 Morbid Map, this effort left about 400 newer entries unmatched. The groupings for the rest of the entries were compared. In about 100 cases, my divisions were too conservative and groups needed to be fused. The titles are sometimes not consistent in the Morbid Map, the word order sometimes does not match or the most relevant word is not placed at the beginning. Also, cancer names vary a lot. Most likely, Goh et al. manually intervened to sort out such groups. To deal with these inconsistent titles, I created new titles that can be processed correctly by the title matching parser and added these to the database table that facilitates the title matching. These new titles are often just variation on the original one, with a different word order or an extra comma. The changes to the titles are generated by a sql script and can be used on a new version of the Morbid Map to re-create the groups. The first version of this script contains about 150 sql statements. The total numbers of groups is about 1500. At the end, about 25 groups were deliberately kept more conservative than Goh et al.; i.e. one group by Goh et al. corresponds to two or more of disease groups. In 25 cases, I fused groups that are separated in Goh et al. Overall, thorough manual checks were done ONLY on groups that were different in Goh et al.

Multigenic groups of diseases

There are 474 disease groups with more than one associated gene. The number of associated genes ranges between two and 50. In total, there are about 2300 genes (some more than once) involved in these disease groups. The average number of genes per multigenic group is 5.

Genes involved in multiple diseases

Currently, morbid map includes about 2200 distinct genes. 1400 of these are involved only in one disorder. The remaining 800 are implicated in two to 11 diseases. From this group, 58 genes are found connected to more than 5 disorders. The highest connected fraction (8 to 11 disorders, 14 instances) is enriched with tumor supressors (such as p53, PTEN, APC, KRAS), collagens and FGF receptors.

Reference

Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL, The human disease network. Proc Natl Acad Sci USA. 2007 May 22;104(21):8685-90