Difference between revisions of "DiG: Disease groups"

From irefindex
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Sorting entries in Morbid Map into disease groups=
+
[[Image:Vitruvian_man.jpg|right|Vitruvian man by Leonardo da Vinci]]
  
==Introduction==
+
=Introduction=
We investigate properties of human disease genes within the protein-protein interaction network.  Two essential datasets required for this research are:
+
We investigate properties of human disease genes within the protein-protein interaction network.  Two essential datasets are required for this research:
1. human protein-protein interaction network
+
#the human protein-protein interaction network and
2. list of genes associated with genetic disorders
+
#a list of genes associated with genetic disorders
The human interaction network is taken from iRefIndex resource and the creation of the disease gene list is described in this page.
+
The human interaction network is taken from the [http://irefindex.uio.no iRefIndex] resource and the creation of the disease gene list is described on this page.
  
 +
Disease groups can be searched on using the iRefScape plugin for Cytoscape.
 +
 +
=Method for sorting entries in Morbid Map into disease groups (DiG)=
  
 
==OMIM Morbid map==
 
==OMIM Morbid map==
The largest and the most reliable disease association list is available from the Online Mendelian Inheritance in Man (OMIM) database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim).  The Morbid Map (MM) maintained within this resource lists gene disease associations and is exported as simple table with about 5000 entries.  Each line in this table corresponds to one gene – disorder association and looks as follows:
+
The largest and the most reliable disease association list is available from the Online Mendelian Inheritance in Man [http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim (OMIM) database][http://www.ncbi.nlm.nih.gov/Omim/getmorbid.cgi The Morbid Map], maintained within this resource, lists gene and their disease associations and is exported as a simple table with about 5000 entries.  Each line in this table corresponds to one gene – disorder association and looks as follows: <br/>
Disease title, OMIM id referring to a disease and evidence code| gene symbol and synonyms|  OMIM id reffering to gene entry| locus
+
<pre>Disease title, OMIM id referring to a disease and evidence code| gene symbol and synonyms|  OMIM id referring to gene entry| locus </pre><br/>
The evidence code is a number in parentheses and indicates whether the evidence for association (mutation) was positioned by mapping the wild type gene (1), by mapping the disease phenotype itself (2), or by both approaches (3). The last "3", includes mapping of the wild type gene combined with demonstration of a mutation in that gene in association with the disorder.  To ensure the best possible data for our research, I used only entries denoted with (3).
+
The evidence code is a number in parentheses and indicates whether the evidence for association (mutation) was positioned by mapping the wild type gene (1), by mapping the disease phenotype itself (2), or by both approaches (3). The last "3", includes mapping of the wild type gene combined with demonstration of a mutation in that gene in association with the disorder.  To ensure the best possible data for our research, we used only entries denoted with (3).
  
 
===Example of Morbid Map entry:===
 
===Example of Morbid Map entry:===
Parkinson disease, 168600 (3)  |TBP, SCA17|600075|6q27
 
  
 +
<pre>Parkinson disease, 168600 (3)  |TBP, SCA17|600075|6q27</pre>
  
 
==Pre-processing the Morbid Map==
 
==Pre-processing the Morbid Map==
For the map to be readily cross-referenced with our system, I translated the gene symbols into Entrez Gene Identifiers.  Great care was taken translating gene names into gene ids.  The names are not unique so there is a potential for assigning a wrong id. I used taxonomy and chromosome limits and searched gene symbol name first in the geneinfo table released by the Gene database, in case of no hit, I broadened the search and looked at synonyms and locus names.  Despite this effort, there were about 140 unresolved cases in Morbid Map.
+
For the map to be readily cross-referenced with our system, gene symbols were translated into their corresponding Entrez Gene Identifiers.  Great care was taken translating gene names into Gene identifiers because the names are not unique. The taxonomy identifier 9606 (human) and chromosome limits (locus info is present in Morbid Map) were used to limit search results for the gene symbol name first in the gene info table (released by the EntrezGene database). In those cases where no hit was found, the search was broadened to include synonyms and locus names for the gene name listed in Morbid Map.  Despite this effort, there were about 140 unresolved cases (no hit or too many hits) in the Morbid Map.
I also parsed out evidence code and disease OMIM id into separate database fields.  There are nearly 800 entries with no disease omim id specified.  These are mostly distinct titles.
+
The evidence code and disease OMIM id were separated from the disease title column into separate database fields.  There are nearly 800 entries with no disease omim id specified.  These are mostly distinct titles.
  
  
 
==Matching Morbid map entries==
 
==Matching Morbid map entries==
In many cases, one disease appears in Morbid Map in association with more than one gene.  These “multigenic” diseases are most valuable to us since I can study the properties of sub-networks that associated genes form within the large human network.  It is, therefore, necessary to group the disease titles together.  However, the titles are not always exactly the same but they vary in detail.  These variations often describe disease sub-types or are a result of a simple inconsistency.
+
In many cases, one disease appears in Morbid Map in association with more than one gene.  These “multigenic” diseases are the most valuable to us since we can study the properties of the sub-networks that these genes form within the larger human network.  Groups of disease-gene association were constructed using disease titles (first column in the Morbid Map).  In some cases, multiple entries have identical titles; these were assigned the same disease group identifier.  However, titles were not always exactly the same and varied in the detail given.  These variations often describe disease sub-types or are a result of a simple inconsistency.
  
===Sub-type:===
+
===Example of a disease with sub-types:===
 
<tt>
 
<tt>
*Parkinson disease 11, 607688 (3)                                                                
+
Parkinson disease 11, 607688 (3) <br/>                                                             
*Parkinson disease 13, 610297 (3)                                                                
+
Parkinson disease 13, 610297 (3) <br/>                                                         
*Parkinson disease 4, autosomal dominant Lewy body, 605543 (3)                                    
+
Parkinson disease 4, autosomal dominant Lewy body, 605543 (3) <br/>
Parkinson disease 6, early onset, 605909 (3)                                                    
+
Parkinson disease 6, early onset, 605909 (3) <br/>                                       
Parkinson disease 7, autosomal recessive early-onset, 606324 (3)                                
+
Parkinson disease 7, autosomal recessive early-onset, 606324 (3) <br/>
Parkinson disease, 168600 (3)                                                                    
+
Parkinson disease, 168600 (3) <br/>                                     
Parkinson disease, 168600 (3)                                                                  
+
Parkinson disease, 168600 (3) <br/>
Parkinson disease, 168600 (3)                                                                  
+
Parkinson disease, 168600 (3)<br/>                                                     
Parkinson disease, familial, 168600 (3)                                                        
+
Parkinson disease, familial, 168600 (3) <br/>                                                       
Parkinson disease, familial, 168601 (3)                                                        
+
Parkinson disease, familial, 168601 (3) <br/>                                             
Parkinson disease, juvenile, type 2, 600116 (3)                                                
+
Parkinson disease, juvenile, type 2, 600116 (3) <br/>                                         
Parkinson disease, resistance to, 168600 (3)                                                    
+
Parkinson disease, resistance to, 168600 (3) <br/>                                           
Parkinson disease-8, 607060 (3)                                                                
+
Parkinson disease-8, 607060 (3) <br/>                                       
{Parkinson disease, protection against}, 168600 (3)                                            
+
{Parkinson disease, protection against}, 168600 (3) <br/>                                 
{Parkinson disease, susceptibility to}, 168600 (3)                                              
+
{Parkinson disease, susceptibility to}, 168600 (3) <br/>                             
{Parkinson disease, susceptibility to}, 168600 (3)                                              
+
{Parkinson disease, susceptibility to}, 168600 (3) <br/>                       
{Parkinson disease, susceptibility to}, 168600 (3)                                            
+
{Parkinson disease, susceptibility to}, 168600 (3) <br/>                 
{Parkinson disease}, 168600 (3)                                              
+
{Parkinson disease}, 168600 (3) <br/>
 
</tt>
 
</tt>
===Inconsistency:===
+
===Example of an inconsistency:===
 
<tt>
 
<tt>
Li Fraumeni syndrome, 151623 (3)  
+
Li Fraumeni syndrome, 151623 (3) <br/>
Li-Fraumeni syndrome, 151623 (3)  
+
Li-Fraumeni syndrome, 151623 (3) <br/>
Li-Fraumeni syndrome, 609265 (3)  
+
Li-Fraumeni syndrome, 609265 (3) <br/>
 
</tt>
 
</tt>
  
 
==Grouping of titles in morbid map==
 
==Grouping of titles in morbid map==
Using simple text parsing rules, titles in morbid map were grouped using string matching.  Highly similar or identical titles are pooled into one group of disorders.  To initiate the text search, the titles are stripped of everything behind the first comma (if there is no comma, disease tag, omim id are taken off) or after keywords “due to” or “with” that tend to specify subtypes of a main phenotype.  Before the search for matching titles, white spaces, stand-alone digits and roman numerals are also removed.  The resulting search strings are anchored at the beginning and two kinds of brackets (“{”,”[”) are allowed to occur at the begininng to select for partial title hits.  The brackets are used by OMIM as a special symbol to indicate susceptibility to certain conditions.
+
Using simple text parsing rules, titles in morbid map were grouped using regular expression methods.  Highly similar or identical titles are pooled into one group of disorders.  To initiate the text search, the titles are stripped of everything behind the first comma (if there is no comma, disease tag, omim id are taken off) or after keywords “due to” or “with” that tend to specify subtypes of a main phenotype.  Before the search for matching titles, white spaces, stand-alone digits and roman numerals are also removed.  The resulting search strings are anchored at the beginning to select for partial title hits.  Two kinds of opening brackets (“{”,”[”) are allowed to occur at the begininng, these brackets are used by OMIM as a special symbol to indicate susceptibility to certain conditions.
The sql regular expression search is used to form an initial group of hits.  Every partial hit is examined.  To be accepted, the match has to either end at the word boundary or with punctuation (“,”,”-”,”/”,”(”) to continue with allowed suffixes (presently “s”, “tous”, “tosis” to accommodate match of, for example, adenomatous and adenomatosis).  Testing for word boundaries is necessary to exclude false matches that result from partial word matching (e.g. AICA versus Aicardi, POR versus Porencephaly).
+
A sql regular expression search was used to form an initial group of hits.  Every partial hit was examined.  To be accepted, the match had to either end at the word boundary or with punctuation (“,”,”-”,”/”,”(”) or to continue with allowed suffixes (presently “s”, “tous”, “tosis” to accommodate matches of, for example, adenomatous and adenomatosis).  Testing for word boundaries is necessary to exclude false matches that result from partial word matching (e.g. AICA versus Aicardi, POR versus Porencephaly).
All approved matches are assigned integer disease identifier.  The Morbid Map released in August 2008 contained 3700 entries with disease tag “(3)” that were assigned to 1500 disease groups.
+
All approved matches were assigned the same integer disease identifier.  The Morbid Map released in August 2008 contained 3700 entries with disease tag “(3)” and these were assigned to 1500 disease groups.  This mapping of disease genes into groups is referred to as DiG.
  
 
==Testing disease groups==
 
==Testing disease groups==
Publication from Goh et al. presents a similar effort in processing the Morbid Map.  The release from December 2005 was processed by string matching to fuse sub-types of diseases.  The original 2929 entries with disease tag “(3)” were grouped into 1284 groups and manually classified to 22 classes pointing to a physiological system affected in human body.  I used this publication to assess the accuracy of my own disease grouping.
+
The publication from Goh et al. [[#ref1|[1]]] presented a similar effort in processing the Morbid Map.  The Morbid Map release from December 2005 was processed by string matching to fuse sub-types of diseases.  The original 2929 entries with disease tag “(3)” were grouped into 1284 groups and manually classified to 22 classes pointing to a physiological system affected in the human body.  This publication was used to assess the accuracy of our disease grouping.
I transferred the grouping of Goh et al. into August 2008 Morbid Map, this effort left about 400 entries unmatched.  The rest of the groups were compared.  In about 100 cases, my divisions were too conservative and groups needed to be fused.  The titles are sometimes not consistent in the Morbid Map, the word order sometimes does not match or the most relevant word is not placed at the beginning.  Also, cancer names vary a lot.  Most likely, Goh et al. manually intervened to sort out such groups.  To deal with these inconsistent titles, I created new titles that can be processed correctly by the title matching parser and added these to the database table that facilitates the title matching.   These new titles are often just variation on the original one, with a different word order or an extra comma.  The changes to the titles are generated by a sql script and can be used on a new version of the morbid map to re-create the groups.  The first version of this script contains about 150 sql statements.
+
The disease grouping of Goh et al. [[#ref1|[1]]] was transferred onto the August 2008 Morbid Map. This left about 400 newer entries unmatched.  The groupings for the rest of the entries were compared (i.e. the genes that were mapped by both the Goh study and in our DiG list).  In about 100 cases, DiG divisions were too conservative and groups needed to be fused.  Sometimes the titles were not consistent between different entries in the Morbid Map.  In other cases, the word order did not match or the most relevant word did not occur at the beginning of the title.  Also, cancer names varied a lot.  Most likely, Goh et al. manually intervened to sort out such groups.  To deal with these inconsistent titles, new titles were created that could be processed correctly by the title matching parser and these to were added to the table prior to title matching. These new titles were often just variations on the original one, with a different word order or an extra comma.  The changes to the titles were generated by a sql script and can be used on a new version of the Morbid Map to re-create the groups.  The first version of this script contains about 150 sql statements.
The total numbers of group is about 1500.  At the end, about 25 groups were deliberately kept more conservative than Goh et al.; i.e. one group by Goh et al. corresponds to two or more of disease groups.  In 25 cases, I fused groups that are separated in Goh et al.  Overall, manual checks were done ONLY on groups that were different in Goh et al.
+
The total numbers of groups is about 1500.  In the end, about 25 groups were deliberately kept more conservative than Goh et al.; i.e. one group by Goh et al. corresponds to two or more disease groups in DiG.  In 25 cases, groups were fused that are separated in Goh et al.  Overall, thorough manual checks were done '''ONLY''' on groups that were different between Goh et al. and DiG.
 +
 
 +
=Description of DiG contents=
  
 
==Multigenic groups of diseases==
 
==Multigenic groups of diseases==
There are 474 disease groups with more than one associated gene.  The number of associated genes ranges between two and 50.  In total, there are about 2300 genes involved in these disease groups (474).  The average number of genes per multigenic group is 5.
+
There are 474 disease groups with more than one associated gene.  The number of associated genes ranges between two and 50.  In total, there are about 2300 genes (some more than once) involved in these disease groups.  The average number of genes per multigenic group is 5.
  
 
==Genes involved in multiple diseases==
 
==Genes involved in multiple diseases==
 
Currently, morbid map includes about 2200 distinct genes.  1400 of these are involved only in one disorder.  The remaining 800 are implicated in two to 11 diseases.  From this group, 58 genes are found connected to more than 5 disorders.  The highest connected fraction (8 to 11 disorders, 14 instances) is enriched with tumor supressors (such as p53, PTEN, APC, KRAS), collagens and FGF receptors.
 
Currently, morbid map includes about 2200 distinct genes.  1400 of these are involved only in one disorder.  The remaining 800 are implicated in two to 11 diseases.  From this group, 58 genes are found connected to more than 5 disorders.  The highest connected fraction (8 to 11 disorders, 14 instances) is enriched with tumor supressors (such as p53, PTEN, APC, KRAS), collagens and FGF receptors.
  
==Reference==
+
=Credits=
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL, The human disease network. Proc Natl Acad Sci USA. 2007 May 22;104(21):8685-90
+
DiG was constructed by Katerina Michalickova in the Donaldson group at the Biotechnology Centre of Oslo.
 +
 
 +
=Availability=
 +
DiG has not been published yet but can be made available to interested collaborators.  Contact ian.donaldson@biotek.uio.no for more information.
 +
 
 +
The DiG file format is described at http://donaldson.uio.no/wiki/README_DiG_1.0
 +
 
 +
Sources used to build this data set are described at http://donaldson.uio.no/wiki/Sources_DiG_1.0
 +
 
 +
=Reference=
 +
<div id="ref1">1. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL, The human disease network. Proc Natl Acad Sci USA. 2007 May 22;104(21):8685-90 PMID 17502601
 +
 
 +
 
 +
[[Category:DiG]]

Latest revision as of 11:55, 17 October 2011

Vitruvian man by Leonardo da Vinci

Introduction

We investigate properties of human disease genes within the protein-protein interaction network. Two essential datasets are required for this research:

  1. the human protein-protein interaction network and
  2. a list of genes associated with genetic disorders

The human interaction network is taken from the iRefIndex resource and the creation of the disease gene list is described on this page.

Disease groups can be searched on using the iRefScape plugin for Cytoscape.

Method for sorting entries in Morbid Map into disease groups (DiG)

OMIM Morbid map

The largest and the most reliable disease association list is available from the Online Mendelian Inheritance in Man (OMIM) database. The Morbid Map, maintained within this resource, lists gene and their disease associations and is exported as a simple table with about 5000 entries. Each line in this table corresponds to one gene – disorder association and looks as follows:

Disease title, OMIM id referring to a disease and evidence code| gene symbol and synonyms|  OMIM id referring to gene entry| locus 


The evidence code is a number in parentheses and indicates whether the evidence for association (mutation) was positioned by mapping the wild type gene (1), by mapping the disease phenotype itself (2), or by both approaches (3). The last "3", includes mapping of the wild type gene combined with demonstration of a mutation in that gene in association with the disorder. To ensure the best possible data for our research, we used only entries denoted with (3).

Example of Morbid Map entry:

Parkinson disease, 168600 (3)  |TBP, SCA17|600075|6q27

Pre-processing the Morbid Map

For the map to be readily cross-referenced with our system, gene symbols were translated into their corresponding Entrez Gene Identifiers. Great care was taken translating gene names into Gene identifiers because the names are not unique. The taxonomy identifier 9606 (human) and chromosome limits (locus info is present in Morbid Map) were used to limit search results for the gene symbol name first in the gene info table (released by the EntrezGene database). In those cases where no hit was found, the search was broadened to include synonyms and locus names for the gene name listed in Morbid Map. Despite this effort, there were about 140 unresolved cases (no hit or too many hits) in the Morbid Map. The evidence code and disease OMIM id were separated from the disease title column into separate database fields. There are nearly 800 entries with no disease omim id specified. These are mostly distinct titles.


Matching Morbid map entries

In many cases, one disease appears in Morbid Map in association with more than one gene. These “multigenic” diseases are the most valuable to us since we can study the properties of the sub-networks that these genes form within the larger human network. Groups of disease-gene association were constructed using disease titles (first column in the Morbid Map). In some cases, multiple entries have identical titles; these were assigned the same disease group identifier. However, titles were not always exactly the same and varied in the detail given. These variations often describe disease sub-types or are a result of a simple inconsistency.

Example of a disease with sub-types:

Parkinson disease 11, 607688 (3)
Parkinson disease 13, 610297 (3)
Parkinson disease 4, autosomal dominant Lewy body, 605543 (3)
Parkinson disease 6, early onset, 605909 (3)
Parkinson disease 7, autosomal recessive early-onset, 606324 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, 168600 (3)
Parkinson disease, familial, 168600 (3)
Parkinson disease, familial, 168601 (3)
Parkinson disease, juvenile, type 2, 600116 (3)
Parkinson disease, resistance to, 168600 (3)
Parkinson disease-8, 607060 (3)
{Parkinson disease, protection against}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease, susceptibility to}, 168600 (3)
{Parkinson disease}, 168600 (3)

Example of an inconsistency:

Li Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 151623 (3)
Li-Fraumeni syndrome, 609265 (3)

Grouping of titles in morbid map

Using simple text parsing rules, titles in morbid map were grouped using regular expression methods. Highly similar or identical titles are pooled into one group of disorders. To initiate the text search, the titles are stripped of everything behind the first comma (if there is no comma, disease tag, omim id are taken off) or after keywords “due to” or “with” that tend to specify subtypes of a main phenotype. Before the search for matching titles, white spaces, stand-alone digits and roman numerals are also removed. The resulting search strings are anchored at the beginning to select for partial title hits. Two kinds of opening brackets (“{”,”[”) are allowed to occur at the begininng, these brackets are used by OMIM as a special symbol to indicate susceptibility to certain conditions. A sql regular expression search was used to form an initial group of hits. Every partial hit was examined. To be accepted, the match had to either end at the word boundary or with punctuation (“,”,”-”,”/”,”(”) or to continue with allowed suffixes (presently “s”, “tous”, “tosis” to accommodate matches of, for example, adenomatous and adenomatosis). Testing for word boundaries is necessary to exclude false matches that result from partial word matching (e.g. AICA versus Aicardi, POR versus Porencephaly). All approved matches were assigned the same integer disease identifier. The Morbid Map released in August 2008 contained 3700 entries with disease tag “(3)” and these were assigned to 1500 disease groups. This mapping of disease genes into groups is referred to as DiG.

Testing disease groups

The publication from Goh et al. [1] presented a similar effort in processing the Morbid Map. The Morbid Map release from December 2005 was processed by string matching to fuse sub-types of diseases. The original 2929 entries with disease tag “(3)” were grouped into 1284 groups and manually classified to 22 classes pointing to a physiological system affected in the human body. This publication was used to assess the accuracy of our disease grouping. The disease grouping of Goh et al. [1] was transferred onto the August 2008 Morbid Map. This left about 400 newer entries unmatched. The groupings for the rest of the entries were compared (i.e. the genes that were mapped by both the Goh study and in our DiG list). In about 100 cases, DiG divisions were too conservative and groups needed to be fused. Sometimes the titles were not consistent between different entries in the Morbid Map. In other cases, the word order did not match or the most relevant word did not occur at the beginning of the title. Also, cancer names varied a lot. Most likely, Goh et al. manually intervened to sort out such groups. To deal with these inconsistent titles, new titles were created that could be processed correctly by the title matching parser and these to were added to the table prior to title matching. These new titles were often just variations on the original one, with a different word order or an extra comma. The changes to the titles were generated by a sql script and can be used on a new version of the Morbid Map to re-create the groups. The first version of this script contains about 150 sql statements. The total numbers of groups is about 1500. In the end, about 25 groups were deliberately kept more conservative than Goh et al.; i.e. one group by Goh et al. corresponds to two or more disease groups in DiG. In 25 cases, groups were fused that are separated in Goh et al. Overall, thorough manual checks were done ONLY on groups that were different between Goh et al. and DiG.

Description of DiG contents

Multigenic groups of diseases

There are 474 disease groups with more than one associated gene. The number of associated genes ranges between two and 50. In total, there are about 2300 genes (some more than once) involved in these disease groups. The average number of genes per multigenic group is 5.

Genes involved in multiple diseases

Currently, morbid map includes about 2200 distinct genes. 1400 of these are involved only in one disorder. The remaining 800 are implicated in two to 11 diseases. From this group, 58 genes are found connected to more than 5 disorders. The highest connected fraction (8 to 11 disorders, 14 instances) is enriched with tumor supressors (such as p53, PTEN, APC, KRAS), collagens and FGF receptors.

Credits

DiG was constructed by Katerina Michalickova in the Donaldson group at the Biotechnology Centre of Oslo.

Availability

DiG has not been published yet but can be made available to interested collaborators. Contact ian.donaldson@biotek.uio.no for more information.

The DiG file format is described at http://donaldson.uio.no/wiki/README_DiG_1.0

Sources used to build this data set are described at http://donaldson.uio.no/wiki/Sources_DiG_1.0

Reference

1. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL, The human disease network. Proc Natl Acad Sci USA. 2007 May 22;104(21):8685-90 PMID 17502601