silico.biotoul.fr
 

Prioritization:Phylogenetic profiles

From silico.biotoul.fr

Jump to: navigation, search

Proximity measure

A gene profile consists of the presence/absence of an isorthologous gene (see the Data section below) in each genome of our locally hosted complete genomes database (CGDB).

The proximity measure implemented is currently the following: a distance matrix is built, consisting of the Jaccard index for each pair of genes. The proximity of a candidate gene to a set of genes is then computed as the average of the Jaccard indices of the candidate gene to each known gene.

Jaccard index (1901) coefficient of Gower & Legendre s1 = a / (a+b+c) where a is the number of co-presence and b and c the number of mismatches (d the number of co-absence is ignored).

Data

The homology relationship can be refined to paralogy i.e., genes for which their sequence diverged after a duplication event, or to orthology i.e., genes for which their sequence diverged following a speciation event. Unfortunately, the orthology definition allows one gene to have multiple orthologs in another genome, which complicates computational analysis and thus, it is desirable to identify genes that may play the same role across organisms.

Fitch proposed the term isorthology to refer to orthologous genes that may have retained the same original function. To infer isorthology, we assign this relationship between sequences for which no duplication event occurred after speciation. This is motivated by the fact that such constraints maximize the chances that the gene functions have not evolved and specialized too drastically since the speciation event. We formalize the translation of this new constraint with the following Iso relationship between two sequences a and b from genomes A and B: Iso(a,b) = true if and only if a and b are bidirectional best hits (BBH) and there is no other sequence c in either genome A or B having a better homology score with a or b.