Librairie de bases et fonctions utilitaires

library(reticulate)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kableExtra) 
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kabex = function(o) o %>% kable(format="html", escape=F) %>% kable_styling(bootstrap_options = c("striped"))
library(DT) 
dted = function(o) o %>% datatable(rownames=F, filter="top", options=list(pageLength = 10), escape=F)

Intro

Pour mettre en oeuvre l’intégration de données, nous allons partir de questions biologiques larges :

  1. Est-ce que les modules dans le réseau d’interactions protéine-protéine d’Escherichia coli sont codés par des gènes co-exprimés ?

  2. Quelles sont les caractéristiques que partage chacun des ensembles de gènes qui sont co-exprimés et qui codent pour des protéines d’un même module du graphe d’interaction protéine-protéine ?

Pour tenter de répondre à cette question, nous aurons besoin

  • des protéines impliquées dans chaque module du PIN (protein interaction network)
  • d’une valeur de co-expression entre chaque paire de gènes
  • des caractéristiques associées à chaque gène ou protéine

Plan :

  • STRINGdb → graphe d’interactions ppi

  • igraph → community detection (with edge weights) → ppi clusters/modules

  • STRINGdb → coexpr graph → ppi cluster members mean/median/… coexpr ?

  • coexpr graph → igrah community detection → coexpr clusters

  • comparaison ppi clusters vs. coexpr clusters

  • ppi clusters enrichment analysis

  • best match between ppi & coexpr clusters → enrichment

  • STRINGdb → GO enrichment (with STRINGdb)

  • GO → MariaDB → Neo4j

  • STRINGdb → GO → Neo4j

Données d’intéractions protéine-protéine

STRINGdb pour Ecoli

https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html

Installation via conda

conda deactivate
conda activate idh
conda install bioconductor-stringdb
library(STRINGdb)

Extrait de la vignette : « To begin, you should first know the NCBI taxonomy identifiers of the organism on which you have performed the experiment (e.g. 9606 for Human, 10090 for mouse). If you don’t know that, you can search the NCBI Taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy) or start looking at our species table (that you can also use to verify that your organism is represented in the STRING database). Hence, if your species is not Human (i.e. our default species), you can find it and their taxonomy identifiers on STRING webpage under the ’organisms’ section (https://string-db.org/cgi/input.pl?input_page_active_form=organism or download the full list in the download section of STRING website. »

On trouve (2020-09-15 11:40am) :
Name:                         Escherichia coli str. K-12 substr. MG1655
Aliases:                      (E. coli str. K-12 substr. MG1655 / Escherichia coli K12 MG1655 / Escherichia coli K12 substr. MG1655 / Escherichia coli MG1655 / Escherichia coli str. K12 substr. MG1655 / Escherichia coli str. MG1655 / Escherichia coli strain MG1655)
STRING-type:                   core
distinct protein-coding genes: 4127
NCBI taxon-Id:                 511145

Pour utiliser STRING, il nous faudra télécharger des données en local. Création d’un répertoire dédié :

mkdir string.data

Connexion au serveur

stringdb = STRINGdb$new(version='11', species=511145, score_threshold=200, input_directory='string.data')

Téléchargement du genome/proteome :

proteins = stringdb$get_proteins() %>% as_tibble
proteins
## # A tibble: 4,127 x 4
##    protein_external_… preferred_name protein_size annotation                                                            
##    <chr>              <chr>                 <int> <chr>                                                                 
##  1 511145.b0001       thrL                     21 Thr operon leader peptide; This protein is involved in control of the…
##  2 511145.b0002       thrA                    820 Bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase …
##  3 511145.b0003       thrB                    310 Homoserine kinase; Catalyzes the ATP-dependent phosphorylation of L- …
##  4 511145.b0004       thrC                    428 Threonine synthase; Catalyzes the gamma-elimination of phosphate from…
##  5 511145.b0005       yaaX                     98 annotation not available                                              
##  6 511145.b0006       yaaA                    258 annotation not available                                              
##  7 511145.b0007       yaaJ                    476 Uncharacterized transporter YaaJ; Inner membrane transport protein    
##  8 511145.b0008       talB                    317 Transaldolase B; Transaldolase is important for the balance of metabo…
##  9 511145.b0009       mog                     195 Molybdopterin adenylyltransferase; Catalyzes the adenylation of molyb…
## 10 511145.b0010       satP                    188 Succinate-acetate/proton symporter SatP; Uptake of acetate and succin…
## # … with 4,117 more rows

On observe que le fichier 5511145.protein.info.v11.0.txt.gz est téléchargé dans le répertoire string.data

ls -lh string.data
## total 15M
## -rw-rw-r-- 1 barriot gsi 314K Sep 24 14:20 511145.protein.actions.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 730K Sep 15 12:41 511145.protein.aliases.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 270K Sep 15 12:23 511145.protein.info.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 8.4M Sep 15 18:02 511145.protein.links.detailed.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 5.1M Sep 15 12:13 511145.protein.links.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi    4 Sep 15 13:29 NOBACKUP

Qu’y a-t-il dedans ?

zcat string.data/511145.protein.info.v11.0.txt.gz | head
## protein_external_id  preferred_name  protein_size    annotation
## 511145.b0001 thrL    21  Thr operon leader peptide; This protein is involved in control of the biosynthesis of threonine
## 511145.b0002 thrA    820 Bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase I (C-terminal); Protein involved in threonine biosynthetic process, methionine biosynthetic process and homoserine biosynthetic process
## 511145.b0003 thrB    310 Homoserine kinase; Catalyzes the ATP-dependent phosphorylation of L- homoserine to L-homoserine phosphate. Is also able to phosphorylate the hydroxy group on gamma-carbon of L-homoserine analogs when the functional group at the alpha-position is a carboxyl, an ester, or even a hydroxymethyl group. Neither L- threonine nor L-serine are substrates of the enzyme
## 511145.b0004 thrC    428 Threonine synthase; Catalyzes the gamma-elimination of phosphate from L- phosphohomoserine and the beta-addition of water to produce L- threonine. To a lesser extent, is able to slowly catalyze the deamination of L-threonine into alpha-ketobutyrate and that of L- serine and 3-chloroalanine into pyruvate. Is also able to rapidly convert vinylglycine to threonine, which proves that the pyridoxal p-quinonoid of vinylglycine is an intermediate in the TS reaction
## 511145.b0005 yaaX    98  annotation not available
## 511145.b0006 yaaA    258 annotation not available
## 511145.b0007 yaaJ    476 Uncharacterized transporter YaaJ; Inner membrane transport protein
## 511145.b0008 talB    317 Transaldolase B; Transaldolase is important for the balance of metabolites in the pentose-phosphate pathway
## 511145.b0009 mog 195 Molybdopterin adenylyltransferase; Catalyzes the adenylation of molybdopterin as part of the biosynthesis of the molybdenum-cofactor; Belongs to the MoaB/Mog family

Téléchargement des interactions

g.string = stringdb$get_graph()

On observe que le fichier 511145.protein.links.v11.0.txt.gz est téléchargé dans le répertoire string.data

Qu’y a-t-il dedans ?

zcat string.data/511145.protein.links.v11.0.txt.gz | head
## protein1 protein2 combined_score
## 511145.b0001 511145.b3766 354
## 511145.b0001 511145.b2483 430
## 511145.b0001 511145.b0075 303
## 511145.b0001 511145.b3672 408
## 511145.b0001 511145.b0861 306
## 511145.b0001 511145.b2603 282
## 511145.b0001 511145.b0003 608
## 511145.b0001 511145.b3386 167
## 511145.b0001 511145.b3769 303

On observe que les liens combined_score < 200 sont quand même présents dans ce fichier. Le filtre se fait dans R. Pourquoi ?

Auparavant (version 10), on avait tous les scores dans ce fichier, et pas seulement le score combiné. Or nous avons besoin de la colonne correspondant aux donnée d’interaction protéine-protéine (experimental dans le fichier).

Pour cela, il faut passer par la page de téléchargement sur https://string-db.org/cgi/download.pl en restreignant l’organisme d’intérêt.

Le fichier complet faisant 67.6Go (au 15/09/2020), il est conseillé de d’abord sélectionner un organisme avant le téléchargement. Une copie est disponible sur silico à récupérer sur http://silico.biotoul.fr/enseignement/m2bbs/idh/

Qu’y a-t-il dedans ?

zcat string.data/511145.protein.links.detailed.v11.0.txt.gz | head
## protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
## 511145.b0001 511145.b3766 0 0 0 125 0 0 292 354
## 511145.b0001 511145.b2483 0 0 0 0 0 0 430 430
## 511145.b0001 511145.b0075 0 0 0 0 0 0 303 303
## 511145.b0001 511145.b3672 0 0 0 0 0 0 408 408
## 511145.b0001 511145.b0861 0 0 0 0 0 0 306 306
## 511145.b0001 511145.b2603 0 0 0 0 0 0 282 282
## 511145.b0001 511145.b0003 604 0 0 0 0 0 50 608
## 511145.b0001 511145.b3386 0 0 0 0 0 0 167 167
## 511145.b0001 511145.b3769 0 0 0 0 0 0 303 303

Chargement sous forme de tibble :

links = read_delim("string.data/511145.protein.links.detailed.v11.0.txt.gz", delim=" ", col_types = "ccnnnnnnnn")
links %>% head %>% kabex
protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
511145.b0001 511145.b3766 0 0 0 125 0 0 292 354
511145.b0001 511145.b2483 0 0 0 0 0 0 430 430
511145.b0001 511145.b0075 0 0 0 0 0 0 303 303
511145.b0001 511145.b3672 0 0 0 0 0 0 408 408
511145.b0001 511145.b0861 0 0 0 0 0 0 306 306
511145.b0001 511145.b2603 0 0 0 0 0 0 282 282

Nous nous intéresserons aux données d’expression et donc à la colonne coexpression. Pour le réseau d’interaction ppi, la colonne experimental contient des interactions physiques et fonctionnelles. Un autre fichier est donc nécessaire : protein.actions (binding)

actions = read_delim("string.data/511145.protein.actions.v11.0.txt.gz", delim="\t")
## Parsed with column specification:
## cols(
##   item_id_a = col_character(),
##   item_id_b = col_character(),
##   mode = col_character(),
##   action = col_character(),
##   is_directional = col_logical(),
##   a_is_acting = col_logical(),
##   score = col_double()
## )
actions  
## # A tibble: 73,224 x 7
##    item_id_a    item_id_b    mode    action is_directional a_is_acting score
##    <chr>        <chr>        <chr>   <chr>  <lgl>          <lgl>       <dbl>
##  1 511145.b0002 511145.b3607 binding <NA>   FALSE          FALSE         183
##  2 511145.b0002 511145.b0031 binding <NA>   FALSE          FALSE         309
##  3 511145.b0002 511145.b0033 binding <NA>   FALSE          FALSE         211
##  4 511145.b0002 511145.b0268 binding <NA>   FALSE          FALSE         213
##  5 511145.b0002 511145.b0600 binding <NA>   FALSE          FALSE         357
##  6 511145.b0002 511145.b0674 binding <NA>   FALSE          FALSE         258
##  7 511145.b0002 511145.b1096 binding <NA>   FALSE          FALSE         243
##  8 511145.b0002 511145.b1622 binding <NA>   FALSE          FALSE         208
##  9 511145.b0002 511145.b2058 binding <NA>   FALSE          FALSE         183
## 10 511145.b0002 511145.b2290 binding <NA>   FALSE          FALSE         357
## # … with 73,214 more rows
ppi = actions %>% filter(mode=='binding') %>% dplyr::select(protein1 = item_id_a, protein2 = item_id_b, ppi = score)

experimetal: BIND, DIP, GRID, HPRD, IntAct, MINT, and PID. (src: https://string-db.org/help/faq/#which-databases-does-string-extract-experimental-data-from)

Création du graphe correspondant :

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
g.ppi = graph.data.frame(ppi, directed=F)
g.ppi
## IGRAPH 026ba8f UN-- 2586 37752 -- 
## + attr: name (v/c), ppi (e/n)
## + edges from 026ba8f (vertex names):
##  [1] 511145.b0002--511145.b3607 511145.b0002--511145.b0031 511145.b0002--511145.b0033 511145.b0002--511145.b0268
##  [5] 511145.b0002--511145.b0600 511145.b0002--511145.b0674 511145.b0002--511145.b1096 511145.b0002--511145.b1622
##  [9] 511145.b0002--511145.b2058 511145.b0002--511145.b2290 511145.b0002--511145.b2319 511145.b0002--511145.b2379
## [13] 511145.b0002--511145.b2478 511145.b0002--511145.b2784 511145.b0002--511145.b2838 511145.b0002--511145.b3008
## [17] 511145.b0002--511145.b3172 511145.b0002--511145.b3225 511145.b0002--511145.b3347 511145.b0002--511145.b3376
## [21] 511145.b0002--511145.b3433 511145.b0002--511145.b3650 511145.b0002--511145.b3770 511145.b0002--511145.b3829
## [25] 511145.b0002--511145.b3939 511145.b0002--511145.b3940 511145.b0002--511145.b4019 511145.b0002--511145.b4024
## [29] 511145.b0002--511145.b4139 511145.b0002--511145.b4177 511145.b0002--511145.b4207 511145.b0002--511145.b4298
## + ... omitted several edges
E(g.ppi)$ppi %>% head(100)
##   [1] 183 309 211 213 357 258 243 208 183 357 430 357 214 152 540 215 430 213 169 215 430 152 243 210 215 878 605 814
##  [29] 196 274 169 213 153 432 210 456 152 185 394 456 183 379 227 153 158 800 278 171 278 212 194 292 237 306 398 365
##  [57] 442 430 409 879 800 172 794 873 181 165 155 181 237 165 152 185 185 188 161 183 179 763 172 911 185 767 154 210
##  [85] 165 181 185 818 185 152 172 185 165 185 509 401 878 169 185 185

Stats TODO nb de gènes dans string vs. dans le génome

Community detection

→ Louvain clusters

ppi.clusters = cluster_louvain(g.ppi, weights=E(g.ppi)$ppi)
ppi.clusters
## IGRAPH clustering multi level, groups: 74, mod: 0.74
## + groups:
##   $`1`
##   [1] "511145.b0066" "511145.b0067" "511145.b0068"
##   
##   $`2`
##   [1] "511145.b0289" "511145.b0291"
##   
##   $`3`
##     [1] "511145.b0054" "511145.b0127" "511145.b0128" "511145.b0150" "511145.b0151" "511145.b0152" "511145.b0153"
##     [8] "511145.b0158" "511145.b0197" "511145.b0198" "511145.b0199" "511145.b0262" "511145.b0365" "511145.b0366"
##    [15] "511145.b0367" "511145.b0368" "511145.b0402" "511145.b0584" "511145.b0588" "511145.b0589" "511145.b0590"
##   + ... omitted several groups/vertices

Exploration des clusters/communautés obtenues :

length(ppi.clusters)
## [1] 74
sizes(ppi.clusters) %>% sort(decreasing=T) %>% head(25)
## Community sizes
##   6  58  70  56  71   3   8  54  62  41  24  19  14  47  33  37  45  66  59   5  13  27  36  64  67 
## 432 261 260 252 205 167 145 136 128 115  85  81  64  49  15  15  10  10   9   6   6   5   5   5   5
modularity(g.ppi, ppi.clusters$membership, weights=E(g.ppi)$ppi)
## [1] 0.7356774

Bonne mesure de modularité.

Communities with at least 10 members

which(sizes(ppi.clusters) >= 10)
##  3  6  8 14 19 24 33 37 41 45 47 54 56 58 62 66 70 71 
##  3  6  8 14 19 24 33 37 41 45 47 54 56 58 62 66 70 71
coms.ppi = tibble(ppi.community = which(sizes(ppi.clusters) >= 10))
sizes(ppi.clusters)[ coms.ppi$ppi.community ]
## Community sizes
##   3   6   8  14  19  24  33  37  41  45  47  54  56  58  62  66  70  71 
## 167 432 145  64  81  85  15  15 115  10  49 136 252 261 128  10 260 205
coms.ppi$size =sizes(ppi.clusters)[ coms.ppi$ppi.community ] 
coms.ppi %>% kabex
ppi.community size
3 167
6 432
8 145
14 64
19 81
24 85
33 15
37 15
41 115
45 10
47 49
54 136
56 252
58 261
62 128
66 10
70 260
71 205
coms.ppi$size %>% sum
## [1] 2430

Ajout de ces communautés à proteins

proteins.ppi = tibble(protein_external_id=V(g.ppi)$name, ppi.community=ppi.clusters$membership) %>% inner_join(coms.ppi)
## Joining, by = "ppi.community"
proteins =  proteins %>% left_join(proteins.ppi) %>% dplyr::select(-size)
## Joining, by = "protein_external_id"
proteins
## # A tibble: 4,127 x 5
##    protein_external_… preferred_name protein_size annotation                                               ppi.community
##    <chr>              <chr>                 <int> <chr>                                                            <dbl>
##  1 511145.b0001       thrL                     21 Thr operon leader peptide; This protein is involved in …            NA
##  2 511145.b0002       thrA                    820 Bifunctional: aspartokinase I (N-terminal); homoserine …            58
##  3 511145.b0003       thrB                    310 Homoserine kinase; Catalyzes the ATP-dependent phosphor…            58
##  4 511145.b0004       thrC                    428 Threonine synthase; Catalyzes the gamma-elimination of …             6
##  5 511145.b0005       yaaX                     98 annotation not available                                            NA
##  6 511145.b0006       yaaA                    258 annotation not available                                            NA
##  7 511145.b0007       yaaJ                    476 Uncharacterized transporter YaaJ; Inner membrane transp…            NA
##  8 511145.b0008       talB                    317 Transaldolase B; Transaldolase is important for the bal…            71
##  9 511145.b0009       mog                     195 Molybdopterin adenylyltransferase; Catalyzes the adenyl…             8
## 10 511145.b0010       satP                    188 Succinate-acetate/proton symporter SatP; Uptake of acet…            NA
## # … with 4,117 more rows

Données d’expression des gènes

Graphe avec les scores de coexpression

g.full = graph.data.frame(links, directed=F)
g.full
## IGRAPH cb0f3c9 UN-- 4125 1060854 -- 
## + attr: name (v/c), neighborhood (e/n), fusion (e/n), cooccurence (e/n), coexpression (e/n), experimental
## | (e/n), database (e/n), textmining (e/n), combined_score (e/n)
## + edges from cb0f3c9 (vertex names):
##  [1] 511145.b0001--511145.b3766 511145.b0001--511145.b2483 511145.b0001--511145.b0075 511145.b0001--511145.b3672
##  [5] 511145.b0001--511145.b0861 511145.b0001--511145.b2603 511145.b0001--511145.b0003 511145.b0001--511145.b3386
##  [9] 511145.b0001--511145.b3769 511145.b0001--511145.b0411 511145.b0001--511145.b4558 511145.b0001--511145.b3771
## [13] 511145.b0001--511145.b3748 511145.b0001--511145.b3107 511145.b0001--511145.b4246 511145.b0001--511145.b0166
## [17] 511145.b0001--511145.b1920 511145.b0001--511145.b4045 511145.b0001--511145.b2762 511145.b0001--511145.b0981
## [21] 511145.b0001--511145.b1806 511145.b0001--511145.b2310 511145.b0001--511145.b0002 511145.b0001--511145.b3671
## [25] 511145.b0001--511145.b0863 511145.b0001--511145.b4529 511145.b0001--511145.b0005 511145.b0001--511145.b1002
## + ... omitted several edges

Community detection

→ Louvain coexpr clusters

coexpr.clusters = cluster_louvain(g.full, weights=E(g.full)$coexpression)
coexpr.clusters
## IGRAPH clustering multi level, groups: 143, mod: 0.49
## + groups:
##   $`1`
##   [1] "511145.b0018"
##   
##   $`2`
##   [1] "511145.b0024"
##   
##   $`3`
##   [1] "511145.b0217"
##   
##   $`4`
##   + ... omitted several groups/vertices

Exploration des clusters/communautés obtenues :

length(coexpr.clusters)
## [1] 143
sizes(coexpr.clusters) %>% sort(decreasing = T) %>% head(25)
## Community sizes
##   46   48   62   52   44   57   92   36   32   72   39   81   61   42   30   37   65    1    2    3    4    5    6    7 
## 1213  807  659  327  302  269  129   76   74   57   26   22   18   11    5    2    2    1    1    1    1    1    1    1 
##    8 
##    1
modularity(g.full, coexpr.clusters$membership, weights=E(g.full)$coexpression)
## [1] 0.4858105

Bonne modularité

Communities with at least 10 members

which(sizes(coexpr.clusters) >= 10)
## 32 36 39 42 44 46 48 52 57 61 62 72 81 92 
## 32 36 39 42 44 46 48 52 57 61 62 72 81 92
coms.coexpr = tibble(coexpr.community = which(sizes(coexpr.clusters) >= 10))
sizes(coexpr.clusters)[ coms.coexpr$coexpr.community ]
## Community sizes
##   32   36   39   42   44   46   48   52   57   61   62   72   81   92 
##   74   76   26   11  302 1213  807  327  269   18  659   57   22  129
coms.coexpr$size =sizes(coexpr.clusters)[ coms.coexpr$coexpr.community ] 
coms.coexpr %>% kabex
coexpr.community size
32 74
36 76
39 26
42 11
44 302
46 1213
48 807
52 327
57 269
61 18
62 659
72 57
81 22
92 129
coms.coexpr$size %>% sum
## [1] 3990

Ajout des ces communautés à proteins

proteins.coexpr = tibble(protein_external_id=V(g.full)$name, coexpr.community=coexpr.clusters$membership) %>% inner_join(coms.coexpr) %>% dplyr::select(-size)
## Joining, by = "coexpr.community"
proteins =  proteins %>% left_join(proteins.coexpr)
## Joining, by = "protein_external_id"

Compare ppi clusters & coexpr clusters

proteins %>% filter(!is.na(ppi.community)) %>% nrow
## [1] 2430
proteins %>% filter(!is.na(coexpr.community)) %>% nrow
## [1] 3990
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
prs = proteins %>% filter(!is.na(ppi.community) & !is.na(coexpr.community))
prs
## # A tibble: 2,417 x 6
##    protein_external… preferred_name protein_size annotation                               ppi.community coexpr.community
##    <chr>             <chr>                 <int> <chr>                                            <dbl>            <dbl>
##  1 511145.b0002      thrA                    820 "Bifunctional: aspartokinase I (N-termi…            58               44
##  2 511145.b0003      thrB                    310 "Homoserine kinase; Catalyzes the ATP-d…            58               44
##  3 511145.b0004      thrC                    428 "Threonine synthase; Catalyzes the gamm…             6               44
##  4 511145.b0008      talB                    317 "Transaldolase B; Transaldolase is impo…            71               48
##  5 511145.b0009      mog                     195 "Molybdopterin adenylyltransferase; Cat…             8               48
##  6 511145.b0014      dnaK                    638 "Chaperone protein DnaK; Plays an essen…            56               72
##  7 511145.b0015      dnaJ                    376 "Chaperone protein DnaJ; Interacts with…            56               72
##  8 511145.b0023      rpsT                     87 "30S ribosomal protein S20; Binds direc…            62               57
##  9 511145.b0025      ribF                    313 "Riboflavin biosynthesis protein RibF; …            58               46
## 10 511145.b0026      ileS                    938 "Isoleucine--tRNA ligase; Catalyzes the…            24               57
## # … with 2,407 more rows

Good overlap between genes clustered in both clustering.

library(ade4)
tab = prs %>% dcast(coexpr.community ~ ppi.community)
## Using coexpr.community as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(tab) = paste0('ppi.', colnames(tab))
rownames(tab) = paste0('coexpr.',tab[,1])
tab[,-1] %>% table.value()

Certains semblent bien se correspondre (visuellement), ex: coexpr.1 et ppi.37 ; et d’autres beaucoup moins, ex: ppi.58

Pour avoir une mesure entre 0 et 1, utilisons le coefficient simple d’appariement et l’indice de Jaccard.

Matrices pour ppi et coexpr:

ppi.m = sapply(prs$ppi.community, function(yi) yi == prs$ppi.community)
coexpr.m = sapply(prs$coexpr.community, function(yi) yi == prs$coexpr.community)
Table de contingence
           objet j
            1    0
objet   1   a    b
  i     0   c    d
a  = sum( as.numeric(ppi.m & coexpr.m))
a
## [1] 163499
bc = sum( as.numeric(xor(ppi.m, coexpr.m)))
bc
## [1] 1298028
d  = sum( as.numeric(!ppi.m & !coexpr.m))
d
## [1] 4380362

SMC

bc / (a+bc+d)
## [1] 0.2221932

Jaccard

bc / (a+bc)
## [1] 0.8881314