library(reticulate)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kabex = function(o) o %>% kable(format="html", escape=F) %>% kable_styling(bootstrap_options = c("striped"))
library(DT)
dted = function(o) o %>% datatable(rownames=F, filter="top", options=list(pageLength = 10), escape=F)
Pour mettre en oeuvre l’intégration de données, nous allons partir de questions biologiques larges :
Est-ce que les modules dans le réseau d’interactions protéine-protéine d’Escherichia coli sont codés par des gènes co-exprimés ?
Quelles sont les caractéristiques que partage chacun des ensembles de gènes qui sont co-exprimés et qui codent pour des protéines d’un même module du graphe d’interaction protéine-protéine ?
Pour tenter de répondre à cette question, nous aurons besoin
Plan :
STRINGdb → graphe d’interactions ppi
igraph → community detection (with edge weights) → ppi clusters/modules
STRINGdb → coexpr graph → ppi cluster members mean/median/… coexpr ?
coexpr graph → igrah community detection → coexpr clusters
comparaison ppi clusters vs. coexpr clusters
ppi clusters enrichment analysis
best match between ppi & coexpr clusters → enrichment
STRINGdb → GO enrichment (with STRINGdb)
GO → MariaDB → Neo4j
STRINGdb → GO → Neo4j
https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html
Installation via conda
conda deactivate
conda activate idh
conda install bioconductor-stringdb
library(STRINGdb)
Extrait de la vignette : « To begin, you should first know the NCBI taxonomy identifiers of the organism on which you have performed the experiment (e.g. 9606 for Human, 10090 for mouse). If you don’t know that, you can search the NCBI Taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy) or start looking at our species table (that you can also use to verify that your organism is represented in the STRING database). Hence, if your species is not Human (i.e. our default species), you can find it and their taxonomy identifiers on STRING webpage under the ’organisms’ section (https://string-db.org/cgi/input.pl?input_page_active_form=organism or download the full list in the download section of STRING website. »
On trouve (2020-09-15 11:40am) :Name: Escherichia coli str. K-12 substr. MG1655 Aliases: (E. coli str. K-12 substr. MG1655 / Escherichia coli K12 MG1655 / Escherichia coli K12 substr. MG1655 / Escherichia coli MG1655 / Escherichia coli str. K12 substr. MG1655 / Escherichia coli str. MG1655 / Escherichia coli strain MG1655) STRING-type: core distinct protein-coding genes: 4127 NCBI taxon-Id: 511145
Pour utiliser STRING, il nous faudra télécharger des données en local. Création d’un répertoire dédié :
mkdir string.data
Connexion au serveur
stringdb = STRINGdb$new(version='11', species=511145, score_threshold=200, input_directory='string.data')
Téléchargement du genome/proteome :
proteins = stringdb$get_proteins() %>% as_tibble
proteins
## # A tibble: 4,127 x 4
## protein_external_… preferred_name protein_size annotation
## <chr> <chr> <int> <chr>
## 1 511145.b0001 thrL 21 Thr operon leader peptide; This protein is involved in control of the…
## 2 511145.b0002 thrA 820 Bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase …
## 3 511145.b0003 thrB 310 Homoserine kinase; Catalyzes the ATP-dependent phosphorylation of L- …
## 4 511145.b0004 thrC 428 Threonine synthase; Catalyzes the gamma-elimination of phosphate from…
## 5 511145.b0005 yaaX 98 annotation not available
## 6 511145.b0006 yaaA 258 annotation not available
## 7 511145.b0007 yaaJ 476 Uncharacterized transporter YaaJ; Inner membrane transport protein
## 8 511145.b0008 talB 317 Transaldolase B; Transaldolase is important for the balance of metabo…
## 9 511145.b0009 mog 195 Molybdopterin adenylyltransferase; Catalyzes the adenylation of molyb…
## 10 511145.b0010 satP 188 Succinate-acetate/proton symporter SatP; Uptake of acetate and succin…
## # … with 4,117 more rows
On observe que le fichier 5511145.protein.info.v11.0.txt.gz
est téléchargé dans le répertoire string.data
ls -lh string.data
## total 15M
## -rw-rw-r-- 1 barriot gsi 314K Sep 24 14:20 511145.protein.actions.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 730K Sep 15 12:41 511145.protein.aliases.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 270K Sep 15 12:23 511145.protein.info.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 8.4M Sep 15 18:02 511145.protein.links.detailed.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 5.1M Sep 15 12:13 511145.protein.links.v11.0.txt.gz
## -rw-rw-r-- 1 barriot gsi 4 Sep 15 13:29 NOBACKUP
Qu’y a-t-il dedans ?
zcat string.data/511145.protein.info.v11.0.txt.gz | head
## protein_external_id preferred_name protein_size annotation
## 511145.b0001 thrL 21 Thr operon leader peptide; This protein is involved in control of the biosynthesis of threonine
## 511145.b0002 thrA 820 Bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase I (C-terminal); Protein involved in threonine biosynthetic process, methionine biosynthetic process and homoserine biosynthetic process
## 511145.b0003 thrB 310 Homoserine kinase; Catalyzes the ATP-dependent phosphorylation of L- homoserine to L-homoserine phosphate. Is also able to phosphorylate the hydroxy group on gamma-carbon of L-homoserine analogs when the functional group at the alpha-position is a carboxyl, an ester, or even a hydroxymethyl group. Neither L- threonine nor L-serine are substrates of the enzyme
## 511145.b0004 thrC 428 Threonine synthase; Catalyzes the gamma-elimination of phosphate from L- phosphohomoserine and the beta-addition of water to produce L- threonine. To a lesser extent, is able to slowly catalyze the deamination of L-threonine into alpha-ketobutyrate and that of L- serine and 3-chloroalanine into pyruvate. Is also able to rapidly convert vinylglycine to threonine, which proves that the pyridoxal p-quinonoid of vinylglycine is an intermediate in the TS reaction
## 511145.b0005 yaaX 98 annotation not available
## 511145.b0006 yaaA 258 annotation not available
## 511145.b0007 yaaJ 476 Uncharacterized transporter YaaJ; Inner membrane transport protein
## 511145.b0008 talB 317 Transaldolase B; Transaldolase is important for the balance of metabolites in the pentose-phosphate pathway
## 511145.b0009 mog 195 Molybdopterin adenylyltransferase; Catalyzes the adenylation of molybdopterin as part of the biosynthesis of the molybdenum-cofactor; Belongs to the MoaB/Mog family
Téléchargement des interactions
g.string = stringdb$get_graph()
On observe que le fichier 511145.protein.links.v11.0.txt.gz
est téléchargé dans le répertoire string.data
Qu’y a-t-il dedans ?
zcat string.data/511145.protein.links.v11.0.txt.gz | head
## protein1 protein2 combined_score
## 511145.b0001 511145.b3766 354
## 511145.b0001 511145.b2483 430
## 511145.b0001 511145.b0075 303
## 511145.b0001 511145.b3672 408
## 511145.b0001 511145.b0861 306
## 511145.b0001 511145.b2603 282
## 511145.b0001 511145.b0003 608
## 511145.b0001 511145.b3386 167
## 511145.b0001 511145.b3769 303
On observe que les liens combined_score < 200 sont quand même présents dans ce fichier. Le filtre se fait dans R. Pourquoi ?
Auparavant (version 10), on avait tous les scores dans ce fichier, et pas seulement le score combiné. Or nous avons besoin de la colonne correspondant aux donnée d’interaction protéine-protéine (experimental dans le fichier).
Pour cela, il faut passer par la page de téléchargement sur https://string-db.org/cgi/download.pl en restreignant l’organisme d’intérêt.
Le fichier complet faisant 67.6Go (au 15/09/2020), il est conseillé de d’abord sélectionner un organisme avant le téléchargement. Une copie est disponible sur silico à récupérer sur http://silico.biotoul.fr/enseignement/m2bbs/idh/
Qu’y a-t-il dedans ?
zcat string.data/511145.protein.links.detailed.v11.0.txt.gz | head
## protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
## 511145.b0001 511145.b3766 0 0 0 125 0 0 292 354
## 511145.b0001 511145.b2483 0 0 0 0 0 0 430 430
## 511145.b0001 511145.b0075 0 0 0 0 0 0 303 303
## 511145.b0001 511145.b3672 0 0 0 0 0 0 408 408
## 511145.b0001 511145.b0861 0 0 0 0 0 0 306 306
## 511145.b0001 511145.b2603 0 0 0 0 0 0 282 282
## 511145.b0001 511145.b0003 604 0 0 0 0 0 50 608
## 511145.b0001 511145.b3386 0 0 0 0 0 0 167 167
## 511145.b0001 511145.b3769 0 0 0 0 0 0 303 303
Chargement sous forme de tibble :
links = read_delim("string.data/511145.protein.links.detailed.v11.0.txt.gz", delim=" ", col_types = "ccnnnnnnnn")
links %>% head %>% kabex
protein1 | protein2 | neighborhood | fusion | cooccurence | coexpression | experimental | database | textmining | combined_score |
---|---|---|---|---|---|---|---|---|---|
511145.b0001 | 511145.b3766 | 0 | 0 | 0 | 125 | 0 | 0 | 292 | 354 |
511145.b0001 | 511145.b2483 | 0 | 0 | 0 | 0 | 0 | 0 | 430 | 430 |
511145.b0001 | 511145.b0075 | 0 | 0 | 0 | 0 | 0 | 0 | 303 | 303 |
511145.b0001 | 511145.b3672 | 0 | 0 | 0 | 0 | 0 | 0 | 408 | 408 |
511145.b0001 | 511145.b0861 | 0 | 0 | 0 | 0 | 0 | 0 | 306 | 306 |
511145.b0001 | 511145.b2603 | 0 | 0 | 0 | 0 | 0 | 0 | 282 | 282 |
Nous nous intéresserons aux données d’expression et donc à la colonne coexpression
. Pour le réseau d’interaction ppi, la colonne experimental
contient des interactions physiques et fonctionnelles. Un autre fichier est donc nécessaire : protein.actions (binding)
actions = read_delim("string.data/511145.protein.actions.v11.0.txt.gz", delim="\t")
## Parsed with column specification:
## cols(
## item_id_a = col_character(),
## item_id_b = col_character(),
## mode = col_character(),
## action = col_character(),
## is_directional = col_logical(),
## a_is_acting = col_logical(),
## score = col_double()
## )
actions
## # A tibble: 73,224 x 7
## item_id_a item_id_b mode action is_directional a_is_acting score
## <chr> <chr> <chr> <chr> <lgl> <lgl> <dbl>
## 1 511145.b0002 511145.b3607 binding <NA> FALSE FALSE 183
## 2 511145.b0002 511145.b0031 binding <NA> FALSE FALSE 309
## 3 511145.b0002 511145.b0033 binding <NA> FALSE FALSE 211
## 4 511145.b0002 511145.b0268 binding <NA> FALSE FALSE 213
## 5 511145.b0002 511145.b0600 binding <NA> FALSE FALSE 357
## 6 511145.b0002 511145.b0674 binding <NA> FALSE FALSE 258
## 7 511145.b0002 511145.b1096 binding <NA> FALSE FALSE 243
## 8 511145.b0002 511145.b1622 binding <NA> FALSE FALSE 208
## 9 511145.b0002 511145.b2058 binding <NA> FALSE FALSE 183
## 10 511145.b0002 511145.b2290 binding <NA> FALSE FALSE 357
## # … with 73,214 more rows
ppi = actions %>% filter(mode=='binding') %>% dplyr::select(protein1 = item_id_a, protein2 = item_id_b, ppi = score)
experimetal:
BIND, DIP, GRID, HPRD, IntAct, MINT, and PID. (src: https://string-db.org/help/faq/#which-databases-does-string-extract-experimental-data-from)
Création du graphe correspondant :
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g.ppi = graph.data.frame(ppi, directed=F)
g.ppi
## IGRAPH 026ba8f UN-- 2586 37752 --
## + attr: name (v/c), ppi (e/n)
## + edges from 026ba8f (vertex names):
## [1] 511145.b0002--511145.b3607 511145.b0002--511145.b0031 511145.b0002--511145.b0033 511145.b0002--511145.b0268
## [5] 511145.b0002--511145.b0600 511145.b0002--511145.b0674 511145.b0002--511145.b1096 511145.b0002--511145.b1622
## [9] 511145.b0002--511145.b2058 511145.b0002--511145.b2290 511145.b0002--511145.b2319 511145.b0002--511145.b2379
## [13] 511145.b0002--511145.b2478 511145.b0002--511145.b2784 511145.b0002--511145.b2838 511145.b0002--511145.b3008
## [17] 511145.b0002--511145.b3172 511145.b0002--511145.b3225 511145.b0002--511145.b3347 511145.b0002--511145.b3376
## [21] 511145.b0002--511145.b3433 511145.b0002--511145.b3650 511145.b0002--511145.b3770 511145.b0002--511145.b3829
## [25] 511145.b0002--511145.b3939 511145.b0002--511145.b3940 511145.b0002--511145.b4019 511145.b0002--511145.b4024
## [29] 511145.b0002--511145.b4139 511145.b0002--511145.b4177 511145.b0002--511145.b4207 511145.b0002--511145.b4298
## + ... omitted several edges
E(g.ppi)$ppi %>% head(100)
## [1] 183 309 211 213 357 258 243 208 183 357 430 357 214 152 540 215 430 213 169 215 430 152 243 210 215 878 605 814
## [29] 196 274 169 213 153 432 210 456 152 185 394 456 183 379 227 153 158 800 278 171 278 212 194 292 237 306 398 365
## [57] 442 430 409 879 800 172 794 873 181 165 155 181 237 165 152 185 185 188 161 183 179 763 172 911 185 767 154 210
## [85] 165 181 185 818 185 152 172 185 165 185 509 401 878 169 185 185
Stats TODO nb de gènes dans string vs. dans le génome
→ Louvain clusters
ppi.clusters = cluster_louvain(g.ppi, weights=E(g.ppi)$ppi)
ppi.clusters
## IGRAPH clustering multi level, groups: 74, mod: 0.74
## + groups:
## $`1`
## [1] "511145.b0066" "511145.b0067" "511145.b0068"
##
## $`2`
## [1] "511145.b0289" "511145.b0291"
##
## $`3`
## [1] "511145.b0054" "511145.b0127" "511145.b0128" "511145.b0150" "511145.b0151" "511145.b0152" "511145.b0153"
## [8] "511145.b0158" "511145.b0197" "511145.b0198" "511145.b0199" "511145.b0262" "511145.b0365" "511145.b0366"
## [15] "511145.b0367" "511145.b0368" "511145.b0402" "511145.b0584" "511145.b0588" "511145.b0589" "511145.b0590"
## + ... omitted several groups/vertices
Exploration des clusters/communautés obtenues :
length(ppi.clusters)
## [1] 74
sizes(ppi.clusters) %>% sort(decreasing=T) %>% head(25)
## Community sizes
## 6 58 70 56 71 3 8 54 62 41 24 19 14 47 33 37 45 66 59 5 13 27 36 64 67
## 432 261 260 252 205 167 145 136 128 115 85 81 64 49 15 15 10 10 9 6 6 5 5 5 5
modularity(g.ppi, ppi.clusters$membership, weights=E(g.ppi)$ppi)
## [1] 0.7356774
Bonne mesure de modularité.
Communities with at least 10 members
which(sizes(ppi.clusters) >= 10)
## 3 6 8 14 19 24 33 37 41 45 47 54 56 58 62 66 70 71
## 3 6 8 14 19 24 33 37 41 45 47 54 56 58 62 66 70 71
coms.ppi = tibble(ppi.community = which(sizes(ppi.clusters) >= 10))
sizes(ppi.clusters)[ coms.ppi$ppi.community ]
## Community sizes
## 3 6 8 14 19 24 33 37 41 45 47 54 56 58 62 66 70 71
## 167 432 145 64 81 85 15 15 115 10 49 136 252 261 128 10 260 205
coms.ppi$size =sizes(ppi.clusters)[ coms.ppi$ppi.community ]
coms.ppi %>% kabex
ppi.community | size |
---|---|
3 | 167 |
6 | 432 |
8 | 145 |
14 | 64 |
19 | 81 |
24 | 85 |
33 | 15 |
37 | 15 |
41 | 115 |
45 | 10 |
47 | 49 |
54 | 136 |
56 | 252 |
58 | 261 |
62 | 128 |
66 | 10 |
70 | 260 |
71 | 205 |
coms.ppi$size %>% sum
## [1] 2430
Ajout de ces communautés à proteins
proteins.ppi = tibble(protein_external_id=V(g.ppi)$name, ppi.community=ppi.clusters$membership) %>% inner_join(coms.ppi)
## Joining, by = "ppi.community"
proteins = proteins %>% left_join(proteins.ppi) %>% dplyr::select(-size)
## Joining, by = "protein_external_id"
proteins
## # A tibble: 4,127 x 5
## protein_external_… preferred_name protein_size annotation ppi.community
## <chr> <chr> <int> <chr> <dbl>
## 1 511145.b0001 thrL 21 Thr operon leader peptide; This protein is involved in … NA
## 2 511145.b0002 thrA 820 Bifunctional: aspartokinase I (N-terminal); homoserine … 58
## 3 511145.b0003 thrB 310 Homoserine kinase; Catalyzes the ATP-dependent phosphor… 58
## 4 511145.b0004 thrC 428 Threonine synthase; Catalyzes the gamma-elimination of … 6
## 5 511145.b0005 yaaX 98 annotation not available NA
## 6 511145.b0006 yaaA 258 annotation not available NA
## 7 511145.b0007 yaaJ 476 Uncharacterized transporter YaaJ; Inner membrane transp… NA
## 8 511145.b0008 talB 317 Transaldolase B; Transaldolase is important for the bal… 71
## 9 511145.b0009 mog 195 Molybdopterin adenylyltransferase; Catalyzes the adenyl… 8
## 10 511145.b0010 satP 188 Succinate-acetate/proton symporter SatP; Uptake of acet… NA
## # … with 4,117 more rows
Graphe avec les scores de coexpression
g.full = graph.data.frame(links, directed=F)
g.full
## IGRAPH cb0f3c9 UN-- 4125 1060854 --
## + attr: name (v/c), neighborhood (e/n), fusion (e/n), cooccurence (e/n), coexpression (e/n), experimental
## | (e/n), database (e/n), textmining (e/n), combined_score (e/n)
## + edges from cb0f3c9 (vertex names):
## [1] 511145.b0001--511145.b3766 511145.b0001--511145.b2483 511145.b0001--511145.b0075 511145.b0001--511145.b3672
## [5] 511145.b0001--511145.b0861 511145.b0001--511145.b2603 511145.b0001--511145.b0003 511145.b0001--511145.b3386
## [9] 511145.b0001--511145.b3769 511145.b0001--511145.b0411 511145.b0001--511145.b4558 511145.b0001--511145.b3771
## [13] 511145.b0001--511145.b3748 511145.b0001--511145.b3107 511145.b0001--511145.b4246 511145.b0001--511145.b0166
## [17] 511145.b0001--511145.b1920 511145.b0001--511145.b4045 511145.b0001--511145.b2762 511145.b0001--511145.b0981
## [21] 511145.b0001--511145.b1806 511145.b0001--511145.b2310 511145.b0001--511145.b0002 511145.b0001--511145.b3671
## [25] 511145.b0001--511145.b0863 511145.b0001--511145.b4529 511145.b0001--511145.b0005 511145.b0001--511145.b1002
## + ... omitted several edges
→ Louvain coexpr clusters
coexpr.clusters = cluster_louvain(g.full, weights=E(g.full)$coexpression)
coexpr.clusters
## IGRAPH clustering multi level, groups: 143, mod: 0.49
## + groups:
## $`1`
## [1] "511145.b0018"
##
## $`2`
## [1] "511145.b0024"
##
## $`3`
## [1] "511145.b0217"
##
## $`4`
## + ... omitted several groups/vertices
Exploration des clusters/communautés obtenues :
length(coexpr.clusters)
## [1] 143
sizes(coexpr.clusters) %>% sort(decreasing = T) %>% head(25)
## Community sizes
## 46 48 62 52 44 57 92 36 32 72 39 81 61 42 30 37 65 1 2 3 4 5 6 7
## 1213 807 659 327 302 269 129 76 74 57 26 22 18 11 5 2 2 1 1 1 1 1 1 1
## 8
## 1
modularity(g.full, coexpr.clusters$membership, weights=E(g.full)$coexpression)
## [1] 0.4858105
Bonne modularité
Communities with at least 10 members
which(sizes(coexpr.clusters) >= 10)
## 32 36 39 42 44 46 48 52 57 61 62 72 81 92
## 32 36 39 42 44 46 48 52 57 61 62 72 81 92
coms.coexpr = tibble(coexpr.community = which(sizes(coexpr.clusters) >= 10))
sizes(coexpr.clusters)[ coms.coexpr$coexpr.community ]
## Community sizes
## 32 36 39 42 44 46 48 52 57 61 62 72 81 92
## 74 76 26 11 302 1213 807 327 269 18 659 57 22 129
coms.coexpr$size =sizes(coexpr.clusters)[ coms.coexpr$coexpr.community ]
coms.coexpr %>% kabex
coexpr.community | size |
---|---|
32 | 74 |
36 | 76 |
39 | 26 |
42 | 11 |
44 | 302 |
46 | 1213 |
48 | 807 |
52 | 327 |
57 | 269 |
61 | 18 |
62 | 659 |
72 | 57 |
81 | 22 |
92 | 129 |
coms.coexpr$size %>% sum
## [1] 3990
Ajout des ces communautés à proteins
proteins.coexpr = tibble(protein_external_id=V(g.full)$name, coexpr.community=coexpr.clusters$membership) %>% inner_join(coms.coexpr) %>% dplyr::select(-size)
## Joining, by = "coexpr.community"
proteins = proteins %>% left_join(proteins.coexpr)
## Joining, by = "protein_external_id"
proteins %>% filter(!is.na(ppi.community)) %>% nrow
## [1] 2430
proteins %>% filter(!is.na(coexpr.community)) %>% nrow
## [1] 3990
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
prs = proteins %>% filter(!is.na(ppi.community) & !is.na(coexpr.community))
prs
## # A tibble: 2,417 x 6
## protein_external… preferred_name protein_size annotation ppi.community coexpr.community
## <chr> <chr> <int> <chr> <dbl> <dbl>
## 1 511145.b0002 thrA 820 "Bifunctional: aspartokinase I (N-termi… 58 44
## 2 511145.b0003 thrB 310 "Homoserine kinase; Catalyzes the ATP-d… 58 44
## 3 511145.b0004 thrC 428 "Threonine synthase; Catalyzes the gamm… 6 44
## 4 511145.b0008 talB 317 "Transaldolase B; Transaldolase is impo… 71 48
## 5 511145.b0009 mog 195 "Molybdopterin adenylyltransferase; Cat… 8 48
## 6 511145.b0014 dnaK 638 "Chaperone protein DnaK; Plays an essen… 56 72
## 7 511145.b0015 dnaJ 376 "Chaperone protein DnaJ; Interacts with… 56 72
## 8 511145.b0023 rpsT 87 "30S ribosomal protein S20; Binds direc… 62 57
## 9 511145.b0025 ribF 313 "Riboflavin biosynthesis protein RibF; … 58 46
## 10 511145.b0026 ileS 938 "Isoleucine--tRNA ligase; Catalyzes the… 24 57
## # … with 2,407 more rows
Good overlap between genes clustered in both clustering.
library(ade4)
tab = prs %>% dcast(coexpr.community ~ ppi.community)
## Using coexpr.community as value column: use value.var to override.
## Aggregation function missing: defaulting to length
colnames(tab) = paste0('ppi.', colnames(tab))
rownames(tab) = paste0('coexpr.',tab[,1])
tab[,-1] %>% table.value()
Certains semblent bien se correspondre (visuellement), ex: coexpr.1 et ppi.37 ; et d’autres beaucoup moins, ex: ppi.58
Pour avoir une mesure entre 0 et 1, utilisons le coefficient simple d’appariement et l’indice de Jaccard.
Matrices pour ppi et coexpr:
ppi.m = sapply(prs$ppi.community, function(yi) yi == prs$ppi.community)
coexpr.m = sapply(prs$coexpr.community, function(yi) yi == prs$coexpr.community)
Table de contingence
objet j 1 0 objet 1 a b i 0 c d
a = sum( as.numeric(ppi.m & coexpr.m))
a
## [1] 163499
bc = sum( as.numeric(xor(ppi.m, coexpr.m)))
bc
## [1] 1298028
d = sum( as.numeric(!ppi.m & !coexpr.m))
d
## [1] 4380362
SMC
bc / (a+bc+d)
## [1] 0.2221932
Jaccard
bc / (a+bc)
## [1] 0.8881314