TP Traitement de graphes et réseaux biologiques

1 TP n°1 : STRINGdb, BFS, sous-graphe induit, composante connexe, diamètre et coefficient d’agglomération

1.1 STRINGdb

1.1.1 Graphe d’interaction protéine-protéine chez Escherichia coli

Les données sur lesquelles vous allez travailler proviennent de la base de données STRING v11.5.

Tout d’abord, trouvez le site de STRING et visualisez les partenaires de dnaJ d’Escherichia coli K12-MG1655.

Le menu Settings permet de choisir le rendu du graphe avec des des liens confidence ou evidence en affichant soit des arêtes de tailles reflétant le niveau de confiance du lien, soit des multi-arêtes indiquant les sources de données intervenant dans un lien, soit le type d’interaction. Dans le même menu, on peut sélectionner le type de sources de données affichées (textmining, experimental, …). On peut aussi afficher davantage de sommets avec le bouton More.

Dans la partie Download du site, vous allez récupérer des fichiers pour E. coli. Il faut d’abord restreindre les données à l’organisme considéré. Vous utiliserez la souche K-12 substr. MG1655.

Ensuite récupérez :

le fichier contenant les liens avec les scores détaillés pour chaque type de données : 511145.protein.links.detailed.v11.5.txt.gz
le fichier contenant la description des sommets : 511145.protein.info.v11.5.txt.gz

Utiliser les commandes du shell (zcat, cut, zgrep, sort, wc) pour répondre aux questions suivantes :

combien de protéines ?
combien d’arêtes ?

Le graphe étant un peu gros pour les exercices qui suivent vous allez filtrez les liens pour ne garder que ceux qui ont un score supérieur ou égal à 700 pour la colonne experimental.

Apperçu du contenu du fichier :

zcat data/511145.protein.links.detailed.v11.5.txt.gz | head

## protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
## 511145.b0001 511145.b0810 0 0 0 0 0 0 162 161
## 511145.b0001 511145.b0946 0 0 0 0 0 0 200 200
## 511145.b0001 511145.b4403 0 0 0 0 0 0 363 363
## 511145.b0001 511145.b2126 0 0 0 0 0 0 231 230
## 511145.b0001 511145.b3769 0 0 0 0 0 0 299 299
## 511145.b0001 511145.b0943 0 0 0 0 0 0 163 163
## 511145.b0001 511145.b0075 0 0 0 0 0 0 619 619
## 511145.b0001 511145.b1002 0 0 0 0 0 0 239 238
## 511145.b0001 511145.b2023 0 0 0 0 0 0 297 297

Les colonnes qui nous intéressent sont donc les 2 premières et la 7ème :

zcat data/511145.protein.links.detailed.v11.5.txt.gz | cut -f 1,2,7 -d' ' | head

## protein1 protein2 experimental
## 511145.b0001 511145.b0810 0
## 511145.b0001 511145.b0946 0
## 511145.b0001 511145.b4403 0
## 511145.b0001 511145.b2126 0
## 511145.b0001 511145.b3769 0
## 511145.b0001 511145.b0943 0
## 511145.b0001 511145.b0075 0
## 511145.b0001 511145.b1002 0
## 511145.b0001 511145.b2023 0

Cette sortie peut être filtrée pour des valeurs supérieures à 700 à l’aide la commande awk :

zcat data/511145.protein.links.detailed.v11.5.txt.gz | cut -f 1,2,7 -d' ' | awk '{ if ($3>700 && $1<$2) { print }}' | head

## protein1 protein2 experimental
## 511145.b0009 511145.b1727 870
## 511145.b0014 511145.b2231 869
## 511145.b0014 511145.b3530 723
## 511145.b0014 511145.b2592 981
## 511145.b0014 511145.b2614 997
## 511145.b0014 511145.b0015 994
## 511145.b0014 511145.b1084 845
## 511145.b0014 511145.b0911 870
## 511145.b0014 511145.b0383 800

Cela semble correspondre à ce que l’on voulait obtenir. On redirige la sortie vers un nouveau fichier que l’on nomme 511145.protein.links.experimental.txt :

zcat data/511145.protein.links.detailed.v11.5.txt.gz | cut -f 1,2,7 -d' ' | awk '{ if ($3>700 && $1<$2) { print }}' > data/511145.protein.links.experimental.txt

Combien de sommets/protéines avons-nous dans ce fichier ? Pour cela, il nous extraire la première puis la deuxième colonne (les identifiants des protéines sont répartis sur les 2 colonnes car on a ajouté protein1<protein2 dans la commande awk précédente), pour ensuite utiliser sort -u :

(cut -f 1 -d' ' data/511145.protein.links.experimental.txt ; cut -f 2 -d' ' data/511145.protein.links.experimental.txt) | sort -u | wc -l

## 956

En retirant protein1 et protein2 des identifiants, ça fait 954 sommets/identifiants.

De même pour le nombre d’arêtes/liens :

wc -l data/511145.protein.links.experimental.txt

## 4552 data/511145.protein.links.experimental.txt

En retirant la ligne contenant les noms des colonnes, ça fait 4551 arêtes.

1.2 Cytoscape

Cytoscape est un logiciel de visualisation et d’exploration de graphes. Il a été développé dans le cadre de graphes et réseaux biologiques. Un certain nombre de plugins ou greffons sont disponibles par exemple pour récupérer/télécharger des voies métaboliques (BioCyc) directement depuis Cytoscape.

1.2.1 Import d’un graphe à partir d’un fichier

Importez le graphe à partir du menu File → import → Network from file…. Il faut alors dans les paramètres avancés changer le délimiteur de colonnes et ensuite spécifier quels sont les attributs correspondant aux extrémités des arêtes.

Après chargement, vous devriez obtenir quelque chose qui ressemble à ceci :

Capture d’écran Cytoscape

1.2.2 Chargement de données supplémentaires

De la même manière que précédemment, il est possible de charger des données supplémentaires sur les sommets et les arcs/arêtes. En effet, le graphe chargé ne contient que les identifiants des gènes d’E. coli appelés bnumber. On pourra vouloir accéder ou afficher aux protéines via leur nom plus répandu tel que rne pour la RNase E qui dégrade la majorité des ARN, ou encore le chaperon moléculaire DnaK. Dans ce cas, on peut chercher dans le fichier 511145.protein.info.v11.5.txt.gz :

zgrep dnaK data/511145.protein.info.v11.5.txt.gz

## 511145.b0014 dnaK    638 Plays an essential role in the initiation of phage lambda DNA replication, where it acts in an ATP-dependent fashion with the DnaJ protein to release lambda O and P proteins from the preprimosomal complex. DnaK is also involved in chromosomal DNA replication, possibly through an analogous interaction with the DnaA protein. Also participates actively in the response to hyperosmotic shock
## 511145.b1642 slyA    144 Transcription regulator that can specifically activate or repress expression of target genes. Activates expression of genes such as molecular chaperones (groL, groS, dnaK, grpE, and cbpA), proteins involved in acid resistance (hdeA, hdeB, and gadA), the starvation lipoprotein slp, virulence protein hlyE/clyA. Represses expression of genes involved in the histidine biosynthetic pathway such as hisA, hisB, hisD, hisF and hisG. Required for the activation of virulence genes

Et à partir de cet identifiant, retrouver DnaK dans le graphe :

DnaK

Importez le fichier précédemment récupéré 511145.protein.info.v11.5.txt (qu’il faudra décompresser/désarchiver auparavant avec la commande gunzip) à partir du menu File → import → Table from file… en modifiant le délimiteur de colonnes si besoin, en précisant bien s’il s’agit de données sur les sommets ou les arêtes, et en indiquant quelle colonne permet d’identifier les sommets et quelles autres colonnes doivent être importées sous quel type (nombre ou autre).

Dans l’onglet Style (sur le côté gauche de la fenêtre principale), il est possible de modifier la manière dont est rendu un bon nombre de paramètres pour l’affichage. Pour les sommets, on peut modifier leur forme, leur label affiché, leur taille, leur couleur, …; idem pour les arêtes :

Changment du style de l’affichage

Il serait possible de passer plusieurs heures à explorer les possibilités offertes par ce logiciel, ou encore à explorer le graphe d’interaction protéine-protéine d’E. coli, ce que vous pourrez faire par vous-même par ailleurs. L’exploration manuelle reste néanmoins laborieuse et chronophage, par exemple, pour rechercher tous les sommets que l’on peut atteindre depuis DnaK.

Pour cette séance, les objectifs vont être :

en python, charger le fichier 511145.protein.links.detailed.v11.5.txt.gz dans une structure de données,
implémenter l’algorithme de parcours en largeur à partir d’un sommet,
déterminer les sommets atteignables depuis DnaK et les plus courts chemins y menant,
extraire le sous graphe induit par cet ensemble de sommets (plus grande composante connexe pour ce graphe) et
calculer le coéfficient d’agglomération des sommets, une mesure locale de densité.

1.3 Module python pour le traitement de graphes

Vous allez réaliser un module python pour les graphes (orientés ou non) que vous allez construire au fur et à mesure de ce TP, du suivant, et du projet pour cette UE.

Ce module va permettre entre autres :

le chargement d’un graphe sous différents formats (CSV, TSV, …, OBO pour la Gene Ontology, GOA pour les annotations des produits des gènes avec des GOTerm),
le parcours en largeur (BFS),
le parcours en profondeur (DFS),
la détection de cycle/circuit,
le tri topologique,
l’identification des plus courts chemins à partir d’un sommet (Bellman-Ford), ou entre toutes les paires de sommets (Floyd-Warshall).

1.3.1 Structures de données et représentation d’un graphe

Pour commencer, il vous faut définir les structures de données que vous allez utiliser pour représenter et traiter des graphes. Un graphe

peut avoir différents attributs (nom, ordre, diamètre, …),
peut être orienté ou non,
peut être pondéré ou non,
possède un ensemble de sommets,
possède un ensemble d’arcs ou d’arêtes.

De plus, un sommet peut avoir différentes étiquettes ou attributs : identifiant, nom, index, date de passage, couleur, coordonnées (x,y), coefficient d’agglomération, centralité, etc.. De même, un arc ou une arête peut avoir différents attributs : poids, longueur, couleur, type, centralité, etc..

Un des choix cruciaux va être de choisir la représentation (structure de données utilisée) pour les arcs ou arêtes : il est possible, par exemple, d’opter pour une matrice d’adjacence ou des listes d’adjacence. Si l’on manipule des graphes peu denses, la représentation par listes d’adjacence est plus économe en mémoire (sauf en cas d’optimisation de la gestion de matrices creuses).

Toutes ces informations peuvent être stockées dans un dictionnaire à l’allure suivante (avec la syntaxe python) :

graph = {
    'directed':          True,
    'weighted':          False,
    'nodes':             {},
    'edges':             {},
    'weight_attribute':  None,
}

Les clés nodes et edges pointent elles-mêmes vers des dictionnaires (imbriqués).

Considérons le graphe g1 très simple suivant constitué de 2 sommets et d’un arc (A,B) ayant un poids de 5 :

     weight: 5
 A -------------> B

Le plus compliqué est le stockage des arcs qui peuvent avoir différents attributs. Le choix proposé est d’utiliser un dictionnaire de dictionnaires de dictionnaires de dictionnaires … : pour l’arc de A à B de poids 5 : graph['edges']['A']['B']['weight'] = 5

En d’autres termes :

graph['edges'] est un dictionnaire dont les clés sont les sommets source des arcs et dont les valeurs sont des dictionnaires,
graph['edges']['A'] est un dictionnaire pour les arcs dont l’extrémité initiale est le sommet A. Ce sommet peut avoir un degré sortant supérieur à 1, donc il faut stocker les extrémités terminales sous forme de dictionnaire. Les clés de ce dictionnaire seront donc les sommets correspondants aux extrémités terminales des sommets dont l’extrémité initiale est A,
graph['edges']['A']['B'] est un dictionnaire permettant de stocker les attributs de l’arc (A,B),
graph['edges']['A']['B']['weight'] est ainsi le poids de l’arc A → B.

Ainsi, si un graphe pondéré est chargé, on pourra choisir de stocker le poids des arcs dans les attributs des arcs (accessibles à partir de la clé edges) avec la clé weight. Pour $g1$, le dictionnaire pour son stockage aura l’allure suivante :

g1 = {
 'directed': True,
 'weighted': True,
 'nodes': {
    'A': { },
    'B': { }
    },
 'edges': {
    'A': { 
      'B': {
         'weight': 5
         }
      },
    'B': {}
 },
 'weight_attribute': 'weight'
}

1.4 Modules et scripts en python

Création du premier module/fichier graphmaster.py avec la fonction create_graph et l’exécution conditionnelle de code pour les tests du module. Puis, un autre fichier/script qui utilise le module.

Créez un fichier graphmaster.py qui va contenir les fonctions de la librairie que vous allez développée. Pour l’instant, il va contenir une seule fonction permettant de créer une variable représentant un graphe vide :

#!/bin/env python

from pprint import pprint

def create_graph(directed = True, weighted = False, weight_attribute = None):
    """
    create a dictionnary representing a graph and returns it.
    """
    g = { 'nodes': {}, 'edges': {}, 'in_edges': {}, 'directed': directed, 'weighted': weighted, 'weight_attribute': weight_attribute }
    return g


##### main → tests #####
if __name__ == "__main__":
    print("# Graph lib tests")
    print("## create_graph")
    g = create_graph()
    pprint(g)

Utilisation (pour le formatage du sujet de TP, le contenu ci-dessus a été placé dans un sous-répertoire scripts et dans un fichier graphmaster.create_graph.py) :

chmod +x ./scripts/graphmaster.create_graph.py
./scripts/graphmaster.create_graph.py

## # Graph lib tests
## ## create_graph
## {'directed': True,
##  'edges': {},
##  'nodes': {},
##  'weight_attribute': None,
##  'weighted': False}

Utilisation du module graphmaster.py depuis un autre script use.graphmaster.py:

#!/bin/env python

import graphmaster as gm

print('Hello there')

g1 = gm.create_graph(directed=False)
print('Created g1 as undirected:', g1)

chmod +x scripts/use.graphmaster.py
./scripts/use.graphmaster.py

## Hello there
## Created g1 as undirected: {'nodes': {}, 'edges': {}, 'directed': False, 'weighted': False, 'weight_attribute': None}

Remarque : On notera que lorsque le module graphmaster est utilisé depuis un autre script, la partie ##### main → tests ##### if __name__ == "__main__": du module n’est pas exécutée.

1.5 Ajout de sommets et d’arêtes à un graphe

Il s’agit maintenant de pouvoir ajouter des sommets et des arcs/arêtes.

Avant d’ajouter un sommet, il faut s’assurer qu’il n’existe pas déjà en testant l’existence de sa clé dans le dictionnaire nodes. Si ce n’est pas le cas, on pourra alors ajouter cette clé qui va pointer vers les attributs du sommet :

def add_node(g, node_id, attributes = None): 
    """
    add a node with node_id (node id provided as a string or int) to the graph g.
    attributes on the node can be provided by a dict.
    returns the node n attributes.
    """
    if node_id not in g['nodes']: # ensure node does not already exist
        if attributes is None: # create empty attributes if not provided
            attributes = {}
        g['nodes'][node_id] = attributes
        g['edges'][node_id] = {} # init outgoing edges
    return g['nodes'][node_id] # return node attributes

De même, pour ajouter un arc, il faut :

s’assurer que l’extrémité initiale existe
s’assurer que l’extrémité terminale existe
s’assurer que l’arc n’existe pas
si le graphe n’est PAS orienté, il faudra alors ajouter A → B ainsi que B → A

def add_edge(g, node_id1, node_id2, attributes = None, node1_attributes = None, node2_attributes = None): 
    # create nodes if they do not exist
    if node_id1 not in g['nodes']: add_node(g, node_id1, node1_attributes) # ensure node1 exists
    if node_id2 not in g['nodes']: add_node(g, node_id2, node2_attributes) # ensure node2 exists
    # add edge(s) only if they do not exist
    if node_id2 not in g['edges'][node_id1]:
        if attributes is None: # create empty attributes if not provided
            attributes = {}
        g['edges'][node_id1][node_id2] = attributes
        if not g['directed']:
            g['edges'][node_id2][node_id1] = g['edges'][node_id1][node_id2] # share the same attributes as n1->n2
    return g['edges'][node_id1][node_id2] # return edge attributes

Il s’agira ensuite de vérifier que l’on obtient bien ce que l’on souhaite à l’aide de tests que l’on pourra ajouter dans la partie ##### main → tests #####:

##### main → tests #####
if __name__ == "__main__":
  print("# Graph lib tests")
  print("## create_graph")
  g = create_graph()
  pprint(g)
  
  print("## add nodes and edges")
  g = create_graph()
  add_node(g, 'A')
  add_node(g, 'B')
  add_edge(g, 'A', 'B', { 'weight': 5 } )
  pprint(g)

On devrait obtenir :

## add nodes and edges
{
 'directed': True,
 'edges': {'A': {'B': {'weight': 5}}, 
          'B': {}
          },
 'nodes': {'A': {}, 
           'B': {}
          },
 'weight_attribute': None,
 'weighted': False
}

1.6 Chargement d’un graphe à partir d’un fichier texte tabulé

Il existe des modules python qui pourraient nous simplifier cette tâche (le module pandas par exemple), mais pour des raisons pédagogiques et pour ne dépendre que d’un minimum de modules, ceci sera réalisé par nous-même.

La première ligne du fichier contient les noms des colonnes que l’on stockera afin de nommer les attributs sur chaque arc créé. Il faudra retirer les 2 premiers éléments de cette liste qui devraient valoir “node_id1” et “node_id2” et qui ne nous servent à rien.

Ensuite, pour chacune des ligne, il faudra la découper avec le caractère séparateur de colonnes, récupérer les identifiants des extrémités de l’arc, créér et remplir le dictionnaire contenant les attributs de l’arc, et enfin ajouter l’arc au graphe :

def read_delim(filename, column_separator='\t', directed=True, weighted=False, weight_attribute=None): 
    """
    Parse a text file which columns are separated by the specified column_separator and 
    returns a graph.
    
    line syntax: node_id1   node_id2    att1    att2    att3    ...
    """
    g = create_graph(directed, weighted, weight_attribute)
    with open(filename) as f: 
        # GET COLUMNS NAMES
        tmp = f.readline().rstrip()
        attNames= tmp.split(column_separator)
        # REMOVES FIRST TWO COLUMNS WHICH CORRESPONDS TO THE LABELS OF THE CONNECTED VERTICES
        attNames.pop(0)  # remove first column name (source node not to be in attribute names)
        attNames.pop(0)  # remove second column (target node ...)
        # PROCESS THE REMAINING LINES
        row = f.readline().rstrip()
        while row:
            vals = row.split(column_separator)
            u = vals.pop(0)
            v = vals.pop(0)
            att = {}
            for i in range(len(attNames)):
                att[ attNames[i] ] = vals[i]
            add_edge(g, u, v, att)
            row = f.readline().rstrip() # NEXT LINE
        return g

Télécharger le fichier dressing.tsv et tester que le chargement s’effectue correctement.

#!/bin/env python
import graphmaster as gm
from pprint import pprint
pprint(gm.read_delim('data/dressing.tsv'))

## {'directed': True,
##  'edges': {'ceinture': {'veste': {}},
##            'chaussettes': {'chaussures': {}},
##            'chaussures': {},
##            'chemise': {'ceinture': {}, 'cravate': {}},
##            'cravate': {'veste': {}},
##            'pantalon': {'ceinture': {}, 'chaussures': {}},
##            'sous-vetements': {'chaussures': {}, 'pantalon': {}},
##            'veste': {}},
##  'nodes': {'ceinture': {},
##            'chaussettes': {},
##            'chaussures': {},
##            'chemise': {},
##            'cravate': {},
##            'pantalon': {},
##            'sous-vetements': {},
##            'veste': {}},
##  'weight_attribute': None,
##  'weighted': False}

Graphe dressing

1.7 Parcours en largeur (BFS)

Algorithme vu en cours :

BFS(G, s)
 1 for each vertex u ∈ V[G]
 2    do state[u]← UNEXPLORED
 3       distance[u] ← ∞
 4       predecessor[u] ← NIL
 5 state[s] ← DISCOVERED
 6 d[s] ← 0
 7 queue ← ∅
 8 ENQUEUE(queue, s)
 9 while queue ≠ ∅
10    do u ← DEQUEUE(queue)
11       for each v ∈ Adj[u]
12          do if state[v] = UNEXPLORED
13             then state[v] ← DISCOVERED 
14                  distance[v] ← distance[u] + 1
15                  predecessor[v] ← u
16                  ENQUEUE(queue, v)
17        state[u] ← FINISHED

Implémenter cet algorithme en ajoutant une fonction bfs qui renvoie les dictionnaires state, distance et predecessor.

Sur dressing à partir de sous-vetements, on obtient :

import graphmaster as gm
from pprint import pprint
dressing = gm.read_delim('data/dressing.tsv')
pprint(gm.bfs(dressing, 'sous-vetements'))

## {'distance': {'ceinture': 2,
##               'chaussettes': inf,
##               'chaussures': 1,
##               'chemise': inf,
##               'cravate': inf,
##               'pantalon': 1,
##               'sous-vetements': 0,
##               'veste': 3},
##  'predecessor': {'ceinture': 'pantalon',
##                  'chaussures': 'sous-vetements',
##                  'pantalon': 'sous-vetements',
##                  'veste': 'ceinture'},
##  'source': 'sous-vetements',
##  'state': {'ceinture': 'finished',
##            'chaussettes': 'unexplored',
##            'chaussures': 'finished',
##            'chemise': 'unexplored',
##            'cravate': 'unexplored',
##            'pantalon': 'finished',
##            'sous-vetements': 'finished',
##            'veste': 'finished'}}

Utiliser à présent cette fonction pour déterminer s’il existe un chemin entre dnaK et bamB dans le graphe d’interaction protéine-protéine 511145.protein.links.experimental.txt généré plus tôt.

Quelle est la longueur du ou des plus courts chemins ?

Afficher ce chemin.

Résultat attendu :

d(dnaK, bamB) = 9

['511145.b0014',
 '511145.b0911',
 '511145.b0023',
 '511145.b3981',
 '511145.b3705',
 '511145.b3737',
 '511145.b3738',
 '511145.b3508',
 '511145.b4175',
 '511145.b2512']
 
 ou dnaK → rpsA → rpsT → secE → yidC → atpE → atpB → yhiD → hflC → bamB

1.8 Diamètre du graphe (1er algorithme naïf)

Maintenant qu’il est possible à partir d’un sommet de calculer sa distance à tout ceux qu’il peut atteindre avec bfs, nous allons pouvoir calculer le diamètre du graphe (plus long des plus courts chemins à partir de tous les sommets). Il s’agit donc d’utiliser bfs sur tous les sommets et de garder la plus grande distance observée entre paires de sommets. Cette approche est naïve et pas efficace mais a le mérite d’être simple. Nous verrons plus tard d’autres méthodes dont la complexité est meilleure.

#!/bin/env python

import graphmaster as gm

g = gm.read_delim('data/511145.protein.links.experimental.CC1.tsv', directed=False)

diameter = -1
u1=None
u2=None
for n in g['nodes']:
    search = gm.bfs(g, n)
    for k,v in search['distance'].items():
        if v > diameter:
            diameter = v
            u1 = n
            u2 = k

print(f'The diameter is ... {diameter} between {u1} and {u2}')

time ./scripts/graphmaster.bfs.diameter.py

## The diameter is ... 19 between 511145.b2554 and 511145.b0972
## 
## real 0m0.478s
## user 0m0.459s
## sys  0m0.006s

1.9 Sous-graphe induit

Ajouter une fonction à la librairie qui permet d’extraire un sous-graphe induit à partir d’un ensemble de sommets.

Il s’agit donc de compléter le code de la fonction ci-dessous :

def induced_subgraph(g, nodes):
    """
    Extract induced subgraph from a set of nodes.
    """
  sg = create_graph(g['directed'], g['weighted'], weight_attribute=g['weight_attribute'])
  # A COMPLETER
  return sg

Le sous-graphe induit par les sommets visités de dressing avec bfs à partir de sous-vetements :

{'directed': True,
 'edges': {'ceinture': {'veste': {}},
           'chaussures': {},
           'pantalon': {'ceinture': {}, 'chaussures': {}},
           'sous-vetements': {'chaussures': {}, 'pantalon': {}},
           'veste': {}},
 'nodes': {'ceinture': {},
           'chaussures': {},
           'pantalon': {},
           'sous-vetements': {},
           'veste': {}},
 'weight_attribute': None,
 'weighted': False}

Extraire le sous-graphe induit par les sommets visités depuis dnaK et utiliser la fonction ci-dessous pour sauvegarder le résultat dans le fichier 511145.protein.links.experimental.CC1.tsv.

def save_delim(g, filename, column_separator='\t'):
    # collect attribute names on edges
    attributes = set()
    for u in g['edges']:
        for v in g['edges'][u]:
            for a in g['edges'][u][v]:
                attributes.add(a)
    # header
    line = 'node_id1'+column_separator+'node_id2'+column_separator+column_separator.join(attributes)
    with open(filename, "w") as f:
        # write headder
        f.write(line+'\n')
        # iterate over each source node, target node and attribute name to write their values
        for u in g['edges']:
            for v in g['edges'][u]:
                line = u + column_separator + v
                for a in attributes:
                    val = str(g['edges'][u][v][a]) if a in g['edges'][u][v] else ''
                    line += column_separator + val
                f.write(line+'\n')

Vous pourrez charger le fichier dans Cytoscape pour le visualiser.

Le graphe enregistré et chargé contient A→B et B→A. Dans cytoscape, on peut faire Edit → Remove Duplicate Edges… en cochant Ignore edge direction pour simplifier le graphe.

1.10 Coefficient d’agglomération

Le coefficient d’agglomération pour un graphe est une mesure indiquant la probabilité que les sommets voisins d’un sommet donné soient aussi voisins. Il se mesure soit globalement sur le graphe, soit localement pour chacun des sommets.

Pour le calculer pour un sommet $i$, il s’agit de compter le nombre de triangles dans le graphe contenant le sommet $i$ et de le diviser par le nombre de paires de voisins de $i$. Une autre manière de le calculer est de compter le nombre d’arêtes dans le sous-graphe induit par les sommets voisins de $i$ et de le diviser par le nombre maximal d’arêtes si ce sous-graphe était complet.

Ajouter une fonction clustering_coefficient(g, node_id) qui renvoie le coefficient d’agglomération d’un sommet donné.

Quels sont les sommets ayant les plus grandes valeurs dans le graphe correspondant à la plus grande composante connexe extrait précédemment ?

Quelques valeurs (disponibles aussi dans Cytoscape dans les informations sur les sommets après avoir utilisé Tools → Analyze Network) :

'511145.b0014': 0.044444444444444446,
'511145.b0089': 0.8333333333333334,
'511145.b0145': 0.9333333333333333,
'511145.b0184': 0.7857142857142857,

2 TP n°2 : Gene Ontology, Annotations, DFS

Cette partie va permettre de commencer à utiliser le graphe de la Gene Ontology à travers l’implémentation du parcours en profondeur.

2.1 Gene Ontology

La Gene Ontology permet d’annoter avec un vocabulaire contrôlé les produits des gènes. Elle se compose de 3 domaines :

Molecular function
Biological process
Cellular component

Ces 3 domaines forment 3 graphes orientés sans circuit (DAG) si l’on ne tient compte que des relations is a et part of entre les termes (appelés GOTerm par la suite).

Elle est disponible au format OBO (pour Open Biological and Biomedical Ontologies) qui est un format texte avec des règles spécifiques dont voici l’extrait du début :

format-version: 1.2
data-version: releases/2022-10-07
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
..//..
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:
11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

Download go-basic.obo

curl http://current.geneontology.org/ontology/go-basic.obo > data/go-basic.obo

GO: virion component https://www.uniprot.org/uniprotkb?query=(proteome:UP000001985)

SARS-Cov2 annotations from EBI https://www.ebi.ac.uk/GOA/ en rouge pre-release ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/pre_release/

Fichier : http://ftp.ebi.ac.uk/pub/databases/GO/goa/pre_release/uniprot_sars-cov-2.gaf

protéines : cat data/uniprot_sars-cov-2.gaf | cut -f 2 | grep -v ‘^!’ | sort -u

P0DTC1
P0DTC2
P0DTC3
P0DTC4
P0DTC5
P0DTC6
P0DTC7
P0DTC9
P0DTD1
P0DTD2
P0DTD3
P0DTD8

URL UniProt à construire : https://www.uniprot.org/uniprotkb?query=%28accession%3AP0DTC1%29%20OR%20%28accession%3AP0DTC2%29%20OR%20%28accession%3AP0DTC3%29OR%20%28accession%3AP0DTC4%29OR%20%28accession%3AP0DTC5%29OR%20%28accession%3AP0DTC6%29OR%20%28accession%3AP0DTC7%29OR%20%28accession%3AP0DTC9%29OR%20%28accession%3AP0DTD1%29OR%20%28accession%3AP0DTD2%29OR%20%28accession%3AP0DTD3%29OR%20%28accession%3AP0DTD8%29

Colonnes intéressantes : Entry, Entry Name, Protein Names, Gene Ontology IDs

Celui là est bien protein les noms et descriptions des protéines mais le gaf/goa sera mieux pour les étudiants (pour que ce soit comme pour le projet)

vir = read_tsv('data/SARS.Cov2.uniprot-download_true_fields_accession_2Cid_2Cprotein_name_2Cgo_id_f-2022.10.27-10.15.13.19.tsv')
vir %>% separate_rows(`Gene Ontology IDs`, sep='; ')

2.2 Annotations

L’annotation des produits de gènes consiste à attribuer un ou plusieurs GOTerms à un produit de gène (noté GeneProduct par la suite). Les annotations sont produites de manière indépendante de la Gene Ontology en elle-même. Elles sont généralement disponibles au format GOA/GAF dont voici un extrait pour le SARS-Cov2.

head data/uniprot_sars-cov-2.gaf

## !gaf-version: 2.1
## !Generated: 2020-03-26 11:32
## !GO-version: http://purl.obolibrary.org/obo/go/releases/2020-03-24/extensions/go-plus.owl
## !
## UniProtKB    P0DTC1  P0DTC1      GO:0003723  GO_REF:0000043  IEA UniProtKB-KW:KW-0694    F   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt     
## UniProtKB    P0DTC1  P0DTC1      GO:0004518  GO_REF:0000043  IEA UniProtKB-KW:KW-0540    F   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt     
## UniProtKB    P0DTC1  P0DTC1      GO:0004519  GO_REF:0000043  IEA UniProtKB-KW:KW-0255    F   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt     
## UniProtKB    P0DTC1  P0DTC1      GO:0006508  GO_REF:0000043  IEA UniProtKB-KW:KW-0645    P   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt     
## UniProtKB    P0DTC1  P0DTC1      GO:0008233  GO_REF:0000043  IEA UniProtKB-KW:KW-0645    F   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt     
## UniProtKB    P0DTC1  P0DTC1      GO:0008234  GO_REF:0000043  IEA UniProtKB-KW:KW-0788    F   Replicase polyprotein 1a        protein taxon:2697049   20200321    UniProt

La description de ce que contiennent les colonnes est disponible par exemple sur http://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes/README

Ici, nous allons nous intéresser surtout aux colonnes n°2 (identifiant du GeneProduct, ici identifiant UniProt) et n°4 (identifiant du GOTerm)

Par exemple, les première lignes concernent la protéine P0DTC1 annotée avec les GOTerms GO:0003723, GO:0004518, GO:0004519, GO:0006508, … Le code IEA est un evidence code et indique d’où provient l’annotation → dans l’extrait plus haut, toutes ces annotations sont des prédictions (IEA) sans validation expérimentale ni vérification.

2.3 Module python geneontology

Pour cette partie, on vous fournit le début d’un module geneontology (qu’il faudra compléter pour le projet) et qui permet pour l’instant de charger le graphe à partir de ces 2 formats de fichier (OBO et GOA) et du module graphmaster (qu’il faudra aussi compléter pour le projet).

Il vous suffit donc de copier/coller le code suivant dans un fichier geneontology.py (comme pour graphmaster.py) :

#!/bin/env python

import re
import graphmaster as gm

def load_OBO(filename='go-basic.obo'):
    """
    parse the OBO file and returns the graph
    obsolete terms are discarded
    only is_a and part_of relationships are loaded

    Extract of a file to be parsed:
    [Term]
    id: GO:0000028
    name: ribosomal small subunit assembly
    namespace: biological_process
    def: "The aggregation, arrangement and bonding together of constituent RNAs and proteins to form the small ribosomal subunit." [GOC:jl]
    subset: gosubset_prok
    synonym: "30S ribosomal subunit assembly" NARROW [GOC:mah]
    synonym: "40S ribosomal subunit assembly" NARROW [GOC:mah]
    is_a: GO:0022618 ! ribonucleoprotein complex assembly
    relationship: part_of GO:0042255 ! ribosome assembly
    relationship: part_of GO:0042274 ! ribosomal small subunit biogenesis
    """
    def parseTerm(lines):
        # search for obsolete
        for l in lines:
            if l.startswith('is_obsolete: true'):
                return
        # otherwise create node
        go_id = re_go_id.match(lines.pop(0)).group(1)
        go_attr = gm.add_node(go_graph, go_id) # add node to graph and get the node attribute dict
        go_attr['type'] = 'GOTerm'
        for line in lines:
            if re_go_name.match(line): go_attr['name'] = re_go_name.match(line).group(1)
            elif re_go_namespace.match(line): go_attr['namespace'] = re_go_namespace.match(line).group(1)
            elif re_go_def.match(line): go_attr['def'] = re_go_def.match(line).group(1)
            elif re_go_alt_id.match(line): go_graph['alt_id'][ re_go_alt_id.match(line).group(1) ] = go_id  # alt_id → go_id
            elif re_go_is_a.match(line): 
                parent_id = re_go_is_a.match(line).group(1)
                gm.add_edge(go_graph, go_id, parent_id, { 'relationship': 'is a' })
            elif re_go_part_of.match(line): 
                parent_id = re_go_part_of.match(line).group(1)
                gm.add_edge(go_graph, go_id, parent_id, { 'relationship': 'part of' })
    # method main
    go_graph          = gm.create_graph(directed=True, weighted=False)
    go_graph['alt_id'] = {} # alternate GO ids
    # regexp to parse term lines
    re_go_id          = re.compile('^id:\s+(GO:\d+)\s*$')
    re_go_name        = re.compile('^name:\s+(.+)\s*$')
    re_go_namespace   = re.compile('^namespace:\s+(.+)\s*$')
    re_go_def         = re.compile('^def:\s+(.+)\s*$')
    re_go_alt_id      = re.compile('^alt_id:\s+(GO:\d+)\s*$')
    re_go_is_a        = re.compile('^is_a:\s+(GO:\d+)\s')
    re_go_xref        = re.compile('^xref:\s+(\S+)\s*$')
    re_go_part_of      = re.compile('^relationship:\s+part_of\s+(GO:\d+)\s')
    # buffer each term lines, then parse lines to create GOTerm node
    with open(filename) as f:
        line = f.readline().rstrip()
        # skip header until first [Term] is reached
        while not line.startswith('[Term]'): 
            line = f.readline().rstrip()
        buff = []  
        line = f.readline()
        stop = False
        while line and not stop:
            line = line.rstrip()
            # new Term
            if line.startswith('[Term]'):
                parseTerm(buff)
                buff=[]
            # last Term
            elif line.startswith('[Typedef]'):
                parseTerm(buff)
                stop=True
            # or append to buffer
            else:
                buff.append(line)
            line = f.readline()
    return go_graph

def load_GOA(go, filename, warnings=True):
    """
    parse GOA file and add annotated gene products to previsouly loaded graph go

    Extract of a file to be parsed:
    gaf-version: 2.1
    !GO-version: http://purl.obolibrary.org/obo/go/releases/2020-11-28/extensions/go-plus.owl
    UniProtKB       O05154  tagX            GO:0008360      GO_REF:0000043  IEA     UniProtKB-KW:KW-0133    P       Putative glycosyltransferase TagX       tagX|SAOUHSC_00644      protein 93061   20201128        UniProt 
            
    UniProtKB       O05154  tagX            GO:0016740      GO_REF:0000043  IEA     UniProtKB-KW:KW-0808    F       Putative glycosyltransferase TagX       tagX|SAOUHSC_00644      protein 93061   20201128        UniProt 
            
    UniProtKB       O05204  ahpF            GO:0000302      GO_REF:0000002  IEA     InterPro:IPR012081      P       Alkyl hydroperoxide reductase subunit F ahpF|SAOUHSC_00364      protein 93061   20201128        InterPro
            
        0        1       2   3       4             5          6        7      8             9                              10
                id    name        go_id               evidence-codes                     desc                           aliases

    GAF spec: http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/
    Column  Content                         Required?   Cardinality     Example
    1       DB                              required    1               UniProtKB
    2       DB Object ID                    required    1               P12345
    3       DB Object Symbol                required    1               PHO3
    4       Qualifier                       optional    0 or greater    NOT
    5       GO ID                           required    1               GO:0003993
    6       DB:Reference (|DB:Reference)    required    1 or greater    PMID:2676709
    7       Evidence Code                   required    1               IMP
    8       With (or) From                  optional    0 or greater    GO:0000346
    9       Aspect                          required    1               F
    10      DB Object Name                  optional    0 or 1          Toll-like receptor 4
    11      DB Object Synonym (|Synonym)    optional    0 or greater    hToll   Tollbooth
    12      DB Object Type                  required    1               protein
    13      Taxon(|taxon)                   required    1 or 2          taxon:9606
    14      Date                            required    1               20090118
    15      Assigned By                     required    1               SGD
    16      Annotation Extension            optional    0 or greater    part_of(CL:0000576)
    17      Gene Product Form ID            optional    0 or 1          UniProtKB:P12345-2
    """
    with open(filename) as f: 
        line = f.readline()
        while line:
            if not line.startswith('!'): # skip comments
                cols = line.rstrip().split('\t')
                gp_id = cols[1]
                gt_id = cols[4]
                if gt_id not in go['nodes']: # GOTerm not found search alternate ids
                    while gt_id not in go['nodes'] and gt_id in go['alt_id']:
                        gt_id = go['alt_id'][gt_id] # replace term by alternate
                if gt_id not in go['nodes']: # failure: warn user
                    if warnings:
                        print(f'Warning: could not attach a gene product ({gp_id}) to a non existing GO Term ({gt_id})')
                else: # success: GOTerm to attach to was found
                    # create node for gene product if not already present
                    if gp_id not in go['nodes']:
                        gp_attr = gm.add_node(go, gp_id, { 'id': gp_id, 'type': 'GeneProduct'})
                    # create or update gene product attributes
                    gp_attr = go['nodes'][gp_id]
                    gp_attr['name'] = cols[2]
                    gp_attr['desc'] = cols[9]
                    gp_attr['aliases'] = cols[10].split('|')
                    # attach gene product to GOTerm
                    gt_attr = go['nodes'][gt_id]
                    e_attr = gm.add_edge(go, gp_id, gt_id)
                    e_attr['relationship'] = 'annotation'
                    if 'evidence-codes' not in e_attr:
                        e_attr['evidence-codes'] = []
                    e_attr['evidence-codes'].append( cols[6] )
            line = f.readline()

Créer le module python geneontology à partir du code précédent. Il sera testé dans la partie qui suit.

2.4 Chargement du graphe à partir du module python

La Gene Ontology étant un peu grosse, pour ce TP, nous allons utiliser une toute petite partie extraite à partir du GOTerm virion component du domaine CC :

SARS-Cov2 annoté virion component

Télécharger les fichiers correspondant :

go.covid.cys: session Cytoscape enregistrée avec le sous-graphe CC virion component et les protéines annotées
go-virion_component.obo: Fichier OBO des termes à partir de virion component
uniprot_sars-cov-2.gaf: Fichier d’annotation des protéines du SARS-Cov2 (récupéré sur https://www.ebi.ac.uk/GOA/)

Une fois les fichiers récupérés (dans un sous-répertoire data/), vous pourrez les charger dans un script python de la forme :

#!/bin/env python

import geneontology as gom
from pprint import pprint

print('Loading OBO GeneOntology virion_component subgraph')
vir = gom.load_OBO('data/go-virion_component.obo')

print('loading GOA annotation file for SARS-Cov2')
gom.load_GOA(vir, 'data/uniprot_sars-cov-2.gaf', warnings=False)

pprint(vir)

Combien de sommets a le graphe ?
Combien d’arcs is_a, part_of et annotation ?

2.5 Parcours en profondeur (DFS)

Il s’agit à présent d’implémenter l’algorithme de parcours en profondeur vu en cours ainsi que ses utilisations (existence de cycle/circuit dans un graphe, tri topologique, …).

Algorithme de DFS :

DFS(G)
1 for each vertex u ∈ V[G]
2    do state[u]← UNEXPLORED
3       predecessor[u] ← NONE
4 time ← 0
5 for each vertex u ∈ V[G]
6    do if state[u] = UNEXPLORED
7       then DFS-VISIT(u)

DFS-VISIT(u)
1 state[u] ← DISCOVERED     ⟾ UNEXPLORED vertex has just been discovered
2 time ← time + 1
3 discovered[u] ← time
4 for each v ∈ Adjacent[u]   ⟾ Explore edge (u,v)
5    do if state[v] = UNEXPLORED
6       then predecessor[v] = u
             classification[ (u,v) ] = TREE EDGE
7            DFS-VISIT(v)
        else if state[v] = DISCOVERED
             then classification[ (u,v) ] = BACK EDGE
             else if discovered[u] > discovered[v]
                  then classification[ (u,v) ] = CROSS EDGE
                  else classification[ (u,v) ] = FORWARD EDGE
8 state[u] ← FINISHED        ⟾ Finished processing of u
9 finished[u] ← time ← time + 1

Ajouter la fonction dfs à votre module graphmaster à partir de l’algorithme.

Vous pourrez tester cette fonction sur le graphe vir ainsi que sur le graphe dressing.

La sortie attendue sur vir si les sommets sont parcourus par ordre alphabétique (ligne 5 de DFS, et ligne 4 de DFS-VISIT) est données en annexe.

2.5.1 Existence de cycle/circuit

Ajouter une fonction is_acyclic au module graphmaster qui renvoie vrai ou faux selon que le graphe est sans circuit ou non.

Vous pourrez la tester sur vir et dressing ainsi que sur le graphe (avec circuit) directed.graph.with.cycle.tsv correspondant à celui du cours (après la diapo sur l’algorithme de DFS).

2.5.2 Tri topologique

Ajouter une fonction topological_sort au module graphmaster qui renvoie les sommets du graphes en ayant effectué un tri topologique.

Vous pourrez la tester sur vir et dressing.

La sortie attendue (si l’on parcourt les sommets par ordre alphabétique) est donnée en annexe.

2.5.3 Pour aller plus loin

On pourra utiliser DFS pour identifier les composantes fortement connexes.

, ou encore calculer l’IC des GOTerms par rapport

3 TP n°3 : Bellman-Ford et Floyd-Warshall

3.1 Bellman-Ford implémentation

Rappel de l’algorithme :

INITIALIZE(G, s)
1 for each vertex u ∈ V[G]
2    do distance[u] ← ∞
3       predecessor[u] ← NIL
4 distance[s] ← 0

RELAX(u, v, weight)
1 if distance[v] > distance[u] + weight(u,v)
2    then distance[v] ← distance[u] + weight(u,v)
3         predecessor[v] ← u

BELLMAN-FORD(G, weight, source)
1 INITIALIZE(G, source)
2 for i ← 1 to |V[G]| - 1
3    do for each edge (u,v) ∈ E[G]
4          do RELAX(u, v, weight)
5 for each edge (u, v) 
6    do if distance[v] > distance[u] + weight(u,v)
7          then return FALSE
8 return TRUE

Ajoutez la fonction BellmanFord à votre module et testez la sur le graphe vu en cours graphe.Bellman-Ford.tsv à partir du sommet z.

Remarque : Pour simplifier le code, il peut être utile de créer des fonctions utilitaires :

edge_attrs(g, u, v) : qui renvoie le dictionnaire d’attributs pour l’arc (u,v) s’il existe et None sinon
weight(g, u, v) : qui renvoie le poids d’un arc/arête.
- Ce poids vaut ∞ si l’arc n’existe pas.
- Si le graphe n’est pas pondéré le poids vaut 1 et
- s’il est pondéré il est stocké dans l’attribut weight_attribute du graphe (si le graphe est pondéré) dans les attributs de l’arc.

Sortie attendue :

{'distance': {'u': 2.0, 'v': 4.0, 'x': 7.0, 'y': -2.0, 'z': 0},
 'predecessor': {'u': 'v', 'v': 'x', 'x': 'z', 'y': 'u', 'z': None}}

Graphe utilisé pour Bellman-Ford

3.2 Script STRINGdb meilleure chaîne

Dans cette partie, le but est, en utilisant l’algorithme Bellman-Ford, de proposer un script python qui affiche un des meilleurs chemins (ou plutôt chaînes pour être fidèle à la notion de chaîne dans un graphe non orienté) entre 2 sommets du graphe récupéré depuis STRINGdb au TP n°1 contenant tous les liens selon les sources de données : 511145.protein.links.detailed.v11.5.txt.gz.

zcat data/511145.protein.links.detailed.v11.5.txt.gz | head

## protein1 protein2 neighborhood fusion cooccurence coexpression experimental database textmining combined_score
## 511145.b0001 511145.b0810 0 0 0 0 0 0 162 161
## 511145.b0001 511145.b0946 0 0 0 0 0 0 200 200
## 511145.b0001 511145.b4403 0 0 0 0 0 0 363 363
## 511145.b0001 511145.b2126 0 0 0 0 0 0 231 230
## 511145.b0001 511145.b3769 0 0 0 0 0 0 299 299
## 511145.b0001 511145.b0943 0 0 0 0 0 0 163 163
## 511145.b0001 511145.b0075 0 0 0 0 0 0 619 619
## 511145.b0001 511145.b1002 0 0 0 0 0 0 239 238
## 511145.b0001 511145.b2023 0 0 0 0 0 0 297 297

Pour ce faire, le script devra récupérer en ligne de commande le sommet source, le sommet destination ainsi que le nom de l’attribut (colonne) à considérer pour le poids des arêtes.

De plus, les valeurs dans le fichier correspondent à la probabilité (multipliée par 1000 pour avoir des entiers) d’un lien entre les 2 sommets et ne sont donc pas des poids. Ainsi, le meilleur chemin serait celui qui aurait la meilleure probabilité, c’est-à-dire qu’il faudrait faire le produit des probabilités de chaque arête qui le compose. Afin de pouvoir utiliser BellmanFord en l’état, il faudra donc faire un traitement qui transforme le produit des probabilités en somme des poids. La solution la plus courante est de transformer les probabilités en $-\log_{10}$ ; ainsi, une probabilité de 0.8 devient 0.09691, une probabilité de 0.2 devient 0.69897, et avec cette transformation la valeur tant vers +∞ quand la probabilité tant vers 0.

x = seq(0,1, by=0.00001)
plot(x, -log10(x),type='l')

Une fois la transformation effectuée, on aura donc bien le meilleur chemin (le plus probable) avec $\prod P_i$ qui sera maximal quand $\sum -log(P_i)$ sera minimale.

Dans votre script, il faudra donc ajouter un nouvel attribut sur les arêtes (par exemple weight) qui correspond au -log10 de la probabilité du lien selon la colonne spécifiée par l’utilisateur, par exemple entre dnaK (511145.b0014) et bamB (511145.b2512) :

./scripts/graphmaster.STRINGdb_path.py 511145.b0014 511145.b2512 coexpression
 nodes:  4126
 edges:  541593
Graph with only coexpression edges > 800:
 nodes:  640
 edges:  2375

No path

ou à partir de experimental comme lors du TP n°1 :

time ./scripts/graphmaster.STRINGdb_path.py data/511145.protein.links.detailed.v11.5.txt 511145.b0014 511145.b2512 experimental
Full graph:
 nodes:  4126
 edges:  541593
Graph with only experimental edges > 800:
 nodes:  811
 edges:  3445

path:
 (511145.b0014, 511145.b0911) experimental = 870.0, weight = 0.060480747381381476
 (511145.b0911, 511145.b3985) experimental = 997.0, weight = 0.0013048416883442813
 (511145.b3985, 511145.b4203) experimental = 997.0, weight = 0.0013048416883442813
 (511145.b4203, 511145.b2287) experimental = 988.0, weight = 0.005243055412371884
 (511145.b2287, 511145.b2286) experimental = 999.0, weight = 0.0004345117740176917
 (511145.b2286, 511145.b3731) experimental = 992.0, weight = 0.0034883278458213473
 (511145.b3731, 511145.b3738) experimental = 998.0, weight = 0.0008694587126288915
 (511145.b3738, 511145.b3508) experimental = 947.0, weight = 0.0236500209967266
 (511145.b3508, 511145.b4175) experimental = 947.0, weight = 0.0236500209967266
 (511145.b4175, 511145.b2512) experimental = 877.0, weight = 0.057000406633959486

real    0m6.651s
user    0m6.574s
sys 0m0.592s

Remarque : le graphe avec tous les liens, y compris ceux dont la valeur de coexpression ou experimental était à 0, est relativement gros (4126 sommets et 541593 arêtes) et, pour accélérer le traitement, seul les liens > 800 ont été gardés.

L’algorithme étant en complexité $\mathcal{O}(VE)$, on voit ici la différence entre 4126 * 541593 ~ 2.2 10⁹ et le graphe plus réduit de 811 * 3445 ~ 2.8 10⁶.

Sur le graphe avec tous les liens > 0 :

time ./scripts/graphmaster.STRINGdb_path.py data/511145.protein.links.detailed.v11.5.txt 511145.b0014 511145.b2512 experimental
Full graph:
 nodes:  4126
 edges:  541593
Graph with only experimental edges > 0:
 nodes:  3695
 edges:  74532

path:
 (511145.b0014, 511145.b0911) experimental = 870.0, weight = 0.060480747381381476
 (511145.b0911, 511145.b3985) experimental = 997.0, weight = 0.0013048416883442813
 (511145.b3985, 511145.b4203) experimental = 997.0, weight = 0.0013048416883442813
 (511145.b4203, 511145.b2287) experimental = 988.0, weight = 0.005243055412371884
 (511145.b2287, 511145.b2286) experimental = 999.0, weight = 0.0004345117740176917
 (511145.b2286, 511145.b3731) experimental = 992.0, weight = 0.0034883278458213473
 (511145.b3731, 511145.b3738) experimental = 998.0, weight = 0.0008694587126288915
 (511145.b3738, 511145.b3508) experimental = 947.0, weight = 0.0236500209967266
 (511145.b3508, 511145.b4175) experimental = 947.0, weight = 0.0236500209967266
 (511145.b4175, 511145.b2512) experimental = 877.0, weight = 0.057000406633959486

real    9m1.146s
user    9m0.161s
sys 0m0.739s

On passe donc de ~9 minutes à quelques secondes.

3.3 Floyd-Warshall implémentation

Pour l’algorithme suivant,

D[x,y] est la distance du plus court chemin entre les sommets x et y
N[x,y] est le successeur de x dans le plus court chemin allant de x à y
W[x,y] est la valuation de l’arc (x,y)

Algorithme :

initialiser D = W, et N à partir des arcs/arêtes du graphe
pour k de 1 à n
   pour i de 1 à n
      pour j de 1 à n
         si D[i,k] + D[k,j] < D[i,j] alors
            D[i,j] = D[i,k] + D[k,j]
            N[i,j] = N[i,k]

Récupération du plus court chemin à partir de la matrice N :

plusCourtChemin(D,N, i,j)
   si D[i,j] est infinie alors
      il n'y a pas de chemin entre i et j
   chemin = initialiserChemin(i)
   k = N[i,j]
   tant que k est défini faire
      ajouter(chemin, k)
      k = N[k,j]
   ajouter(chemin, j)

Implémentez les fonctions FloydWarshall et FloydWarshallPath dans votre librairie. Pour cela, vous êtes forcement incité(e) à ajouter une fonction adjacency_matrix qui renvoie le graphe source forme de matrice d’adjacence.

Aide : Création d’une matrice de 10 lignes et 20 colonnes, initialisée avec des valeurs infinies avec numpy

import numpy as np
 
matrix = np.full( (10, 20), np.inf )

Testez ces fonctions sur le graphe graphe.Floyd-Warshall.tsv et affichez les plus courts chemins de A à C et de A à B (avec leur longueur).

Quel est le diamètre du graphe ? Affichez le chemin correspondant.

Graphe Floyd-Warshall

#### Floyd-Warshall ####
{'directed': True,
 'edges': {'A': {'B': {'weight': '3'},
                 'C': {'weight': '8'},
                 'E': {'weight': '-4'}},
           'B': {'D': {'weight': '1'}, 'E': {'weight': '7'}},
           'C': {'B': {'weight': '4'}},
           'D': {'A': {'weight': '2'}, 'C': {'weight': '-5'}},
           'E': {'D': {'weight': '6'}}},
 'nodes': {'A': {}, 'B': {}, 'C': {}, 'D': {}, 'E': {}},
 'weight_attribute': 'weight',
 'weighted': True}
D: 
array([[ 0.,  1., -3.,  2., -4.],
       [ 3.,  0., -4.,  1., -1.],
       [ 7.,  4.,  0.,  5.,  3.],
       [ 2., -1., -5.,  0., -2.],
       [ 8.,  5.,  1.,  6.,  0.]])
N: 
array([[0, 4, 4, 4, 4],
       [3, 1, 3, 3, 3],
       [1, 1, 2, 1, 1],
       [0, 2, 2, 3, 0],
       [3, 3, 3, 3, 4]])

### Shortest path from A to B
path: ['A', 'E', 'D', 'C', 'B'], length: 1.0

### from A to C
path: ['A', 'E', 'D', 'C'], length: -3.0

### from E to C
path: ['E', 'D', 'C'], length: 1.0

### from E to A
path: ['A', 'E', 'D', 'C'], length: -3.0

### graph diameter: 8.0

3.4 Pour aller plus loin

Implémentation des algorithmes de Dijkstra et Johnson.

4 TP n°4 : Librairies et modules existants. Prise en main de igraph

La librairie igraph met à disposition tout un ensemble de fonctions pour le traitement et la visualisation de graphes. Elle est codée en C. Nous allons utiliser aujourd’hui son interfaçage avec R. Pour la charger :

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

Vous trouverez la documentation de la librairie sur le site dédié. Pour celle de l’interface R en ligne : https://igraph.org/r/

4.1 Chargement et affichage

Pour charger un graphe (différents format possibles : pajek, newick, …) :

g = read.graph("http://silico.biotoul.fr/site/images/9/9f/Cleandb_Luca_1_S_1_1_65_Iso_Tr_1-CC1.tgr", directed=FALSE)
g

## IGRAPH 1f0b7fc U--- 42 408 -- 
## + edges from 1f0b7fc:
##   [1] 1-- 2 1-- 3 1-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--10 1--11 1--12 1--13
##  [13] 1--14 1--15 1--16 1--17 1--18 1--19 1--20 1--21 1--22 1--23 1--24 1--25
##  [25] 1--26 1--27 2-- 6 2-- 7 2--10 2--16 2--17 2--18 2--21 2--22 2--23 2--26
##  [37] 3-- 4 3-- 8 3-- 9 3--11 3--12 3--14 3--15 3--16 3--19 3--20 3--24 3--25
##  [49] 3--27 3--28 3--29 3--30 3--31 3--32 3--33 3--34 3--35 3--36 3--37 3--38
##  [61] 3--39 4-- 8 4-- 9 4--11 4--12 4--14 4--15 4--16 4--19 4--20 4--24 4--25
##  [73] 4--27 4--28 4--29 4--30 4--31 4--32 4--33 4--34 4--35 4--36 4--37 4--38
##  [85] 4--39 5-- 6 5-- 7 5--13 5--16 5--17 5--18 5--21 5--22 5--26 6-- 7 6--10
##  [97] 6--13 6--14 6--16 6--17 6--18 6--21 6--22 6--23 6--26 7--10 7--13 7--14
## + ... omitted several edges

Consulter l’aide de la fonction (?read_graph) pour voir les autres formats supportés.

Pour l’afficher, plot(g) suffit, les paramètres par défauts seront utilisés et un algorithme de dessin sera sélectionné :

plot(g)

Pour l’afficher, en précisant l’algorithme de dessin (layout en anglais) :

plot(g, layout=layout.fruchterman.reingold)

Mais à chaque appel, le layout sera différent. On peut le sauvegarder pour le réutiliser (et ne pas le refaire pour chaque plot) :

lfr = layout.fruchterman.reingold(g)
plot(g, layout=lfr, vertex.size=3, vertex.label=NA)

Le layout n’est autre que les coordonnées x et y de chacun des sommets ; pour les 6 premiers :

head(lfr)

##             [,1]      [,2]
## [1,]  0.27333891 -3.728175
## [2,]  0.52364657 -6.081953
## [3,] -0.14961852 -2.121585
## [4,]  0.06007839 -2.189415
## [5,]  1.83723103 -5.249440
## [6,]  0.48114450 -5.454229

Remarque : le fait de sauvegarder la disposition des sommets

évite de la recalculer pour chaque plot,
permet d’avoir la même disposition à chaque plot (certains algorithmes, tel que FR, ne donnent pas la même disposition à chaque fois).

Consulter l’aide des fonction plot.igraph et layout.fructhterman.reingold pour voir les options ainsi que les autres algorithmes de dessin disponibles.

4.2 Accès aux sommets et arêtes

Pour obtenir la liste des sommets : V(g)

la liste des arêtes : E(g)

V(g)

## + 42/42 vertices, from 1f0b7fc:
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

E(g)

## + 408/408 edges from 1f0b7fc:
##   [1] 1-- 2 1-- 3 1-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--10 1--11 1--12 1--13
##  [13] 1--14 1--15 1--16 1--17 1--18 1--19 1--20 1--21 1--22 1--23 1--24 1--25
##  [25] 1--26 1--27 2-- 6 2-- 7 2--10 2--16 2--17 2--18 2--21 2--22 2--23 2--26
##  [37] 3-- 4 3-- 8 3-- 9 3--11 3--12 3--14 3--15 3--16 3--19 3--20 3--24 3--25
##  [49] 3--27 3--28 3--29 3--30 3--31 3--32 3--33 3--34 3--35 3--36 3--37 3--38
##  [61] 3--39 4-- 8 4-- 9 4--11 4--12 4--14 4--15 4--16 4--19 4--20 4--24 4--25
##  [73] 4--27 4--28 4--29 4--30 4--31 4--32 4--33 4--34 4--35 4--36 4--37 4--38
##  [85] 4--39 5-- 6 5-- 7 5--13 5--16 5--17 5--18 5--21 5--22 5--26 6-- 7 6--10
##  [97] 6--13 6--14 6--16 6--17 6--18 6--21 6--22 6--23 6--26 7--10 7--13 7--14
## [109] 7--16 7--17 7--18 7--21 7--22 7--23 7--26 8-- 9 8--11 8--12 8--14 8--15
## + ... omitted several edges

Quel est l’ordre du graphe ?

Combien a-t-il d’arêtes ?

On peut assigner des étiquettes aux sommets : V(g)$name = vector_of_labels

Nous allons utiliser le fichier Cleandb_Luca_1_S_1_1_65_Iso_Tr_1-CC1.cod ; apperçu du contenu :

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## ✖ purrr::compose()       masks igraph::compose()
## ✖ tidyr::crossing()      masks igraph::crossing()
## ✖ dplyr::filter()        masks stats::filter()
## ✖ dplyr::groups()        masks igraph::groups()
## ✖ dplyr::lag()           masks stats::lag()
## ✖ purrr::simplify()      masks igraph::simplify()

vertex_names = read_tsv("http://silico.biotoul.fr/site/images/6/61/Cleandb_Luca_1_S_1_1_65_Iso_Tr_1-CC1.cod", col_names = F)

## Rows: 42 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): X2
## dbl (2): X1, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

vertex_names

La 2ème colonne contient le nom des sommets. Il s’agit donc de l’assigner aux sommets du graphe que l’on a déjà chargé :

V(g)$name=vertex_names$X2
plot(g, layout=lfr, vertex.size=3, vertex.label=V(g)$name)

Pour ne pas à avoir à repasser les mêmes paramètres à chaque utilisation de plot, on peut changer ceux par défaut :

igraph.options(vertex.label.cex=.6) # font size
igraph.options(vertex.label.family='sans')
igraph.options(vertex.size=3) 
igraph.options(vertex.color=NA) 

plot(g, layout=lfr)

Il est possible de stocker de l’information sur le graphe, les sommets et/ou les arêtes avec les fonctions dédiées :

vertex_attr(g, name="name")

##  [1] "AfulA01.AF0890"     "AperA01.APE0061"    "BantA01.AAP27658.1"
##  [4] "BcerA01.AAP10714.1" "BjapA01.BLR6158"    "BmelA02.BMEII0702" 
##  [7] "CkorA01.ACB07774.1" "CtepA01.BMPA"       "DradA01.DR2070"    
## [10] "HbutA01.ABM80170.1" "HwalA01.RBSB"       "LinnA01.TCSA"      
## [13] "MlotA01.MLL0501"    "NspeA01.ALR5361"    "PabyA01.TMPC-LIKE" 
## [16] "PaerA01.AAG03536.1" "ParsA01.ABP50747.1" "PcalA01.ABO08469.1"
## [19] "PfurA01.PF1695"     "PhorA01.PH1714"     "PislA01.ABL88151.1"
## [22] "PyaeA01.PAE3414"    "SmarA01.ABN69244.1" "SmutA01.SMU.1121C" 
## [25] "TmarA01.AAD35196.1" "TneuA01.ACB40241.1" "TonnA01.ACJ16081.1"
## [28] "AperA01.APE2592"    "BsubA01.YUFN"       "CkorA01.ACB07069.1"
## [31] "HbutA01.ABM81242.1" "MlotA01.MLR3148"    "OiheA01.OB3388"    
## [34] "ParsA01.ABP51144.1" "PcalA01.ABO09261.1" "PislA01.ABL87654.1"
## [37] "PyaeA01.PAE3121"    "SmarA01.ABN69796.1" "TneuA01.ACB40566.1"
## [40] "HmarA01.YUFN"       "HsalA01.RBSB"       "HspeA01.VNG0903C"

ou encore

V(g)$name

##  [1] "AfulA01.AF0890"     "AperA01.APE0061"    "BantA01.AAP27658.1"
##  [4] "BcerA01.AAP10714.1" "BjapA01.BLR6158"    "BmelA02.BMEII0702" 
##  [7] "CkorA01.ACB07774.1" "CtepA01.BMPA"       "DradA01.DR2070"    
## [10] "HbutA01.ABM80170.1" "HwalA01.RBSB"       "LinnA01.TCSA"      
## [13] "MlotA01.MLL0501"    "NspeA01.ALR5361"    "PabyA01.TMPC-LIKE" 
## [16] "PaerA01.AAG03536.1" "ParsA01.ABP50747.1" "PcalA01.ABO08469.1"
## [19] "PfurA01.PF1695"     "PhorA01.PH1714"     "PislA01.ABL88151.1"
## [22] "PyaeA01.PAE3414"    "SmarA01.ABN69244.1" "SmutA01.SMU.1121C" 
## [25] "TmarA01.AAD35196.1" "TneuA01.ACB40241.1" "TonnA01.ACJ16081.1"
## [28] "AperA01.APE2592"    "BsubA01.YUFN"       "CkorA01.ACB07069.1"
## [31] "HbutA01.ABM81242.1" "MlotA01.MLR3148"    "OiheA01.OB3388"    
## [34] "ParsA01.ABP51144.1" "PcalA01.ABO09261.1" "PislA01.ABL87654.1"
## [37] "PyaeA01.PAE3121"    "SmarA01.ABN69796.1" "TneuA01.ACB40566.1"
## [40] "HmarA01.YUFN"       "HsalA01.RBSB"       "HspeA01.VNG0903C"

Pour certains sommets seulement :

vertex_attr(g, name="name", index=10)

## [1] "HbutA01.ABM80170.1"

ou aussi :

V(g)[10]$name

## [1] "HbutA01.ABM80170.1"

4.3 diamètre, plus courts chemins

Quel est le diamètre du graphe ? (cf. diameter ou distances)

diameter(g)

## [1] 3

la même chose avec distances (longueur du plus court chemin entre chaque paire de sommets)

distances(g)[1:6, 1:6]

##                    AfulA01.AF0890 AperA01.APE0061 BantA01.AAP27658.1
## AfulA01.AF0890                  0               1                  1
## AperA01.APE0061                 1               0                  2
## BantA01.AAP27658.1              1               2                  0
## BcerA01.AAP10714.1              1               2                  1
## BjapA01.BLR6158                 1               2                  2
## BmelA02.BMEII0702               1               1                  2
##                    BcerA01.AAP10714.1 BjapA01.BLR6158 BmelA02.BMEII0702
## AfulA01.AF0890                      1               1                 1
## AperA01.APE0061                     2               2                 1
## BantA01.AAP27658.1                  1               2                 2
## BcerA01.AAP10714.1                  0               2                 2
## BjapA01.BLR6158                     2               0                 1
## BmelA02.BMEII0702                   2               1                 0

max(distances(g))

## [1] 3

4.4 Isomorphisme

Prenons 2 petits graphes dont on sait qu’ils sont isomorphes :

df1 = data.frame(from=c('A', 'B', 'C', 'D', 'E', 'A'), to=c('B', 'C', 'D', 'E', 'A', 'C'))
df1

g1 = graph_from_data_frame(df1)
plot(g1, vertex.label.cex=1, vertex.size=20)

df2 = data.frame(from=c( 'V', 'W', 'X', 'U', 'U','Y'), to=c( 'W', 'X', 'Y','V', 'W', 'U'))
g2 = graph_from_data_frame(df2)
par(mfrow=c(1,2))
plot(g1, vertex.label.cex=1, vertex.size=20)
plot(g2, vertex.label.cex=1, vertex.size=20)

isomorphic(g1,g2)

## [1] TRUE

Matrices d’adjacences :

get.adjacency(g1)

## 5 x 5 sparse Matrix of class "dgCMatrix"
##   A B C D E
## A . 1 1 . .
## B . . 1 . .
## C . . . 1 .
## D . . . . 1
## E 1 . . . .

get.adjacency(g2)

## 5 x 5 sparse Matrix of class "dgCMatrix"
##   V W X U Y
## V . 1 . . .
## W . . 1 . .
## X . . . . 1
## U 1 1 . . .
## Y . . . 1 .

Les matrices ne sont pas les mêmes.

Obtention des représentations canoniques.

canonical.permutation(g1)

## $labeling
## [1] 5 3 4 1 2
## 
## $info
## $info$nof_nodes
## [1] 1
## 
## $info$nof_leaf_nodes
## [1] 1
## 
## $info$nof_bad_nodes
## [1] 0
## 
## $info$nof_canupdates
## [1] 0
## 
## $info$max_level
## [1] 0
## 
## $info$group_size
## [1] "1"

Et la matrice correspondante

get.adjacency(permute.vertices(g1, canonical.permutation(g1)$labeling))

## 5 x 5 sparse Matrix of class "dgCMatrix"
##   D E B C A
## D . 1 . . .
## E . . . . 1
## B . . . 1 .
## C 1 . . . .
## A . . 1 1 .

De même pour g2 :

get.adjacency(permute.vertices(g2, canonical.permutation(g2)$labeling))

## 5 x 5 sparse Matrix of class "dgCMatrix"
##   X Y V W U
## X . 1 . . .
## Y . . . . 1
## V . . . 1 .
## W 1 . . . .
## U . . 1 1 .

Nous obtenons bien les mêmes matrices si l’on ignore les noms des lignes/colonnes.

4.5 Autres fonctions

Longueur moyenne des plus courts chemins (sans valuation) (average.path.length)

average.path.length(g)

## [1] 1.702671

Lister les composantes connexes (clusters) avant et après suppression du point d’articulation

clusters(g)

## $membership
##     AfulA01.AF0890    AperA01.APE0061 BantA01.AAP27658.1 BcerA01.AAP10714.1 
##                  1                  1                  1                  1 
##    BjapA01.BLR6158  BmelA02.BMEII0702 CkorA01.ACB07774.1       CtepA01.BMPA 
##                  1                  1                  1                  1 
##     DradA01.DR2070 HbutA01.ABM80170.1       HwalA01.RBSB       LinnA01.TCSA 
##                  1                  1                  1                  1 
##    MlotA01.MLL0501    NspeA01.ALR5361  PabyA01.TMPC-LIKE PaerA01.AAG03536.1 
##                  1                  1                  1                  1 
## ParsA01.ABP50747.1 PcalA01.ABO08469.1     PfurA01.PF1695     PhorA01.PH1714 
##                  1                  1                  1                  1 
## PislA01.ABL88151.1    PyaeA01.PAE3414 SmarA01.ABN69244.1  SmutA01.SMU.1121C 
##                  1                  1                  1                  1 
## TmarA01.AAD35196.1 TneuA01.ACB40241.1 TonnA01.ACJ16081.1    AperA01.APE2592 
##                  1                  1                  1                  1 
##       BsubA01.YUFN CkorA01.ACB07069.1 HbutA01.ABM81242.1    MlotA01.MLR3148 
##                  1                  1                  1                  1 
##     OiheA01.OB3388 ParsA01.ABP51144.1 PcalA01.ABO09261.1 PislA01.ABL87654.1 
##                  1                  1                  1                  1 
##    PyaeA01.PAE3121 SmarA01.ABN69796.1 TneuA01.ACB40566.1       HmarA01.YUFN 
##                  1                  1                  1                  1 
##       HsalA01.RBSB   HspeA01.VNG0903C 
##                  1                  1 
## 
## $csize
## [1] 42
## 
## $no
## [1] 1

Une seule pour ce graphe.

Lister les points d’articulation (articulation.points)

articulation.points(g)

## + 1/42 vertex, named, from 1f0b7fc:
## [1] HwalA01.RBSB

Suppression du point d’articulation

g1 = delete_vertices(g, articulation_points(g))
plot(g1)

ou aussi

plot(g - articulation_points(g))

Obtenir le line graph (line.graph)

lg = line.graph((g))
par(mfrow=c(1,2))
plot(g, vertex.label=NA)
plot(lg, vertex.label=NA)

4.6 Algorithmes de parcours et de plus courts chemins

Graphe dressing

tb = read_tsv('http://silico.biotoul.fr/enseignement/m1/graph/dressing.tsv')

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 9 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): id1, id2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dressing = graph_from_data_frame(tb, directed=T)
plot(dressing, vertex.size=15)

4.6.1 BFS

p=bfs(dressing, root='sous-vetements', dist=TRUE, unreachable=FALSE)
p$dist

##    chaussettes       pantalon        chemise        cravate       ceinture 
##            NaN              1            NaN            NaN              2 
## sous-vetements     chaussures          veste 
##              0              1              3

4.6.2 DFS et ses utilisations

p = dfs(dressing, root='sous-vetements', unreachable=T, father=T, order.out=T)
p

## $root
## [1] 5
## 
## $mode
## [1] "out"
## 
## $order
## + 8/8 vertices, named, from 9267495:
## [1] sous-vetements pantalon       ceinture       veste          chaussures    
## [6] chaussettes    chemise        cravate       
## 
## $order.out
## + 8/8 vertices, named, from 9267495:
## [1] veste          ceinture       chaussures     pantalon       sous-vetements
## [6] chaussettes    cravate        chemise       
## 
## $father
## + 8/8 vertices, named, from 9267495:
## [1] chaussettes    pantalon       chemise        cravate        ceinture      
## [6] sous-vetements chaussures     veste         
## 
## $dist
## NULL
## 
## $neimode
## [1] "out"

Sans circuit/cycle ?

is_dag(dressing)

## [1] TRUE

Tri topologique

topological.sort(dressing)

## + 8/8 vertices, named, from 9267495:
## [1] chaussettes    chemise        sous-vetements cravate        pantalon      
## [6] ceinture       chaussures     veste

4.7 Bellman-Ford

Pour cette partie, on utilisera le graphe du cours :

df.bf = read_tsv('http://silico.biotoul.fr/enseignement/m1/graph/graphe.Bellman-Ford.tsv')

## Rows: 10 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): id1, id2
## dbl (1): weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bf = graph_from_data_frame(df.bf)
bf

## IGRAPH 1d91ee0 DNW- 5 10 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 1d91ee0 (vertex names):
##  [1] z->u z->x u->v u->y u->x v->u y->z y->v x->v x->y

E(bf)$weight

##  [1]  6  7  5 -4  8 -2  2  7 -3  9

plot(bf, vertex.size=15, edge.label=E(bf)$weight)

## Warning in v(graph): Non-positive edge weight found, ignoring all weights during
## graph layout.

Amélioration de l’affichage pour distinguer les poids des arcs :

E(bf)$curved = 0.2
E(bf)$color = E(bf)
V(bf)$color = V(bf)
bf.layout=layout.circle(bf)
plot(bf, layout=bf.layout, vertex.size=15, edge.label=E(bf)$weight)

Utilisation de l’algorithme Bellman-Ford pour les distances à partir de z :

d=distances(bf, v='z', mode='out', algo='bellman-ford', weights = E(bf)$weight)
d

##   z u v  y x
## z 0 2 4 -2 7

plot(bf, layout=bf.layout, edge.label=E(bf)$weight, vertex.label = paste(V(bf)$name,':', d), vertex.size=35 )

4.8 Johnson

Graphe utilisé pour Floyd-Warshall

df.fw = read_tsv('http://silico.biotoul.fr/enseignement/m1/graph/graphe.Floyd-Warshall.tsv')

## Rows: 9 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): id1, id2
## dbl (1): weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fw = graph_from_data_frame(df.fw)
fw

## IGRAPH 320b0b5 DNW- 5 9 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 320b0b5 (vertex names):
## [1] A->B A->C A->E B->D B->E C->B D->A D->C E->D

E(fw)$weight

## [1]  3  8 -4  1  7  4  2 -5  6

fw.layout=layout.circle(fw)
plot(fw, layout=fw.layout, edge.label=E(fw)$weight, vertex.size=15)

Plus courtes distances entre toutes les paires de sommets :

shortest.paths(fw, mode='out')

##   A  B  C D  E
## A 0  1 -3 2 -4
## B 3  0 -4 1 -1
## C 7  4  0 5  3
## D 2 -1 -5 0 -2
## E 8  5  1 6  0

4.9 Applications sur le graphe de STRINGdb

Chargement des données

links = read_delim('data/511145.protein.links.detailed.v11.5.txt', delim = ' ')

## Rows: 1083186 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): protein1, protein2
## dbl (8): neighborhood, fusion, cooccurence, coexpression, experimental, data...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

links

Informations sur les sommets

proteins = read_tsv('data/511145.protein.info.v11.5.txt')

## Rows: 4127 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): #string_protein_id, preferred_name, annotation
## dbl (1): protein_size
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

proteins

Pour avoir un graphe plus petit, on va filtrer sur les valeurs supérieures à 800 pour coexpression ou experimental

links.filtered = links %>%
  filter((coexpression>800 | experimental>800) & protein1 < protein2)
links.filtered

Création du graphe

g = graph_from_data_frame(d = links.filtered, directed = F, vertices = proteins)
g

## IGRAPH 1f3249d UN-- 4127 4579 -- 
## + attr: name (v/c), preferred_name (v/c), protein_size (v/n),
## | annotation (v/c), neighborhood (e/n), fusion (e/n), cooccurence
## | (e/n), coexpression (e/n), experimental (e/n), database (e/n),
## | textmining (e/n), combined_score (e/n)
## + edges from 1f3249d (vertex names):
##  [1] 511145.b0002--511145.b0004 511145.b0002--511145.b0003
##  [3] 511145.b0003--511145.b0004 511145.b0004--511145.b3940
##  [5] 511145.b0007--511145.b0333 511145.b0007--511145.b0720
##  [7] 511145.b0009--511145.b1727 511145.b0014--511145.b3932
##  [9] 511145.b0014--511145.b2231 511145.b0014--511145.b4142
## + ... omitted several edges

plot(g, vertex.label=NA)

Peu utile, bien sûr. Concentrons-nous sur les composantes connexes. Combien y en a-t-il ?

CCs = clusters(g)
CCs$no

## [1] 3133

Quelle est la plus grosse et quelle taille fait-elle ?

max(CCs$csize)

## [1] 652

652 sommets, quel est son n° ?

which(CCs$csize == max(CCs$csize))

## [1] 5

Extraction du sous-graphe induit

cc1 = induced.subgraph(g, CCs$membership == which(CCs$csize == max(CCs$csize)))
cc1

## IGRAPH 17f29e5 UN-- 652 4127 -- 
## + attr: name (v/c), preferred_name (v/c), protein_size (v/n),
## | annotation (v/c), neighborhood (e/n), fusion (e/n), cooccurence
## | (e/n), coexpression (e/n), experimental (e/n), database (e/n),
## | textmining (e/n), combined_score (e/n)
## + edges from 17f29e5 (vertex names):
##  [1] 511145.b0014--511145.b0015 511145.b0032--511145.b0033
##  [3] 511145.b0083--511145.b0084 511145.b0083--511145.b0089
##  [5] 511145.b0084--511145.b0089 511145.b0083--511145.b0093
##  [7] 511145.b0084--511145.b0093 511145.b0089--511145.b0093
##  [9] 511145.b0094--511145.b0095 511145.b0114--511145.b0115
## + ... omitted several edges

cc1.fr =  layout.fruchterman.reingold(cc1)
plot(cc1, layout=cc1.fr, vertex.label=NA)

Graph has double edges !!!

5 Annexes

Fichiers :

5.1 Environnement conda/mamba

mamba create -n graph22 numpy pandas igraph   r-tidyverse r-igraph r-reticulate  r-rmdformats r-factoextra r-dt r-kableextra

5.2 Sortie de DFS sur vir (sommets triés alphabétiquement)

DFS on vir (nodes are sorted)
{'discovered': {'GO:0019013': 1,
                'GO:0019028': 2,
                'GO:0019029': 7,
                'GO:0019030': 9,
                'GO:0019031': 11,
                'GO:0019033': 15,
                'GO:0036338': 12,
                'GO:0039615': 17,
                'GO:0039616': 19,
                'GO:0039617': 21,
                'GO:0039618': 23,
                'GO:0039619': 25,
                'GO:0039620': 27,
                'GO:0039621': 29,
                'GO:0039622': 31,
                'GO:0039623': 33,
                'GO:0039624': 35,
                'GO:0039625': 37,
                'GO:0039626': 39,
                'GO:0039627': 41,
                'GO:0039628': 43,
                'GO:0039629': 45,
                'GO:0039641': 47,
                'GO:0039642': 49,
                'GO:0039670': 51,
                'GO:0044423': 3,
                'GO:0046727': 55,
                'GO:0046729': 57,
                'GO:0046798': 59,
                'GO:0046806': 61,
                'GO:0055036': 63,
                'GO:0098015': 65,
                'GO:0098017': 67,
                'GO:0098018': 69,
                'GO:0098021': 52,
                'GO:0098022': 71,
                'GO:0098023': 73,
                'GO:0098024': 75,
                'GO:0098025': 77,
                'GO:0098026': 79,
                'GO:0098027': 81,
                'GO:0098028': 83,
                'GO:0098029': 85,
                'GO:0098030': 87,
                'GO:0098031': 89,
                'GO:0098032': 91,
                'GO:0098033': 93,
                'GO:0098061': 95,
                'P0DTC2': 97,
                'P0DTC3': 99,
                'P0DTC5': 101,
                'P0DTC7': 103,
                'P0DTC9': 105},
 'finished': {'GO:0019013': 6,
              'GO:0019028': 5,
              'GO:0019029': 8,
              'GO:0019030': 10,
              'GO:0019031': 14,
              'GO:0019033': 16,
              'GO:0036338': 13,
              'GO:0039615': 18,
              'GO:0039616': 20,
              'GO:0039617': 22,
              'GO:0039618': 24,
              'GO:0039619': 26,
              'GO:0039620': 28,
              'GO:0039621': 30,
              'GO:0039622': 32,
              'GO:0039623': 34,
              'GO:0039624': 36,
              'GO:0039625': 38,
              'GO:0039626': 40,
              'GO:0039627': 42,
              'GO:0039628': 44,
              'GO:0039629': 46,
              'GO:0039641': 48,
              'GO:0039642': 50,
              'GO:0039670': 54,
              'GO:0044423': 4,
              'GO:0046727': 56,
              'GO:0046729': 58,
              'GO:0046798': 60,
              'GO:0046806': 62,
              'GO:0055036': 64,
              'GO:0098015': 66,
              'GO:0098017': 68,
              'GO:0098018': 70,
              'GO:0098021': 53,
              'GO:0098022': 72,
              'GO:0098023': 74,
              'GO:0098024': 76,
              'GO:0098025': 78,
              'GO:0098026': 80,
              'GO:0098027': 82,
              'GO:0098028': 84,
              'GO:0098029': 86,
              'GO:0098030': 88,
              'GO:0098031': 90,
              'GO:0098032': 92,
              'GO:0098033': 94,
              'GO:0098061': 96,
              'P0DTC2': 98,
              'P0DTC3': 100,
              'P0DTC5': 102,
              'P0DTC7': 104,
              'P0DTC9': 106},
 'predecessor': {'GO:0019013': None,
                 'GO:0019028': 'GO:0019013',
                 'GO:0019029': None,
                 'GO:0019030': None,
                 'GO:0019031': None,
                 'GO:0019033': None,
                 'GO:0036338': 'GO:0019031',
                 'GO:0039615': None,
                 'GO:0039616': None,
                 'GO:0039617': None,
                 'GO:0039618': None,
                 'GO:0039619': None,
                 'GO:0039620': None,
                 'GO:0039621': None,
                 'GO:0039622': None,
                 'GO:0039623': None,
                 'GO:0039624': None,
                 'GO:0039625': None,
                 'GO:0039626': None,
                 'GO:0039627': None,
                 'GO:0039628': None,
                 'GO:0039629': None,
                 'GO:0039641': None,
                 'GO:0039642': None,
                 'GO:0039670': None,
                 'GO:0044423': 'GO:0019028',
                 'GO:0046727': None,
                 'GO:0046729': None,
                 'GO:0046798': None,
                 'GO:0046806': None,
                 'GO:0055036': None,
                 'GO:0098015': None,
                 'GO:0098017': None,
                 'GO:0098018': None,
                 'GO:0098021': 'GO:0039670',
                 'GO:0098022': None,
                 'GO:0098023': None,
                 'GO:0098024': None,
                 'GO:0098025': None,
                 'GO:0098026': None,
                 'GO:0098027': None,
                 'GO:0098028': None,
                 'GO:0098029': None,
                 'GO:0098030': None,
                 'GO:0098031': None,
                 'GO:0098032': None,
                 'GO:0098033': None,
                 'GO:0098061': None,
                 'P0DTC2': None,
                 'P0DTC3': None,
                 'P0DTC5': None,
                 'P0DTC7': None,
                 'P0DTC9': None},
 'state': {'GO:0019013': 'finished',
           'GO:0019028': 'finished',
           'GO:0019029': 'finished',
           'GO:0019030': 'finished',
           'GO:0019031': 'finished',
           'GO:0019033': 'finished',
           'GO:0036338': 'finished',
           'GO:0039615': 'finished',
           'GO:0039616': 'finished',
           'GO:0039617': 'finished',
           'GO:0039618': 'finished',
           'GO:0039619': 'finished',
           'GO:0039620': 'finished',
           'GO:0039621': 'finished',
           'GO:0039622': 'finished',
           'GO:0039623': 'finished',
           'GO:0039624': 'finished',
           'GO:0039625': 'finished',
           'GO:0039626': 'finished',
           'GO:0039627': 'finished',
           'GO:0039628': 'finished',
           'GO:0039629': 'finished',
           'GO:0039641': 'finished',
           'GO:0039642': 'finished',
           'GO:0039670': 'finished',
           'GO:0044423': 'finished',
           'GO:0046727': 'finished',
           'GO:0046729': 'finished',
           'GO:0046798': 'finished',
           'GO:0046806': 'finished',
           'GO:0055036': 'finished',
           'GO:0098015': 'finished',
           'GO:0098017': 'finished',
           'GO:0098018': 'finished',
           'GO:0098021': 'finished',
           'GO:0098022': 'finished',
           'GO:0098023': 'finished',
           'GO:0098024': 'finished',
           'GO:0098025': 'finished',
           'GO:0098026': 'finished',
           'GO:0098027': 'finished',
           'GO:0098028': 'finished',
           'GO:0098029': 'finished',
           'GO:0098030': 'finished',
           'GO:0098031': 'finished',
           'GO:0098032': 'finished',
           'GO:0098033': 'finished',
           'GO:0098061': 'finished',
           'P0DTC2': 'finished',
           'P0DTC3': 'finished',
           'P0DTC5': 'finished',
           'P0DTC7': 'finished',
           'P0DTC9': 'finished'},
 'time': 106,
 'type': {('GO:0019013', 'GO:0019028'): 'tree',
          ('GO:0019013', 'GO:0044423'): 'forward',
          ('GO:0019028', 'GO:0044423'): 'tree',
          ('GO:0019029', 'GO:0019028'): 'cross',
          ('GO:0019030', 'GO:0019028'): 'cross',
          ('GO:0019031', 'GO:0036338'): 'tree',
          ('GO:0019033', 'GO:0044423'): 'cross',
          ('GO:0036338', 'GO:0044423'): 'cross',
          ('GO:0039615', 'GO:0019030'): 'cross',
          ('GO:0039616', 'GO:0019030'): 'cross',
          ('GO:0039617', 'GO:0019030'): 'cross',
          ('GO:0039618', 'GO:0019030'): 'cross',
          ('GO:0039619', 'GO:0019030'): 'cross',
          ('GO:0039620', 'GO:0019030'): 'cross',
          ('GO:0039621', 'GO:0019030'): 'cross',
          ('GO:0039622', 'GO:0019030'): 'cross',
          ('GO:0039623', 'GO:0019030'): 'cross',
          ('GO:0039624', 'GO:0019030'): 'cross',
          ('GO:0039624', 'GO:0044423'): 'cross',
          ('GO:0039625', 'GO:0019030'): 'cross',
          ('GO:0039625', 'GO:0044423'): 'cross',
          ('GO:0039626', 'GO:0019030'): 'cross',
          ('GO:0039626', 'GO:0044423'): 'cross',
          ('GO:0039627', 'GO:0019030'): 'cross',
          ('GO:0039628', 'GO:0019030'): 'cross',
          ('GO:0039629', 'GO:0019030'): 'cross',
          ('GO:0039641', 'GO:0036338'): 'cross',
          ('GO:0039642', 'GO:0044423'): 'cross',
          ('GO:0039670', 'GO:0019030'): 'cross',
          ('GO:0039670', 'GO:0098021'): 'tree',
          ('GO:0046727', 'GO:0019028'): 'cross',
          ('GO:0046727', 'GO:0044423'): 'cross',
          ('GO:0046729', 'GO:0044423'): 'cross',
          ('GO:0046798', 'GO:0019028'): 'cross',
          ('GO:0046806', 'GO:0019028'): 'cross',
          ('GO:0055036', 'GO:0044423'): 'cross',
          ('GO:0098015', 'GO:0044423'): 'cross',
          ('GO:0098017', 'GO:0046727'): 'cross',
          ('GO:0098018', 'GO:0046727'): 'cross',
          ('GO:0098021', 'GO:0019028'): 'cross',
          ('GO:0098021', 'GO:0044423'): 'cross',
          ('GO:0098022', 'GO:0098021'): 'cross',
          ('GO:0098023', 'GO:0044423'): 'cross',
          ('GO:0098023', 'GO:0098015'): 'cross',
          ('GO:0098024', 'GO:0044423'): 'cross',
          ('GO:0098024', 'GO:0098015'): 'cross',
          ('GO:0098025', 'GO:0044423'): 'cross',
          ('GO:0098025', 'GO:0098015'): 'cross',
          ('GO:0098026', 'GO:0044423'): 'cross',
          ('GO:0098026', 'GO:0098015'): 'cross',
          ('GO:0098027', 'GO:0044423'): 'cross',
          ('GO:0098027', 'GO:0098015'): 'cross',
          ('GO:0098028', 'GO:0044423'): 'cross',
          ('GO:0098028', 'GO:0098015'): 'cross',
          ('GO:0098029', 'GO:0019030'): 'cross',
          ('GO:0098029', 'GO:0044423'): 'cross',
          ('GO:0098030', 'GO:0019030'): 'cross',
          ('GO:0098030', 'GO:0044423'): 'cross',
          ('GO:0098031', 'GO:0019030'): 'cross',
          ('GO:0098031', 'GO:0044423'): 'cross',
          ('GO:0098032', 'GO:0098022'): 'cross',
          ('GO:0098032', 'GO:0098031'): 'cross',
          ('GO:0098033', 'GO:0098022'): 'cross',
          ('GO:0098033', 'GO:0098030'): 'cross',
          ('GO:0098061', 'GO:0044423'): 'cross',
          ('P0DTC2', 'GO:0019031'): 'cross',
          ('P0DTC2', 'GO:0044423'): 'cross',
          ('P0DTC2', 'GO:0055036'): 'cross',
          ('P0DTC3', 'GO:0044423'): 'cross',
          ('P0DTC5', 'GO:0019031'): 'cross',
          ('P0DTC5', 'GO:0044423'): 'cross',
          ('P0DTC5', 'GO:0055036'): 'cross',
          ('P0DTC7', 'GO:0044423'): 'cross',
          ('P0DTC9', 'GO:0019013'): 'cross',
          ('P0DTC9', 'GO:0044423'): 'cross'}}

5.3 Sortie de DFS sur dressing (sommets triés alphabétiquement)

{'discovered': {'ceinture': 1,
                'chaussettes': 5,
                'chaussures': 6,
                'chemise': 9,
                'cravate': 10,
                'pantalon': 13,
                'sous-vetements': 15,
                'veste': 2},
 'finished': {'ceinture': 4,
              'chaussettes': 8,
              'chaussures': 7,
              'chemise': 12,
              'cravate': 11,
              'pantalon': 14,
              'sous-vetements': 16,
              'veste': 3},
 'predecessor': {'ceinture': None,
                 'chaussettes': None,
                 'chaussures': 'chaussettes',
                 'chemise': None,
                 'cravate': 'chemise',
                 'pantalon': None,
                 'sous-vetements': None,
                 'veste': 'ceinture'},
 'state': {'ceinture': 'finished',
           'chaussettes': 'finished',
           'chaussures': 'finished',
           'chemise': 'finished',
           'cravate': 'finished',
           'pantalon': 'finished',
           'sous-vetements': 'finished',
           'veste': 'finished'},
 'time': 16,
 'type': {('ceinture', 'veste'): 'tree',
          ('chaussettes', 'chaussures'): 'tree',
          ('chemise', 'ceinture'): 'cross',
          ('chemise', 'cravate'): 'tree',
          ('cravate', 'veste'): 'cross',
          ('pantalon', 'ceinture'): 'cross',
          ('pantalon', 'chaussures'): 'cross',
          ('sous-vetements', 'chaussures'): 'cross',
          ('sous-vetements', 'pantalon'): 'cross'}}

5.4 Sortie du tri topologique sur vir

P0DTC9 N (finished: 106)
P0DTC7 7a (finished: 104)
P0DTC5 P0DTC5 (finished: 102)
P0DTC3 3a (finished: 100)
P0DTC2 S (finished: 98)
GO:0098061 viral capsid, internal space (finished: 96)
GO:0098033 icosahedral viral capsid, neck fiber (finished: 94)
GO:0098032 icosahedral viral capsid, collar fiber (finished: 92)
GO:0098031 icosahedral viral capsid, collar (finished: 90)
GO:0098030 icosahedral viral capsid, neck (finished: 88)
GO:0098029 icosahedral viral capsid, spike (finished: 86)
GO:0098028 virus tail, shaft (finished: 84)
GO:0098027 virus tail, sheath (finished: 82)
GO:0098026 virus tail, tube (finished: 80)
GO:0098025 virus tail, baseplate (finished: 78)
GO:0098024 virus tail, fiber (finished: 76)
GO:0098023 virus tail, tip (finished: 74)
GO:0098022 viral capsid, fiber (finished: 72)
GO:0098018 viral capsid, minor subunit (finished: 70)
GO:0098017 viral capsid, major subunit (finished: 68)
GO:0098015 virus tail (finished: 66)
GO:0055036 virion membrane (finished: 64)
GO:0046806 viral scaffold (finished: 62)
GO:0046798 viral portal complex (finished: 60)
GO:0046729 viral procapsid (finished: 58)
GO:0046727 capsomere (finished: 56)
GO:0039670 viral capsid, turret (finished: 54)
GO:0098021 viral capsid, decoration (finished: 53)
GO:0039642 virion nucleoid (finished: 50)
GO:0039641 viral inner membrane (finished: 48)
GO:0039629 T=219 icosahedral capsid (finished: 46)
GO:0039628 T=169 icosahedral viral capsid (finished: 44)
GO:0039627 T=147 icosahedral capsid (finished: 42)
GO:0039626 viral intermediate capsid (finished: 40)
GO:0039625 viral inner capsid (finished: 38)
GO:0039624 viral outer capsid (finished: 36)
GO:0039623 T=25 icosahedral viral capsid (finished: 34)
GO:0039622 T=16 icosahedral viral capsid (finished: 32)
GO:0039621 T=13 icosahedral viral capsid (finished: 30)
GO:0039620 T=7 icosahedral viral capsid (finished: 28)
GO:0039619 T=4 icosahedral viral capsid (finished: 26)
GO:0039618 T=pseudo3 icosahedral viral capsid (finished: 24)
GO:0039617 T=3 icosahedral viral capsid (finished: 22)
GO:0039616 T=2 icosahedral viral capsid (finished: 20)
GO:0039615 T=1 icosahedral viral capsid (finished: 18)
GO:0019033 viral tegument (finished: 16)
GO:0019031 viral envelope (finished: 14)
GO:0036338 viral membrane (finished: 13)
GO:0019030 icosahedral viral capsid (finished: 10)
GO:0019029 helical viral capsid (finished: 8)
GO:0019013 viral nucleocapsid (finished: 6)
GO:0019028 viral capsid (finished: 5)
GO:0044423 virion component (finished: 4)