Mentions légales du service

Skip to content
Snippets Groups Projects
Forked from NGUYEN Linh Chi / compmatrixdb
Up to date with the upstream repository.

Scriptmatrixdb

This guide is about installing scriptmatrixdb.py.

Installation

To install package:

pip install powergrasp --no-cache-dir --no-deps

pip install pyasp -U --no-cache-dir

pip install pytest bubbletools networkx requests bs4 clyngor

More documentation about PowerGrASP or Bubble-tools can be found in their githubs.

Automatization of data interaction graph compression

Scriptmatrixdb.py documentation

Documentation obtained in the terminal with the command : python scriptmatrixdb.py -h

usage: scriptmatrixdb.py [-h] [--A col_number_alias_A]
	                 [--B col_number_alias_B]
	                 [--graph {powergraph,oriented}]
	                 [--interac {human,human-mouse,chicken,mouse,mouse-rat,dog,taurus,rat,sheep,pig,none,unknown}]
	                 [--render] [--annot Annotation] [--allpwrn]
	                 [--score SCORE] [--pv pvalue] [--withoutCHEBI]
	                 [--graphonly] [--aspfile ASPFILE] [--tabfile TABFILE]
	                 INFILE

positional arguments:
  INFILE                .mitab input file

optional arguments:
  -h, --help            show this help message and exit
  --A col_number_alias_A
	                The column number of the alias A (positive number).
  --B col_number_alias_B
	                The column number of the alias B (positive number).
  --graph {powergraph,oriented}
	                Powergraph or oriented graph.
  --interac {human,human-mouse,chicken,mouse,mouse-rat,dog,taurus,rat,sheep,pig,none,unknown}
	                Taxon filter for the graph compression.
  --render              Generate a png image with the powergraph plugin (Oog)
	                on Cytoscape.
  --annot Annotation    csv file for DAVID
  --allpwrn             To have the maximum concept (all powernodes).
  --score SCORE         Minimum enrichment score (positive number).
  --pv pvalue           p-value for functional annotation (positive decimal
	                number between 0 and 1).
  --withoutCHEBI        Does not take into account the molecules with ChEBI
	                identification (non-protein molecules).
  --graphonly           To have only the powergraph.
  --aspfile ASPFILE     pre existing asp file from converter.
  --tabfile TABFILE     pre existing tabulated file from converter.

Example with the tabulated file "matrixdb_CORE27_example.mitab"

python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05

Running the script scriptmatrixdb.py with an input file in named mitab format matrixdb_CORE27_example.mitab.

Columns 5 and 6 (Aliases for A and Aliases for B) contain the interaction's molecules aliases.

The minimum enrichment score required for DAVID clusters is 2.

P-value of the GO terms selected in this enrichment should not exceed 0.05.

Possible option

  • Functional annotation in DAVID is done by selecting annotation categories. All of these categories can be found in the annot_all.csv file. By default, the script takes the annot.csv file that contains annotation categories for Gene Ontology. For an annotation with DAVID's GO terms coming only from the GOTERM_BP_DIRECT category. A sample file named annot_matrixdb.csv can be found in annot folder and can be used with this command :

      python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05 --annot=annot/annot_matrixdb.csv
  • Option to obtain a compressed graph with only human interactions :

      python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05 --interac=human
  • Option to obtain a compressed graph with only protein molecules :

      python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05 --withoutCHEBI
  • Option to get only compressed graphic :

      python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05 --graphonly

Rules of rewriting molecules names

To rewrite the graph compression's molecules names from the origin tabulated "mitab" file. Rules for this rewrite have been established. To do this, there are 3 files in the annot folder :

  • aliases.csv is a tabulated file contains 2 columns, with first the molecules name after rewriting, then the name before rewriting.
  • decomposables.csv is a tabulated file contains 3 columns. First column contains the name of the dimeric molecule before rewriting and then in the other 2 columns, the 2 monomeric molecules names rewritten.
  • taxon_aliases.csv is a tabulated file that contains the species common name and its name found in the tabulated mitab file.

Functional annotation of a protein list

Script_bbl.py documentation

Documentation obtained in terminal with the command : python script_bbl.py -h

usage: script_bbl.py [-h] [--annot Annotation] [--withoutCHEBI]
	             [--pwrn Powernode_choice]
	             INFILE score pvalue FileRef

positional arguments:
  INFILE                Input file (.bbl).
  score                 Minimum enrichment score (positive number).
  pvalue                p-value for functional annotation (positive decimal
	                number between 0 and 1).
  FileRef               Ref file for the bbl file '_tab.csv'.

optional arguments:
  -h, --help            show this help message and exit
  --annot Annotation    Annotation file.
  --withoutCHEBI        Does not take into account the molecules with ChEBI
	                identification (non-protein molecules).
  --pwrn Powernode_choice
	                Chosen powernode from the bbl file (write 'powernode
	                name' if powernode name contains special characters)

Example of annotation obtained with a powernode using script "sciptmatrixdb.py"

Compressed graph matrixdb_CORE27_example.bbl, was formed and placed in the folder matrixdb_CORE27_example through the command :

python scriptmatrixdb.py matrixdb_CORE27_example.mitab 5 6 2 0.05 --graphonly

A powernode's annotation such as a powernode PWRN-"1,3-dimethyl-2-[2-oxopropyl thio]imidazolium chloride"-2-1 of the matrixdb_CORE27_example.bbl file can be done with :

python script_bbl.py matrixdb_CORE27_example/matrixdb_CORE27_example.bbl 2 0.05 matrixdb_CORE27_example/matrixdb_CORE27_example_tab.csv --annot=annot/annot_matrixdb.csv --PWRN='PWRN-"1,3-dimethyl-2-[2-oxopropyl thio]imidazolium chloride"-2-1'

This script will then be created for each powernode chosen as for example with a powernode named PN_name:

  • A PN_name folder containing for each direct descending powernode (such as : powernode_name) :

    • A tabulated file List_of_powernode_name.csv with a molecules list contained in this powernode.
    • A folder List_of_powernode_name with the same name as this file and containing :
      • A tabulated file List_powernode_name.txt containing the each molecule list with its identifier and its taxonomy.
      • A tabulated file powernode_name_listinit.txt containing molecules' identifiers list, before complementing with identifiers that are exclusive to the MatrixDB database.
      • A folder powernode_name.txt containing the molecules' protein identifiers list used then for the functional annotation with DAVID.
      • A tabulated file powernode_name_listdelete.txt Containing the molecules list not taken into account because identifier unknown or not corresponding to the tabulated file provided (example : matrixdb_CORE27_example_tab.csv). (This file is created only if there is at least one molecule that is not taken into account.)
      • A folder list_David containing :
        • A tabulated file List_David_powernode_name.csv retrieved from the DAVID annotation, which contains the GO terms for each cluster.
        • A tabulated file LIST_GOTERM_powernode_name.csv containing the GO terms list selected according to the cluster enrichment score and the parameter defined pvalue.
      • A folder htmlfile with the file obtained from the DAVID web page. (Only if the connection is successful.)
  • A tabulated file RESULTAT_FINAL_PN_name.csv, containing the set of results (powernodes names studied, Uniprot identifiers numbers taken into account in DAVID, GO terms number found, ...) for this powernode studied.

-> Moreover, a tabulated file Conclusion_Node is created with the powernodes studied summary, that is to say those who do not have direct successors. (The biggest powernode possible.) This file contains powernodes studied (= NODE), proteins identifiers numbers used to DAVID , GO terms number found, and the exact list of these GO terms.

script_stat

Perform different tasks on protein-protein interaction graph.

Installation

pip install matplotlib numpy scipy

script_stat.py documentation

Documentation obtained by using the command: python script_stat.py -h

usage: script_stat.py [-h] [--tabfile TABFILE] [--tabfile2 TABFILE2]
	                  [--oldtab] [--withoutCHEBI]
	                  [--graph {degree,coef,both,stacked,scatter,None}]
	                  [--interac {human,human-mouse,chicken,mouse,mouse-rat,dog,taurus,rat,sheep,pig,none,unknown}]
	                  [--withoutzero] [--bins BINS] [--log] [--subtab SUBTAB]
	                  [--neighbour] [--conn] [--equi] [--cyto] [--tree]
	                  [--hub HUB] [--percent PERCENT] [--iso] [--ap]
	                  [--degree] [--origindata ORIGINDATA] [--test TEST]
	                  INFILE

positional arguments:
  INFILE                .mitab input file

optional arguments:
  -h, --help            show this help message and exit
  --tabfile TABFILE     Pre existing tabulated file from converter.
  --tabfile2 TABFILE2   Pre existing tabulated file from converter.
  --oldtab              If tabfile is w/ alias (no isoforms).
  --withoutCHEBI        Does not take into account the molecules with ChEBI
	                    identification (non-protein molecules).
  --graph {degree,coef,both,stacked,scatter,None}
	                    Create histograms
  --interac {human,human-mouse,chicken,mouse,mouse-rat,dog,taurus,rat,sheep,pig,none,unknown}
	                    Taxon filter for the graph compression.
  --withoutzero         Does not take into account the protein with a coef of
	                    zero.
  --bins BINS           Choose the number of bins in the graph (default: 100).
  --log                 logarithm scale for y axis
  --subtab SUBTAB       coef (x) or coef interval (x-x), create a sub
	                    tabulated file.
  --neighbour           Take into account the nodes neighbours for the
	                    --subtab or --tree options if True.
  --conn                Count the connected components of the graph.
  --equi                Process the data in order to find the equivalence
	                    group.
  --cyto                Generate a tabulated file with the coef of each node,
	                    can be used for visualization in Cytoscape.
  --tree                Create a tabulated and an asp file without trees.
  --hub HUB             Threshold for the hub. ex: 0.01 if you want to remove
	                    the 1 percent most connected nodes. Create a tabulated
	                    and an asp file without the hubs.
  --percent PERCENT     Percentage of the nodes we want to keep (between 0 and
	                    1). If 'all', does 10 to 90 percent files. Create a
	                    tabulated file and an asp file w/ nodes w/ the smaller
	                    degree.
  --iso                 Create a file with listing the isoforms and their id.
  --ap                  Detects the articulation points of the graph.
  --degree              Give info on degree distribution.
  --origindata ORIGINDATA
	                    Can be used to input the original data if the tabfile
	                    is a percent file.
  --test TEST           Perform a given test

Using --tabfile

Pre-existing tabulated file generated by scriptmatrixdb.py. (ex: test/tab_example.csv)

Using --oldtab

Option to be used when using a tabulated file that uses aliases that merge isoforms. This option allow the user to take into account the isoforms. Create new aliases if isoforms detected.

Using --graph option

The user can chose between different graph options:

  • both: generate a degree distribution and a clustering coefficient distribution
  • degree: generate a degree distribution
  • coef: generate a clustering coefficient distribution
  • stacked: generate a coefficient distribution with the degree annoted with colors
  • scatter: generate a scatter plot crossing the degree and the clustering coefficient

The --graph option can be combined with other ones:

  • --bins
  • --log
  • --withoutzero: can be used on the clustering coefficient distribution to remove the zero value, and on the degree distribution to remove the value 1

Examples:

  • Generate a degree distribution with a logarithmic scale for the y axis using an existing tabulated file

      python script_stat.py foo.mitab --tabfile foo_tab.csv --graph degree --log
  • Generate a scatter plot

      python script_stat.py foo.mitab --graph scatter

Using the --subtab option

  • Create a tabulated file with only the nodes with a coefficient of 0:

      python script_stat.py foo.mitab --subtab 0
  • Create a tabulated file with the nodes with a coefficient ranging from 0.3 to 0.4:

      python script_stat.py foo.mitab --subtab 0.3-0.4

This option can be used with the --neighbour option to extend the graph to the selected nodes neighbours.

  • Create a tabulated file with the nodes with a coefficient of 0 and their neighbours:

      python script_stat.py foo.mitab --subtab 0 --neighbour

Using the --percent option

  • Create a tabulated file and a asp file from the original data containing 84% of the less connected nodes:

      python script_stat.py foo.mitab --percent 0.84

string_from_matrixdb

Automatization of the comparison between a matrixDB network and its STRING equivalent.

Retrieve the nodes from MatrixDB and search for annotation in STRING

  • map the identifiers into string identifiers
  • produce visualization for the STRING networks * full annotation * (physical interaction only) --> TODO
  • produce the corresponding interaction tables
  • compare the two networks (MatrixDB and STRING) and produce a difference network + the corresponding difference table --> TODO

string_from_matrixdb.py documentation

Documentation obtained by using the command: python string_from_matrixdb.py -h

usage: string_from_matrixdb.py [-h] [--species SPECIES] [--mapping] [--visu]
	                           [--inter] [--diff DIFF]
	                           proteinlist

positional arguments:
  proteinlist        List of proteins to be searched in STRING.

optional arguments:
  -h, --help         show this help message and exit
  --species SPECIES  NCBI taxon identifiers
  --mapping          Mapping of the protein names only.
  --visu             Visualization of the networks only.
  --inter            Retrieve interactions table only.
  --diff DIFF        Network file (tabulated file) used to compare te result
	                 found with the STRING database. Produce the difference
	                 network and the difference table from two networks (ex:
	                 MatrixDB and STRING).

Usage examples

  • Generate the network and the interaction table:

      python tstring_from_matrixdb.py est/protein_list_2.txt
  • Identifiers mapping only:

      python string_from_matrixdb.py test/protein_list.txt --mapping
  • Generate the STRING network only:

      python string_from_matrixdb.py  test/protein_list.txt --visu
  • Generate the STRING interaction table only:

      python string_from_matrixdb.py test/protein_list_2.txt --inter