Commit e8f814ba authored by Mikaël Salson's avatar Mikaël Salson

Merge branch 'feature-a/3004-doc-markdown' into 'dev'

Feature a/3004 : doc : convert in markdown and update

Closes #3348

See merge request !202
parents 6eb6527f b649aa47
Pipeline #32900 failed with stages
in 4 seconds
!NO_LAUNCHER:
!LAUNCH: (cd $VIDJIL_DIR ; grep './$EXEC ' doc/algo.org | sh)
!LAUNCH: (cd $VIDJIL_DIR ; grep './$EXEC ' doc/vidjil-algo.md | sh)
# Test examples embedded in 'doc/algo.org'
# Test examples embedded in 'doc/vidjil-algo.md'
# Run eight times
8:vidjil-algo -- V.D.J recombinations analysis .*vidjil.org
......
MD_FILES=$(wildcard *.md)
ORG_FILES=$(wildcard *.org)
html: $(ORG_FILES)
all: html htmlorg
html: $(MD_FILES)
cd .. ; mkdocs build
htmlorg: $(ORG_FILES)
make $(subst .org,.html, $^)
%.html: %.org
# Requires emacs with org-mode
emacs -batch $^ -f "org-html-export-to-html"
analysis-example%: format-analysis.org
analysis-example%: vidjil-format.md
python ../tools/org-babel-tangle.py --all $^
# Credits
Vidjil is an open-source platform for the analysis of high-throughput sequencing data from lymphocytes, developed and maintained by
the [Bonsai bioinformatics lab](http://cristal.univ-lille.fr/bonsai) at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille
and the [VidjilNet consortium](http://www.vidjil.net) at Fondation Inria.
Contact: [Mathieu Giraud and Mikaël Salson](mailto:contact@vidjil.org)
## Vidjil core development
- Mathieu Giraud, 2011-2018
- Mikaël Salson, 2011-2018
- Marc Duez, 2012-2016
- Tatiana Rocher, 2014-2017
- Florian Thonier, 2015-2018
- Ryan Herbert, 2015-2018
- Aurélien Béliard, 2016-2017
## Other contributors
- David Chatel, 2011-2012
- Antonin Carette, 2014
- Loïc Breton and Jordan Gilliot, 2014
- François Dubiez, 2015-2016
- Amina Boussalia and Fabien Fache, 2016
- Eddy El Khatib and Nicolas Berveglieri, 2017
- Téo Vasseur, 2017
- Cyprien Borée, 2018
## `.should-vdj.fa` tests with curated V(D)J designations
- Yann Ferret (CHRU Lille), 2014-2015
- Florian Thonier (Inserm, Paris Necker), 2015-2016
## Acknowledgements
We thank all our users, collaborators and colleagues who provided feedback on Vidjil and proposed new ideas.
Our special thanks go to:
- Marine Armand
- Jack Bartram
- Aurélie Caillault
- Yann Ferret
- Alice Fievet
- Martin Figeac
- Nathalie Grardel
- Michaela Kotrová
- Claude Preudhomme
- Shéhérazade Sebda
Vidjil is developed in collaboration or in connection with the following groups:
- department of Hematology of CHRU Lille
- Functional and Structural Genomic Platform (U. Lille 2, IFR-114, IRCL)
- Institut Necker Enfants Malades, Paris
- EuroClonality-NGS working group
## Funding
The development of Vidjil is funded by:
- Région Nord-Pas-de-Calais/Hauts-de-France, 2012-2017
- Université Lille 1, 2014-2017
- SIRIC ONCOLille (Grant INCa-DGOS-Inserm 6041), 2014-2017
- Inria Lille, 2015-2018
- InCA, 2016-2018
- VidjilNet consortium, 2018
## Reference
Marc Duez et al.,
“Vidjil: A web platform for analysis of high-throughput repertoire sequencing”,
PLOS ONE 2016, 11(11):e0166126
<http://dx.doi.org/10.1371/journal.pone.0166126>
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
BMC Genomics 2014, 15:409
<http://dx.doi.org/10.1186/1471-2164-15-409>
#+TITLE: Vidjil credits
Vidjil is an open-source platform for the analysis of high-throughput sequencing data from lymphocytes,
developed by the Bonsai bioinformatics lab, at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille.
Contact: Mathieu Giraud and Mikaël Salson <contact@vidjil.org>
* Vidjil core development
- Mathieu Giraud, 2011-2017
- Mikaël Salson, 2011-2017
- Marc Duez, 2012-2016
- Tatiana Rocher, 2014-2017
- Florian Thonier, 2015-2017
- Ryan Herbert, 2015-2017
- Aurélien Béliard, 2016-2017
* Other contributors
- David Chatel, 2011-2012
- Antonin Carette, 2014
- Loïc Breton and Jordan Gilliot, 2014
- François Dubiez, 2015-2016
- Amina Boussalia and Fabien Fache, 2016
- Eddy El Khatib and Nicolas Berveglieri, 2017
- Téo Vasseur, 2017
* .should-vdj.fa tests with curated V(D)J designations
- Yann Ferret (CHRU Lille), 2014-2015
- Florian Thonier (Inserm, Paris Necker), 2015-2016
* Acknowledgements
We thank all our users, collaborators and colleagues who provided feedback on Vidjil and proposed new ideas.
Our special thanks go to:
- Marine Armand
- Jack Bartram
- Aurélie Caillault
- Yann Ferret
- Alice Fievet
- Martin Figeac
- Nathalie Grardel
- Michaela Kotrová
- Claude Preudhomme
- Shéhérazade Sebda
Vidjil is developed in collaboration or in connection with the following groups:
- department of Hematology of CHRU Lille
- Functional and Structural Genomic Platform (U. Lille 2, IFR-114, IRCL)
- Institut Necker Enfants Malades, Paris
- EuroClonality-NGS working group
* Funding
The development of Vidjil is funded by:
- Région Nord-Pas-de-Calais/Hauts-de-France, 2012-2017
- Université Lille 1, 2014-2017
- SIRIC ONCOLille (Grant INCa-DGOS-Inserm 6041), 2014-2017
- Inria Lille, 2015-2017
- InCA, 2016-2017
* Reference
Marc Duez et al.,
“Vidjil: A web platform for analysis of high-throughput repertoire sequencing”,
PLOS ONE 2016, 11(11):e0166126
http://dx.doi.org/10.1371/journal.pone.0166126
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
BMC Genomics 2014, 15:409
http://dx.doi.org/10.1186/1471-2164-15-409
#+TITLE: Encoding clones with V(D)J recombinations (2016b)
#+AUTHOR: The Vidjil team
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="org-mode.css" />
The following [[http://en.wikipedia.org/wiki/JSON][.json]] format allows to
encode a set of clones with V(D)J immune recombinations,
possibly with user annotations.
In Vidjil, this format is used by both the =.analysis= and the =.vidjil= files.
The =.vidjil= file represents the actual data on clones (and that can
reach megabytes, or even more), usually produced by processing reads by some RepSeq software.
(for example with detailed information on the 100 or 1000 top clones).
The =.analysis= file describes customizations done by the user
(or by some automatic pre-processing) on the Vidjil web application. The web application
can load or save such files (and possibly from/to the patient/sample database).
It is intended to be very small (a few kilobytes).
All settings in the =.analysis= file override the settings that could be
present in the =.vidjil= file.
* What is a clone ?
There are several definitions of what may be a clonotype,
depending on different RepSeq software or studies.
This format accept any kind of definition:
Clones are identified by a =id= string that may be an arbitrary identifier such as =clone-072a=.
Software computing clones may choose some relevant identifiers:
- =CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC=, Vidjil algorithm, 50 nt window centered on the CDR3
- =CARPRDWNTYYYYGMDVW=, a CDR3 AA sequence
- =CARPRDWNTYYYYGMDVW IGHV3-11*00 IGHJ6*00=, a CDR3 AA sequence with additional V/J gene information (MiXCR)
- the 'clone sequence' as computed by the ARReST in =.clntab= files (processed by =fuse.py=)
- see also 'IMGT clonotype (AA) or (nt)'
* Examples
** =.vidjil= file -- one sample
This is an almost minimal =.vidjil= file, describing clones in one sample.
The =seg= element is optional: clones without =seg= elements will be shown on the grid with '?/?'.
The =_average_read_length= is also optional, but allows to plot GENSCAN-like plots more precisely than getting only the length of the sequence.
All other elements are required. The =reads.germlines= list can have only one element the case of data on a unique locus.
There is here one clone on the =TRG= locus with a designation =TRGV5*01 5/CC/0 TRGJ1*02=.
Note that other elements could be added by some program (such as =tag=, to identify some clones,
or =clusters=, to further cluster some clones, see below).
#+BEGIN_SRC js :tangle analysis-example1.vidjil
{
"producer": "program xyz version xyz",
"timestamp": "2014-10-01 12:00:11",
"vidjil_json_version": "2016b",
"samples": {
"number": 1,
"original_names": ["T8045-BC081-Diag.fastq"]
},
"reads" : {
"total" : [ 437164 ] ,
"segmented" : [ 335662 ] ,
"germline" : {
"TRG" : [ 250000 ] ,
"IGH" : [ 85662 ]
}
},
"clones": [
{
"id": "clone-001",
"sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
"reads" : [ 243241 ],
"_average_read_length": [ 119.3 ],
"germline": "TRG",
"top": 1,
"seg":
{
"5": {"name": "TRGV5*01", "start": 1, "stop": 86, "delRight":5},
"3": {"name": "TRGJ1*02", "start": 89, "stop": 118, "delLeft":0},
"cdr3": { "start": 77, "stop": 104, "seq": "gccacctgggccttattataagaaactc" }
}
}
]
}
#+END_SRC
** =.vidjil= file -- several related samples
This a =.vidjil= file obtained by merging with =fuse.py= two =.vidjil= files corresponding to two samples.
Clones that have a same =id= are gathered (see 'What is a clone?', above).
It is the responsibility of the program generating the initial =.vidjil= files to choose these =id= to
do a correct gathering.
#+BEGIN_SRC js :tangle analysis-example2.vidjil
{
"producer": "program xyz version xyz / fuse.py version xyz",
"timestamp": "2014-10-01 14:00:11",
"vidjil_json_version": "2016b",
"samples": {
"number": 2,
"original_names": ["T8045-BC081-Diag.fastq", "T8045-BC082-fu1.fastq"]
},
"reads" : {
"total" : [ 437164, 457810 ] ,
"segmented" : [ 335662, 410124 ] ,
"germline" : {
"TRG" : [ 250000, 300000 ] ,
"IGH" : [ 85662, 10124 ]
}
},
"clones": [
{
"id": "clone-001",
"sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
"reads" : [ 243241, 14717 ],
"germline": "TRG",
"top": 1,
"seg":
{
"5": {"name": "TRGV5*01", "start": 1, "stop": 86, "delRight": 5},
"3": {"name": "TRGJ1*02", "start": 89, "stop": 118, "delLeft": 0}
}
},
{
"id": "clone2",
"sequence": "GATACA",
"reads": [ 153, 10221 ],
"germline": "TRG",
"top": 2
},
{
"id": "clone3",
"sequence": "ATACAGA",
"reads": [ 521, 42 ],
"germline": "TRG",
"top": 3,
"seg":
{
"5": {"start": 1, "stop": 100},
"3": {"start": 101, "stop": 200}
}
}
]
}
#+END_SRC
** =.analysis= file
This file reflects the annotations a user could have done within the Vidjil web application or some other tool.
She has manually set sample names (=names=), tagged (=tag=, =tags=), named (=name=) and clustered (=clusters=)
some clones, and added external data (=data=).
#+BEGIN_SRC js :tangle analysis-example2.analysis
{
"producer": "user Bob, via Vidjil webapp",
"timestamp": "2014-10-01 12:00:11",
"vidjil_json_version": "2016b",
"samples": {
"id": [
"T8045-BC081-Diag.fastq",
"T8045-BC082-fu1.fastq"
],
"number": 2,
"names": ["diag", "fu1"],
"original_names": ["file1.fastq", "file2.fastq"],
"order": [1, 0]
},
"clones": [
{
"id": "clone-001",
"name": "Main ALL clone",
"tag": "0"
},
{
"id": "spikeE",
"label": "spike",
"sequence": "ATGACTCTGGAGTCTATTACTGTGCCACCTGGGATGTGAGTATTATAAGAAAC",
"tag": "3",
"expected": "0.1"
}
],
"clusters": [
[ "clone2", "clone3"],
[ "clone-5", "clone-10", "clone-179" ]
],
"data": {
"qPCR": [0.83, 0.024],
"spikeZ": [0.01, 0.02]
},
"tags": {
"names": {
"0" : "main clone",
"3" : "spike",
"5" : "custom tag"
},
"hide": [4, 5]
}
}
#+END_SRC
The =order= field defines the order in which order the points should be
considered. In that case we should first consider the second point (whose =name=
is /fu1)/ and the point to be considered in second should be the first one in
the file (whose =name= is /diag/).
The =clusters= field indicate clones (by their =id=) that have been further clustered.
Usually, these clones were defined in a related =.vidjil= file (as /clone2/ and /clone3/,
see the =.vidjil= file in the previous section). If these clones do not exist, the clusters are
just ignored. The first item of the cluster is considered as the
representative clone of the cluster.
* Detailed specification
** Generic information for traceability [required]
#+BEGIN_SRC js
"producer": "my-repseq-software -z -k (v. 123)", // arbitrary string, user/software/version/options producing this file [required]
"timestamp": "2014-10-01 12:00:11", // last modification date [required]
"vidjil_json_version": "2016b", // version of the .json format [required]
#+END_SRC
** Statistics: the =reads= element [.vidjil only, required]
The number of analyzed reads (=segmented=) may be higher than the sum of the read number of all clones,
when one choose to report only the 'top' clones (=-t= option for fuse).
#+BEGIN_SRC js
{
"total" : [], // total number of reads per sample (with samples.number elements)
"segmented" : [], // number of analyzed/segmented reads per sample (with samples.number elements)
"germline" : { // number of analyzed/segmented reads per sample/germline (with samples.number elements)
"TRG" : [],
"IGH" : []
}
}
#+END_SRC
** =samples= element [required]
#+BEGIN_SRC js
{
"number": 2, // number of samples [required]
"original_names": [], // original sample names (with samples.number elements) [required]
"names": [], // custom sample names (with samples.number elements) [optional]
// These names are editable and will be used on the graphs
"order": [], // custom sample order (lexicographic order by default) [optional]
// traceability on each sample (with sample.number elements)
"producer": [],
"timestamp": [],
"log": []
}
#+END_SRC
** =clones= list, with read count, tags, V(D)J designation and other sequence features
Each element in the =clones= list describes properties of a clone.
In a =.vidjil= file, this is the main part, describing all clones.
In the =.analysis= file, this section is intended to describe some specific clones.
#+BEGIN_SRC js
{
"id": "", // clone identifier, must be unique [required] [see above, 'What is a clone ?']
// the clone identifier in the .vidjil file and in .analysis file must match
"germline": "" // [required for .vidjil]
// (should match a germline defined in germline/germline.data)
"name": "", // clone custom name [optional]
// (the default name, in .vidjil, is computed from V/D/J information)
"label": "", // clone labels, separed by spaces [optional]
// These labels may add some information entered with a controled vocabulary
"sequence": "", // reference nt sequence [required for .vidjil]
// (for .analysis, not really used now in the web application,
// for special clones/sequences that are known,
// such as standard/spikes or know patient clones)
"tag": "", // tag id from 0 to 7 (see below) [optional]
"expected": "" // expected abundance of this clone (between 0 and 1) [optional]
// this will create a normalization option in the
// settings web application menu
"seg": // detailed V(D)J designation/segmentation and other sequences features or values [optional]
// on the web application, clones that are not segmented will be shown on the grid with '?/?'
// positions are related to the 'sequence'
// names of V/D/J genes should match the ones in files referenced in germline/germline.data
// Positions on the sequence start at 1.
{
"5": {"name": "IGHV5*01", "start": 1, "stop": 120, "delRight": 5}, // V (or 5') segment
"4": {"name": "IGHD1*01", "start": 124, "stop": 135, "info": "unsure designation", "delRight": 5, "delLeft": 0}, // D (or middle) segment
// Recombination with several D may use "4a", "4b"...
"3": {"name": "IGHJ3*02", "start": 136, "stop": 171, "delLeft": 5}, // J (or 3') segment
// any feature to be highlighted in the sequence, with optional fields related to this feature:
// - "start"/"stop" : positions on the clone sequence (starting at 1)
// - "delLeft/delRight" : a numerical value . It is the numbers of nucleotides deleted during the rearrangment. DelRight are compatible with V/5 and D/4 segments, delLeft is compatible with D/4 and J/3 segments.
// - "seq" : a sequence
// - "val" : a numerical value
// - "info" : a textual vlaue
//
// JUNCTION//CDR3 should be stored that way (in fields called "junction" of "cdr3"),
// its productivity must be stored in a boolean field called "productive".
"somefeature": { "start": 56, "stop": 61, "seq": "ACTGTA", "val": 145.7, "info": "analyzed with xyz" },
// Numerical or textual features concerning all the sequence or its analysis (such as 'evalue')
// can be provided by omitting "start" and "stop" elements.
"someotherfeature": {"val": 0.004521},
"anotherfeature": {"info": "VH CDR3 stereotypy"},
}
"reads": [], // number of reads in this clones [.vidjil only, required]
// (with samples.number elements)
"top": 0, // (not documented now) [required] threshold to display/hide the clone
"stats": [] // (not documented now) [.vidjil only] (with sample.number elements)
}
#+END_SRC
** =germlines= list [optional][work in progress, to be documented]
extend the =germline.data= default file with a custom germline
#+BEGIN_SRC js
"germlines" : {
"custom" : {
"shortcut": "B",
"5": ["TRBV.fa"],
"4": ["TRBD.fa"],
"3": ["TRBJ.fa"]
}
}
#+END_SRC
** Further clustering of clones: the =clusters= list [optional]
Each element in the 'clusters' list describe a list of clones that are 'merged'.
In the web application, it will be still possible to see them or to unmerge them.
The first clone of each line is used as a representative for the cluster.
** =data= list [optional][work in progress, to be documented]
Each element in the =data= list is a list of values (of size samples.number)
showing additional data for each sample, as for example qPCR levels or spike information.
In the browser, it will be possible to display these data and to normalize
against them (not implemented now).
** Tagging some clones: =tags= list [optional]
The =tags= list describe the custom tag names as well as tags that should be hidden by default.
The default tag names are defined in [[../browser/js/vidjil-style.js]].
#+BEGIN_SRC js
"key" : "value" // "key" is the tag id from 0 to 7 and "value" is the custom tag name attributed
#+END_SRC
# Analyzed human locus
The Vidjil web application displays multi-locus data, as long as this information
is provided in the `.vidjil` file computed by the analysis program.
Vidjil-algo currently analyzes the following locus,
selecting the best locus for each read.
The configuration of analyzed locus is done in the `germline/homo-sapiens.g` preset.
| | | complete recombinations | | incomplete/special recombinations |
| -------------------- | ------- | ---------------------------------------------- | --------- | --------------------------------- |
| | **TRA** | Va-Ja | | |
| | **TRB** | Vb-(Db)-Jb | **TRB+** | Db-Jb |
| | **TRD** | Vd-(Dd)-Jd | **TRD+** | Vd-Dd3, Dd2-(Dd)-Jd, Dd2-Dd3 |
| | | | **TRA+D** | Vd-(Dd)-Ja, Dd-Ja |
| | **TRG** | Vg-Jg | | |
| | **IGH** | Vh-(Dh)-Jh | **IGH+** | Dh-Jh |
| | **IGL** | Vl-Jl | | |
| | **IGK** | Vk-Jk | **IGK+** | Vk-KDE, INTRON-KDE |
| vidjil-algo option | | `-g germline/homo-sapiens.g:TRA,TRB,TRD,TRG` | | `-g germline/homo-sapiens.g` |
| | | `-g germline/homo-sapiens.g:IGH,IGL,IGK` | | |
| server configuration | | `multi` | | `multi+inc` |
The detection of complete recombinations is reliable and should work provided that the reads
are long enough (especially the J region).
The detection of incomplete/special recombinaisons is more challenging and may fail in some cases.
In particular, as D genes may be very short, detecting TRD+ (Dd2/Dd3) and IGH+ (Dh-Jh) recombinations
require to have reads with fairly conserved D genes or up/downstream regions.
Finally, the `-2` command line option and the `multi+inc+xxx` server configuration try to
detect unexpected or chimeric recombinations between genes of different germlines or on different
strands (such as PCR dimers or +V/-V recombinations).
These recombinations, tagged as `xxx`, can be technological artefacts or unusual biological recombinations.
#+TITLE: Vidjil -- Analyzed human locus
#+AUTHOR: The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Ryan and Tatiana)
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="org-mode.css" />
#+OPTIONS: toc:nil
The Vidjil web application is able to display multi-locus data, as long as this information
is provided in the =.vidjil= file computed by the analysis program.
Vidjil-algo currently analyzes the following locus,
selecting the best locus for each read.
The configuration of analyzed locus is done in the =germline/homo-sapiens.g= preset.
|----------------------+-------+-------------------------+--------+-----------------------------------|
| | | complete recombinations | | incomplete/special recombinations |
|----------------------+-------+-------------------------+--------+-----------------------------------|
| | *TRA* | Va-Ja | | |
| | *TRB* | Vb-(Db)-Jb | *TRB+* | Db-Jb |
| | *TRD* | Vd-(Dd)-Jd | *TRD+* | Vd-Dd3, Dd2-(Dd)-Jd, Dd2-Dd3 |
| | | | *TRA+D*| Vd-(Dd)-Ja, Dd-Ja |
| | *TRG* | Vg-Jg | | |
|----------------------+-------+-------------------------+--------+-----------------------------------|
| | *IGH* | Vh-(Dh)-Jh | *IGH+* | Dh-Jh |
| | *IGL* | Vl-Jl | | |
| | *IGK* | Vk-Jk | *IGK+* | Vk-KDE, INTRON-KDE |
|----------------------+-------+-------------------------+--------+-----------------------------------|
| vidjil-algo option | | =-g germline/homo-sapiens.g:TRA,TRB,TRD,TRG,IGH,IGL,IGK= | | =-g germline/homo-sapiens.g= |
| server configuration | | =multi= | | =multi+inc= |
|----------------------+-------+-------------------------+--------+-----------------------------------|
The detection of complete recombinations is reliable and should work provided that the reads
are long enough (especially the J region).
The detection of incomplete/special recombinaisons is more challenging and may fail in some cases.
In particular, as D genes may be very short, detecting TRD+ (Dd2/Dd3) and IGH+ (Dh-Jh) recombinations
require to have reads with fairly conserved D genes or up/downstream regions.
Finally, the =-2= command line option and the =multi+inc+xxx= server configuration try to
detect unexpected or chimeric recombinations between genes of different germlines or on different
strands (such as PCR dimers or +V/-V recombinations).
These recombinations, tagged as =xxx=, can be technological artefacts or unusual biological recombinations.