algo.org 13.4 KB
Newer Older
1 2 3
#+TITLE: Vidjil -- Algo Manual
#+AUTHOR: The Vidjil team (Mathieu, Mikaël and Marc)

Mathieu Giraud's avatar
Mathieu Giraud committed
4
# Vidjil -- V(D)J recombinations analysis -- [[http://www.vidjil.org]]
Mathieu Giraud's avatar
Mathieu Giraud committed
5
# Copyright (C) 2011, 2012, 2013, 2014 by Bonsai bioinformatics at LIFL (UMR CNRS 8022, Université Lille) and Inria Lille
Marc Duez's avatar
merge  
Marc Duez committed
6
# contact@vidjil.org
Mikaël Salson's avatar
Mikaël Salson committed
7

Mikaël Salson's avatar
Mikaël Salson committed
8 9 10 11
V(D)J recombinations in lymphocytes are essential for immunological
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
Mikaël Salson's avatar
Mikaël Salson committed
12

13
Vidjil processes high-throughput sequencing data to extract V(D)J
Mathieu Giraud's avatar
Mathieu Giraud committed
14
junctions and gather them into clones. Vidjil starts 
Mikaël Salson's avatar
Mikaël Salson committed
15 16
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
17
to output the most abundant clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
18
because, in the first phase, no alignment is performed with database
19
germline sequences. Vidjil can also cluster similar
Mikaël Salson's avatar
Mikaël Salson committed
20 21 22
clones, or leave this to the user after a manual review. 

The method is described in the following paper:
Mikaël Salson's avatar
Mikaël Salson committed
23

Mikaël Salson's avatar
Mikaël Salson committed
24 25
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
26 27
BMC Genomics 2014, 15:409
http://dx.doi.org/10.1186/1471-2164-15-409
Mikaël Salson's avatar
Mikaël Salson committed
28 29 30

Vidjil is open-source, released under GNU GPLv3 license.

31
* Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
32 33 34 35 36 37 38 39 40 41 42 43

Vidjil has been successfully tested on the following platforms :
 - CentOS 6.3 amd64
 - CentOS 6.3 i386
 - Debian Squeeze 
 - Fedora 17
 - FreeBSD 9.1 amd64
 - NetBSD 6.0.1 amd64
 - Ubuntu 12.04 amd64
 - Ubuntu 12.04 i386


44
* Installation
Mikaël Salson's avatar
Mikaël Salson committed
45

Mikaël Salson's avatar
Mikaël Salson committed
46
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
47 48 49 50 51 52 53
make data
   # get some IGH rearrangements from a single individual, as described in:
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
54
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
55 56 57 58 59 60 61 62 63 64
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

make                     # compile Vijil
make test                # run self-tests

./vidjil -h              # display help/usage
Mikaël Salson's avatar
Mikaël Salson committed
65
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
66 67


68
* Vidjil parameters
Mikaël Salson's avatar
Mikaël Salson committed
69

Mikaël Salson's avatar
Mikaël Salson committed
70
Launching vidjil with =-h= option provides the list of parameters that can be
Mikaël Salson's avatar
Mikaël Salson committed
71 72
used.

Mathieu Giraud's avatar
Mathieu Giraud committed
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
** Main algorithm parameters

#+BEGIN_EXAMPLE
Window prediction
  (use either -s or -k option, but not both)
  -s <string>   spaced seed used for the V/J affectation
                (default: #####-#####, ######-######, #######-#######, depends on germline)
  -k <int>      k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
                (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
  -w <int>      w-mer size used for the length of the extracted window (default: 40)(default with -d: 60)
#+END_EXAMPLE

The =-s= and =-k= options are the options of the heuristic. A detailed
explanation can be found in the paper. More help on that will be
available in the following months. The defaults values should work.

The =-w= option fixes the size of the "window" that is the main
identifier to gather clones. The defaults values (40 for TRG, 60 for
IGH) were selected to ensure a high-quality clone gathering. The
high-throughput heuristic predicts the center of the "window" that may
Mikaël Salson's avatar
Mikaël Salson committed
93
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
94 95 96 97 98 99 100 101 102
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
in >99% of cases). The extracted window should be large enough to
fully contain the CDR3 as well as some part of the end of the V and
the start of the J to uniquely identify a clone.

Setting =-w= to 30 for TRG and 50 for IGH may "segment" (analyze) a
few more reads, but may in some rare cases falsely cluster reads from
different clones.  Setting =-w= to lower values is not recommended.

103 104 105 106 107
** Threshold on clone output

The following options control how many clones are output and analyzed.

#+BEGIN_EXAMPLE
108
Limits to report a clone (or a window)
109
  -r <nb>       minimal number of reads supporting a clone (default: 10)
110 111
  -% <ratio>    minimal percentage of reads supporting a clone (default: 0)

112
Limits to further analyze some clones
113 114 115
  -y <nb>       maximal number of clones computed with a representative ('all': no limit) (default: 100)
  -z <nb>       maximal number of clones to be segmented ('all': no limit, do not use) (default: 20)
  -A            reports and segments all clones (-r 1 -% 0 -y all -z all), to be used only on very small datasets
116 117
#+END_EXAMPLE

118
The =-r/-%= options are strong thresholds: if a clone does not have
119 120
the requested number of reads, the clone is discarded (except when
using =-l=, see below).
121
The default =-r 10= option is meant to only output clones that
122 123
have a significant read support. *You shoud use* =-r 1= *if you
want to detect all clones starting from the first read* (especially for
124 125
MRD detection).

126 127
The =-y= option limits the number of clones for which a representative
sequence is computed. Usually you do not need to have more
128
representatives (see below), but you can safely put =-y all= if you want
129 130
to compute all representative sequences.

131 132 133
The =-z= option limits the number of clones that are fully analyzed,
/with their V(D)J segmentation/, in particular to enable the browser
to display the clones on the grid (otherwise they are displayed on the
134
'?/?' axis). It should be smaller than =-y=.
Mikaël Salson's avatar
Mikaël Salson committed
135 136 137 138
If you want to analyze more clones, you should use =-z 50= or
=-z 100=.  It is not recommended to use larger values: outputting more
than 100 clones is often not useful since they can't be visualized easily
in the browser, and takes large computation time.
139

140
Note that even if a clone is not in the top 20 (or 50, or 100) but
141
still passes the =-r=, =-%= options, it is still reported in the .vidjil
142
file. If the clone is at some MRD point in the top 20 (or 50, or 100),
Mikaël Salson's avatar
Mikaël Salson committed
143
it will be fully analyzed/segmented by this other point (and then
144 145 146 147
collected by the =fuse.py= script, using representatives computed at this
other point, and then, on the browser, correctly displayed on the grid). 
*Thus is advised to leave the default* =-y 100 -z 20= *options 
for the majority of uses.*
148 149 150 151 152

The =-A= option disables all these thresholds. This option should be
used only for test and debug purposes, on very small datasets, and
produce large file and takes huge computation times.

153
** Force to follow some sequences
Mikaël Salson's avatar
Mikaël Salson committed
154

Mikaël Salson's avatar
Mikaël Salson committed
155
Vidjil allows to specify a list of windows that must be followed
156
(even if those windows are 'rare', below the =-r/-R/-%= thresholds).
Mikaël Salson's avatar
Mikaël Salson committed
157
The parameter =-l= is made for providing such a list in a file following
Mikaël Salson's avatar
Mikaël Salson committed
158
the following format: window label (separed by one space)
Mikaël Salson's avatar
Mikaël Salson committed
159

Mikaël Salson's avatar
Mikaël Salson committed
160 161 162
The first column of the file is the window to be followed
while the remaining columns consist of the window's label.
In Vidjil output, the labels are output alongside their windows.
Mikaël Salson's avatar
Mikaël Salson committed
163

164 165 166 167 168 169 170 171

** Further clustering

These options have no consequences on the visualization through the
browser. They are intented for a command-line use only.

Setting the =-n= option triggers an additional automatic
clustering using DBSCAN algorithm (Ester and al., 1996). 
Mikaël Salson's avatar
Mikaël Salson committed
172

Mikaël Salson's avatar
Mikaël Salson committed
173
The =-e= option allows to specify a file for manually clustering two windows
Mikaël Salson's avatar
Mikaël Salson committed
174 175 176
considered as similar. Such a file may be automatically produced by vidjil
(out/edges), depending on the option provided. Only the two first columns 
(separed by one space) are important to vidjil, they only consist of the 
Mikaël Salson's avatar
Mikaël Salson committed
177
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
178 179


180
* Examples of use
Mikaël Salson's avatar
Mikaël Salson committed
181 182

All the following examples are on a IGH VDJ recombinations : they thus
Mikaël Salson's avatar
Mikaël Salson committed
183
require the =-G germline/IGH= and the =-d= options.
Mikaël Salson's avatar
Mikaël Salson committed
184

Mikaël Salson's avatar
Mikaël Salson committed
185
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
186
./vidjil -G germline/IGH -d data/Stanford_S22.fasta
Mikaël Salson's avatar
Mikaël Salson committed
187
   # Extract (with an ultra-fast heuristic) all windows
188 189 190
   # Summary of windows is available in out/Stanford_S22.vidjil
   # (for the '.vidjil' format, see below)
   # To have detailed/debug results in out/Stanford_S22.vdj.fa
191 192 193
   # (which is a FASTA file embedding heuristic information 
      in the headers, '.vdj' format, see warning below)
   # run Vidjil with option '-U'
Mikaël Salson's avatar
Mikaël Salson committed
194
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
195

Mikaël Salson's avatar
Mikaël Salson committed
196
#+BEGIN_EXAMPLE
Mikaël Salson's avatar
Mikaël Salson committed
197 198 199 200
>8--window--1 
CACCTATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGGGGCCA
>5--window--2 
CTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGACGTCTG
Mikaël Salson's avatar
Mikaël Salson committed
201
(...)
Mikaël Salson's avatar
Mikaël Salson committed
202
#+END_EXAMPLE
Mikaël Salson's avatar
Mikaël Salson committed
203

Mikaël Salson's avatar
Mikaël Salson committed
204
   Windows of size 60 (modifiable by =-w=) have been extracted.
Mikaël Salson's avatar
Mikaël Salson committed
205
   The first window has 8 occurrences, the second window has 5 occurrences.
Mikaël Salson's avatar
Mikaël Salson committed
206

Mikaël Salson's avatar
Mikaël Salson committed
207
#+BEGIN_SRC sh
208
./vidjil -c clones -G germline/IGH -r 1 -R 1 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
209
   # Extracts the windows (-r 1, with at least 1 read each), 
210 211
   # then gather them into clones 
   # A more natural option could be -r 5.
Mathieu Giraud's avatar
Mathieu Giraud committed
212
   # For debug purpose, if one wants all the clones, use the option -A.
213 214
   # Results are both
   #  - on the standard output
215 216 217 218
   #  - in out/clones_simul.vdj.fa (fasta file to be processed by other tools)
   #  - in out/clones_simul.vidjil (for the browser)
   # Additional files are in out/clones_simul.windows.fa and out/seq/clone.fa-*
   # If one adds the '-U' option, an additonal out/clones_simul.segmented.vdj.fa file is produced,
219
   # listing segmented reads using the .vdj format (see below)
Mikaël Salson's avatar
Mikaël Salson committed
220
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
221

Mikaël Salson's avatar
Mikaël Salson committed
222
#+BEGIN_SRC sh
223
./vidjil -c clones -G germline/IGH -r 1 -n 5 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
224
   # Window extraction + clone gathering,
225
   # with automatic clustering, distance five (-n 5)
Mikaël Salson's avatar
Mikaël Salson committed
226
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
227

Mikaël Salson's avatar
Mikaël Salson committed
228
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
229
./vidjil -c segment -G germline/IGH -d data/segment_S22.fa
Mikaël Salson's avatar
Mikaël Salson committed
230 231 232
   # Segment the reads onto VDJ germline 
   # (this is slow and should only be used for testing)
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
233

234 235 236 237 238
#+BEGIN_SRC sh
./vidjil -c germlines file.fastq
   # Search for all the germlines and output statistics
   # on the number of occurrences in each germline
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
239

240
* Segmentation and .vdj format
241

242
Vidjil output includes segmentation of V(D)J recombinations. This happens
243 244
in the following situations:

245
- in a first pass, when requested with =-U= option, in a =.segmented.vdj.fa= file.
246 247 248 249 250 251 252

      The goal of this ultra-fast segmentation, based on a seed
      heuristics, is only to locate the w-window overlapping the
      CDR3. This should not be taken as a real V(D)J segmentation, as
      the center of the window may be shifted up to 15 bases from the
      actual center.

253
- in a second pass, on the standard output
254 255
    - at the end of the clones detection (=-c clones=, also in in
      =basename.vdj.fa=, where =basename= is the basename of the input file)
Mikaël Salson's avatar
Mikaël Salson committed
256
    - or directly when explicitly requiring segmentation (=-c segment=)
257 258

      This segmentation obtained by full comparison (dynamic
259
      programming) with all germline sequences. Such segmentation are
260 261 262 263
      not at the core of the Vidjil clone gathering method (which
      relies only on the 'window', see above). They are provided only
      for convenience and should be checked with other softwares such
      as IgBlast, iHHMune-align or IMGT/V-QUEST.
Mikaël Salson's avatar
Mikaël Salson committed
264 265

Segmentations of V(D)J recombinations are displayed using a dedicated
266
.vdj format. This format is compatible with FASTA format. A line starting
Mikaël Salson's avatar
Mikaël Salson committed
267 268
with a > is of the following form:

Mikaël Salson's avatar
Mikaël Salson committed
269
#+BEGIN_EXAMPLE
270
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295

        name          sequence name
        +             strand on which the sequence is mapped
        VDJ           type of segmentation (can be "VJ", "VDJ", 
    	              or shorter tags such as "V" for incomplete sequences).	
		      The following line are for "VDJ" recombinations :

        startV endV   start and end position of the V gene in the sequence (start at 0)
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
        
296
        comments      optional comments. In Vidjil, the following comments are now used:
297
                      - "seed" when this comes for the first pass (.segmented.vdj.fa). See the warning above.
298
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
Mikaël Salson's avatar
Mikaël Salson committed
299

Mikaël Salson's avatar
Mikaël Salson committed
300 301
#+END_EXAMPLE

Mikaël Salson's avatar
Mikaël Salson committed
302 303 304 305 306
Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
307
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  coments
Mikaël Salson's avatar
Mikaël Salson committed
308

Mikaël Salson's avatar
Mikaël Salson committed
309

310
* .vidjil and .json format and web interface
Mikaël Salson's avatar
Mikaël Salson committed
311

312
A summary of extracted windows is also available in a JSON format,
Mikaël Salson's avatar
Mikaël Salson committed
313
including, for each windows, the number of reads sharing this window.
314
The format of this file may change in future releases.
Mathieu Giraud's avatar
Mathieu Giraud committed
315

316
This file is used by the dynamic browser for visualization
Mathieu Giraud's avatar
Mathieu Giraud committed
317 318
and analysis of clones and their tracking along different samples,
(for example time points in a MRD setup or in a immunological study).
319
Please see the file [[file:browser.org][browser]].org for more information on the browser.