algo.org 13.3 KB
Newer Older
1
2
3
#+TITLE: Vidjil -- Algo Manual
#+AUTHOR: The Vidjil team (Mathieu, Mikaël and Marc)

Mathieu Giraud's avatar
Mathieu Giraud committed
4
# Vidjil -- V(D)J recombinations analysis -- [[http://www.vidjil.org]]
Mathieu Giraud's avatar
Mathieu Giraud committed
5
# Copyright (C) 2011, 2012, 2013, 2014 by Bonsai bioinformatics at LIFL (UMR CNRS 8022, Université Lille) and Inria Lille
Marc Duez's avatar
merge    
Marc Duez committed
6
# contact@vidjil.org
Mikaël Salson's avatar
Mikaël Salson committed
7

Mikaël Salson's avatar
Mikaël Salson committed
8
9
10
11
V(D)J recombinations in lymphocytes are essential for immunological
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
Mikaël Salson's avatar
Mikaël Salson committed
12

13
Vidjil processes high-throughput sequencing data to extract V(D)J
Mathieu Giraud's avatar
Mathieu Giraud committed
14
junctions and gather them into clones. Vidjil starts 
Mikaël Salson's avatar
Mikaël Salson committed
15
16
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
17
to output the most abundant clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
18
because, in the first phase, no alignment is performed with database
19
germline sequences. Vidjil can also cluster similar
Mikaël Salson's avatar
Mikaël Salson committed
20
21
22
clones, or leave this to the user after a manual review. 

The method is described in the following paper:
Mikaël Salson's avatar
Mikaël Salson committed
23

Mikaël Salson's avatar
Mikaël Salson committed
24
25
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
26
27
BMC Genomics 2014, 15:409
http://dx.doi.org/10.1186/1471-2164-15-409
Mikaël Salson's avatar
Mikaël Salson committed
28
29
30

Vidjil is open-source, released under GNU GPLv3 license.

31
* Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
32
33
34
35
36
37
38
39
40
41
42
43

Vidjil has been successfully tested on the following platforms :
 - CentOS 6.3 amd64
 - CentOS 6.3 i386
 - Debian Squeeze 
 - Fedora 17
 - FreeBSD 9.1 amd64
 - NetBSD 6.0.1 amd64
 - Ubuntu 12.04 amd64
 - Ubuntu 12.04 i386


44
* Installation
Mikaël Salson's avatar
Mikaël Salson committed
45

Mikaël Salson's avatar
Mikaël Salson committed
46
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
47
48
49
50
51
52
53
make data
   # get some IGH rearrangements from a single individual, as described in:
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
54
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
55
56
57
58
59
60
61
62
63
64
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

make                     # compile Vijil
make test                # run self-tests

./vidjil -h              # display help/usage
Mikaël Salson's avatar
Mikaël Salson committed
65
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
66
67


68
* Vidjil parameters
Mikaël Salson's avatar
Mikaël Salson committed
69

Mikaël Salson's avatar
Mikaël Salson committed
70
Launching vidjil with =-h= option provides the list of parameters that can be
Mikaël Salson's avatar
Mikaël Salson committed
71
72
used.

Mathieu Giraud's avatar
Mathieu Giraud committed
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
** Main algorithm parameters

#+BEGIN_EXAMPLE
Window prediction
  (use either -s or -k option, but not both)
  -s <string>   spaced seed used for the V/J affectation
                (default: #####-#####, ######-######, #######-#######, depends on germline)
  -k <int>      k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
                (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
  -w <int>      w-mer size used for the length of the extracted window (default: 40)(default with -d: 60)
#+END_EXAMPLE

The =-s= and =-k= options are the options of the heuristic. A detailed
explanation can be found in the paper. More help on that will be
available in the following months. The defaults values should work.

The =-w= option fixes the size of the "window" that is the main
identifier to gather clones. The defaults values (40 for TRG, 60 for
IGH) were selected to ensure a high-quality clone gathering. The
high-throughput heuristic predicts the center of the "window" that may
Mikaël Salson's avatar
Mikaël Salson committed
93
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
94
95
96
97
98
99
100
101
102
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
in >99% of cases). The extracted window should be large enough to
fully contain the CDR3 as well as some part of the end of the V and
the start of the J to uniquely identify a clone.

Setting =-w= to 30 for TRG and 50 for IGH may "segment" (analyze) a
few more reads, but may in some rare cases falsely cluster reads from
different clones.  Setting =-w= to lower values is not recommended.

103
104
105
106
107
** Threshold on clone output

The following options control how many clones are output and analyzed.

#+BEGIN_EXAMPLE
108
Limits to report a clone (or a window)
109
  -r <nb>       minimal number of reads supporting a clone (default: 10)
110
111
  -% <ratio>    minimal percentage of reads supporting a clone (default: 0)

112
Limits to further analyze some clones
113
114
115
  -y <nb>       maximal number of clones computed with a representative ('all': no limit) (default: 100)
  -z <nb>       maximal number of clones to be segmented ('all': no limit, do not use) (default: 20)
  -A            reports and segments all clones (-r 1 -% 0 -y all -z all), to be used only on very small datasets
116
117
#+END_EXAMPLE

118
The =-r/-%= options are strong thresholds: if a clone does not have
119
120
the requested number of reads, the clone is discarded (except when
using =-l=, see below).
121
The default =-r 10= option is meant to only output clones that
122
123
have a significant read support. *You shoud use* =-r 1= *if you
want to detect all clones starting from the first read* (especially for
124
125
MRD detection).

126
127
The =-y= option limits the number of clones for which a representative
sequence is computed. Usually you do not need to have more
128
representatives (see below), but you can safely put =-y all= if you want
129
130
to compute all representative sequences.

131
132
133
The =-z= option limits the number of clones that are fully analyzed,
/with their V(D)J segmentation/, in particular to enable the browser
to display the clones on the grid (otherwise they are displayed on the
134
'?/?' axis). It should be smaller than =-y=.
Mikaël Salson's avatar
Mikaël Salson committed
135
136
137
138
If you want to analyze more clones, you should use =-z 50= or
=-z 100=.  It is not recommended to use larger values: outputting more
than 100 clones is often not useful since they can't be visualized easily
in the browser, and takes large computation time.
139

140
Note that even if a clone is not in the top 20 (or 50, or 100) but
141
still passes the =-r=, =-%= options, it is still reported in the .data
142
file. If the clone is at some MRD point in the top 20 (or 50, or 100),
Mikaël Salson's avatar
Mikaël Salson committed
143
it will be fully analyzed/segmented by this other point (and then
144
145
146
147
collected by the =fuse.py= script, using representatives computed at this
other point, and then, on the browser, correctly displayed on the grid). 
*Thus is advised to leave the default* =-y 100 -z 20= *options 
for the majority of uses.*
148
149
150
151
152

The =-A= option disables all these thresholds. This option should be
used only for test and debug purposes, on very small datasets, and
produce large file and takes huge computation times.

153
** Force to follow some sequences
Mikaël Salson's avatar
Mikaël Salson committed
154

Mikaël Salson's avatar
Mikaël Salson committed
155
Vidjil allows to specify a list of windows that must be followed
156
(even if those windows are 'rare', below the =-r/-R/-%= thresholds).
Mikaël Salson's avatar
Mikaël Salson committed
157
The parameter =-l= is made for providing such a list in a file following
Mikaël Salson's avatar
Mikaël Salson committed
158
the following format: window label (separed by one space)
Mikaël Salson's avatar
Mikaël Salson committed
159

Mikaël Salson's avatar
Mikaël Salson committed
160
161
162
The first column of the file is the window to be followed
while the remaining columns consist of the window's label.
In Vidjil output, the labels are output alongside their windows.
Mikaël Salson's avatar
Mikaël Salson committed
163

164
165
166
167
168
169
170
171

** Further clustering

These options have no consequences on the visualization through the
browser. They are intented for a command-line use only.

Setting the =-n= option triggers an additional automatic
clustering using DBSCAN algorithm (Ester and al., 1996). 
Mikaël Salson's avatar
Mikaël Salson committed
172

Mikaël Salson's avatar
Mikaël Salson committed
173
The =-e= option allows to specify a file for manually clustering two windows
Mikaël Salson's avatar
Mikaël Salson committed
174
175
176
considered as similar. Such a file may be automatically produced by vidjil
(out/edges), depending on the option provided. Only the two first columns 
(separed by one space) are important to vidjil, they only consist of the 
Mikaël Salson's avatar
Mikaël Salson committed
177
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
178
179


180
* Examples of use
Mikaël Salson's avatar
Mikaël Salson committed
181
182

All the following examples are on a IGH VDJ recombinations : they thus
Mikaël Salson's avatar
Mikaël Salson committed
183
require the =-G germline/IGH= and the =-d= options.
Mikaël Salson's avatar
Mikaël Salson committed
184

Mikaël Salson's avatar
Mikaël Salson committed
185
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
186
./vidjil -G germline/IGH -d data/Stanford_S22.fasta
Mikaël Salson's avatar
Mikaël Salson committed
187
   # Extract (with an ultra-fast heuristic) all windows
188
   # Summary of windows is available in out/vidjil.data
189
   # ('.data' format, see below)
190
191
192
193
   # To have detailed/debug results in out/segmented.vdj.fa
   # (which is a FASTA file embedding heuristic information 
      in the headers, '.vdj' format, see warning below)
   # run Vidjil with option '-U'
Mikaël Salson's avatar
Mikaël Salson committed
194
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
195

Mikaël Salson's avatar
Mikaël Salson committed
196
#+BEGIN_EXAMPLE
Mikaël Salson's avatar
Mikaël Salson committed
197
198
199
200
>8--window--1 
CACCTATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGGGGCCA
>5--window--2 
CTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGACGTCTG
Mikaël Salson's avatar
Mikaël Salson committed
201
(...)
Mikaël Salson's avatar
Mikaël Salson committed
202
#+END_EXAMPLE
Mikaël Salson's avatar
Mikaël Salson committed
203

Mikaël Salson's avatar
Mikaël Salson committed
204
   Windows of size 60 (modifiable by =-w=) have been extracted.
Mikaël Salson's avatar
Mikaël Salson committed
205
   The first window has 8 occurrences, the second window has 5 occurrences.
Mikaël Salson's avatar
Mikaël Salson committed
206

Mikaël Salson's avatar
Mikaël Salson committed
207
#+BEGIN_SRC sh
208
./vidjil -c clones -G germline/IGH -r 1 -R 1 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
209
   # Extracts the windows (-r 1, with at least 1 read each), 
210
211
   # then gather them into clones 
   # A more natural option could be -r 5.
Mathieu Giraud's avatar
Mathieu Giraud committed
212
   # For debug purpose, if one wants all the clones, use the option -A.
213
214
215
216
   # Results are both
   #  - on the standard output
   #  - in out/clones.vdj.fa (fasta file to be processed by other tools)
   #  - in out/vidjil.data (for the browser)
217
218
219
   # Additional files are in out/seq/windows.fa-* and out/seq/clone.fa-*
   # If one adds the '-U' option, an additonal out/segmented.vdj.fa file is produced,
   # listing segmented reads using the .vdj format (see below)
Mikaël Salson's avatar
Mikaël Salson committed
220
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
221

Mikaël Salson's avatar
Mikaël Salson committed
222
#+BEGIN_SRC sh
223
./vidjil -c clones -G germline/IGH -r 1 -n 5 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
224
   # Window extraction + clone gathering,
225
   # with automatic clustering, distance five (-n 5)
Mikaël Salson's avatar
Mikaël Salson committed
226
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
227

Mikaël Salson's avatar
Mikaël Salson committed
228
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
229
./vidjil -c segment -G germline/IGH -d data/segment_S22.fa
Mikaël Salson's avatar
Mikaël Salson committed
230
231
232
   # Segment the reads onto VDJ germline 
   # (this is slow and should only be used for testing)
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
233

234
235
236
237
238
#+BEGIN_SRC sh
./vidjil -c germlines file.fastq
   # Search for all the germlines and output statistics
   # on the number of occurrences in each germline
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
239

240
* Segmentation and .vdj format
241

242
Vidjil output includes segmentation of V(D)J recombinations. This happens
243
244
in the following situations:

245
- in a first pass, when requested with =-U= option, in =segmented.vdj.fa= file.
246
247
248
249
250
251
252

      The goal of this ultra-fast segmentation, based on a seed
      heuristics, is only to locate the w-window overlapping the
      CDR3. This should not be taken as a real V(D)J segmentation, as
      the center of the window may be shifted up to 15 bases from the
      actual center.

253
- in a second pass, on the standard output
Mikaël Salson's avatar
Mikaël Salson committed
254
255
    - at the end of the clones detection (=-c clones=, also in in =clones.vdj.fa=)
    - or directly when explicitly requiring segmentation (=-c segment=)
256
257

      This segmentation obtained by full comparison (dynamic
258
      programming) with all germline sequences. Such segmentation are
259
260
261
262
      not at the core of the Vidjil clone gathering method (which
      relies only on the 'window', see above). They are provided only
      for convenience and should be checked with other softwares such
      as IgBlast, iHHMune-align or IMGT/V-QUEST.
Mikaël Salson's avatar
Mikaël Salson committed
263
264

Segmentations of V(D)J recombinations are displayed using a dedicated
265
.vdj format. This format is compatible with FASTA format. A line starting
Mikaël Salson's avatar
Mikaël Salson committed
266
267
with a > is of the following form:

Mikaël Salson's avatar
Mikaël Salson committed
268
#+BEGIN_EXAMPLE
269
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294

        name          sequence name
        +             strand on which the sequence is mapped
        VDJ           type of segmentation (can be "VJ", "VDJ", 
    	              or shorter tags such as "V" for incomplete sequences).	
		      The following line are for "VDJ" recombinations :

        startV endV   start and end position of the V gene in the sequence (start at 0)
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
        
295
296
297
        comments      optional comments. In Vidjil, the following comments are now used:
                      - "seed" when this comes for the first pass (segmented.vdj.fa). See the warning above.
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
Mikaël Salson's avatar
Mikaël Salson committed
298

Mikaël Salson's avatar
Mikaël Salson committed
299
300
#+END_EXAMPLE

Mikaël Salson's avatar
Mikaël Salson committed
301
302
303
304
305
Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
306
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  coments
Mikaël Salson's avatar
Mikaël Salson committed
307

Mikaël Salson's avatar
Mikaël Salson committed
308

309
* vidjil.data .json format and web interface
Mikaël Salson's avatar
Mikaël Salson committed
310

311
A summary of extracted windows is also available in a JSON format,
Mikaël Salson's avatar
Mikaël Salson committed
312
including, for each windows, the number of reads sharing this window.
313
The format of this file may change in future releases.
Mathieu Giraud's avatar
Mathieu Giraud committed
314

315
This file is used by the dynamic browser for visualization
Mathieu Giraud's avatar
Mathieu Giraud committed
316
317
and analysis of clones and their tracking along different samples,
(for example time points in a MRD setup or in a immunological study).
318
Please see the file [[file:browser.org][browser]].org for more information on the browser.