Attention une mise à jour du serveur va être effectuée le lundi 17 mai entre 13h et 13h30. Cette mise à jour va générer une interruption du service de quelques minutes.

algo.org 29.1 KB
Newer Older
1
#+TITLE: Vidjil Algorithm -- Command-line Manual
2
#+AUTHOR: The Vidjil team (Mathieu, Mikaël, Marc, Tatiana and Rayan)
3 4 5 6 7
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="../css/org-mode.css" />

# This manual can be browsed online:
#     http://www.vidjil.org/doc/algo.html               (last stable release)
#     http://git.vidjil.org/blob/master/doc/algo.org    (development version)
8

9 10 11
# Vidjil -- High-throughput Analysis of V(D)J Immune Repertoire -- [[http://www.vidjil.org]]
# Copyright (C) 2011, 2012, 2013, 2014, 2015 by Bonsai bioinformatics 
# at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille
Marc Duez's avatar
merge  
Marc Duez committed
12
# contact@vidjil.org
Mikaël Salson's avatar
Mikaël Salson committed
13

Mikaël Salson's avatar
Mikaël Salson committed
14 15 16 17
V(D)J recombinations in lymphocytes are essential for immunological
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
Mikaël Salson's avatar
Mikaël Salson committed
18

19
The Vidjil algorithm processes high-throughput sequencing data to extract V(D)J
Mathieu Giraud's avatar
Mathieu Giraud committed
20
junctions and gather them into clones. Vidjil starts 
Mikaël Salson's avatar
Mikaël Salson committed
21 22
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
23
to output all sequenced clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
24
because, in the first phase, no alignment is performed with database
25 26 27
germline sequences. At the end, only the representative sequences 
of each clone have to be analyzed. Vidjil can also cluster similar
clones, or leave this to the user after a manual review in the browser.
Mikaël Salson's avatar
Mikaël Salson committed
28

29
The method is described in the following reference:
Mikaël Salson's avatar
Mikaël Salson committed
30

Mikaël Salson's avatar
Mikaël Salson committed
31 32
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
33 34
BMC Genomics 2014, 15:409
http://dx.doi.org/10.1186/1471-2164-15-409
Mikaël Salson's avatar
Mikaël Salson committed
35 36 37

Vidjil is open-source, released under GNU GPLv3 license.

38 39 40
* Requirements and installation

** Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
41

42
The Vidjil algorithm has been successfully tested on the following platforms :
Mikaël Salson's avatar
Mikaël Salson committed
43 44
 - CentOS 6.3 amd64
 - CentOS 6.3 i386
45 46
 - Debian Squeeze 6.0
 - Debian Wheezy 7.0 amd64
Mikaël Salson's avatar
Mikaël Salson committed
47
 - Fedora 19
48
 - FreeBSD 9.2
49 50
 - Ubuntu 12.04 LTS amd64
 - Ubuntu 14.04 LTS amd64
Mikaël Salson's avatar
Mikaël Salson committed
51

Mathieu Giraud's avatar
Mathieu Giraud committed
52 53 54
Vidjil is developed with continuous integration using systematic unit and functional testing.
The development team internally uses [[https://jenkins-ci.org/][Jenkins]] for that.
Moreover, the results of some of these tests can be publicly checked on [[https://travis-ci.org/vidjil/vidjil][travis-ci.org]].
Mikaël Salson's avatar
Mikaël Salson committed
55

56 57 58 59 60 61
** Build requirements (optional)

This paragraph details the requirements to build Vidjil from source.
You can also download a static binary (see next paragraph, 'Installation').

To compile Vidjil, make sure:
62
  - to be on a POSIX system ;
63
  - to have a C++11 compiler (as =g++= 4.8 or above or =clang= 3.3 or above)
64 65 66
  - to have the =zlib= installed (=zlib1g-dev= package under Debian/Ubuntu,
    =zlib-devel= package under Fedora/CentOS).

67 68 69 70 71 72 73

*** CentOS 6

g++-4.8 is included in the devtools 2.0.

#+BEGIN_SRC sh
sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo
74 75 76 77 78
sudo yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++ devtoolset-2-valgrind

# scl enable devtoolset-2 bash     # either open a shell running devtools
source /opt/rh/devtoolset-2/enable # ... or source devtools in the same shell
#+END_SRC
79 80 81 82 83

*** FreeBSD 9.2

g++-4.8 is included in FreeBSD 9.2.

84 85 86 87
You may also need to install the =gzstream= library with:
#+BEGIN_SRC sh
pkg install gzstream
#+END_SRC
88

89 90 91
Also Vidjil uses GNU make which requires =gmake= under FreeBSD.
At the time of redacting the documentation, =g++= requires extra options to
ensure flawless compilation and execution of Vidjil:
92
#+BEGIN_SRC sh
93 94 95 96
make MAKE=gmake CXXFLAGS="-D_GLIBCXX_USE_C99 -Wl,-rpath=/usr/local/lib/gcc49"
#+END_SRC
The =gcc49= at the end of the command line is to be replaced by the =gcc= version
used. 
97
*** Debian Squeeze 6.0 / Wheezy 7.0
98 99 100 101 102 103

g++-4.8 should be pinned from testing.
Put in =/etc/apt/preferences= the following lines:

#+BEGIN_SRC sh
Package: *
104
Pin: release n=wheezy # (or squeeze)
105 106 107 108 109 110 111 112 113 114 115
Pin-Priority: 900

Package: g++-4.8, gcc-4.8, valgrind*
Pin: release n=jessie
Pin-Priority: 950
#+END_SRC

Then g++ 4.8 can be installed.

#+BEGIN_SRC sh
apt-get update
116
apt-get install -t jessie g++-4.8 valgrind
117 118 119
#+END_SRC


120 121 122 123
*** Ubuntu 14.04 LTS

#+BEGIN_SRC sh
sudo apt-get install g++-4.8
Mathieu Giraud's avatar
Mathieu Giraud committed
124
#+END_SRC
125 126 127 128 129 130 131 132 133 134

*** Ubuntu 12.04 LTS

g++-4.8 is included in the devtools 2.0.

#+BEGIN_SRC sh
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.8
Mathieu Giraud's avatar
Mathieu Giraud committed
135
#+END_SRC
136 137 138 139




140
** Installation
Mikaël Salson's avatar
Mikaël Salson committed
141

Mikaël Salson's avatar
Mikaël Salson committed
142
#+BEGIN_SRC sh
Mikaël Salson's avatar
Mikaël Salson committed
143 144

make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
145
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
146 147 148 149 150 151
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

152 153 154 155 156

# either
make                     # build Vijil from the sources (see the requirements, above)

# or
157
wget http://bioinfo.lifl.fr/vidjil/vidjil-2015.12_x86_64 -O vidjil
158 159
                         # download a static binary (built for x86_64 architectures)

Mikaël Salson's avatar
Mikaël Salson committed
160
./vidjil -h              # display help/usage
Mikaël Salson's avatar
Mikaël Salson committed
161
#+END_SRC
Mikaël Salson's avatar
Mikaël Salson committed
162

163
If your build system does not use C++11 by default, you should replace the =make= commands by:
164

165 166
#+BEGIN_SRC sh
make CXXFLAGS='-std=c++11'                           ### gcc-4.8
167
make CXXFLAGS='-std=c++11' LDFLAGS='-stdlib=libc++'  ### OS X Mavericks
168
#+END_SRC
169

170 171 172 173 174 175 176 177 178 179 180
** Self-tests (optional)

You can run the tests with the following commands:

#+BEGIN_SRC sh
make data
   # get IGH rearrangements from a single individual, as described in:
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

181
make test                # run self-tests (can take 5 to 60 minutes)
182 183
#+END_SRC

Mikaël Salson's avatar
Mikaël Salson committed
184

185
* Input and parameters
186 187

The main input file of Vidjil is a /set of reads/, given as a =.fasta=
188 189
or =.fastq= file, possibly compressed with gzip (=.gz=).
This set of reads can reach several gigabytes. It is
190 191 192
never loaded entirely in the memory, but reads are processed one by
one by the Vidjil algorithm.

193 194
The =-h= and =-H= help options provide the list of parameters that can be
used. We detail here the options of the main =-c clones= command.
195 196 197 198 199 200

The default options are very conservative (large window, no further
automatic clusterization, see below), leaving the user or other
software making detailed analysis and decisions on the final
clustering.

201 202 203 204 205
** Germline selection

#+BEGIN_EXAMPLE
Germline databases (one -V/(-D)/-J, or -G, or -g option must be given for all commands except -c germlines)
  -V <file>     V germline multi-fasta file
206
  -D <file>     D germline multi-fasta file (and resets -m and -w options), will segment into V(D)J components
207 208
  -J <file>     J germline multi-fasta file
  -G <prefix>   prefix for V (D) and J repertoires (shortcut for -V <prefix>V.fa -D <prefix>D.fa -J <prefix>J.fa) (basename gives germline code)
209 210 211 212
  -g <path>     multiple locus/germlines. In the path <path>, takes 'germlines.data' to select locus and parameters
                Selecting '-g germline' processes TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with some incomplete/unusal recombinations
                A different 'germlines.data' file can also be provided with -g <file>
  -i            multiple locus/germlines, also incomplete/unusual rearrangements (must be used with -g)
213 214 215 216
#+END_EXAMPLE

 - Options such as =-G germline/IGH= or =-G germline/TRG= select one germline system.
 - The =-V/(-D)/-J= options enable to select individual V, (D) and J repertoires (fasta files).
Mathieu Giraud's avatar
Mathieu Giraud committed
217
   This allows in particular to select incomplete rearrangement using custom V or J repertoires with added sequences.
218
 - The =-g germline/= option launches the analysis on the seven germlines, selecting the best locus for each read.
219
   Using =-g germline/ -i= stests also some incomplete and unusual recombinations (locus with a =+= in their name)
220
   See [[http://git.vidjil.org/blob/master/doc/locus.org][locus.org]] for information on the analyzable locus.
221 222 223
 - Analyzed locus and parameters are configured through the =germline/germlines.data= file.
   To select a custom set of TR or Ig locus, you may copy =germline/germlines.data= into a new file,
   as for example =germline/custom-germlines.data=, and run Vidjil with =-g germline/custom-germlines.data -i=.
224

Mathieu Giraud's avatar
Mathieu Giraud committed
225 226 227 228 229 230 231 232 233
** Main algorithm parameters

#+BEGIN_EXAMPLE
Window prediction
  (use either -s or -k option, but not both)
  -s <string>   spaced seed used for the V/J affectation
                (default: #####-#####, ######-######, #######-#######, depends on germline)
  -k <int>      k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
                (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
234
  -w <int>      w-mer size used for the length of the extracted window (default: 50)
235
  -e <float>    maximal e-value for determining if a segmentation can be trusted (default: 'all', no limit)
236
  -t <int>      trim V and J genes (resp. 5' and 3' regions) to keep at most <int> nt (default: 100) (0: no trim)
Mathieu Giraud's avatar
Mathieu Giraud committed
237 238
#+END_EXAMPLE

239
The =-s=, =-k= are the options of the seed-based heuristic. A detailed
240 241
explanation can be found in (Giraud, Salson and al., 2014).
/These options are for advanced usage, the defaults values should work./
Mathieu Giraud's avatar
Mathieu Giraud committed
242
The =-s= or =-k= option selects the seed used for the k-mer V/J affectation.
Mathieu Giraud's avatar
Mathieu Giraud committed
243 244

The =-w= option fixes the size of the "window" that is the main
245
identifier to gather clones. The default value (=-w 50=) was selected
246 247 248 249
to ensure a high-quality clone gathering: reads are clustered when
they /exactly/ share, at the nucleotide level, a 50 bp-window centered
on the CDR3. No sequencing errors are corrected inside this window.
The center of the "window", predicted by the high-throughput heuristic, may
Mikaël Salson's avatar
Mikaël Salson committed
250
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
251 252 253
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
in >99% of cases). The extracted window should be large enough to
fully contain the CDR3 as well as some part of the end of the V and
254
the start of the J, or at least some specific N region, to uniquely identify a clone.
Mathieu Giraud's avatar
Mathieu Giraud committed
255

256 257 258 259 260 261 262 263
Setting =-w= to higher values (such as =-w 60= or =-w 100=) makes the clone gathering
even more conservative, enabling to split clones with low specificity (such as IGH with very
large D, short or no N regions and almost no somatic hypermutations). However, such settings
may "segment" (analyze) less reads, depending on the read length of your data, and may also
return more clones, as any sequencing error in the window is not corrected.

Setting =-w= to lower values than 50 may "segment" (analyze) a few more reads, depending
on the read length of your data, but may in some cases falsely cluster reads from
264 265
different clones.
For VJ recombinations, the =-w 40= option is usually safe, and =-w 30= can also be tested.
266
Setting =-w= to lower values is not recommended.
Mathieu Giraud's avatar
Mathieu Giraud committed
267

268
The =-e= option sets the maximal e-value accepted for segmenting a sequence.
269
It is an upper bound on the number of exepcted windows found by chance by the seed-based heuristic.
270 271
The e-value computation takes into account both the number of reads in the
input sequence and the number of locus searched for.
272 273 274
The default value is 1.0, but values such as 1000, 1e-3 or even less can be used
to have a more or less permissive segmentation.
The threshold can be disabled with =-e all=.
275

Mikaël Salson's avatar
Mikaël Salson committed
276
The =-t= option sets the maximal number of nucleotides that will be indexed in
277 278
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
279
The default is =-t 100=.
280

281
** Thresholds on clone output
282 283 284 285

The following options control how many clones are output and analyzed.

#+BEGIN_EXAMPLE
286
Limits to report a clone (or a window)
287
  -r <nb>       minimal number of reads supporting a clone (default: 5)
288 289
  -% <ratio>    minimal percentage of reads supporting a clone (default: 0)

290
Limits to further analyze some clones
291
  -y <nb>       maximal number of clones computed with a representative ('all': no limit) (default: 100)
292
  -z <nb>       maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use) (default: 100)
293
  -A            reports and segments all clones (-r 1 -% 0 -y all -z all), to be used only on very small datasets
294 295
#+END_EXAMPLE

296
The =-r/-%= options are strong thresholds: if a clone does not have
297 298
the requested number of reads, the clone is discarded (except when
using =-l=, see below).
299
The default =-r 5= option is meant to only output clones that
300
have a significant read support. *You should use* =-r 1= *if you
301
want to detect all clones starting from the first read* (especially for
302 303
MRD detection).

304 305
The =-y= option limits the number of clones for which a representative
sequence is computed. Usually you do not need to have more
306
representatives (see below), but you can safely put =-y all= if you want
307 308
to compute all representative sequences.

309
The =-z= option limits the number of clones that are fully analyzed,
310
/with their V(D)J segmentation/, in particular to enable the web application
311
to display the clones on the grid (otherwise they are displayed on the
312
'?/?' axis).
313 314 315
If you want to analyze more clones, you should use =-z 200= or
=-z 500=.  It is not recommended to use larger values: outputting more
than 500 clones is often not useful since they can not be visualized easily
316
in the web application, and takes large computation time (full dynamic programming,
317
see below).
318

319
Note that even if a clone is not in the top 100 (or 200, or 500) but
320
still passes the =-r=, =-%= options, it is still reported in both the =.vidjil=
321
and =.vdj.fa= files. If the clone is at some MRD point in the top 100 (or 200, or 500),
Mikaël Salson's avatar
Mikaël Salson committed
322
it will be fully analyzed/segmented by this other point (and then
323
collected by the =fuse.py= script, using representatives computed at this
324
other point, and then, on the web application, correctly displayed on the grid).
325
*Thus is advised to leave the default* =-z 100= *option
326
for the majority of uses.*
327 328 329 330 331

The =-A= option disables all these thresholds. This option should be
used only for test and debug purposes, on very small datasets, and
produce large file and takes huge computation times.

Mikaël Salson's avatar
Mikaël Salson committed
332

333 334 335
** Labeled windows

Vidjil allows to indicate that specific windows that must be followed
336
(even if those windows are 'rare', below the =-r/-%= thresholds).
337 338 339

Such windows can be provided either with =-W <window>=, or with =-l <file>=.
The file given by =-l= should have one window by line, as in the following example:
Mikaël Salson's avatar
Mikaël Salson committed
340

341
#+BEGIN_EXAMPLE
342 343
GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1
GAGAGATGGACGGAATACGTTAAACGACATATGGTTCGGGGTATGGTGCT my-clone-2 foo
344 345 346
#+END_EXAMPLE

Windows and labels must be separed by one space.
Mikaël Salson's avatar
Mikaël Salson committed
347 348 349
The first column of the file is the window to be followed
while the remaining columns consist of the window's label.
In Vidjil output, the labels are output alongside their windows.
Mikaël Salson's avatar
Mikaël Salson committed
350

351 352 353 354
With the =-F= option, /only/ the labeld windows are kept. This allows
to quickly filter a set of reads, looking for a known window,
with the =-FaW <window>= options:
All the reads with this windows will be extracted to =out/seq/clone.fa-1=.
355

356 357 358
** VDJ assignation options
   The =-m= option controls the minimum difference of positions between the end
   of the V and the start of the J. Note that it is even possible to set =-m -10=
359 360
   (meaning that V and J could overlap 10 bp). This is the default for VJ recombinations
   (except when using a =germlines.data= file).
361

362
** Further clustering (experimental)
363

364 365
The following options are experimental and have no consequences on the =.vdj.fa= file,
nor on the standard output. They instead add a =clusters= sections in the =.vidjil= file
366
that will be visualized in the web application.
367

368 369 370 371 372
The =-n= option triggers an automatic clustering using DBSCAN algorithm (Ester and al., 1996).
Using =-n 5= usually cluster reads within a distance of 1 mismatch (default score
being +1 for a match and -4 for a mismatch). However, more distant reads can also
be clustered when there are more than 10 reads within the distance threshold.
This behaviour can be controlled with the =-N= option.
Mikaël Salson's avatar
Mikaël Salson committed
373

374
The =-E= option allows to specify a file for manually clustering two windows
Mikaël Salson's avatar
Mikaël Salson committed
375
considered as similar. Such a file may be automatically produced by vidjil
376
(=out/edges=), depending on the option provided. Only the two first columns
Mikaël Salson's avatar
Mikaël Salson committed
377
(separed by one space) are important to vidjil, they only consist of the 
Mikaël Salson's avatar
Mikaël Salson committed
378
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
379 380


381

382 383
* Output

384
** Main output files
385 386 387

The main output of Vidjil (with the default =-c clones= command) are two following files:

388
 - The =.vidjil= file is /the file for the Vidjil web application/.
389 390
   The file is in a =.json= format (detailed in [[file:format-analysis.org][format-analysis.org]])
   describing the windows and their count, the representatives (=-y=),
391
   the detailed V(D)J designation (=-z=, see warning below), and possibly
392 393
   the results of the further clustering.

394
   The web application takes this =.vidjil= file (possibly merged with
395 396 397
   =fuse.py=) for the /visualization and analysis/ of clones and their
   tracking along different samples (for example time points in a MRD
   setup or in a immunological study).
398
   Please see [[file:browser.org][browser]].org for more information on the web application.
399 400 401 402 403

 - The =.vdj.fa= file is /a FASTA file for further processing by other bioinformatics tools/.
   The sequences are at least the windows (and their count in the headers) or
   the representatives (=-y=) when they have been computed.
   The headers include the count of each window, and further includes the
404
   detailed V(D)J designation (=-z=, see warning below), given in a '.vdj' format, see below.
405 406 407 408 409 410 411 412 413
   The further clustering is not output in this file.

   The =.vdj.fa= output enables to use Vidjil as a /filtering tool/,
   shrinking a large read set into a manageable number of (pre-)clones
   that will be deeply analyzed and possibly further clustered by
   other software.


By default, the two output files are named =out/basename.vidjil= in =out/basename.vdj.fa=, where:
414
 - =out= is the directory where all the outputs are stored (can be changed with the =-o= option).
415 416
 - =basename= is the basename of the input =.fasta/.fastq= file (can be overriden with the =-b= option)

417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
** Auxiliary output files

The auxiliary files include =out/basename.windows.fa= (list of windows, with number of occurrences, as below)
and =out/seq/clone.fa-*= (detailed analysis by clone).

#+BEGIN_EXAMPLE
>8--window--1
ATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGG
>5--window--2
ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC
(...)
#+END_EXAMPLE

Windows of size 50 (modifiable by =-w=) have been extracted.
The first window has 8 occurrences, the second window has 5 occurrences.

433

434 435 436 437
** Unsegmentation causes

Vidjil output details statistics on the reads that are not segmented (not analyzed).
Basically, *an unsegmented read is a read where Vidjil cannot identify a window at the junction of V and J genes*.
438 439
To properly analyze a read, Vijdil needs that the sequence spans enough V region and J region
(or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations).
440 441 442 443 444 445 446 447
The following unsegmentation causes are reported:

|                     |                                                                                                                     |
|---------------------+---------------------------------------------------------------------------------------------------------------------|
| =UNSEG too short=   | Reads are too short, shorter than the seed (by default between 9 and 13 bp).                                        |
|---------------------+---------------------------------------------------------------------------------------------------------------------|
| =UNSEG strand=      | The strand is mixed in the read, with some similarities both with the =+= and the =-= strand.                       |
|---------------------+---------------------------------------------------------------------------------------------------------------------|
448
| =UNSEG too few V/J= | No information has been found on the read: There are not enough similarities neither with a V gene or a J gene.     |
449
|---------------------+---------------------------------------------------------------------------------------------------------------------|
450
| =UNSEG only V/5=    | Relevant similarities have been found with some V, but not enough with any J.                                       |
451
|---------------------+---------------------------------------------------------------------------------------------------------------------|
452
| =UNSEG only J/3=    | Relevant similarities have been found with some J, but not enough with any V.                                       |
453 454 455 456 457 458 459 460 461 462
|---------------------+---------------------------------------------------------------------------------------------------------------------|
| =UNSEG ambiguous=   | Vidjil finds some V and J similarities mixed together which makes the situation ambiguous and hardly solvable.      |
|---------------------+---------------------------------------------------------------------------------------------------------------------|
| =UNSEG too short w= | The junction can be identified but the read is too short so that Vidjil could extract the window (by default 50bp). |
|                     | It often means the junction is very close from one end of the read.                                                 |
|---------------------+---------------------------------------------------------------------------------------------------------------------|


Some datasets may give reads with many low =UNSEG too few= reads:

463
 -  =UNSEG too few V/J= usually happens when reads share almost nothing with the V(D)J region.
464 465
    This is expected when the PCR or capture-based approach included other regions, such as in whole RNA-seq.

466
 - =UNSEG only V/5= and =UNSEG only J/3= happen when reads do not span enough the junction zone.
467 468 469 470 471 472
    Vidjil detects a “window” including the CDR3. By default this window is 50bp long,
    so the read needs be that long centered on the junction.

See [[http://git.vidjil.org/blob/master/doc/browser.org][browser.org]] for information on the biological or sequencing causes that can lead to few segmented reads.


473 474 475 476 477 478 479 480 481 482 483 484 485 486
** Filtering reads

It is possible to extract all segmented or unsegmented reads, possibly to give them to  other software.
Runing Vidjil with =-U= gives a file =out/basename.unsegmented.vdj.fa=, with all segmented reads.
On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file.
However, the =-U= option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations.

Similarly, two options are available to get the unsegmented reads:
   - =-u= gives a file =out/basename.segmented.vdj.fa=, with unsegmented reads.
   - =-uu= gives a set of files =out/basename.UNSEG_*=, with unsegmented reads gathered by unsegmentation cause

Again, as these options may generate large files, they are generally not recommended.
However, they are very useful in some situations, especially to understand why some dataset gives poor segmentation result.
For example =-uu -X 1000= splits the unsegemented reads from the 1000 first reads.
487

Mikaël Salson's avatar
Mikaël Salson committed
488

489
** Segmentation and .vdj format
490

491
Vidjil output includes segmentation of V(D)J recombinations. This happens
492 493
in the following situations:

494
- in a first pass, when requested with =-U= option, in a =.segmented.vdj.fa= file.
495 496

      The goal of this ultra-fast segmentation, based on a seed
497 498
      heuristics, is only to identify the locus and to locate the w-window overlapping the
      CDR3. This should not be taken as a real V(D)J designation, as
499 500 501
      the center of the window may be shifted up to 15 bases from the
      actual center.

502
- in a second pass, on the standard output and in both =.vidjil= and =.vdj.fa= files
503 504
        - at the end of the clones detection (default command =-c clones=,
          on a number of clones limited by the =-z= option)
505
        - or directly when explicitly requiring segmentation (=-c segment=)
506

507 508 509
      These V(D)J designations are obtained by full comparison (dynamic programming)
      with all germline sequences.

510 511
      Note that these designations are relatively slow to compute, especially
      for the IGH locus. However, they
512 513 514 515 516
      are not at the core of the Vidjil clone gathering method (which
      relies only on the 'window', see above).
      To check the quality of these designations, the automated test suite include
      sequences with manually curated V(D)J designations (see [[http://git.vidjil.org/blob/master/doc/should-vdj.org][should-vdj.org]]).

Mikaël Salson's avatar
Mikaël Salson committed
517 518

Segmentations of V(D)J recombinations are displayed using a dedicated
519
=.vdj= format. This format is compatible with FASTA format. A line starting
Mikaël Salson's avatar
Mikaël Salson committed
520 521
with a > is of the following form:

Mikaël Salson's avatar
Mikaël Salson committed
522
#+BEGIN_EXAMPLE
523
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
524

525
        name          sequence name (include the number of occurrences in the read set and possibly other information)
Mikaël Salson's avatar
Mikaël Salson committed
526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548
        +             strand on which the sequence is mapped
        VDJ           type of segmentation (can be "VJ", "VDJ", 
    	              or shorter tags such as "V" for incomplete sequences).	
		      The following line are for "VDJ" recombinations :

        startV endV   start and end position of the V gene in the sequence (start at 0)
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
        
549
        comments      optional comments. In Vidjil, the following comments are now used:
550
                      - "seed" when this comes for the first pass (.segmented.vdj.fa). See the warning above.
551
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
552 553
                      - the name of the locus (TRA, TRB, TRG, TRD, IGH, IGL, IGK, possibly followed
                        by a + for incomplete/unusual recombinations)
Mikaël Salson's avatar
Mikaël Salson committed
554

Mikaël Salson's avatar
Mikaël Salson committed
555 556
#+END_EXAMPLE

Mikaël Salson's avatar
Mikaël Salson committed
557 558 559 560 561
Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
Mikaël Salson's avatar
Mikaël Salson committed
562

563
#+BEGIN_EXAMPLE
564
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  comments
565
#+END_EXAMPLE
566 567 568 569 570 571 572


* Examples of use

All the following examples are on a IGH VDJ recombinations : they thus
require either the =-G germline/IGH= option, or the multi-germline =-g germline= option.

573 574
** Basic usage: PCR-based datasets, with primers in the V(D)J regions (such as BIOMED-2 primers)

575 576
#+BEGIN_SRC sh
./vidjil -G germline/IGH data/Stanford_S22.fasta
577 578
   # Gather the reads into clones, based on windows overlapping IGH CDR3s.
   # Summary of clones is available both on stdout, in out/Stanford_S22.vdj.fa and in out/Stanford_S22.vidjil.
579 580
#+END_SRC

581 582 583 584 585 586
#+BEGIN_SRC sh
./vidjil -g germline -i data/reads.fasta
   # Detects for each read the best locus, including an analysis of incomplete/unusual recombinations
   # Gather the reads into clones, again based on windows overlapping the detected CDR3s.
   # Summary of clones is available both on stdout, in out/reads.vdj.fa and in out/reads.vidjil.
#+END_SRC
587 588


589
** Basic usage: Whole RNA-Seq or capture datasets
590 591

#+BEGIN_SRC sh
592 593 594 595 596
./vidjil -g germline -i -U data/reads.fasta
   # Detects for each read the best locus, including an analysis of incomplete/unusual recombinations
   # Gather the reads into clones, again based on windows overlapping the detected CDR3s.
   # Summary of clones is available both on stdout, in out/reads.vdj.fa and in out/reads.vidjil.
   # The out/reads.segmented.vdj.fa include all reads where a V(D)J recombination was found
597 598 599
#+END_SRC


600 601
** Advanced usage

602 603
#+BEGIN_SRC sh
./vidjil -c clones -G germline/IGH -r 1 ./data/clones_simul.fa
604
   # Extracts the windows with at least 1 read each (-r 1, the default being -r 5)
605 606 607 608 609 610 611 612
   # then gather them into clones
#+END_SRC

#+BEGIN_SRC sh
./vidjil -c clones -G germline/IGH -r 1 -n 5 ./data/clones_simul.fa
   # Window extraction + clone gathering,
   # with automatic clustering, distance five (-n 5)
   # The result of the automatic clustering is in the .vidjil file
613
   # and can been seen/edited in the web application.
614 615 616 617
#+END_SRC

#+BEGIN_SRC sh
./vidjil -c segment -G germline/IGH data/segment_S22.fa
618 619
   # Detailed V(D)J designation on all reads
   # (this is slow and should only be used for testing, or on a small file)
620 621 622 623 624 625
#+END_SRC

#+BEGIN_SRC sh
./vidjil -c germlines file.fastq
   # Output statistics on the number of occurrences of k-mers of the different germlines
#+END_SRC