vidjil-algo.md 42.8 KB
Newer Older
1
# vidjil-algo 2018.10-dev
2 3
**Command-line manual**

4
*The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Tatiana and Rayan)*
Mikaël Salson's avatar
Mikaël Salson committed
5

6 7 8 9
```
  Vidjil -- High-throughput Analysis of V(D)J Immune Repertoire -- [[http://www.vidjil.org]]
  Copyright (C) 2011-2018 by Bonsai bioinformatics
  at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille
10
  and VidjilNet consortium.
11 12 13 14 15 16
  contact@vidjil.org
```

This is the help of vidjil-algo, for command-line usage.
This manual can be browsed online:

17 18
 - <http://www.vidjil.org/doc/vidjil-algo>                (last stable release)
 - <http://gitlab.vidjil.org/blob/dev/doc/vidjil-algo.md> (development version)
19 20 21 22 23 24 25

Other documentation (users and administrators of the web application, developpers) can be found from <http://www.vidjil.org/doc/>.


## About

*V(D)J recombinations* in lymphocytes are essential for immunological
Mikaël Salson's avatar
Mikaël Salson committed
26 27 28
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
Mikaël Salson's avatar
Mikaël Salson committed
29

30
Vidjil-algo processes high-throughput sequencing data to extract V(D)J
31
junctions and gather them into clones. Vidjil-algo starts
Mikaël Salson's avatar
Mikaël Salson committed
32 33
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
34
to output all sequenced clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
35
because, in the first phase, no alignment is performed with database
36
germline sequences. At the end, only the consensus sequences
37
of each clone have to be analyzed. Vidjil-algo can also cluster similar
38
clones, or leave this to the user after a manual review in the web application.
Mikaël Salson's avatar
Mikaël Salson committed
39

40 41
The method is described in the following references:

42
- Marc Duez et al.,
43
“Vidjil: A web platform for analysis of high-throughput repertoire sequencing”,
44
PLOS ONE 2016, 11(11):e0166126
45
<http://dx.doi.org/10.1371/journal.pone.0166126>
Mikaël Salson's avatar
Mikaël Salson committed
46

47
- Mathieu Giraud, Mikaël Salson, et al.,
Mikaël Salson's avatar
Mikaël Salson committed
48
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
49
BMC Genomics 2014, 15:409
50
<http://dx.doi.org/10.1186/1471-2164-15-409>
Mikaël Salson's avatar
Mikaël Salson committed
51

52
Vidjil-algo is open-source, released under GNU GPLv3 license.
Mikaël Salson's avatar
Mikaël Salson committed
53

54
# Requirements and installation
55

56
## Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
57

58
Vidjil-algo has been successfully tested on the following platforms :
59 60 61 62 63 64 65 66 67 68 69

  - CentOS 6.3 amd64
  - CentOS 6.3 i386
  - CentOS 7.2 i386
  - Debian Squeeze 6.0
  - Debian Wheezy 7.0 amd64
  - Fedora 19
  - FreeBSD 9.2
  - Ubuntu 12.04 LTS amd64
  - Ubuntu 14.04 LTS amd64
  - Ubuntu 16.04 LTS
70
  - Ubuntu 18.04 LTS
71
  - OS X 10.9, 10.10, 10.11
Mikaël Salson's avatar
Mikaël Salson committed
72

73 74
Use Vidjil-algo 2018.07.2 for a wide support of these platforms.

75
Vidjil-algo is developed with continuous integration using systematic unit and functional testing.
76
The development team internally uses [Gitlab CI](http://gitlab.vidjil.org/pipelines) and [Jenkins](https://jenkins-ci.org/) for that.
Mikaël Salson's avatar
Mikaël Salson committed
77

78
## Build requirements (optional)
79

80
This paragraph details the requirements to build Vidjil-algo from source.
81
You can also download a static binary, see [installation](#installation).
82

83
To compile Vidjil-algo, make sure:
84

85
  - to be on a POSIX system ;
86
  - to have a C++11 compiler (as `g++` 4.8 or above, `g++` 7.3 being supported, or `clang` 3.3 or above).
87 88 89
    Note that the 2018.10-dev version of Vidjil-algo does not work with `g++` 8.0 or above and with `clang`.
    Use the 2018.07.2 version for a wider support.
    
90 91
  - to have the `zlib` installed (`zlib1g-dev` package under Debian/Ubuntu,
    `zlib-devel` package under Fedora/CentOS).
92

93
### CentOS 6
94 95 96

g++-4.8 is included in the devtools 2.0.

97
``` bash
98
sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo
99 100 101 102
sudo yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++ devtoolset-2-valgrind

# scl enable devtoolset-2 bash     # either open a shell running devtools
source /opt/rh/devtoolset-2/enable # ... or source devtools in the same shell
103
```
104

105
### CentOS 7.2
106 107 108

g++-4.8 is included.

109
### FreeBSD 9.2
110 111 112

g++-4.8 is included in FreeBSD 9.2.

113 114 115
You may also need to install the `gzstream` library with:

``` bash
116
pkg install gzstream
117
```
118

119 120
Also Vidjil-algo uses GNU make which requires `gmake` under FreeBSD.
At the time of redacting the documentation, `g++` requires extra options to
121
ensure flawless compilation and execution of Vidjil-algo:
122 123

``` bash
124
make MAKE=gmake CXXFLAGS="-std=c++11 -O2 Wall -D_GLIBCXX_USE_C99 -Wl,-rpath=/usr/local/lib/gcc49"
125 126 127 128 129 130
```

The `gcc49` at the end of the command line is to be replaced by the `gcc` version
used.

### Debian Squeeze 6.0 / Wheezy 7.0
131 132

g++-4.8 should be pinned from testing.
133
Put in `/etc/apt/preferences` the following lines:
134

135
``` bash
136
Package: *
137
Pin: release n=wheezy # (or squeeze)
138 139 140 141 142
Pin-Priority: 900

Package: g++-4.8, gcc-4.8, valgrind*
Pin: release n=jessie
Pin-Priority: 950
143
```
144 145 146

Then g++ 4.8 can be installed.

147
``` bash
148
apt-get update
149
apt-get install -t jessie g++-4.8 valgrind
150
```
151

152 153 154 155
### Ubuntu 16.04 LTS, Ubuntu 18.04 LTS

Recent versions of `g++` are included.

156
### Ubuntu 14.04 LTS
157

158
``` bash
159
sudo apt-get install g++-4.8
160
```
161

162
### Ubuntu 12.04 LTS
163 164 165

g++-4.8 is included in the devtools 2.0.

166
``` bash
167 168 169 170
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.8
171
```
172

173
### OS X
174 175 176

Xcode should be installed first.

177
## Installation
178

179 180 181 182 183
### Download

Stables releases can be downloaded from <http://www.vidjil.org/releases> or <http://bioinfo.lifl.fr/vidjil/>.
Development code is found at <http://gitlab.vidjil.org>.

184
### Compiling
Mikaël Salson's avatar
Mikaël Salson committed
185

186 187
Running `make` from the extracted archive should be enough to install vidjil-algo with germline and demo files.
It runs the three following `make` commands.
188

189
``` bash
Mikaël Salson's avatar
Mikaël Salson committed
190

191
make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
192
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
193 194 195 196 197 198
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

199

200
make vijdil-algo         # build vijil-algo from the sources (see the requirements,
201 202
                         # another option is: wget http://www.vidjil.org/releases/vidjil-algo-latest_x86_64 -O vidjil-algo
                         # to download a static binary (built for x86_64 architectures)
203

204
make demo                # download demo files (S22 and L4, see demo/get-sequences)
205

206
./vidjil-algo -h         # display help/usage
207
```
Mikaël Salson's avatar
Mikaël Salson committed
208

209
On some older systems you may need to replace the `make` commands with:
210

211
``` bash
212
make LDFLAGS='-stdlib=libc++'  ### OS X Mavericks
213 214 215
```

## Self-tests (optional)
216 217 218

You can run the tests with the following commands:

219
``` bash
220
make -C src/tests/data
Mathieu Giraud's avatar
Mathieu Giraud committed
221
   # get IGH recombinations from a single individual, as described in:
222 223 224 225
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

226
make -C src test                # run self-tests (can take 5 to 60 minutes)
227
```
Mikaël Salson's avatar
Mikaël Salson committed
228

229
# Input and parameters
230

231 232 233
The main input file of Vidjil-algo is a *set of reads*, given as a `.fasta`
or `.fastq` file, possibly compressed with gzip (`.gz`).
This set of reads can reach several gigabytes and 2\*10<sup>9</sup> reads. It is
234
never loaded entirely in the memory, but reads are processed one by
235 236
one by Vidjil-algo.
Vidjil-algo can also process BAM files, but please note that:
237

238 239 240 241 242 243
1.  The reads don't need to be aligned beforehand.
2.  In case of paired-end sequencing, the reads must have already been merged
    in the BAM file.

The `-h` and `-H` help options provide the list of parameters that can be
used. We detail here the options of the main `-c clones` command.
244 245 246 247 248 249

The default options are very conservative (large window, no further
automatic clusterization, see below), leaving the user or other
software making detailed analysis and decisions on the final
clustering.

250
## Recombination / locus selection
251

252 253

``` diff
254 255 256 257 258 259 260 261 262 263 264 265
Germline presets (at least one -g or -V/(-D)/-J option must be given)
  -g GERMLINES ...
         -g <.g FILE>(:FILTER)
                    multiple locus/germlines, with tuned parameters.
                    Common values are '-g germline/homo-sapiens.g' or '-g germline/mus-musculus.g'
                    The list of locus/recombinations can be restricted, such as in '-g germline/homo-sapiens.g:IGH,IGK,IGL'
         -g PATH
                    multiple locus/germlines, shortcut for '-g PATH/homo-sapiens.g',
                    processes human TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with some incomplete/unusal recombinations
  -V FILE ...                 custom V germline multi-fasta file(s)
  -D FILE ...                 custom D germline multi-fasta file(s), segment into V(D)J components
  -J FILE ...                 custom V germline multi-fasta file(s)
266 267

Locus/recombinations
268 269
  -d                          try to detect several D (experimental)
  -2                          try to detect unexpected recombinations (must be used with -g)
270
```
271

272
The `germline/*.g` presets configure the analyzed recombinations.
273 274
The following presets are provided:

275
  - `germline/homo-sapiens.g`: Homo sapiens, TR (`TRA`, `TRB`, `TRG`, `TRD`) and Ig (`IGH`, `IGK`, `IGL`) locus,
276
    including incomplete/unusal recombinations (`TRA+D`, `TRB+`, `TRD+`, `IGH+`, `IGK+`, see [locus](locus)).
277 278 279 280 281
  - `germline/homo-sapiens-isotypes.g`: Homo sapiens heavy chain locus, looking for sequences with, on one side, IGHJ (or even IGHV) genes,
    and, on the other side, an IGH constant chain.
  - `germline/homo-sapiens-cd.g`: Homo sapiens, common CD genes (experimental, does not check for recombinations)
  - `germline/mus-musculus.g`: Mus musculus (strains BALB/c and C57BL/6)
  - `germline/rattus-norvegicus.g`: Rattus norvegicus (strains BN/SsNHsdMCW and Sprague-Dawley)
282

283
New `germline/*.g` presets for other species or for custom recombinations can be created, possibly referring to other `.fasta` files.
284 285
Please contact us if you need help in configuring other germlines.

286 287 288 289
  - Recombinations can be filtered, such as in
    `-g germline/homo-sapiens.g:IGH` (only IGH, complete recombinations),
    `-g germline/homo-sapiens.g:IGH,IGH+` (only IGH, as well with incomplete recombinations)
    or `-g germline/homo-sapiens.g:TRA,TRB,TRG` (only TR locus, complete recombinations).
290

291
  - Several presets can be loaded at the same time, as for instance `-g germline/homo-sapiens.g -g germline/germline/homo-sapiens-isotypes.g`.
292

293
  - Using `-2` further test unexpected recombinations (tagged as `xxx`), as in `-g germline/homo-sapiens.g -2`.
294

295
Finally, the advanced `-V/(-D)/-J` options enable to select custom V, (D) and J repertoires given as `.fasta` files.
296

297
## Main algorithm parameters
298

299
``` diff
300 301 302 303 304 305 306 307 308 309
Recombination detection ("window" prediction, first pass)
    (use either -s or -k option, but not both)
    (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
    (all these options, except -w, are overriden when using -g)
  -k INT                      k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
  -w INT                      w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
  -e FLOAT=1                  maximal e-value for determining if a V-J segmentation can be trusted
  -t INT                      trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt  (0: no trim)
  -s SEED=10s                 seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
                              10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
310
```
Mathieu Giraud's avatar
Mathieu Giraud committed
311

312
The `-s`, `-k` are the options of the seed-based heuristic that detects
Mathieu Giraud's avatar
Mathieu Giraud committed
313 314
"junctions", that is a zone in a read that is similar to V genes on its
left end and similar to J genes in its right end. A detailed
315
explanation can be found in (Giraud, Salson and al., 2014).
316 317
*These options are for advanced usage, the defaults values should work.*
The `-s` or `-k` option selects the seed used for the k-mer V/J affectation.
Mathieu Giraud's avatar
Mathieu Giraud committed
318

319 320
The `-w` option fixes the size of the "window" that is the main
identifier to cluster clones. The default value (`-w 50`) was selected
321
to ensure a high-quality clone clustering: reads are clustered when
322
they *exactly* share, at the nucleotide level, a 50 bp-window centered
323 324
on the CDR3. No sequencing errors are corrected inside this window.
The center of the "window", predicted by the high-throughput heuristic, may
Mikaël Salson's avatar
Mikaël Salson committed
325
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
326
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
327
in \>99% of cases when the reads are large enough). Usually, a 50 bp-window
328 329
fully contains the CDR3 as well as some part of the end of the V and
the start of the J, or at least some specific N region to uniquely identify the clone.
Mathieu Giraud's avatar
Mathieu Giraud committed
330

331
Setting `-w` to higher values (such as `-w 60` or `-w 100`) makes the clone clustering
332 333 334 335 336
even more conservative, enabling to split clones with low specificity (such as IGH with very
large D, short or no N regions and almost no somatic hypermutations). However, such settings
may "segment" (analyze) less reads, depending on the read length of your data, and may also
return more clones, as any sequencing error in the window is not corrected.

337
The special `-w all` option takes all the read as the windows, completely disabling
338 339 340
the clustering by windows and generally returning more clones. This should only be used on
datasets where reads of the same clone do have exactly the same length.

341
Setting `-w` to lower values than 50 may "segment" (analyze) a few more reads, depending
342
on the read length of your data, but may in some cases falsely cluster reads from
343
different clones.
344 345
For VJ recombinations, the `-w 40` option is usually safe, and `-w 30` can also be tested.
Setting `-w` to lower values is not recommended.
Mathieu Giraud's avatar
Mathieu Giraud committed
346

347 348
When the read is too short too extract the requested length, the window can be shifted
(at most 10 bp) or shrinkened (down until 30bp) by increments of 5bp. Such reads
349
are counted in `SEG changed w` and the corresponding clones are output with the `Wxx` warning.
350

351
The `-e` option sets the maximal e-value accepted for segmenting a sequence.
352
It is an upper bound on the number of exepcted windows found by chance by the seed-based heuristic.
353 354
The e-value computation takes into account both the number of reads in the
input sequence and the number of locus searched for.
355 356
The default value is 1.0, but values such as 1000, 1e-3 or even less can be used
to have a more or less permissive segmentation.
357
The threshold can be disabled with `-e all`.
358

359
The `-t` option sets the maximal number of nucleotides that will be indexed in
360 361
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
362
However giving a `-t` may also reduce the probability of seeing a heavily
363
trimmed or mutated V gene.
364
The default is `-t 0`.
365

366
## Thresholds on clone output
367 368 369

The following options control how many clones are output and analyzed.

370
``` diff
371
Limits to report a clone (or a window)
372 373 374 375 376 377 378 379 380 381
  --max-clones INT            maximal number of output clones ('all': no maximum, default)
  -r INT=5                    minimal number of reads supporting a clone
  --ratio FLOAT=0             minimal percentage of reads supporting a clone

Limits to further analyze some clones (second pass)
  -y INT=100                  maximal number of clones computed with a consensus sequence ('all': no limit)
  -z INT=100                  maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use)
  -A                          reports and segments all clones (-r 0 --ratio 0 -y all -z all), to be used only on very small datasets (for example -AX 20)
  -x INT                      maximal number of reads to process ('all': no limit, default), only first reads
  -X INT                      maximal number of reads to process ('all': no limit, default), sampled reads
382
```
383

384
The `-r/--ratio` options are strong thresholds: if a clone does not have
385
the requested number of reads, the clone is discarded (except when
386 387 388 389
using `-l`, see below).
The default `-r 5` option is meant to only output clones that
have a significant read support. **You should use** `-r 1` **if you
want to detect all clones starting from the first read** (especially for
390 391
MRD detection).

392 393
The `--max-clones` option limits the number of output clones, even without consensus sequences.

394
The `-y` option limits the number of clones for which a consensus
395
sequence is computed. Usually you do not need to have more
396
consensus (see below), but you can safely put `-y all` if you want
397
to compute all consensus sequences.
398

399 400
The `-z` option limits the number of clones that are fully analyzed,
*with their V(D)J designation and possibly a CDR3 detection*,
401
in particular to enable the web application
402
to display the clones on the grid (otherwise they are displayed on the
403
'?/?' axis).
404 405
If you want to analyze more clones, you should use `-z 200` or
`-z 500`. It is not recommended to use larger values: outputting more
406
than 500 clones is often not useful since they can not be visualized easily
407
in the web application, and takes more computation time.
408

409
Note that even if a clone is not in the top 100 (or 200, or 500) but
410
still passes the `-r`, `--ratio` options, it is still reported in both the `.vidjil`
411
and `.vdj.fa` files. If the clone is at some MRD point in the top 100 (or 200, or 500),
Mikaël Salson's avatar
Mikaël Salson committed
412
it will be fully analyzed/segmented by this other point (and then
413
collected by the `fuse.py` script, using consensus sequences computed at this
414
other point, and then, on the web application, correctly displayed on the grid).
415 416
**Thus is advised to leave the default** `-z 100` **option
for the majority of uses.**
417

418
The `-A` option disables all these thresholds. This option should be
419 420 421
used only for test and debug purposes, on very small datasets, and
produce large file and takes huge computation times.

422
The `-Z` option speeds up the full analysis by a pre-processing step,
423 424 425
again based on k-mers, to select a subset of the V germline genes to be compared to the read.
The option gives the typical size of this subset (it can be larger when several V germlines
genes are very similar, or smaller when there are not enough V germline genes).
426 427 428
The default `-Z 3` is generally safe.
Setting `-Z all` removes this pre-processing step, running a full dynamic programming
with all germline sequences that is much slower.
Mikaël Salson's avatar
Mikaël Salson committed
429

430
## Sequences of interest
431

432
Vidjil-algo allows to indicate that specific sequences should be followed and output,
433
even if those sequences are 'rare' (below the `-r/--ratio` thresholds).
434 435
Such sequences can be provided either with `-W <sequence>`, or with `-l <file>`.
The file given by `-l` should have one sequence by line, as in the following example:
Mikaël Salson's avatar
Mikaël Salson committed
436

437
``` diff
438 439
GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1
GAGAGATGGACGGAATACGTTAAACGACATATGGTTCGGGGTATGGTGCT my-clone-2 foo
440
```
441

442 443 444
Sequences and labels must be separed by one space.
The first column of the file is the sequence to be followed
while the remaining columns consist of the sequence's label.
445
In Vidjil-algo output, the labels are output alongside their sequences.
446

447 448
A sequence given `-W <sequence>` or with `-l <file>` can be exactly the size
of the window (`-w`, that is 50 by default). In this case, it is guaranteed that
449 450
such a window will be output if it is detected in the reads.
More generally, when the provided sequence differs in length with the windows
451 452
we will keep any windows that contain the sequence of interest or, conversely,
we will keep any window that is contained in the sequence of interest.
453 454 455
This filtering will work as expected when the provided sequence overlaps
(at least partially) the CDR3 or its close neighborhood.

456
With the `-F` option, *only* the windows related to the given sequences are kept.
457
This allows to quickly filter a set of reads, looking for a known sequence or window,
458 459
with the `-FaW <sequence>` options:
All the reads with the windows related to the sequence will be extracted to `out/seq/clone.fa-1`.
460

461
## Clone analysis: VDJ assignation and CDR3 detection
462

463
The `-3` option launches a CDR3/JUNCTION detection based on the position
464
of Cys104 and Phe118/Trp118 amino acids. This detection relies on alignment
465
with gapped V and J sequences, as for instance, for V genes, IMGT/GENE-DB sequences,
466
as provided by `make germline`.
467 468
The CDR3/JUNCTION detection won't work with custom non-gapped V/J repertoires.

469 470
CDR3 are reported as productive when they come from an in-frame recombination
and when the sequence does not contain any in-frame stop codons.
471

472
The advanced `-f` option sets the parameters used in the comparisons between
473 474
the clone sequence and the V(D)J germline genes. The default values should work.

475 476
The e-value set by `-e` is also applied to the V/J designation.
The `-E` option further sets the e-value for the detection of D segments.
477

478
## Further clustering (experimental)
479

480 481
The following options are experimental and have no consequences on the `.vdj.fa` file,
nor on the standard output. They instead add a `clusters` sections in the `.vidjil` file
482
that will be visualized in the web application.
483

484 485
The `-n` option triggers an automatic clustering using DBSCAN algorithm (Ester and al., 1996).
Using `-n 5` usually cluster reads within a distance of 1 mismatch (default score
486 487
being +1 for a match and -4 for a mismatch). However, more distant reads can also
be clustered when there are more than 10 reads within the distance threshold.
488
This behaviour can be controlled with the `-N` option.
Mikaël Salson's avatar
Mikaël Salson committed
489

490
The `-=` option allows to specify a file for manually clustering two windows
Mikaël Salson's avatar
Mikaël Salson committed
491
considered as similar. Such a file may be automatically produced by vidjil
492 493
(`out/edges`), depending on the option provided. Only the two first columns
(separed by one space) are important to vidjil, they only consist of the
Mikaël Salson's avatar
Mikaël Salson committed
494
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
495

496
# Output
Mikaël Salson's avatar
Mikaël Salson committed
497

498
## Main output files
499

500
The main output of Vidjil-algo (with the default `-c clones` command) are two following files:
501

502
  - The `.vidjil` file is *the file for the Vidjil web application*.
503
    The file is in a `.json` format (detailed in [vidjil-format](vidjil-format))
504 505 506 507
    describing the windows and their count, the consensus sequences (`-y`),
    the detailed V(D)J and CDR3 designation (`-z`, see warning below), and possibly
    the results of the further clustering.
    
508
    The web application takes this `.vidjil` file ([possibly merged with `fuse.py`](#following-clones-in-several-samples)) for the *visualization and analysis* of clones and their
509 510
    tracking along different samples (for example time points in a MRD
    setup or in a immunological study).
511
    Please see the [user manual](user.md) for more information on the web application.
512

513 514 515 516 517 518 519 520 521 522 523
  - The `.vdj.fa` file is *a FASTA file for further processing by other bioinformatics tools*.
    The sequences are at least the windows (and their count in the headers) or
    the consensus sequences (`-y`) when they have been computed.
    The headers include the count of each window, and further includes the
    detailed V(D)J and CDR3 designation (`-z`, see warning below), given in a '.vdj' format, see below.
    The further clustering is not output in this file.
    
    The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
    shrinking a large read set into a manageable number of (pre-)clones
    that will be deeply analyzed and possibly further clustered by
    other software.
524

525
By default, the two output files are named `out/basename.vidjil` in `out/basename.vdj.fa`, where:
526

527 528
  - `out` is the directory where all the outputs are stored (can be changed with the `-o` option).
  - `basename` is the basename of the input `.fasta/.fastq` file (can be overriden with the `-b` option)
529

530
## Auxiliary output files
531

532
The `out/basename.windows.fa` file contains the list of windows, with number of occurrences:
533

534
``` diff
535
>8--window--1
536
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
537
>5--window--2
538
CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC
539 540
ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC
(...)
541
```
542

543
Windows of size 50 (modifiable by `-w`) have been extracted.
544 545
The first window has 8 occurrences, the second window has 5 occurrences.

546
The `out/seq/clone.fa-*` contains the detailed analysis by clone, with
547
the window, the consensus sequence, as well as with the most similar V, (D) and J germline genes:
548

549
``` diff
550 551
>clone-001--IGH--0000008--0.0608%--window
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
552
>clone-001--IGH--0000008--0.0608%--lcl|FLN1FA001CPAUQ.1|-[105,232]-#2 - 128 bp (55% of 232.0 bp) + VDJ  0 54 73 84 85 127   IGHV3-23*05 6/ACCCGGGAGGAACAATAT/9 IGHD6-13*01 0//5 IGHJ4*02  IGH SEG_+ 1.946653e-19 1.352882e-19/5.937712e-20
553 554 555 556 557 558 559 560 561 562
GCTGTACCTGCAAATGAACAGCCTGCGAGCCGAGGACACGGCCACCTATTACTGT
ACCCGGGAGGAACAATATAGCAGCTGGTAC
TTTGACTTCTGGGGCCAGGGGATCCTGGTCACCGTCTCCTCAG

>IGHV3-23*05
GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTTATAGCAGTGGTAGTAGCACATACTATGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAA
>IGHD6-13*01
GGGTATAGCAGCAGCTGGTAC
>IGHJ4*02
ACTACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
563
```
564

565 566
The `-a` debug option further output in each `out/seq/clone.fa-*` files the full list of reads belonging to this clone.
The `-a` option produces large files, and is not recommanded in general cases.
567

568
## Diversity measures
569

570
Several [diversity indices](https://en.wikipedia.org/wiki/Diversity_index) are reported, both on the standard output and in the `.vidjil` file:
571

572 573 574
  - H (`index_H_entropy`): Shannon's diversity
  - E (`index_E_equitability`): Shannon's equitability
  - Ds (`index_Ds_diversity`): Simpson's diversity
575

576
E ans Ds values are between 0 (no diversity, one clone clusters all analyzed reads)
577
and 1 (full diversity, each analyzed read belongs to a different clone).
578 579
These values are now computed on the windows, before any further clustering.
PCR and sequencing errors can thus lead to slighlty over-estimate the diversity.
580

581
## Unsegmentation causes
582

583
Vidjil-algo outputs details statistics on the reads that are not segmented (not analyzed).
584
Basically, **an unsegmented read is a read where Vidjil-algo cannot identify a window at the junction of V and J genes**.
585 586
To properly analyze a read, Vijdil needs that the sequence spans enough V region and J region
(or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations).
587 588
The following unsegmentation causes are reported:

589 590 591 592 593 594 595 596 597 598 599 600 601 602
|                     |                                                                                                                          |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `UNSEG too short`   | Reads are too short, shorter than the seed (by default between 9 and 13 bp).                                             |
| `UNSEG strand`      | The strand is mixed in the read, with some similarities both with the `+` and the `-` strand.                            |
| `UNSEG too few V/J` | No information has been found on the read: There are not enough similarities neither with a V gene or a J gene.          |
| `UNSEG only V/5`    | Relevant similarities have been found with some V, but none or not enough with any J.                                    |
| `UNSEG only J/3`    | Relevant similarities have been found with some J, but none or not enough with any V.                                    |
| `UNSEG ambiguous`   | vidjil-algo finds some V and J similarities mixed together which makes the situation ambiguous and hardly solvable.      |
| `UNSEG too short w` | The junction can be identified but the read is too short so that vidjil-algo could extract the window (by default 50bp). |
|                     | It often means the junction is very close from one end of the read.                                                      |

Some datasets may give reads with many low `UNSEG too few` reads:

  - `UNSEG too few V/J` usually happens when reads share almost nothing with the V(D)J region.
603 604
    This is expected when the PCR or capture-based approach included other regions, such as in whole RNA-seq.

605
  - `UNSEG only V/5` and `UNSEG only J/3` happen when reads do not span enough the junction zone.
606
    Vidjil-algo detects a “window” including the CDR3. By default this window is 50bp long,
607 608
    so the read needs be that long centered on the junction.

609
See the [user manual](user.md) for information on the biological or sequencing causes that can lead to few segmented reads.
610

611
## Filtering reads
612

613 614 615 616 617 618 619 620 621 622
``` diff
Detailed output per read (generally not recommended, large files, but may be used for filtering, as in -uu -X 1000)
  -U                          output segmented reads (in .segmented.vdj.fa file)
  -u
        -u          output unsegmented reads, gathered by unsegmentation cause, except for very short and 'too few V/J' reads (in *.fa files)
        -uu         output unsegmented reads, gathered by unsegmentation cause, all reads (in *.fa files) (use only for debug)
        -uuu        output unsegmented reads, all reads, including a .unsegmented.vdj.fa file (use only for debug)
  -K                          output detailed k-mer affectation on all reads (in .affects file) (use only for debug, for example -KX 100)
```

623 624
It is possible to extract all segmented or unsegmented reads, possibly to give them to other software.
Runing Vidjil with `-U` gives a file `out/basename.segmented.vdj.fa`, with all segmented reads.
625
On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file.
626
However, the `-U` option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations.
627

628 629
Similarly, options are available to get the unsegmented reads:

630 631 632 633 634 635
  - `-u` gives a set of files `out/basename.UNSEG_*`, with unsegmented reads gathered by unsegmentation cause.
    It outputs only reads sharing significantly sequences with V/J germline genes or with some ambiguity:
    it may be interesting to further study RNA-Seq datasets.

  - `-uu` gives the same set of files, including **all** unsegmented reads (including `UNSEG too short` and `UNSEG too few V/J`),
    and `-uuu` further outputs all these reads in a file `out/basename.unsegmented.vdj.fa`.
636 637 638

Again, as these options may generate large files, they are generally not recommended.
However, they are very useful in some situations, especially to understand why some dataset gives poor segmentation result.
639
For example `-uu -X 1000` splits the unsegmented reads from the 1000 first reads.
Mikaël Salson's avatar
Mikaël Salson committed
640

641 642 643

## AIRR .tsv output

644
Since version 2018.10-dev, vidjil-algo supports the [AIRR format](http://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields).
645
We export all required fields, some optional fields, as also some custom fields (+).
646

647 648
Note that Vidjil-algo is designed to efficiently gather reads from large datasets into clones. 
By default (`-c clones`), we thus report in the AIRR format *clones*.
649
See also [What is a clone ?](vidjil-format/#what-is-a-clone).
650
Using `-c segment` trigger a separate analysis for each read, but this is usually not advised for large datasets. 
651 652 653 654


| Name  | Type | AIRR 1.2 Description <br /> *vidjil-algo implementation* |
| ----- | ---- |  ------------------------------------------------------- |
655
| locus | string | Gene locus (chain type). For example, `IGH`, `IGK`, `IGL`, `TRA`, `TRB`, `TRD`, or `TRG`.<br />*Vidjil-algo outputs all these loci. Moreover, the incomplete recombinations analyzed by vidjil-algo are reported as `IGH+`, `IGK+`, `TRA+D`, `TRB+`, `TRD+`, and `xxx` for unexpected recombinations. See  [locus](locus).*
656
| duplicate_count | number | Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. <br />*Number of reads gathered in the clone.*
657 658
| sequence_id | string  | Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. <br />*This identifier is the (50 bp by default) window extacted around the junction.* |
| clone_id 	| string | 	Clonal cluster assignment for the query sequence. <br />*This identifier is again the (50 bp by default) window extacted around the junction.*
659
| warnings (+) | string | *Warnings associated to this clone. See <https://gitlab.vidjil.org/blob/dev/doc/warnings.md>.*
660 661
| sequence  | string | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. <br />*This contains the consensus/representative sequence of each clone.*
| rev_comp  | boolean | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. <br />*Set to null, as vidjil-algo gather reads from both strands in clones* |
662
| v_call, d_call, j_call  | string  | V/D/J gene with allele. For example, IGHV4-59\*01. <br /> *implemented. In the case of uncomplete/unexpected recombinations (locus with a `+`), we still use `v/d/j_call`. Note that this value can be null on clones beyond the `-z` option.* |
663 664 665
| junction  | string  |      Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. <br />*null*
| junction_aa  | string  | Junction region amino acid sequence.      <br />*implemented*
| cdr3_aa | string | Amino acid translation of the cdr3 field.   <br />*implemented*
666
| productive | boolean | True if the V(D)J sequence is predicted to be productive.  <br /> *true, false, or null when no CDR3 has been detected* |
667 668
| sequence_alignment  | string  | Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. <br /> *null*                                         |
| germline_alignment | string  | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). <br />*null*
669
| v_cigar, d_cigar, j_cigar | string  | CIGAR strings for the V/D/J gene <br />*null*.
670

671 672 673
Currently, we do not output alignment strings.
Our implementation of .tsv may evolve in future versions.
Contact us if a particular feature does interest you.
674 675


676
## Segmentation and .vdj format
677

678
Vidjil output includes segmentation of V(D)J recombinations. This happens
679 680
in the following situations:

681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703
  - in a first pass, when requested with `-U` option, in a `.segmented.vdj.fa` file.
    
    The goal of this ultra-fast segmentation, based on a seed
    heuristics, is only to identify the locus and to locate the w-window overlapping the
    CDR3. This should not be taken as a real V(D)J designation, as
    the center of the window may be shifted up to 15 bases from the
    actual center.

  - in a second pass, on the standard output and in both `.vidjil` and `.vdj.fa` files
    
      - at the end of the clones detection (default command `-c clones`,
        on a number of clones limited by the `-z` option)
      - or directly when explicitly requiring segmentation (`-c segment`)
    
    These V(D)J designations are obtained by full comparison (dynamic programming)
    with all germline sequences.
    
    Note that these designations are relatively slow to compute, especially
    for the IGH locus. However, they
    are not at the core of the Vidjil clone clustering method (which
    relies only on the 'window', see above).
    To check the quality of these designations, the automated test suite include
    sequences with manually curated V(D)J designations (see [should-vdj.org](http://git.vidjil.org/blob/master/doc/should-vdj.org)).
Mikaël Salson's avatar
Mikaël Salson committed
704 705

Segmentations of V(D)J recombinations are displayed using a dedicated
706 707
`.vdj` format. This format is compatible with FASTA format. A line starting
with a \> is of the following form:
Mikaël Salson's avatar
Mikaël Salson committed
708

709
``` diff
710
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
711

712
        name          sequence name (include the number of occurrences in the read set and possibly other information)
Mikaël Salson's avatar
Mikaël Salson committed
713
        +             strand on which the sequence is mapped
714
        VDJ           type of segmentation (can be "VJ", "VDJ", "VDDJ", "53"...
715 716
                      or shorter tags such as "V" for incomplete sequences).    
              The following line are for "VDJ" recombinations :
Mikaël Salson's avatar
Mikaël Salson committed
717

718
        startV endV   start and end position of the V gene in the sequence (start at 1)
Mikaël Salson's avatar
Mikaël Salson committed
719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
735

736
        comments      optional comments. In Vidjil, the following comments are now used:
737
                      - "seed" when this comes for the first pass (.segmented.vdj.fa). See the warning above.
738
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
739 740
                      - the name of the locus (TRA, TRB, TRG, TRD, IGH, IGL, IGK, possibly followed
                        by a + for incomplete/unusual recombinations)
Mikaël Salson's avatar
Mikaël Salson committed
741

742
```
743

Mikaël Salson's avatar
Mikaël Salson committed
744 745 746 747 748
Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
Mikaël Salson's avatar
Mikaël Salson committed
749

750
``` diff
751
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  comments
752
```
753

754
# Examples of use
755

756 757
Examples on a IGH VDJ recombinations require either to specigy `-g germline/homo-sapiens-g:IGH`,
or to use the multi-germline option `-g germline/homo-sapiens.g` that can be shortened into `-g germline`.
758

759
## Basic usage: PCR-based datasets, with primers in the V(D)J regions (such as BIOMED-2 primers)
760

761
``` bash
762 763 764 765 766
./vidjil-algo -c clones   -g germline/homo-sapiens.g -2 -3 -r 1  demo/Demo-X5.fa
  # Detect the locus for each read, cluster and report clones starting from the first read (-r 1).
  # including unexpected recombinations (-2). Assign the V(D)J genes and try to detect the CDR3s (-3).
  # Demo-X5 is a collection of sequences on all human locus, including some incomplete or unusual recombinations:
  # IGH (VDJ, DJ), IGK (VJ, V-KDE, Intron-KDE), IGL, TRA, TRB (VJ, DJ), TRG and TRD (VDDJ, Dd2-Dd3, Vd-Ja).
767
```
768

769
``` bash
770
./vidjil-algo -g germline/homo-sapiens.g:IGH -3 demo/Stanford_S22.fasta
771 772
   # Cluster the reads and report the clones, based on windows overlapping IGH CDR3s.
   # Assign the V(D)J genes and try to detect the CDR3 of each clone.
773
   # Summary of clones is available both on stdout, in out/Stanford_S22.vdj.fa and in out/Stanford_S22.vidjil.
774
```
775

776
``` bash
777
./vidjil-algo -g germline -2 -3 -d demo/Stanford_S22.fasta
778
   # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
779
   # Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
780
   # Assign the VDJ genes (including multiple D) and try to detect the CDR3 of each clone.
781
   # Summary of clones is available both on stdout, in out/reads.vdj.fa and in out/reads.vidjil.
782
```
783

784
## Basic usage: Whole RNA-Seq or capture datasets
785

786
``` bash
787
./vidjil-algo -g germline -2 -U demo/Stanford_S22.fasta
788
   # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
789
   # Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
790
   # Assign the VDJ genes and try to detect the CDR3 of each clone.
791
   # The out/reads.segmented.vdj.fa include all reads where a V(D)J recombination was found
792
```
793

794
Typical whole RNA-Seq or capture datasets may be huge (several GB) but with only a (very) small portion of CDR3s.
795 796
Using Vidjil with `-U` will create a `out/reads.segmented.vdj.fa` file
that includes all reads where a V(D)J recombination (or an unexpected recombination, with `-2`) was found.
797 798
This file will be relatively small (a few kB or MB) and can be taken again as an input for Vidjil or for other programs.

799
## Advanced usage
800

801
``` bash
802
./vidjil-algo -c clones -g germline/homo-sapiens.g -r 1 -n 5 -x 10000 demo/LIL-L4.fastq.gz
803
   # Extracts the windows with at least 1 read each (-r 1, the default being -r 5)
804 805 806 807
   # on the first 10,000 reads, then cluster them into clones
   # with a second clustering step at distance five (-n 5)
   # The result of this second is in the .vidjil file ('clusters')
   # and can been seen and edited in the web application.
808
```
809

810
``` bash
811
./vidjil-algo -c segment -g germline/homo-sapiens.g -2 -3 -d -x 50 demo/Stanford_S22.fasta
812
   # Detailed V(D)J designation, including multiple D, and CDR3 detection on the first 50 reads, without clone clustering
813
   # (this is slow and should only be used for testing, or on a small file)
814
```
815

816
``` bash
817
./vidjil-algo -c germlines -g germline/homo-sapiens.g demo/Stanford_S22.fasta
818
   # Output statistics on the number of occurrences of k-mers of the different germlines
819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847
```

## Following clones in several samples

In a minimal residual disease setup, for instance, we are interested in
following the main clones identified at diagnosis in the following samples.

In its output files, Vidjil keeps track of all the clones, even if it
provides a V(D)J assignation only for the main ones. Therefore the
meaningful information is already in the files (for instance in the `.vidjil`
files). However we have one `.vidjil` per sample which may not be very
convenient. All the more since the web client only takes one `.vidjil` file
as input and cannot take several ones.

Therefore we need to merge all the `.vidjil` files into a single one. That is
the purpose of the [tools/fuse.py](../tools/fuse.py) script.

Let assume that four `.vidjil` files have been produced for each sample
(namely `diag.vidjil`, `fu1.vidjil`, `fu2.vidjil`, `fu3.vidjil`), merging them will
be done in the following way:

``` bash
python tools/fuse.py --output mrd.vidjil --top 100 diag.vidjil fu1.vidjil fu2.vidjil fu3.vidjil
```

The `--top` parameter allows to choose how many top clones per sample should
be kept. 100 means that for each sample, the top 100 clones are kept and
followed in the other samples. In this example the output file is stored in
`mrd.vidjil` which can then be fed to the web client.