vidjil-algo.md 44.9 KB
Newer Older
Mathieu Giraud's avatar
Mathieu Giraud committed
1
# vidjil-algo 2020.08
2 3
**Command-line manual**

4
*The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Tatiana and Rayan)*
Mikaël Salson's avatar
Mikaël Salson committed
5

6 7
```
  Vidjil -- High-throughput Analysis of V(D)J Immune Repertoire -- [[http://www.vidjil.org]]
Mathieu Giraud's avatar
Mathieu Giraud committed
8
  Copyright (C) 2011-2020 by Bonsai bioinformatics
9
  at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille
10
  and VidjilNet consortium.
11 12 13 14 15 16
  contact@vidjil.org
```

This is the help of vidjil-algo, for command-line usage.
This manual can be browsed online:

17 18
 - <http://www.vidjil.org/doc/vidjil-algo>                (last stable release)
 - <http://gitlab.vidjil.org/blob/dev/doc/vidjil-algo.md> (development version)
19 20 21 22 23 24 25

Other documentation (users and administrators of the web application, developpers) can be found from <http://www.vidjil.org/doc/>.


## About

*V(D)J recombinations* in lymphocytes are essential for immunological
Mikaël Salson's avatar
Mikaël Salson committed
26 27 28
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
29
With adapted [library preparation and sequencing](http://www.vidjil.org/doc/locus),
30
high-throughput sequencing (NGS/HTS) now
31 32
enables the deep sequencing of a lymphoid population with dedicated
sequencing methods and software, called either Rep-Seq or AIRR-Seq.
Mikaël Salson's avatar
Mikaël Salson committed
33

34
Vidjil-algo processes high-throughput sequencing data to extract V(D)J
35
junctions and gather them into clones. Vidjil-algo starts
Mikaël Salson's avatar
Mikaël Salson committed
36 37
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
38
to output all sequenced clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
39
because, in the first phase, no alignment is performed with database
40
germline sequences. At the end, only the consensus sequences
41
of each clone have to be analyzed. Vidjil-algo can also cluster similar
42
clones, or leave this to the user after a manual review in the web application.
Mikaël Salson's avatar
Mikaël Salson committed
43

44 45
The method is described in the following references:

46
- Marc Duez et al.,
47
“Vidjil: A web platform for analysis of high-throughput repertoire sequencing”,
48
PLOS ONE 2016, 11(11):e0166126
49
<http://dx.doi.org/10.1371/journal.pone.0166126>
Mikaël Salson's avatar
Mikaël Salson committed
50

51
- Mathieu Giraud, Mikaël Salson, et al.,
Mikaël Salson's avatar
Mikaël Salson committed
52
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
53
BMC Genomics 2014, 15:409
54
<http://dx.doi.org/10.1186/1471-2164-15-409>
Mikaël Salson's avatar
Mikaël Salson committed
55

56
Vidjil-algo is open-source, released under GNU GPLv3+ license.
Mikaël Salson's avatar
Mikaël Salson committed
57

58
# Requirements and installation
59

60
## Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
61

62 63 64
Vidjil-algo is systematically tested with the following compilers :

  - gcc/g++ 4.8, 5.3, 6.3, 7.3, 8.4, 9.3, 10.1
65
  - clang 3.4, 4.0, 6.0, 7.0
Mikaël Salson's avatar
Mikaël Salson committed
66

67 68 69 70 71
These compilers are available on recent OS X and on the following Linux distributions:
  - CentOS 7, 8
  - Debian Jessie 8.0, Stretch 9.0, Buster 10.0
  - FreeBSD 9.2, 10, 11, 12
  - Ubuntu 16.04 LTS, 18.04 LTS, 20.04 LTS
Mikaël Salson's avatar
Mikaël Salson committed
72

73
Vidjil-algo is developed with continuous integration using systematic unit and functional testing.
74 75
The development team internally uses [Gitlab CI](http://gitlab.vidjil.org/pipelines) for that,
and the tested compilers are run through Docker containers described in `.gitlab-ci-compilers.yml`.
Mikaël Salson's avatar
Mikaël Salson committed
76

77
## Build requirements (optional)
78

79
This paragraph details the requirements to build Vidjil-algo from source.
80
You can also download a static binary, see [installation](#installation).
81

82
To compile Vidjil-algo, make sure:
83

84
  - to be on a POSIX system ;
85
  - to have a C++11 compiler (as `g++` 4.8 or above, or `clang` 3.4 or above).
86 87
  - to have the `zlib` installed (`zlib1g-dev` package under Debian/Ubuntu,
    `zlib-devel` package under Fedora/CentOS).
88
  - to have GNU make (`gmake` under FreeBSD). On some FreeBSD distributions, it was required to use commands such as
89
``` bash
90
make MAKE=gmake CXXFLAGS="-std=c++11 -O2 Wall -D_GLIBCXX_USE_C99 -Wl,-rpath=/usr/local/lib/gcc49"
91
```
92
    The `gcc49` at the end of the command line is to be replaced by the `gcc` version used.
93

Mathieu Giraud's avatar
Mathieu Giraud committed
94

95
## Installation
Mathieu Giraud's avatar
Mathieu Giraud committed
96

97 98
### Download

99 100 101 102 103 104
These instructions targets *stable releases* of vidjil-algo, as downloaded from <http://www.vidjil.org/releases>
or <http://bioinfo.lifl.fr/vidjil/>.

Development code is found at <http://gitlab.vidjil.org>, in the `algo` directory.
and compiling and running vidjil-algo on the development code can involve slightly different commands,
including replacing `src` by `algo`.
105

106
### Compiling
Mikaël Salson's avatar
Mikaël Salson committed
107

108 109
Running `make` from the extracted archive should be enough to install vidjil-algo with germline and demo files.
It runs the three following `make` commands.
110

111
``` bash
Mikaël Salson's avatar
Mikaël Salson committed
112

113
make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
114
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
115 116 117 118 119 120
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

121

122
make -C src              # build vijil-algo from the sources (see the requirements,
123 124
                         # another option is: wget http://www.vidjil.org/releases/vidjil-algo-latest_x86_64 -O vidjil-algo
                         # to download a static binary (built for x86_64 architectures)
125

126
make demo                # download demo files (S22 and L4, see demo/get-sequences)
127

128
./vidjil-algo -h         # display help/usage
129
```
Mikaël Salson's avatar
Mikaël Salson committed
130

131
On some older systems you may need to replace the `make` commands with:
132

133
``` bash
134
make LDFLAGS='-stdlib=libc++'  ### OS X Mavericks
135 136 137
```

## Self-tests (optional)
138 139 140

You can run the tests with the following commands:

141
``` bash
142
make -C src/tests/data
Mathieu Giraud's avatar
Mathieu Giraud committed
143
   # get IGH recombinations from a single individual, as described in:
144 145 146 147
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

148
make -C src test                # run self-tests (can take 5 to 60 minutes)
149
```
Mikaël Salson's avatar
Mikaël Salson committed
150

151
# Input and parameters
152

153 154 155
The main input file of Vidjil-algo is a *set of reads*, given as a `.fasta`
or `.fastq` file, possibly compressed with gzip (`.gz`).
This set of reads can reach several gigabytes and 2\*10<sup>9</sup> reads. It is
156
never loaded entirely in the memory, but reads are processed one by
157 158
one by Vidjil-algo.
Vidjil-algo can also process BAM files, but please note that:
159

160 161 162 163 164 165
1.  The reads don't need to be aligned beforehand.
2.  In case of paired-end sequencing, the reads must have already been merged
    in the BAM file.

The `-h` and `-H` help options provide the list of parameters that can be
used. We detail here the options of the main `-c clones` command.
166 167 168 169 170 171

The default options are very conservative (large window, no further
automatic clusterization, see below), leaving the user or other
software making detailed analysis and decisions on the final
clustering.

172
## Recombination / locus selection
173

174 175

``` diff
176
Germline presets (at least one -g or -V/(-D)/-J option must be given)
177 178
  -g, --germline GERMLINES ...

179 180 181 182 183 184
         -g <.g FILE>(:FILTER)
                    multiple locus/germlines, with tuned parameters.
                    Common values are '-g germline/homo-sapiens.g' or '-g germline/mus-musculus.g'
                    The list of locus/recombinations can be restricted, such as in '-g germline/homo-sapiens.g:IGH,IGK,IGL'
         -g PATH
                    multiple locus/germlines, shortcut for '-g PATH/homo-sapiens.g',
185
                    processes human TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with incomplete/unusal recombinations
186
  -V FILE ...                 custom V germline multi-fasta file(s)
187
  -D FILE ...                 custom D germline multi-fasta file(s), analyze into V(D)J components
188
  -J FILE ...                 custom V germline multi-fasta file(s)
189
  -2                          try to detect unexpected recombinations
190
```
191

192
The `germline/*.g` presets configure the analyzed recombinations.
193 194
The following presets are provided:

195
  - `germline/homo-sapiens.g`: Homo sapiens, TR (`TRA`, `TRB`, `TRG`, `TRD`) and Ig (`IGH`, `IGK`, `IGL`) locus,
Mathieu Giraud's avatar
Mathieu Giraud committed
196
    including incomplete/unusal recombinations (`TRA+D`, `TRB+`, `TRD+`, `IGH+`, `IGK+`, see <locus.md>.
197 198 199 200 201
  - `germline/homo-sapiens-isotypes.g`: Homo sapiens heavy chain locus, looking for sequences with, on one side, IGHJ (or even IGHV) genes,
    and, on the other side, an IGH constant chain.
  - `germline/homo-sapiens-cd.g`: Homo sapiens, common CD genes (experimental, does not check for recombinations)
  - `germline/mus-musculus.g`: Mus musculus (strains BALB/c and C57BL/6)
  - `germline/rattus-norvegicus.g`: Rattus norvegicus (strains BN/SsNHsdMCW and Sprague-Dawley)
202

203
New `germline/*.g` presets for other species or for custom recombinations can be created, possibly referring to other `.fasta` files.
204 205
Please contact us if you need help in configuring other germlines.

206 207 208 209
  - Recombinations can be filtered, such as in
    `-g germline/homo-sapiens.g:IGH` (only IGH, complete recombinations),
    `-g germline/homo-sapiens.g:IGH,IGH+` (only IGH, as well with incomplete recombinations)
    or `-g germline/homo-sapiens.g:TRA,TRB,TRG` (only TR locus, complete recombinations).
210

211
  - Several presets can be loaded at the same time, as for instance `-g germline/homo-sapiens.g -g germline/germline/homo-sapiens-isotypes.g`.
212

213
  - Using `-2` further test unexpected recombinations (tagged as `xxx`), as in `-g germline/homo-sapiens.g -2`.
214

215
Finally, the advanced `-V/(-D)/-J` options enable to select custom V, (D) and J repertoires given as `.fasta` files.
216

217
## Main algorithm parameters
218

219
``` diff
220 221 222 223
Recombination detection ("window" prediction, first pass)
    (use either -s or -k option, but not both)
    (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
    (all these options, except -w, are overriden when using -g)
224 225
  -k, --kmer INT              k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
  -w, --window INT            w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
226
  -e, --e-value FLOAT=1       maximal e-value for trusting the detection of a V-J recombination
227 228 229
  --trim INT                  trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt  (0: no trim)
  -s, --seed SEED=10s         seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
                               10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
230
```
Mathieu Giraud's avatar
Mathieu Giraud committed
231

232
The `-s`, `-k` are the options of the seed-based heuristic that detects
Mathieu Giraud's avatar
Mathieu Giraud committed
233 234
"junctions", that is a zone in a read that is similar to V genes on its
left end and similar to J genes in its right end. A detailed
235
explanation can be found in (Giraud, Salson and al., 2014).
236 237
*These options are for advanced usage, the defaults values should work.*
The `-s` or `-k` option selects the seed used for the k-mer V/J affectation.
Mathieu Giraud's avatar
Mathieu Giraud committed
238

239 240
The `-w` option fixes the size of the "window" that is the main
identifier to cluster clones. The default value (`-w 50`) was selected
241
to ensure a high-quality clone clustering: reads are clustered when
242
they *exactly* share, at the nucleotide level, a 50 bp-window centered
243 244
on the CDR3. No sequencing errors are corrected inside this window.
The center of the "window", predicted by the high-throughput heuristic, may
Mikaël Salson's avatar
Mikaël Salson committed
245
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
246
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
247
in \>99% of cases when the reads are large enough). Usually, a 50 bp-window
248 249
fully contains the CDR3 as well as some part of the end of the V and
the start of the J, or at least some specific N region to uniquely identify the clone.
Mathieu Giraud's avatar
Mathieu Giraud committed
250

251
Setting `-w` to higher values (such as `-w 60` or `-w 100`) makes the clone clustering
252 253
even more conservative, enabling to split clones with low specificity (such as IGH with very
large D, short or no N regions and almost no somatic hypermutations). However, such settings
254
may detect recombinations in less reads, depending on the read length of your data, and may also
255 256
return more clones, as any sequencing error in the window is not corrected.

257
The special `-w all` option takes all the read as the windows, completely disabling
258
the clustering by windows and generally returning more clones. This should only be used on
259 260
datasets where reads of the same clone do have exactly the same length, or in situations
in which one want to study very precisely the clonality, tracking all mutations along the read.
261

262
Setting `-w` to lower values than 50 may analyze a few more reads, depending
263
on the read length of your data, but may in some cases falsely cluster reads from
264
different clones.
265 266
For VJ recombinations, the `-w 40` option is usually safe, and `-w 30` can also be tested.
Setting `-w` to lower values is not recommended.
Mathieu Giraud's avatar
Mathieu Giraud committed
267

268 269
When the read is too short too extract the requested length, the window can be shifted
(at most 10 bp) or shrinkened (down until 30bp) by increments of 5bp. Such reads
270
are counted in `SEG changed w` and the corresponding clones are output with the `W50` warning.
271

272
The `-e` option sets the maximal e-value accepted for analyzing a sequence.
273 274 275
It is an upper bound on the number of designated sequences found by chance by vidjil-algo.
The e-value computation takes into account both the number of locus searched for
and, for the defaut `-c clones` command, the number of reads in the input sequence.
276
The default value is 1.0, but values such as 1000, 1e-3 or even less can be used
277
to have a more or less permissive detection and designation.
278
The threshold can be disabled with `-e all`.
279

280 281 282 283
The advanced `--e-value-kmer` option sets the e-value for the seed-based heuristic.
It is an upper bound on the number of expected windows found by chance.
The default value is the same than value than the `-e`.

284
The advanced `--trim` option sets the maximal number of nucleotides that will be indexed in
285 286
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
287
However giving a `--trim` may also reduce the probability of seeing a heavily
288
trimmed or mutated V gene.
289
The default is `--trim 0`.
290

291
## Thresholds on clone output
292 293 294

The following options control how many clones are output and analyzed.

295
``` diff
296 297 298 299 300 301 302
Input
  -x, --first-reads INT       maximal number of reads to process ('all': no limit, default), only first reads
  -X, --sampled-reads INT     maximal number of reads to process ('all': no limit, default), sampled reads

Limits to report and to analyze clones (second pass)
  -r, --min-reads INT=5       minimal number of reads supporting a clone
  --min-ratio FLOAT=0         minimal percentage of reads supporting a clone
303
  --max-clones INT            maximal number of output clones ('all': no maximum, default)
304 305 306 307 308 309
  -y, --max-consensus INT=100 maximal number of clones computed with a consensus sequence ('all': no limit)
  -z, --max-designations INT=100
                              maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use)
  --all                       reports and analyzes all clones
                              (--min-reads 1 --min-ratio 0 --max-clones all --max-consensus all --max-designations all),
                              to be used only on small datasets (for example --all -X 1000)
310
```
311

312
The `-r/--ratio` options are strong thresholds: if a clone does not have
313
the requested number of reads, the clone is discarded (except when
314
using `--label`, see below).
315 316 317
The default `-r 5` option is meant to only output clones that
have a significant read support. **You should use** `-r 1` **if you
want to detect all clones starting from the first read** (especially for
318 319
MRD detection).

320 321
The `--max-clones` option limits the number of output clones, even without consensus sequences.

322
The `--max-consensus` option limits the number of clones for which a consensus
323
sequence is computed. Usually you do not need to have more
324
consensus (see below), but you can safely put `--max-consensus all` if you want
325
to compute all consensus sequences.
326

327
The `--max-designations` option limits the number of clones that are fully analyzed,
328
*with their V(D)J designation and possibly a CDR3 detection*,
329
in particular to enable the web application
330
to display the clones on the grid (otherwise they are displayed on the
331
'?/?' axis).
332 333 334 335 336 337 338 339

These V(D)J designations are obtained by full comparison (dynamic programming)
with all germline sequences.
Note that these designations are relatively slow to compute, especially
for the IGH locus. However, they
are not at the core of the Vidjil clone clustering method (which
relies only on the 'window', see above).
To check the quality of these designations, the automated test suite include
Mathieu Giraud's avatar
Mathieu Giraud committed
340
[sequences with manually curated V(D)J designations](should-vdj.md).
341

342 343
If you want to analyze more clones, you should use `--max-designations 200` or
`--max-designations 500`. It is not recommended to use larger values: outputting more
344
than 500 clones is often not useful since they can not be visualized easily
345
in the web application, and takes more computation time.
346

347
Note that even if a clone is not in the top 100 (or 200, or 500) but
348
still passes the `-r`, `--ratio` options, it is still reported in both the `.vidjil`
349
and `.vdj.fa` files. If the clone is at some MRD point in the top 100 (or 200, or 500),
350
it will be fully analyzed by this other point (and then
351
collected by the `fuse.py` script, using consensus sequences computed at this
352
other point, and then, on the web application, correctly displayed on the grid).
353
**Thus is advised to leave the default** `--max-designations 100` **option
354
for the majority of uses.**
355

356 357 358
The `--all` option disables all these thresholds. This option can be
used for test and debug purposes or on small datasets.
It produces large file and takes more time. 
359

360
The `--analysis-filter` advanced option speeds up the full analysis by a pre-processing step,
Mathieu Giraud's avatar
Mathieu Giraud committed
361 362 363
again based on k-mers, to select a subset of the V germline genes to be compared to the read.
The option gives the typical size of this subset (it can be larger when several V germlines
genes are very similar, or smaller when there are not enough V germline genes).
364 365
The default `--analysis-filter 3` is generally safe.
Setting `--analysis-filter all` removes this pre-processing step, running a full dynamic programming
366
with all germline sequences that is much slower.
Mikaël Salson's avatar
Mikaël Salson committed
367

368
## Sequences of interest
369

370
Vidjil-algo allows to indicate that specific sequences should be followed and output,
371
even if those sequences are 'rare' (below the `-r/--ratio` thresholds).
372 373
Such sequences can be provided either with `--label <sequence>`, or with `--label-file <file>`.
The file given by `--label-file` should have one sequence by line, as in the following example:
Mikaël Salson's avatar
Mikaël Salson committed
374

375
``` diff
376 377
GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1
GAGAGATGGACGGAATACGTTAAACGACATATGGTTCGGGGTATGGTGCT my-clone-2 foo
378
```
379

380
Sequences and labels must be separated by one space.
381 382
The first column of the file is the sequence to be followed
while the remaining columns consist of the sequence's label.
383
In Vidjil-algo output, the labels are output alongside their sequences.
384

385
A sequence given `--label <sequence>` or with `-label-file <file>` can be exactly the size
386
of the window (`-w`, that is 50 by default). In this case, it is guaranteed that
387 388
such a window will be output if it is detected in the reads.
More generally, when the provided sequence differs in length with the windows
389 390
we will keep any windows that contain the sequence of interest or, conversely,
we will keep any window that is contained in the sequence of interest.
391
This filtering will work as expected when the provided sequence overlaps
392 393 394
(at least partially) the CDR3 or its close neighborhood,
but will not work when the sequence is far of the CDR3 (except when
using large `-w` values or `-w all`).
395

396
With the `--label-filter` option, *only* the windows related to the given sequences are kept.
397
This allows to quickly filter a set of reads, looking for a known sequence or window,
398 399 400 401
with the `--grep-reads <sequence>` preset, equivalent to
`--out-reads --label-filter --label <sequence>`:
All the reads with the windows related to the sequence will be extracted 
to files such as `out/seq/clone.fa-1`.
402

403
## Further clone analysis and CDR3 detection
404

405
``` diff
406 407 408 409 410
Clone analysis (second pass)
  -d, --several-D             try to detect several D (experimental)
  -3, --cdr3                  CDR3/JUNCTION detection (requires gapped V/J germlines)
```

411
The `-3` option launches a CDR3/JUNCTION detection based on the position
412
of Cys104 and Phe118/Trp118 amino acids. This detection relies on alignment
413
with gapped V and J sequences, as for instance, for V genes, IMGT/GENE-DB sequences,
414
as provided by `make germline`.
415 416
The CDR3/JUNCTION detection won't work with custom non-gapped V/J repertoires.

417
CDR3 are reported as *productive* when they come from an in-frame recombination
418
and when the sequence does not contain any in-frame stop codons.
419
Note that some other software only consider stop codons in the CDR3,
420 421 422 423
and may thus under-estimate non-productivity.
When the sequence is long enough to start before the start of the V gene
or to end after the end of the J gene, vidjil-algo do not consider these intronic sequences
in the productivity estimation.
424

425
The advanced `--analysis-cost` option sets the parameters used in the comparisons between
426 427
the clone sequence and the V(D)J germline genes. The default values should work.

428 429
The e-value set by `-e` is also applied to the V/J designation.
The `-E` option further sets the e-value for the detection of D segments.
Mathieu Giraud's avatar
Mathieu Giraud committed
430

431
## Further clustering (experimental)
432

433 434
The following options are experimental and have no consequences on the `.vdj.fa` file,
nor on the standard output. They instead add a `clusters` sections in the `.vidjil` file
435
that will be visualized in the web application.
436 437
Any such clustering should be avoided when one wants to precisely study hypermutations.
The web application provides other options to inspect clones and cluster them.
438

439 440
The `--cluster-epsilon` option triggers an automatic clustering using the
[DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) algorithm (Ester and al., 1996).
441
Using `--cluster-epsilon 5` usually clusters reads within a distance of 1 mismatch (default score
Mathieu Giraud's avatar
Mathieu Giraud committed
442
being +1 for a match and -4 for a mismatch). With that option, more distant reads will also
443
be clustered as soon there are more than 10 reads within the distance threshold.
Mathieu Giraud's avatar
Mathieu Giraud committed
444
This behaviour can be controlled with the `-cluster-N` option.
Mikaël Salson's avatar
Mikaël Salson committed
445

446 447 448
Setting `--cluster-epsilon 10`, possibly with `--cluster-N 5` or `--cluster-N 1`
will perform more aggressive clustering and is generally not advised.

449
The `--cluster-forced-edges` option allows to specify a file for manually clustering two windows
450
considered as similar. Such a file may be automatically produced by vidjil-algo
451
(`out/edges`), depending on the option provided. Only the two first columns
452
(separed by one space) are important to vidjil-algo, they only consist of the
Mikaël Salson's avatar
Mikaël Salson committed
453
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
454

455
# Output
Mikaël Salson's avatar
Mikaël Salson committed
456

457
## Main output files
458

459
The default output of Vidjil-algo (with the default `-c clones` command) are the two following files:
460

461 462 463 464
  - The `.vidjil` file is the *main output file*, containing the most information.
    The file is in a `.json` format,
    its specification is detailed in [vidjil-format](vidjil-format.md).
    It describes the clones, with the windows and their count, the consensus sequences (`--max-consensus`),
465
    the detailed V(D)J and CDR3 designation (`--max-designations`, see warning below), and possibly
466 467
    the results of the further clustering.
    
468
    The web application takes this `.vidjil` file ([possibly merged with `fuse.py`](#following-clones-in-several-samples)) for the *visualization and analysis* of clones and their
469 470
    tracking along different samples (for example time points in a MRD
    setup or in a immunological study).
Mathieu Giraud's avatar
Mathieu Giraud committed
471
    Please see the [web application user manual](http://www.vidjil.org/doc/user) for more information.
472 473 474

  - The `.tsv` file is the AIRR output, for compatibility with other software
    using the same format. See [below](#airr-tsv-output) for details.
475

476

477 478
By default, these output files are named
`out/basename.vidjil` and `out/basename.tsv`, where:
479

480 481
  - `out` is the directory where all the outputs are stored (can be changed with the `--dir` option).
  - `basename` is the basename of the input `.fasta/.fastq` file (can be overriden with the `--base` option)
482

483 484
With the `--gz` option, both files are output
as compressed `.vidjil.gz` and `.tsv.gz` files.
485 486 487 488

Vidjil-algo also outputs the first 50 clones on the standard output.
More data can be printed on the standard output with the `-v` option.

489
## Auxiliary output files
490

491 492 493 494 495 496 497 498
### `.vdj.fa`

With the `--out-vdjfa` option, a `.vdj.fa` file is created (or, with `--gz`, a `.vdj.fa.gz` file).
This is *a FASTA file for further processing by other bioinformatics tools*.
Even if it is advised to rather use the full information in the `.vijdil` file,
the `.vdj.fa` is a convenient way to have sequences of clones for further processing.
These sequences are at least the windows (and their count in the headers) or
the consensus sequences (`--max-consensus`) when they have been computed.
Mathieu Giraud's avatar
Mathieu Giraud committed
499
The [headers](#headers-in-the-vdjfa-files-deprecated) are described below, but the format of the headers is deprecated
500 501 502 503 504 505 506 507 508 509
and will not be enforced in future releases.
Some other informations such as the further clustering are not output in this file.

The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
shrinking a large read set into a manageable number of (pre-)clones
that will be deeply analyzed and possibly further clustered by
other software.

### `.windows.fa`

510
The `out/basename.windows.fa` file contains the list of windows, with number of occurrences:
511

512
``` diff
513
>8--window--1
514
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
515
>5--window--2
516
CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC
517 518
ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC
(...)
519
```
520

521
Windows of size 50 (modifiable by `-w`) have been extracted.
522 523
The first window has 8 occurrences, the second window has 5 occurrences.

524 525
### `seq/clone.fa-*`

526 527
With the `--out-clone-files` option, one `out/seq/clone.fa-*` file is created for each clone.
It contains the detailed analysis by clone, with
528
the window, the consensus sequence, as well as with the most similar V, (D) and J germline genes:
529

530
``` diff
531 532
>clone-001--IGH--0000008--0.0608%--window
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
533
>clone-001--IGH--0000008--0.0608%--lcl|FLN1FA001CPAUQ.1|-[105,232]-#2 - 128 bp (55% of 232.0 bp) + VDJ  0 54 73 84 85 127   IGHV3-23*05 6/ACCCGGGAGGAACAATAT/9 IGHD6-13*01 0//5 IGHJ4*02  IGH SEG_+ 1.946653e-19 1.352882e-19/5.937712e-20
534 535 536 537 538 539 540 541 542 543
GCTGTACCTGCAAATGAACAGCCTGCGAGCCGAGGACACGGCCACCTATTACTGT
ACCCGGGAGGAACAATATAGCAGCTGGTAC
TTTGACTTCTGGGGCCAGGGGATCCTGGTCACCGTCTCCTCAG

>IGHV3-23*05
GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTTATAGCAGTGGTAGTAGCACATACTATGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAA
>IGHD6-13*01
GGGTATAGCAGCAGCTGGTAC
>IGHJ4*02
ACTACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
544
```
545

546 547
The `--out-reads` debug option further output in each `out/seq/clone.fa-*` files the full list of reads belonging to this clone.
The `--out-reads` option produces large files, and is not recommended in general cases.
548

549
## Diversity measures
550

551
Several [diversity indices](https://en.wikipedia.org/wiki/Diversity_index) are reported, both on the standard output and in the `.vidjil` file:
552

553 554 555
  - H (`index_H_entropy`): Shannon's diversity
  - E (`index_E_equitability`): Shannon's equitability
  - Ds (`index_Ds_diversity`): Simpson's diversity
556

557
E ans Ds values are between 0 (no diversity, one clone clusters all analyzed reads)
558
and 1 (full diversity, each analyzed read belongs to a different clone).
559
These values are now computed on the windows, before any further clustering.
560
PCR and sequencing errors can thus lead to slightly over-estimate the diversity.
561

562
## Reads without detected recombinations
563

564
Vidjil-algo outputs details statistics on the reads where no recombination was detected
565 566
Basically, **an unanalyzed read is a read where Vidjil-algo cannot identify a window at the junction of V and J genes**.
To properly analyze a read, Vijdil-algo needs that the sequence spans enough V region and J region
567
(or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations).
568
The following causes are reported:
569

570 571 572 573 574 575 576 577
|                     |                                                                                                                          |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `UNSEG too short`   | Reads are too short, shorter than the seed (by default between 9 and 13 bp).                                             |
| `UNSEG strand`      | The strand is mixed in the read, with some similarities both with the `+` and the `-` strand.                            |
| `UNSEG too few V/J` | No information has been found on the read: There are not enough similarities neither with a V gene or a J gene.          |
| `UNSEG only V/5`    | Relevant similarities have been found with some V, but none or not enough with any J.                                    |
| `UNSEG only J/3`    | Relevant similarities have been found with some J, but none or not enough with any V.                                    |
| `UNSEG ambiguous`   | vidjil-algo finds some V and J similarities mixed together which makes the situation ambiguous and hardly solvable.      |
578
| `UNSEG too short w` | The junction can be identified but the read is too short so that vidjil-algo could extract the window (by default 50bp). It often means the junction is very close from one end of the read.  |
579 580 581 582

Some datasets may give reads with many low `UNSEG too few` reads:

  - `UNSEG too few V/J` usually happens when reads share almost nothing with the V(D)J region.
583 584
    This is expected when the PCR or capture-based approach included other regions, such as in whole RNA-seq.

585
  - `UNSEG only V/5` and `UNSEG only J/3` happen when reads do not span enough the junction zone.
586
    Vidjil-algo detects a “window” including the CDR3. By default this window is 50bp long,
587 588
    so the read needs be that long centered on the junction.

589 590
See the [user manual](user.md) for information on the biological or sequencing causes 
that can lead to few analyzed reads.
591

592
## Filtering reads
593

594 595
``` diff
Detailed output per read (generally not recommended, large files, but may be used for filtering, as in -uu -X 1000)
596 597 598 599 600
  -U, --out-detected          output reads with detected recombinations (in .detected.vdj.fa file)
  -u, --out-undetected
        -u          output undetected reads, gathered by cause, except for very short and 'too few V/J' reads (in *.fa files)
        -uu         output undetected reads, gathered by cause, all reads (in *.fa files) (use only for debug)
        -uuu        output undetected reads, all reads, including a .undetected.vdj.fa file (use only for debug)
601 602
  --out-reads                 output all reads by clones (clone.fa-*), to be used only on small datasets
  -K, --out-affects           output detailed k-mer affectation for each read (in .affects file) (use only for debug, for example -KX 100)
603 604
```

605 606 607
It is possible to extract all reads with or without detected recombinations,
possibly to give them to other software.
Runing Vidjil-algo with `-U` gives a file `out/basename.detected.vdj.fa`, with all detected reads.
608
On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file.
609
However, the `-U` option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations.
610 611
Moreover `-U` only uses the ultra-fast first passs analysis, based on k-mer heuristics.

612

613
Similarly, options are available to get the non analyzed reads:
Mathieu Giraud's avatar
Mathieu Giraud committed
614

615
  - `-u` gives a set of files `out/basename.UNSEG_*`, with not detected reads gathered by cause.
616 617 618
    It outputs only reads sharing significantly sequences with V/J germline genes or with some ambiguity:
    it may be interesting to further study RNA-Seq datasets.

619 620
  - `-uu` gives the same set of files, including **all** not detected reads (including `UNSEG too short` and `UNSEG too few V/J`),
    and `-uuu` further outputs all these reads in a file `out/basename.undetected.vdj.fa`.
621 622

Again, as these options may generate large files, they are generally not recommended.
623 624 625
However, they are very useful in some situations, especially to understand
why some dataset gives low detection rate.
For example `-uu -X 1000` splits the not detected reads from the 1000 first reads.
Mikaël Salson's avatar
Mikaël Salson committed
626

627 628 629

## AIRR .tsv output

Mathieu Giraud's avatar
Mathieu Giraud committed
630
Since version 2018.10, vidjil-algo supports the [AIRR format](http://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields).
631
We export all required fields, some optional fields, as also some custom fields (+).
Mathieu Giraud's avatar
Mathieu Giraud committed
632
We also propose in [fuse.py](tools.md) a way to convert AIRR format to the `.vidjil` format.
633

634 635
Note that Vidjil-algo is designed to efficiently gather reads from large datasets into clones. 
By default (`-c clones`), we thus report in the AIRR format *clones*.
636
See also [What is a clone ?](vidjil-format/#what-is-a-clone).
637
Using `-c designations` trigger a separate analysis for each read, but this is usually not advised for large datasets. 
638 639 640 641


| Name  | Type | AIRR 1.2 Description <br /> *vidjil-algo implementation* |
| ----- | ---- |  ------------------------------------------------------- |
Mathieu Giraud's avatar
Mathieu Giraud committed
642
| locus | string | Gene locus (chain type). For example, `IGH`, `IGK`, `IGL`, `TRA`, `TRB`, `TRD`, or `TRG`.<br />*Vidjil-algo outputs all these loci. Moreover, the incomplete recombinations analyzed by vidjil-algo are reported as `IGH+`, `IGK+`, `TRA+D`, `TRB+`, `TRD+`, and `xxx` for unexpected recombinations. See  <locus.md>.*
Mathieu Giraud's avatar
Mathieu Giraud committed
643
| duplicate_count | number | Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. <br />*Number of reads gathered in the clone.*
644 645
| sequence_id | string  | Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. <br />*This identifier is the (50 bp by default) window extacted around the junction.* |
| clone_id 	| string | 	Clonal cluster assignment for the query sequence. <br />*This identifier is again the (50 bp by default) window extacted around the junction.*
Mikaël Salson's avatar
Mikaël Salson committed
646
| warnings (+) | string | *Warnings associated to this clone. See <http://gitlab.vidjil.org/blob/dev/doc/warnings.md>.*
647 648
| sequence  | string | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. <br />*This contains the consensus/representative sequence of each clone.*
| rev_comp  | boolean | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. <br />*Set to null, as vidjil-algo gather reads from both strands in clones* |
649
| v_call, d_call, j_call  | string  | V/D/J gene with allele. For example, IGHV4-59\*01. <br /> *implemented. In the case of uncomplete/unexpected recombinations (locus with a `+`), we still use `v/d/j_call`. Note that this value can be null on clones beyond the `--max-designations` option.* |
650 651 652
| junction  | string  |      Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. <br />*null*
| junction_aa  | string  | Junction region amino acid sequence.      <br />*implemented*
| cdr3_aa | string | Amino acid translation of the cdr3 field.   <br />*implemented*
653
| productive | boolean | True if the V(D)J sequence is predicted to be productive.  <br /> *true, false, or null when no CDR3 has been detected* |
654 655
| sequence_alignment  | string  | Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. <br /> *null*                                         |
| germline_alignment | string  | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). <br />*null*
656
| v_cigar, d_cigar, j_cigar | string  | CIGAR strings for the V/D/J gene <br />*null*.
657

658 659 660
Currently, we do not output alignment strings.
Our implementation of .tsv may evolve in future versions.
Contact us if a particular feature does interest you.
661 662


663 664 665
## Headers in the .vdj.fa files (deprecated)

The `.vdj.fa` format is compatible with the FASTA format.
666

667 668 669 670
The FASTA header of each sequence gives some details on the V(D)J recombinations.
The format of these headers is described below, but is considered as deprecated and may be removed in future releases in Q3 2021.
For post-processing tools needing some of that information, it is thus not recommended to parse these headers,
but rather to use either the `.vidjil` file that contains more information in a structured way, or the AIRR `.tsv` output.
Mikaël Salson's avatar
Mikaël Salson committed
671

672
In a `.vdj.fa` format, a line starting with a \> is of the following form:
Mikaël Salson's avatar
Mikaël Salson committed
673

674
``` diff
675
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
676

677
        name          sequence name (include the number of occurrences in the read set and possibly other information)
Mikaël Salson's avatar
Mikaël Salson committed
678
        +             strand on which the sequence is mapped
679
        VDJ           type of designation (can be "VJ", "VDJ", "VDDJ", "53"...
680
                      or shorter tags such as "V" for incomplete sequences).    
681 682 683
```
The following lines are for VDJ recombinations:
``` diff
Mathieu Giraud's avatar
Mathieu Giraud committed
684
        startV endV   start and end position of the V gene in the sequence (start at 1)
Mikaël Salson's avatar
Mikaël Salson committed
685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
701

702
        comments      optional comments. In Vidjil, the following comments are now used:
703
                      - "seed" when this comes for the first pass (.detected.vdj.fa). See the warning above.
704
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
705 706
                      - the name of the locus (TRA, TRB, TRG, TRD, IGH, IGL, IGK, possibly followed
                        by a + for incomplete/unusual recombinations)
Mikaël Salson's avatar
Mikaël Salson committed
707

708
```
Mikaël Salson's avatar
Mikaël Salson committed
709

Mikaël Salson's avatar
Mikaël Salson committed
710 711 712 713 714
Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
Mikaël Salson's avatar
Mikaël Salson committed
715

716
``` diff
717
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  comments
718
```
719
In the `.detected.vdj.fa` file, the start/end positions of V and J genes are only an estimation,
720 721 722
get from the k-mer heuristics, as the center of the window may be shifted up to 15 bases from the actual center.
In the final `.vdj.fa` file, these values are the correct ones computed after dynamic programming comparison
with germline genes.
723

724
# Examples of use
725

726
## Basic usage
727

728 729 730 731 732 733 734
On PCR-based datasets with primers in the V(D)J regions
(such as EuroClonality-NGS or EuroClonality/BIOMED-2 primer sets),
almost all of the reads are expected to be actual V(D)J recombinations.
On the other side, typical whole RNA-Seq or capture datasets usually have 
only a (very) small portion of recombined sequences.
The following commands work in both cases, detecting the locus for each recombined read,
clustering such reads into clones, and further analyzing the clones.
735

736
``` bash
737 738 739 740 741
./vidjil-algo -c clones   -g germline/homo-sapiens.g -2 -3 -r 1  demo/Demo-X5.fa
  # Detect the locus for each read, cluster and report clones starting from the first read (-r 1).
  # including unexpected recombinations (-2). Assign the V(D)J genes and try to detect the CDR3s (-3).
  # Demo-X5 is a collection of sequences on all human locus, including some incomplete or unusual recombinations:
  # IGH (VDJ, DJ), IGK (VJ, V-KDE, Intron-KDE), IGL, TRA, TRB (VJ, DJ), TRG and TRD (VDDJ, Dd2-Dd3, Vd-Ja).
742
```
743

744
``` bash
745
./vidjil-algo -g germline/homo-sapiens.g:IGH -3 demo/Stanford_S22.fasta
746 747
   # Cluster the reads and report the clones, based on windows overlapping IGH CDR3s.
   # Assign the V(D)J genes and try to detect the CDR3 of each clone.
748 749 750
   # Main output files are both out/Stanford_S22.vidjil and out/Stanford_S22.tsv.
   # Summary of clones is available on stdout.

751
```
752

753
``` bash
754
./vidjil-algo -g germline/homo-sapiens.g -2 -3 -d demo/Stanford_S22.fasta
755
   # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
756
   # Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
757
   # Assign the VDJ genes (including multiple D) and try to detect the CDR3 of each clone.
758 759
   # Main output files are both out/reads.vidjil and out/reads.tsv.
   # Summary of clones is available on stdout.
760
```
761

762
## Sorting reads from whole RNA-Seq or capture datasets
763

764
``` bash
765
./vidjil-algo -g germline/homo-sapiens.g -2 -U demo/Stanford_S22.fasta
766
   # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
767
   # Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
768
   # Assign the VDJ genes and try to detect the CDR3 of each clone.
769
   # The out/reads.detected.vdj.fa include all reads where a V(D)J recombination was found
770
```
771

772
Typical whole RNA-Seq or capture datasets may be huge (several GB) but with only a (very) small portion of recombined sequences.
773
Using Vidjil with `-U` will create a `out/reads.detected.vdj.fa` file
774
that includes all reads where a V(D)J recombination (or an unexpected recombination, with `-2`) was found.
775
This file will be relatively small (a few kB or MB) and can be taken again as an input for Vidjil-algo or for other programs.
776

777
## Advanced usage
778

779 780
An experimental further clustering can be triggered with `--cluster-epsilon`.