vidjil-algo.md 49.7 KB
Newer Older
Mathieu Giraud's avatar
Mathieu Giraud committed
1
# vidjil-algo 2021.02
2 3
**Command-line manual**

4
*The Vidjil team (Mathieu, Mikaël, Aurélien, Florian, Marc, Tatiana and Rayan)*
Mikaël Salson's avatar
Mikaël Salson committed
5

6 7
```
  Vidjil -- High-throughput Analysis of V(D)J Immune Repertoire -- [[http://www.vidjil.org]]
Mathieu Giraud's avatar
Mathieu Giraud committed
8
  Copyright (C) 2011-2021 by Bonsai bioinformatics
9
  at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille
10
  and VidjilNet consortium.
11 12 13 14 15 16
  contact@vidjil.org
```

This is the help of vidjil-algo, for command-line usage.
This manual can be browsed online:

17 18
 - <http://www.vidjil.org/doc/vidjil-algo>                (last stable release)
 - <http://gitlab.vidjil.org/blob/dev/doc/vidjil-algo.md> (development version)
19 20 21 22 23 24 25

Other documentation (users and administrators of the web application, developpers) can be found from <http://www.vidjil.org/doc/>.


## About

*V(D)J recombinations* in lymphocytes are essential for immunological
Mikaël Salson's avatar
Mikaël Salson committed
26 27 28
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
29
With adapted [library preparation and sequencing](http://www.vidjil.org/doc/locus),
30
high-throughput sequencing (NGS/HTS) now
31 32
enables the deep sequencing of a lymphoid population with dedicated
sequencing methods and software, called either Rep-Seq or AIRR-Seq.
Mikaël Salson's avatar
Mikaël Salson committed
33

34
Vidjil-algo processes high-throughput sequencing data to extract V(D)J
35
junctions and gather them into clones. Vidjil-algo starts
Mikaël Salson's avatar
Mikaël Salson committed
36 37
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
38
to output all sequenced clones. The analysis is extremely fast
Mathieu Giraud's avatar
Mathieu Giraud committed
39
because, in the first phase, no alignment is performed with database
40
germline sequences. At the end, only the consensus sequences
41
of each clone have to be analyzed. Vidjil-algo can also cluster similar
42
clones, or leave this to the user after a manual review in the web application.
Mikaël Salson's avatar
Mikaël Salson committed
43

44 45
The method is described in the following references:

46
- Marc Duez et al.,
47
“Vidjil: A web platform for analysis of high-throughput repertoire sequencing”,
48
PLOS ONE 2016, 11(11):e0166126
49
<http://dx.doi.org/10.1371/journal.pone.0166126>
Mikaël Salson's avatar
Mikaël Salson committed
50

51
- Mathieu Giraud, Mikaël Salson, et al.,
Mikaël Salson's avatar
Mikaël Salson committed
52
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
53
BMC Genomics 2014, 15:409
54
<http://dx.doi.org/10.1186/1471-2164-15-409>
Mikaël Salson's avatar
Mikaël Salson committed
55

56
Vidjil-algo is open-source, released under GNU GPLv3+ license.
Mikaël Salson's avatar
Mikaël Salson committed
57

58
# Requirements and installation
59

60
## Supported platforms
Mikaël Salson's avatar
Mikaël Salson committed
61

62 63 64
Vidjil-algo is systematically tested with the following compilers :

  - gcc/g++ 4.8, 5.3, 6.3, 7.3, 8.4, 9.3, 10.1
65
  - clang 3.4, 4.0, 6.0, 7.0
Mikaël Salson's avatar
Mikaël Salson committed
66

67 68 69 70 71
These compilers are available on recent OS X and on the following Linux distributions:
  - CentOS 7, 8
  - Debian Jessie 8.0, Stretch 9.0, Buster 10.0
  - FreeBSD 9.2, 10, 11, 12
  - Ubuntu 16.04 LTS, 18.04 LTS, 20.04 LTS
Mikaël Salson's avatar
Mikaël Salson committed
72

73
Vidjil-algo is developed with continuous integration using systematic unit and functional testing.
74 75
The development team internally uses [Gitlab CI](http://gitlab.vidjil.org/pipelines) for that,
and the tested compilers are run through Docker containers described in `.gitlab-ci-compilers.yml`.
Mikaël Salson's avatar
Mikaël Salson committed
76

77
## Build requirements (optional)
78

79
This paragraph details the requirements to build Vidjil-algo from source.
80
You can also download a static binary, see [installation](#installation).
81

82
To compile Vidjil-algo, make sure:
83

84
  - to be on a POSIX system ;
85
  - to have a C++11 compiler (as `g++` 4.8 or above, or `clang` 3.4 or above).
86 87
  - to have the `zlib` installed (`zlib1g-dev` package under Debian/Ubuntu,
    `zlib-devel` package under Fedora/CentOS).
88
  - to have GNU make (`gmake` under FreeBSD). On some FreeBSD distributions, it was required to use commands such as
89
``` bash
90
make MAKE=gmake CXXFLAGS="-std=c++11 -O2 Wall -D_GLIBCXX_USE_C99 -Wl,-rpath=/usr/local/lib/gcc49"
91
```
92
    The `gcc49` at the end of the command line is to be replaced by the `gcc` version used.
93

Mathieu Giraud's avatar
Mathieu Giraud committed
94

95
## Installation
Mathieu Giraud's avatar
Mathieu Giraud committed
96

97 98
### Download

99 100 101 102 103 104
These instructions targets *stable releases* of vidjil-algo, as downloaded from <http://www.vidjil.org/releases>
or <http://bioinfo.lifl.fr/vidjil/>.

Development code is found at <http://gitlab.vidjil.org>, in the `algo` directory.
and compiling and running vidjil-algo on the development code can involve slightly different commands,
including replacing `src` by `algo`.
105

106
### Compiling
Mikaël Salson's avatar
Mikaël Salson committed
107

108 109
Running `make` from the extracted archive should be enough to install vidjil-algo with germline and demo files.
It runs the three following `make` commands.
110

111
``` bash
Mikaël Salson's avatar
Mikaël Salson committed
112

113
make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
114
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
115 116 117 118 119 120
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

121

122
make -C src              # build vijil-algo from the sources (see the requirements,
123 124
                         # another option is: wget http://www.vidjil.org/releases/vidjil-algo-latest_x86_64 -O vidjil-algo
                         # to download a static binary (built for x86_64 architectures)
125

126
make demo                # download demo files (S22 and L4, see demo/get-sequences)
127

128
./vidjil-algo -h         # display help/usage
129
```
Mikaël Salson's avatar
Mikaël Salson committed
130

131
On some older systems you may need to replace the `make` commands with:
132

133
``` bash
134
make LDFLAGS='-stdlib=libc++'  ### OS X Mavericks
135 136 137
```

## Self-tests (optional)
138 139 140

You can run the tests with the following commands:

141
``` bash
142
make -C src/tests/data
Mathieu Giraud's avatar
Mathieu Giraud committed
143
   # get IGH recombinations from a single individual, as described in:
144 145 146 147
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

148
make -C src test                # run self-tests (can take 5 to 60 minutes)
149
```
Mikaël Salson's avatar
Mikaël Salson committed
150

151
# Input and parameters
152

153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
The `-h` and `-H` help options provide the list of parameters that can be
used. We detail here the options of the main `-c clones` command.

The default options are very conservative (large window, no further
automatic clusterization, see below), leaving the user or other
software making detailed analysis and decisions on the final
clustering.

### Input selection

```
Positionals
  reads_file                  reads file, in one of the following formats:
                                  - FASTA (.fa/.fasta, .fa.gz/.fasta.gz)
                                  - FASTQ (.fq/.fastq, .fq.gz/.fastq.gz)
                                  - BAM (.bam)
                              Paired-end reads should be merged before given as an input to vidjil-algo.
                              Uncompressed FASTA/FASTQ reads can be given from standard input with '-'.
171 172 173 174

Input
  -x, --first-reads INT       maximal number of reads to process ('all': no limit, default), only first reads
  -X, --sampled-reads INT     maximal number of reads to process ('all': no limit, default), sampled reads
175 176
```

177 178 179
The main input file of Vidjil-algo is a *set of reads*, given as a `.fasta`
or `.fastq` file, possibly compressed with gzip (`.gz`).
This set of reads can reach several gigabytes and 2\*10<sup>9</sup> reads. It is
180
never loaded entirely in the memory, but reads are processed one by
181
one by Vidjil-algo.
182 183
FASTA/FASTQ reads can also be given on the standard input by giving `-` instead of a file.

184
Vidjil-algo can also process BAM files, but please note that:
185

186 187 188 189
1.  The reads don't need to be aligned beforehand.
2.  In case of paired-end sequencing, the reads must have already been merged
    in the BAM file.

190 191 192 193
The `--first-reads` option restricts the analysis on a few sequences, for example to probe a large file or to test some parameters.
However, read files may be not homogeneous, with biais in the sequences at the start of the file.
The `--sampled-reads` option rather considers *regularly sampled sequences* from the file.
It is thus generally safe to run `--sampled-reads 1000` to have a fast insight of what there is in some data.
194

195
## Germline presets: locus and recombination selection
196 197

``` diff
198
Germline/recombination selection (at least one -g or -V/(-D)/-J option must be given)
199 200
  -g, --germline GERMLINES ...

201 202
         -g <.g FILE>(:FOCUS) ...
                    germline preset(s) (.g file(s)), detecting multiple recombinations, with tuned parameters.
203
                    Common values are '-g germline/homo-sapiens.g' or '-g germline/mus-musculus.g'
204
                    One can focus on some recombinations, such as in '-g germline/homo-sapiens.g:IGH,IGK,IGL'
205
         -g PATH
206
                    human germline preset, shortcut for '-g PATH/homo-sapiens.g',
207
                    processes human TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with incomplete/unusal recombinations
208
  -V FILE ...                 custom V germline multi-fasta file(s)
209
  -D FILE ...                 custom D germline multi-fasta file(s) for V(D)J designation
210
  -J FILE ...                 custom V germline multi-fasta file(s)
211
  -2                          try to detect unexpected recombinations
212
```
213

214
The `germline/*.g` presets configure the analyzed recombinations.
215 216
The following presets are provided:

217
  - `germline/homo-sapiens.g`: Homo sapiens, TR (`TRA`, `TRB`, `TRG`, `TRD`) and Ig (`IGH`, `IGK`, `IGL`) locus,
Mathieu Giraud's avatar
Mathieu Giraud committed
218
    including incomplete/unusal recombinations (`TRA+D`, `TRB+`, `TRD+`, `IGH+`, `IGK+`, see <locus.md>.
219 220
  - `germline/homo-sapiens-isotypes.g`: Homo sapiens heavy chain locus, looking for sequences with, on one side, IGHJ (or even IGHV) genes,
    and, on the other side, an IGH constant chain.
221 222
  - `germline/homo-sapiens-isoforms.g`: Homo sapiens IKZF1 and ERG recombinations.
  - `germline/homo-sapiens-cd.g`: Homo sapiens, common CD genes (experimental, does not check for recombinations).
223 224
  - `germline/mus-musculus.g`: Mus musculus (strains BALB/c and C57BL/6)
  - `germline/rattus-norvegicus.g`: Rattus norvegicus (strains BN/SsNHsdMCW and Sprague-Dawley)
225

226 227 228 229
  - Recombinations can be filtered, such as in
    `-g germline/homo-sapiens.g:IGH` (only IGH, complete recombinations),
    `-g germline/homo-sapiens.g:IGH,IGH+` (only IGH, as well with incomplete recombinations)
    or `-g germline/homo-sapiens.g:TRA,TRB,TRG` (only TR locus, complete recombinations).
230

231
  - Several presets can be loaded at the same time, as for instance `-g germline/homo-sapiens.g -g germline/germline/homo-sapiens-isotypes.g`.
232

233
  - Using `-2` further test unexpected recombinations (tagged as `xxx`), as in `-g germline/homo-sapiens.g -2`.
234

235
Finally, the advanced `-V/(-D)/-J` options enable to select custom V, (D) and J repertoires given as `.fasta` files.
236

237
## Custom `germline/*.g` presets
238 239

New `germline/*.g` presets for other species or for custom recombinations can be created, possibly referring to other `.fasta` files.
240
This is an advanced usage, please contact us if you need help in configuring such other germlines.
241 242

Inside a `.g` file, the `systems` entries details how vidjil-algo looks for recombinations.
243
Let's look at the `IGH` entry in the `germline/homo-sapiens.g` preset:
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267

```json
        "IGH": {
            "shortcut": "H",
            "color" : "#6c71c4",
            "description": "Human immunoglobulin, heavy locus (14q32.33)",
            "recombinations": [ {
                "5": ["IGHV.fa"],
                "4": ["IGHD.fa"],
                "3": ["IGHJ+down.fa"]
            } ],
            "parameters": {
                "seed": "12s"
            }
        }
```

The `shortcut` must be a unique 1-character string.
The `color` and `description` fields are not used by `vidjil-algo`, but rather by the web application.
The `parameters.seed` value of `12s` is equivalent to `-s 12s` advanced option on k-mer size described below.

Here `recombinations` describes one sequence analysis mode, called `543`:
a VJ junction is detected when there is a significant similarity (in terms of numbers of k-mers, see below) against sequences in `IGHV.fa` in the 5' region,
followed by a significant similarity in the 3' region against sequences in `IGHJ+down.fa`
268
– here we take both J genes and downstream sequences to improve the detection.
269 270 271 272 273 274

In a second pass (V(D)J designation), full alignment is done against these sequences.
The optional `4` entry  (`IGHD.fa`) is taken only there into account.
However, if a D is not detected and designated, the read will be designated as VJ.


275
The `TRD+` entry, for incomplete recombinations (see <locus.md>), shows an example where
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
both Vd-Dd3, Dd2-Jd (possibly Dd2-Dd-Jd), and Dd2-Dd3 recombinations are searched:

```json
    "recombinations": [ {
        "5": ["TRDV.fa"],
        "3": ["TRDD3+down.fa"]
    }, {
        "5": ["TRDD2+up.fa"],
        "4": ["TRDD.fa"],
        "3": ["TRDJ+down.fa"]
    }, {
        "5": ["TRDD2+up.fa"],
        "3": ["TRDD3+down.fa"]
    } ]
```

There is also an experimental sequence analysis mode, called `1`,
that detects similarities and designates sequences *without recombinations*,
as in `germline/homo-sapiens-cd.g`:

```json
     "recombinations": [ { "1": ["CD-sorting.fa"] } ]
```

300
This can be used to detect non-recombined known sequences,
301
as shown here with usual CD sequences in RNA-seq data.
302 303 304
However, putting too many sequences here may generate many hits
that may hide actual recombinations.

305
## Main algorithm parameters
306

307
``` diff
308 309 310 311
Recombination detection ("window" prediction, first pass)
    (use either -s or -k option, but not both)
    (using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
    (all these options, except -w, are overriden when using -g)
312 313
  -k, --kmer INT              k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
  -w, --window INT            w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
314
  -e, --e-value FLOAT=1       maximal e-value for trusting the detection of a V-J recombination
315 316 317
  --trim INT                  trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt  (0: no trim)
  -s, --seed SEED=10s         seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
                               10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
318
```
Mathieu Giraud's avatar
Mathieu Giraud committed
319

320
The `-s`, `-k` are the options of the seed-based heuristic that detects
Mathieu Giraud's avatar
Mathieu Giraud committed
321 322
"junctions", that is a zone in a read that is similar to V genes on its
left end and similar to J genes in its right end. A detailed
323
explanation can be found in (Giraud, Salson and al., 2014).
324
*These options are for advanced usage, the default values, as set in the `germline/*.g` presets, should work.*
325
The `-s` or `-k` option selects the seed used for the k-mer V/J affectation.
Mathieu Giraud's avatar
Mathieu Giraud committed
326

327 328
The `-w` option fixes the size of the "window" that is the main
identifier to cluster clones. The default value (`-w 50`) was selected
329
to ensure a high-quality clone clustering: reads are clustered when
330
they *exactly* share, at the nucleotide level, a 50 bp-window centered
331 332
on the CDR3. No sequencing errors are corrected inside this window.
The center of the "window", predicted by the high-throughput heuristic, may
Mikaël Salson's avatar
Mikaël Salson committed
333
be shifted by a few bases from the actual "center" of the CDR3 (for TRG,
Mathieu Giraud's avatar
Mathieu Giraud committed
334
less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction
335
in \>99% of cases when the reads are large enough). Usually, a 50 bp-window
336 337
fully contains the CDR3 as well as some part of the end of the V and
the start of the J, or at least some specific N region to uniquely identify the clone.
Mathieu Giraud's avatar
Mathieu Giraud committed
338

339
Setting `-w` to higher values (such as `-w 60` or `-w 100`) makes the clone clustering
340 341
even more conservative, enabling to split clones with low specificity (such as IGH with very
large D, short or no N regions and almost no somatic hypermutations). However, such settings
342
may detect recombinations in less reads, depending on the read length of your data, and may also
343 344
return more clones, as any sequencing error in the window is not corrected.

345
The special `-w all` option takes all the read as the windows, completely disabling
346
the clustering by windows and generally returning more clones. This should only be used on
347 348
datasets where reads of the same clone do have exactly the same length, or in situations
in which one want to study very precisely the clonality, tracking all mutations along the read.
349

350
Setting `-w` to lower values than 50 may analyze a few more reads, depending
351
on the read length of your data, but may in some cases falsely cluster reads from
352
different clones.
353 354
For VJ recombinations, the `-w 40` option is usually safe, and `-w 30` can also be tested.
Setting `-w` to lower values is not recommended.
Mathieu Giraud's avatar
Mathieu Giraud committed
355

356 357
When the read is too short too extract the requested length, the window can be shifted
(at most 10 bp) or shrinkened (down until 30bp) by increments of 5bp. Such reads
358
are counted in `SEG changed w` and the corresponding clones are output with the `W50` warning.
359

360
The `-e` option sets the maximal e-value accepted for analyzing a sequence.
361 362 363
It is an upper bound on the number of designated sequences found by chance by vidjil-algo.
The e-value computation takes into account both the number of locus searched for
and, for the defaut `-c clones` command, the number of reads in the input sequence.
364
The default value is 1.0, but values such as 1000, 1e-3 or even less can be used
365
to have a more or less permissive detection and designation.
366
The threshold can be disabled with `-e all`.
367

368 369 370 371
The advanced `--e-value-kmer` option sets the e-value for the seed-based heuristic.
It is an upper bound on the number of expected windows found by chance.
The default value is the same than value than the `-e`.

372
The advanced `--trim` option sets the maximal number of nucleotides that will be indexed in
373 374
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
375
However giving a `--trim` may also reduce the probability of seeing a heavily
376
trimmed or mutated V gene.
377
The default is `--trim 0`.
378

379
## Thresholds on clone output
380 381 382

The following options control how many clones are output and analyzed.

383
``` diff
384 385 386
Limits to report and to analyze clones (second pass)
  -r, --min-reads INT=5       minimal number of reads supporting a clone
  --min-ratio FLOAT=0         minimal percentage of reads supporting a clone
387
  --max-clones INT            maximal number of output clones ('all': no maximum, default)
388 389 390 391 392 393
  -y, --max-consensus INT=100 maximal number of clones computed with a consensus sequence ('all': no limit)
  -z, --max-designations INT=100
                              maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use)
  --all                       reports and analyzes all clones
                              (--min-reads 1 --min-ratio 0 --max-clones all --max-consensus all --max-designations all),
                              to be used only on small datasets (for example --all -X 1000)
394
```
395

396
The `-r/--ratio` options are strong thresholds: if a clone does not have
397
the requested number of reads, the clone is discarded (except when
398
using `--label`, see below).
399 400 401
The default `-r 5` option is meant to only output clones that
have a significant read support. **You should use** `-r 1` **if you
want to detect all clones starting from the first read** (especially for
402 403
MRD detection).

404 405
The `--max-clones` option limits the number of output clones, even without consensus sequences.

406
The `--max-consensus` option limits the number of clones for which a consensus
407
sequence is computed. Usually you do not need to have more
408
consensus (see below), but you can safely put `--max-consensus all` if you want
409
to compute all consensus sequences.
410

411
The `--max-designations` option limits the number of clones that are fully analyzed,
412
*with their V(D)J designation and possibly a CDR3 detection*,
413
in particular to enable the web application
414
to display the clones on the grid (otherwise they are displayed on the
415
'?/?' axis).
416 417 418 419 420 421 422 423

These V(D)J designations are obtained by full comparison (dynamic programming)
with all germline sequences.
Note that these designations are relatively slow to compute, especially
for the IGH locus. However, they
are not at the core of the Vidjil clone clustering method (which
relies only on the 'window', see above).
To check the quality of these designations, the automated test suite include
Mathieu Giraud's avatar
Mathieu Giraud committed
424
[sequences with manually curated V(D)J designations](should-vdj.md).
425

426 427
If you want to analyze more clones, you should use `--max-designations 200` or
`--max-designations 500`. It is not recommended to use larger values: outputting more
428
than 500 clones is often not useful since they can not be visualized easily
429
in the web application, and takes more computation time.
430

431
Note that even if a clone is not in the top 100 (or 200, or 500) but
432
still passes the `-r`, `--ratio` options, it is still reported in both the `.vidjil`
433
and `.vdj.fa` files. If the clone is at some MRD point in the top 100 (or 200, or 500),
434
it will be fully analyzed by this other point (and then
435
collected by the `fuse.py` script, using consensus sequences computed at this
436
other point, and then, on the web application, correctly displayed on the grid).
437
**Thus is advised to leave the default** `--max-designations 100` **option
438
for the majority of uses.**
439

440 441 442
The `--all` option disables all these thresholds. This option can be
used for test and debug purposes or on small datasets.
It produces large file and takes more time. 
443

444
The `--analysis-filter` advanced option speeds up the full analysis by a pre-processing step,
Mathieu Giraud's avatar
Mathieu Giraud committed
445 446 447
again based on k-mers, to select a subset of the V germline genes to be compared to the read.
The option gives the typical size of this subset (it can be larger when several V germlines
genes are very similar, or smaller when there are not enough V germline genes).
448 449
The default `--analysis-filter 3` is generally safe.
Setting `--analysis-filter all` removes this pre-processing step, running a full dynamic programming
450
with all germline sequences that is much slower.
Mikaël Salson's avatar
Mikaël Salson committed
451

452
## Sequences of interest
453

454
Vidjil-algo allows to indicate that specific sequences should be followed and output,
455
even if those sequences are 'rare' (below the `-r/--ratio` thresholds).
456 457
Such sequences can be provided either with `--label <sequence>`, or with `--label-file <file>`.
The file given by `--label-file` should have one sequence by line, as in the following example:
Mikaël Salson's avatar
Mikaël Salson committed
458

459
``` diff
460 461
GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1
GAGAGATGGACGGAATACGTTAAACGACATATGGTTCGGGGTATGGTGCT my-clone-2 foo
462
```
463

464
Sequences and labels must be separated by one space.
465 466
The first column of the file is the sequence to be followed
while the remaining columns consist of the sequence's label.
467
In Vidjil-algo output, the labels are output alongside their sequences.
468

469
A sequence given `--label <sequence>` or with `--label-file <file>` can be exactly the size
470
of the window (`-w`, that is 50 by default). In this case, it is guaranteed that
471 472
such a window will be output if it is detected in the reads.
More generally, when the provided sequence differs in length with the windows
473 474
we will keep any windows that contain the sequence of interest or, conversely,
we will keep any window that is contained in the sequence of interest.
475
This filtering will work as expected when the provided sequence overlaps
476 477 478
(at least partially) the CDR3 or its close neighborhood,
but will not work when the sequence is far of the CDR3 (except when
using large `-w` values or `-w all`).
479

480
With the `--label-filter` option, *only* the windows related to the given sequences are kept.
481
This allows to quickly filter a set of reads, looking for a known sequence or window,
482 483 484 485
with the `--grep-reads <sequence>` preset, equivalent to
`--out-reads --label-filter --label <sequence>`:
All the reads with the windows related to the sequence will be extracted 
to files such as `out/seq/clone.fa-1`.
486

487
## Further clone analysis: V(D)J designation, CDR3 detection
488 489
Note that such sequences must have been detected as a V(D)J (or V(D)J-like) recombination
in the first pass: the `--label`, `-label-file`, or `--label-filter` options can not
490
 detect a recombination that was not detected when removing all the thresholds with `--all`.
491 492
To increase the sensitivity, see above the `--e-value` option, or,
to look for non-recombined sequences, see above the experimental `1` sequence analysis.
493

494
``` diff
495 496 497 498 499
Clone analysis (second pass)
  -d, --several-D             try to detect several D (experimental)
  -3, --cdr3                  CDR3/JUNCTION detection (requires gapped V/J germlines)
```

500
The `-3` option launches a CDR3/JUNCTION detection based on the position
501
of Cys104 and Phe118/Trp118 amino acids. This detection relies on alignment
502
with gapped V and J sequences, as for instance, for V genes, IMGT/GENE-DB sequences,
503
as provided by `make germline`.
504 505
The CDR3/JUNCTION detection won't work with custom non-gapped V/J repertoires.

506
CDR3 are reported as *productive* when they come from an in-frame recombination
507
and when the sequence does not contain any in-frame stop codons.
508
Note that some other software only consider stop codons in the CDR3,
509 510 511 512
and may thus under-estimate non-productivity.
When the sequence is long enough to start before the start of the V gene
or to end after the end of the J gene, vidjil-algo do not consider these intronic sequences
in the productivity estimation.
513

514
The advanced `--analysis-cost` option sets the parameters used in the comparisons between
515 516
the clone sequence and the V(D)J germline genes. The default values should work.

517 518
The e-value set by `-e` is also applied to the V/J designation.
The `-E` option further sets the e-value for the detection of D segments.
Mathieu Giraud's avatar
Mathieu Giraud committed
519

520
## Further clustering (experimental)
521

522 523
The following options are experimental and have no consequences on the `.vdj.fa` file,
nor on the standard output. They instead add a `clusters` sections in the `.vidjil` file
524
that will be visualized in the web application.
525 526
Any such clustering should be avoided when one wants to precisely study hypermutations.
The web application provides other options to inspect clones and cluster them.
527

528 529
The `--cluster-epsilon` option triggers an automatic clustering using the
[DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) algorithm (Ester and al., 1996).
530
Using `--cluster-epsilon 5` usually clusters reads within a distance of 1 mismatch (default score
Mathieu Giraud's avatar
Mathieu Giraud committed
531
being +1 for a match and -4 for a mismatch). With that option, more distant reads will also
532
be clustered as soon there are more than 10 reads within the distance threshold.
Mathieu Giraud's avatar
Mathieu Giraud committed
533
This behaviour can be controlled with the `-cluster-N` option.
Mikaël Salson's avatar
Mikaël Salson committed
534

535 536 537
Setting `--cluster-epsilon 10`, possibly with `--cluster-N 5` or `--cluster-N 1`
will perform more aggressive clustering and is generally not advised.

538
The `--cluster-forced-edges` option allows to specify a file for manually clustering two windows
539
considered as similar. Such a file may be automatically produced by vidjil-algo
540
(`out/edges`), depending on the option provided. Only the two first columns
541
(separed by one space) are important to vidjil-algo, they only consist of the
Mikaël Salson's avatar
Mikaël Salson committed
542
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
543

544
# Output
Mikaël Salson's avatar
Mikaël Salson committed
545

546
## Main output files
547

548
The default output of Vidjil-algo (with the default `-c clones` command) are the two following files:
549

550 551 552 553
  - The `.vidjil` file is the *main output file*, containing the most information.
    The file is in a `.json` format,
    its specification is detailed in [vidjil-format](vidjil-format.md).
    It describes the clones, with the windows and their count, the consensus sequences (`--max-consensus`),
554
    the detailed V(D)J and CDR3 designation (`--max-designations`, see warning below), and possibly
555 556
    the results of the further clustering.
    
557
    The web application takes this `.vidjil` file ([possibly merged with `fuse.py`](#following-clones-in-several-samples)) for the *visualization and analysis* of clones and their
558 559
    tracking along different samples (for example time points in a MRD
    setup or in a immunological study).
Mathieu Giraud's avatar
Mathieu Giraud committed
560
    Please see the [web application user manual](http://www.vidjil.org/doc/user) for more information.
561 562 563

  - The `.tsv` file is the AIRR output, for compatibility with other software
    using the same format. See [below](#airr-tsv-output) for details.
564

565

566 567
By default, these output files are named
`out/basename.vidjil` and `out/basename.tsv`, where:
568

569 570
  - `out` is the directory where all the outputs are stored (can be changed with the `--dir` option).
  - `basename` is the basename of the input `.fasta/.fastq` file (can be overriden with the `--base` option)
571

572 573
With the `--gz` option, both files are output
as compressed `.vidjil.gz` and `.tsv.gz` files.
574 575 576 577

Vidjil-algo also outputs the first 50 clones on the standard output.
More data can be printed on the standard output with the `-v` option.

578
## Auxiliary output files
579

580 581 582 583 584 585 586 587
### `.vdj.fa`

With the `--out-vdjfa` option, a `.vdj.fa` file is created (or, with `--gz`, a `.vdj.fa.gz` file).
This is *a FASTA file for further processing by other bioinformatics tools*.
Even if it is advised to rather use the full information in the `.vijdil` file,
the `.vdj.fa` is a convenient way to have sequences of clones for further processing.
These sequences are at least the windows (and their count in the headers) or
the consensus sequences (`--max-consensus`) when they have been computed.
Mathieu Giraud's avatar
Mathieu Giraud committed
588
The [headers](#headers-in-the-vdjfa-files-deprecated) are described below, but the format of the headers is deprecated
589 590 591 592 593 594 595 596 597 598
and will not be enforced in future releases.
Some other informations such as the further clustering are not output in this file.

The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
shrinking a large read set into a manageable number of (pre-)clones
that will be deeply analyzed and possibly further clustered by
other software.

### `.windows.fa`

599
The `out/basename.windows.fa` file contains the list of windows, with number of occurrences:
600

601
``` diff
602
>8--window--1
603
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
604
>5--window--2
605
CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC
606 607
ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC
(...)
608
```
609

610
Windows of size 50 (modifiable by `-w`) have been extracted.
611 612
The first window has 8 occurrences, the second window has 5 occurrences.

613 614
### `seq/clone.fa-*`

615 616
With the `--out-clone-files` option, one `out/seq/clone.fa-*` file is created for each clone.
It contains the detailed analysis by clone, with
617
the window, the consensus sequence, as well as with the most similar V, (D) and J germline genes:
618

619
``` diff
620 621
>clone-001--IGH--0000008--0.0608%--window
TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG
622
>clone-001--IGH--0000008--0.0608%--lcl|FLN1FA001CPAUQ.1|-[105,232]-#2 - 128 bp (55% of 232.0 bp) + VDJ  0 54 73 84 85 127   IGHV3-23*05 6/ACCCGGGAGGAACAATAT/9 IGHD6-13*01 0//5 IGHJ4*02  IGH SEG_+ 1.946653e-19 1.352882e-19/5.937712e-20
623 624 625 626 627 628 629 630 631 632
GCTGTACCTGCAAATGAACAGCCTGCGAGCCGAGGACACGGCCACCTATTACTGT
ACCCGGGAGGAACAATATAGCAGCTGGTAC
TTTGACTTCTGGGGCCAGGGGATCCTGGTCACCGTCTCCTCAG

>IGHV3-23*05
GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTTATAGCAGTGGTAGTAGCACATACTATGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAA
>IGHD6-13*01
GGGTATAGCAGCAGCTGGTAC
>IGHJ4*02
ACTACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG
633
```
634

635 636
The `--out-reads` debug option further output in each `out/seq/clone.fa-*` files the full list of reads belonging to this clone.
The `--out-reads` option produces large files, and is not recommended in general cases.
637

638
## Diversity measures
639

640
Several [diversity indices](https://en.wikipedia.org/wiki/Diversity_index) are reported, both on the standard output and in the `.vidjil` file:
641

642 643 644
  - H (`index_H_entropy`): Shannon's diversity
  - E (`index_E_equitability`): Shannon's equitability
  - Ds (`index_Ds_diversity`): Simpson's diversity
645

646
E ans Ds values are between 0 (no diversity, one clone clusters all analyzed reads)
647
and 1 (full diversity, each analyzed read belongs to a different clone).
648
These values are now computed on the windows, before any further clustering.
649
PCR and sequencing errors can thus lead to slightly over-estimate the diversity.
650

651
## Reads without detected recombinations
652

653
Vidjil-algo outputs details statistics on the reads where no recombination was detected
654 655
Basically, **an unanalyzed read is a read where Vidjil-algo cannot identify a window at the junction of V and J genes**.
To properly analyze a read, Vijdil-algo needs that the sequence spans enough V region and J region
656
(or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations).
657
The following causes are reported:
658

659 660 661 662 663 664 665 666
|                     |                                                                                                                          |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| `UNSEG too short`   | Reads are too short, shorter than the seed (by default between 9 and 13 bp).                                             |
| `UNSEG strand`      | The strand is mixed in the read, with some similarities both with the `+` and the `-` strand.                            |
| `UNSEG too few V/J` | No information has been found on the read: There are not enough similarities neither with a V gene or a J gene.          |
| `UNSEG only V/5`    | Relevant similarities have been found with some V, but none or not enough with any J.                                    |
| `UNSEG only J/3`    | Relevant similarities have been found with some J, but none or not enough with any V.                                    |
| `UNSEG ambiguous`   | vidjil-algo finds some V and J similarities mixed together which makes the situation ambiguous and hardly solvable.      |
667
| `UNSEG too short w` | The junction can be identified but the read is too short so that vidjil-algo could extract the window (by default 50bp). It often means the junction is very close from one end of the read.  |
668 669 670 671

Some datasets may give reads with many low `UNSEG too few` reads:

  - `UNSEG too few V/J` usually happens when reads share almost nothing with the V(D)J region.
672 673
    This is expected when the PCR or capture-based approach included other regions, such as in whole RNA-seq.

674
  - `UNSEG only V/5` and `UNSEG only J/3` happen when reads do not span enough the junction zone.
675
    Vidjil-algo detects a “window” including the CDR3. By default this window is 50bp long,
676 677
    so the read needs be that long centered on the junction.

678 679
See the [user manual](user.md) for information on the biological or sequencing causes 
that can lead to few analyzed reads.
680

681
## Filtering reads
682

683 684
``` diff
Detailed output per read (generally not recommended, large files, but may be used for filtering, as in -uu -X 1000)
685 686 687 688 689
  -U, --out-detected          output reads with detected recombinations (in .detected.vdj.fa file)
  -u, --out-undetected
        -u          output undetected reads, gathered by cause, except for very short and 'too few V/J' reads (in *.fa files)
        -uu         output undetected reads, gathered by cause, all reads (in *.fa files) (use only for debug)
        -uuu        output undetected reads, all reads, including a .undetected.vdj.fa file (use only for debug)
690 691
  --out-reads                 output all reads by clones (clone.fa-*), to be used only on small datasets
  -K, --out-affects           output detailed k-mer affectation for each read (in .affects file) (use only for debug, for example -KX 100)
692 693
```

694 695 696
It is possible to extract all reads with or without detected recombinations,
possibly to give them to other software.
Runing Vidjil-algo with `-U` gives a file `out/basename.detected.vdj.fa`, with all detected reads.
697
On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file.
698
However, the `-U` option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations.
699 700
Moreover `-U` only uses the ultra-fast first passs analysis, based on k-mer heuristics.

701

702
Similarly, options are available to get the non analyzed reads:
Mathieu Giraud's avatar
Mathieu Giraud committed
703

704
  - `-u` gives a set of files `out/basename.UNSEG_*`, with not detected reads gathered by cause.
705 706 707
    It outputs only reads sharing significantly sequences with V/J germline genes or with some ambiguity:
    it may be interesting to further study RNA-Seq datasets.

708 709
  - `-uu` gives the same set of files, including **all** not detected reads (including `UNSEG too short` and `UNSEG too few V/J`),
    and `-uuu` further outputs all these reads in a file `out/basename.undetected.vdj.fa`.
710 711

Again, as these options may generate large files, they are generally not recommended.
712 713 714
However, they are very useful in some situations, especially to understand
why some dataset gives low detection rate.
For example `-uu -X 1000` splits the not detected reads from the 1000 first reads.
Mikaël Salson's avatar
Mikaël Salson committed
715

716 717 718

## AIRR .tsv output

Mathieu Giraud's avatar
Mathieu Giraud committed
719
Since version 2018.10, vidjil-algo supports the [AIRR format](http://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields).
720
We export all required fields, some optional fields, as also some custom fields (+).
Mathieu Giraud's avatar
Mathieu Giraud committed
721
We also propose in [fuse.py](tools.md) a way to convert AIRR format to the `.vidjil` format.
722

723 724
Note that Vidjil-algo is designed to efficiently gather reads from large datasets into clones. 
By default (`-c clones`), we thus report in the AIRR format *clones*.
725
See also [What is a clone ?](vidjil-format/#what-is-a-clone).
726
Using `-c designations` trigger a separate analysis for each read, but this is usually not advised for large datasets. 
727 728 729 730


| Name  | Type | AIRR 1.2 Description <br /> *vidjil-algo implementation* |
| ----- | ---- |  ------------------------------------------------------- |
Mathieu Giraud's avatar
Mathieu Giraud committed
731
| locus | string | Gene locus (chain type). For example, `IGH`, `IGK`, `IGL`, `TRA`, `TRB`, `TRD`, or `TRG`.<br />*Vidjil-algo outputs all these loci. Moreover, the incomplete recombinations analyzed by vidjil-algo are reported as `IGH+`, `IGK+`, `TRA+D`, `TRB+`, `TRD+`, and `xxx` for unexpected recombinations. See  <locus.md>.*
Mathieu Giraud's avatar
Mathieu Giraud committed
732
| duplicate_count | number | Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence. <br />*Number of reads gathered in the clone.*
733 734
| sequence_id | string  | Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. <br />*This identifier is the (50 bp by default) window extacted around the junction.* |
| clone_id 	| string | 	Clonal cluster assignment for the query sequence. <br />*This identifier is again the (50 bp by default) window extacted around the junction.*
Mikaël Salson's avatar
Mikaël Salson committed
735
| warnings (+) | string | *Warnings associated to this clone. See <http://gitlab.vidjil.org/blob/dev/doc/warnings.md>.*
736 737
| sequence  | string | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. <br />*This contains the consensus/representative sequence of each clone.*
| rev_comp  | boolean | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. <br />*Set to null, as vidjil-algo gather reads from both strands in clones* |
738
| v_call, d_call, j_call  | string  | V/D/J gene with allele. For example, IGHV4-59\*01. <br /> *implemented. In the case of uncomplete/unexpected recombinations (locus with a `+`), we still use `v/d/j_call`. Note that this value can be null on clones beyond the `--max-designations` option.* |
739 740
| v_sequence_start, v_sequence_end <br />d_sequence_start, d_sequence_end  <br /> j_sequence_start, j_sequence_end | number |   Start/end position of the V/D/J genes and of the CDR3 in the query sequence (1-based closed interval).   <br />*implemented* |
| v_support, j_support | number | V/J gene alignment E-value, p-value, likelihood.  <br />*implemented* |
741 742 743
| junction  | string  |      Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. <br />*null*
| junction_aa  | string  | Junction region amino acid sequence.      <br />*implemented*
| cdr3_aa | string | Amino acid translation of the cdr3 field.   <br />*implemented*
744
| cdr3_sequence_start, cdr3_sequence_end  | number |   Start/end position of the CDR3 in the query sequence (1-based closed interval).   <br />*implemented* |
745
| productive | boolean | True if the V(D)J sequence is predicted to be productive.  <br /> *true, false, or null when no CDR3 has been detected* |
746 747
| vj_in_frame | boolean | True if the V and J gene alignments are in-frame. <br /> *true, false, or null when no CDR3 has been detected* |
| stop_codon | boolean | True if the aligned sequence contains a stop codon. <br /> *true, false, or null when vj_in_frame is false* |
748 749
| sequence_alignment  | string  | Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. <br /> *null*                                         |
| germline_alignment | string  | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). <br />*null*
750
| v_cigar, d_cigar, j_cigar | string  | CIGAR strings for the V/D/J gene <br />*null*.
751

752 753 754
Currently, we do not output alignment strings.
Our implementation of .tsv may evolve in future versions.
Contact us if a particular feature does interest you.
755 756