Commit ce6cb977 authored by Vidjil Team's avatar Vidjil Team Committed by Mathieu Giraud
Browse files

doc/: AIRR import through fuse.py

See #3673
by @flothoni and @magiraud
parent 741e643d
......@@ -256,7 +256,7 @@ Here are some notable configuration changes you should consider:
# Docker -- Adding external software
Some software can be added to Vidjil for pre-processing or even processing if the
software outputs data compatible with the `.vidjil` format.
software outputs data compatible with the `.vidjil` or AIRR format.
We recommend you add software by adding a volume to your `docker-compose.yml`.
By default we add our external files to `/opt/vidjil` on the host machine. You can then
reference the executable in `vidjil-server/conf/defs.py`.
......
# `fuse.py` : converting and merging immune repertoire data
## Merging files to follow clones in several samples
Many immune repertoire sequencing studies aim to track clones in several samples.
One can compare repertoires from several samples coming from a same person or different ones,
and detect and quantify common clones.
For example in a minimal residual disease (MRD) setup, we are interested in
following the main clones identified at diagnosis in the following samples.
Let assume that four `.vidjil` files have been produced for each sample
(namely `diag.vidjil`, `fu1.vidjil`, `fu2.vidjil`, `fu3.vidjil`), merging them will
be done in the following way:
``` bash
python tools/fuse.py --output mrd.vidjil --top 100 diag.vidjil fu1.vidjil fu2.vidjil fu3.vidjil
```
The `--top` parameter allows to choose how many top clones per sample should
be kept. The default value is 50. Here `--top 100` means that for each sample, the top 100 clones are kept
*and followed in the other samples*, even if it is not in the top 100 of the other samples.
This allows to follow and quantify targeted clones even when there have only a few reads in some samples.
The `mrd.vidjil` file can then be fed to the web client.
## Using AIRR data
The AIRR community has published [a standard representation](http://docs.airr-community.org/en/latest/datarep/overview.html#format-specification) to describe results of immune receptor repertoire analysis.
Used by an increasing number of software, this `.tsv` format allows to easily transfer immune repertoire data between pipelines.
The [AIRR output of vidjil-algo](./vidjil-algo/#airr-tsv-output) enables to fed vidjil-algo output to other software.
Conversely, `fuse.py` is able to take one or several AIRR `.tsv` file(s) to get a `.vidjil` file that can be opened by the Vidjil web application:
``` bash
python tools/fuse.py --output out.vidjil sample1.tsv sample2.tsv
```
For a same analysis, you can mix `.vidjil` and AIRR files.
However, the following points should be taken into account:
- The Vidjil web application uses the `duplicate_count` value for each clone in a `.tsv` file
to the size of each clone. This was discussed on the AIRR mailing list, but other software may use other fields.
Note that the AIRR output of `vidjil-algo` uses the same convention.
- Some RepSeq software (such as IgBlast) do not cluster at all clones but only analyzes independently each read.
As `fuse.py` does not add clustering information, the output of these software will be also shown unclustered in the Vidjil web application.
- More generally, RepSeq software have various definitions of clones (see [What is a clone ?](vidjil-format/#what-is-a-clone)).
When processed with `fuse.py`, clones across several samples will be identified when they share the same `clone_id` value.
When merging data from different samples, one must ensure that the software outputs relevant `clone_id` to mark these very same clones,
otherwise they would appear as unrelated in the web application (but they can still be clustered there).
This will also often be the case when merging files coming from different software.
......@@ -30,16 +30,19 @@ recombinations and the sequences found in one or several samples.
The easiest way to get these files is to [request an account](http://app.vidjil.org/) on the public Vidjil test server.
You will then be able to upload,
manage, process your samples (`.fasta`, `.fastq`, `.gz` or `.clntab` files) directly on the web application
manage, process your samples (`.fasta`, `.fastq`, `.gz`, `.bam`, or `.clntab` files) directly on the web application
(see *The patient/experiment database and the server*), and the server behind the patient/experiment
database computes these `.vidjil` files.
Otherwise, such `.vidjil` files can be obtained:
database computes these `.vidjil` files with vidjil-algo.
Otherwise, such `.vidjil` files can be obtained either:
- from vidjil-algo (starting from
- running vidjil-algo from the command line (starting from
`.fasta`, `.fastq` or `.gz` files, see [vidjil-algo documentation](http://www.vidjil.org/doc/vidjil-algo/)).
To gather several `.vidjil` files, you have to use the [fuse.py](http://git.vidjil.org/blob/master/tools/fuse.py) script
- or by any other V(D)J analysis pipelines able to output files
respecting the `.vidjil` [file format](./format-analysis.org) (contact us if you are interested)
respecting the `.vidjil` [file format](http://www.vidjil.org/doc/vidjil-format/)
- or by using the [fuse.py](http://git.vidjil.org/blob/master/tools/fuse.py) script on the standard [AIRR representation](http://docs.airr-community.org/en/latest/datarep/overview.html#format-specification)
Contact us if you want help on converting such data.
# First aid
......
......@@ -679,6 +679,7 @@ For example `-uu -X 1000` splits the not analyzed reads from the 1000 first read
Since version 2018.10, vidjil-algo supports the [AIRR format](http://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields).
We export all required fields, some optional fields, as also some custom fields (+).
We also propose in [fuse.py](/tools) a way to convert AIRR format to the `.vidjil` format.
Note that Vidjil-algo is designed to efficiently gather reads from large datasets into clones.
By default (`-c clones`), we thus report in the AIRR format *clones*.
......@@ -864,21 +865,15 @@ limited by `--max-clones`.
By default *all* the clones of the sample are kept (`--max-clones all`),
even if the V(D)J designation is computed only for some of them.
Merging `.vidjil` files into a single one is done
with the [tools/fuse.py](../tools/fuse.py) script, such as in:
The `tools/fuse.py` script, as documented [here](./tools.md),
merge several `.vidjil` files into a single one that can then be fed to the web client:
``` sh
python tools/fuse.py --output mrd.vidjil --top 100 diag.vidjil fu1.vidjil fu2.vidjil fu3.vidjil
python tools/fuse.py --output out.vidjil --top 100 sample1.vidjil sample2.vidjil sample3.vidjil
```
The Vidjil web application takes the resulting `.vidjil` file (here `mrd.vidjil`).
The `--top` parameter allows to choose how many top clones per sample should
be kept. The default value is 50. Here `--top 100` means that for each sample, the top 100 clones are kept
*and followed in the other samples*, even if it is not in the top 100 of the other samples.
This allows to follow and quantify targeted clones even when there have only a few reads in some samples.
As the `--top` value is below the default `--max-designations 100`, it means that every clone in the
As the `--top` value is equal or below the default `--max-designations 100`, it means that every clone in the
"merged" file will be fully analyzed with a V(D)J designation.
Thus is advised to leave, in `vdijil-algo` the default `--max-clones all --max-designations 100` options
for the majority of uses.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment