Commit 40434fca authored by Mathieu Giraud's avatar Mathieu Giraud
Browse files

Merge branch 'doc/workflow-filtering' into 'dev'

Doc: post-sequencer workflow, filtering reads

See merge request !915
parents b89c8460 0ac36d49
Pipeline #215224 passed with stages
in 96 minutes and 44 seconds
# Post-sequencer workflow before upload to a Vidjil server
This help is intended for bioinformaticians preparing workflows after their sequencer output.
See also considerations on [libraries and recombinations](
## File formats
It is recommended to upload `.fastq.gz` files to the Vidjil server.
Indeed, vidjil-algo takes into account the quality information in the output of the representative sequence.
When the base quality is not available, it is also possible to upload `fa.gz` files.
Note that vidjil-algo (and the Vidjil server) also accept uncompressed `.fastq` or `.fa` files
and even `.bam` files (but the added information of `.bam` files is not taken into account,
so uploading such files is not optimal).
## Local pre-filtering of large datasets
On large capture or RNA-seq datasets, very few reads, are expected to have V(D)J recombinations, typically as few as 0.01%, 0.001%, or even 0.0001%. Vidjil-algo was designed to efficiently find such a few needles in a stack of needles.
Large files may be hard to upload and to store.
To save bandwidth and disk space, it is thus advised to locally pre-process reads
to merge them (when applicable) and to filter them, with a first iteration of Vidjil-algo,
before uploading to a Vidjil server.
This filtering will produce much smaller files that could also be used by other software.
We offer two versions:
- The latest stable version, `vidjil-algo-latest`, which is in production for clinical applications.
- Tha alpha version, `vidjil-algo-alpha`, that provides at least 5× speed-up on multiple locus filtering.
Sensibility should be equivalent or even better than with the stable version.
Work is underway to release this version for production.
### Installation
**Install `vidjil-algo`**
- Requirements ([more documentation]( on a recent Ubuntu system, `sudo apt-get install zlib1g-dev`
- Download and extract <> or <>
- Inside `vidjil-algo` directory, build it with `make` (it boths compile vijdil-algo and fetches germlines genes repertoires created from IMGT and NCBI)
**Install `flash2`**
- Download and extract <>
- Inside `flash2` directory, build it with `make`
You may copy `vidjil-algo` and `flash2` binaries to folders avaialble from your `$PATH`.
### Usage
flash2 outputs several files: merged reads, unmerged reads from R1 file, unmerged reads from R2, and histogram.
You can concatenate merged reads and one of the unmerged files
to keep the same number of reads that in the inital fastq file
(as the [pre-processing]( on the Vidjil server).
The following command line thus keeps `out.notCombined_1`, from R1,
supposing that R1 reads are "more centered" on the V(D)J junction than R2 reads.
Starting from `R1.fastq` and `R2.fastq`:
- Merge: `flash2 R1.fastq R2.fastq -m 300 -t 4 -z` (`-t 4` : run on 4 threads)
- Concatenate the files you want to keep, as for example `cat out.extendedFrags.fastq out.notCombined_1.fastq.gz > merged-reads.fastq.gz`
- Filter: `vidjil-algo --filter-reads --gz -g germline/homo-sapiens.g merged-reads.fastq.gz`
The resulting `merged-reads.filtered.fa.gz` file can be uploaded on any Vidjil server,
or re-analyzed with vidjil-algo or with other software.
......@@ -24,6 +24,7 @@ nav:
- Specification of the .vidjil format:
- Specification of the warnings:
- Specification of the .should-vdj.fa tests:
- Post-sequencer workflow:
- Further developer documentation:
- Server administrator:
- Server administration (web):
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment