Commit 15dfb96b authored by Vidjil Team's avatar Vidjil Team Committed by Mathieu Giraud

doc/workflow.md: local filtering, draft

vdj#1200
prepared by @flothoni and @magiraud
parent b89c8460
## Local pre-filtering of large datasets
On large capture or RNA-seq datasets, very few reads, are expected to have V(D)J recombinations, typically as few as 0.01%, 0.001%, or even 0.0001%. Vidjil-algo was designed to efficiently find such a few needles in a stack of needles.
Large files may be hard to upload and to store.
To save bandwidth and disk space, it is thus advised to locally pre-process reads
to merge them (when applicable) and to filter them, with a first iteration of Vidjil-algo,
before uploading to a Vidjil server.
This filtering will produce much smaller files that could also be used by other software.
### Installation
**Install `vidjil-algo`**
- Requirements ([more documentation](vidjil-algo.md#installation)): on a recent Ubuntu system, `sudo apt-get install zlib1g-dev`
- Dowload and extract <http://www.vidjil.org/releases/vidjil-algo-latest.tar.gz>
- Inside `vidjil-algo` directory, build it with `make` (it boths compile vijdil-algo and fetches germlines genes repertoires created from IMGT and NCBI)
**Install `flash2`**
- Download and extract <https://github.com/dstreett/FLASH2/archive/master.zip>
- Inside `flash2` directory, build it with `make`
You may copy `vidjil-algo` and `flash2` binaries to folders avaialble from your `$PATH`.
### Usage
flash2 outputs several files: merged reads, unmerged reads from R1 file, unmerged reads from R2, and histogram.
You can concatenate merged reads and one of the unmerged files
to keep the same number of reads that in the inital fastq file
(as the [pre-processing](user.md#pre-processing) on the Vidjil server).
The following command line thus keeps `out.notCombined_1`, from R1,
supposing that R1 reads are "more centered" on the V(D)J junction than R2 reads.
Starting from `R1.fastq` and `R2.fastq`:
- Merge: `flash2 R1.fastq R2.fastq -m 300 -t 4 -z` (`-t 4` : run on 4 threads)
- Concatenate the files you want to keep, as for example `cat out.extendedFrags.fastq out.notCombined_1.fastq.gz > merged-reads.fastq.gz`
- Filter: `vidjil-algo --filter-reads -g germline/homo-sapiens.g merged-reads.fastq.gz`
The resulting `reads-merged.detected.fa` files can be uploaded on any Vidjil server,
or re-analyzed with vidjil-algo or with other software.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment