Commit a8749f83 authored by Mikaël Salson's avatar Mikaël Salson

Merge branch 'feature-a/3795-4387-out-vdjfa' into 'dev'

Feature a/3795 4387, --out-vdjfa

Closes #4387 and #3795

See merge request !775
parents 24acd18e 56fb02a1
Pipeline #160290 passed with stages
in 7 minutes and 42 seconds
!LAUNCH: $VIDJIL_DIR/$EXEC -r 1 -x 10 -y 5 -z 1 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vdj.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -r 1 -x 10 -y 5 -z 1 --out-vdjfa -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vdj.fa
# Testing -x/-y/-z options
......
!LAUNCH: $VIDJIL_DIR/$EXEC -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -b out-a $VIDJIL_DATA/clones_simul.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -b out-a --out-vdjfa $VIDJIL_DATA/clones_simul.fa
$ Output
1: out-a.vidjil
1: out-a.tsv
1: out-a.vdj.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --gz -b out-b $VIDJIL_DATA/clones_simul.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --gz -b out-b --out-vdjfa $VIDJIL_DATA/clones_simul.fa
$ Compressed output
1: out-b.vidjil.gz
......
......@@ -37,4 +37,4 @@ $ Display advanced options
: custom Cost
$ Correct number of options, including advanced options
59:^..-
60:^..-
......@@ -17,6 +17,9 @@ $ Correct output message
$ There is no clone output in individual files
0:detail, by clone
$ There is no deprecated .vdj.fa file
0:.vdj.fa
!LAUNCH: rm out/Stanford_S22.tsv ; $LAUNCHER $VIDJIL_DIR/$EXEC $EXTRA -z 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta > /dev/null ; touch out/Stanford_S22.tsv ; cat out/Stanford_S22.tsv
$ The AIRR .tsv file has four lines
......
!LAUNCH: $VIDJIL_DIR/$EXEC -r 1 -x 10 -y all -z 1 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vdj.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -r 1 -x 10 -y all -z 1 --out-vdjfa -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vdj.fa
# Testing -x/-y/-z options
......
......@@ -587,6 +587,11 @@ int main (int argc, char **argv)
bool out_gz = false;
app.add_flag("--gz", out_gz, "output compressed .tsv.gz, .vdj.fa.gz, and .vidjil.gz files") -> group(group) -> level();
bool output_vdjfa = false;
app.add_flag("--out-vdjfa", output_vdjfa,
"output clones in a " CLONES_FILENAME " file (only for clone sequence data)")
-> group(group) -> level();
bool output_clone_files = false;
app.add_flag("--out-clone-files", output_clone_files,
"output clones in individual files (in " CLONE_DIR "/" CLONE_FILENAME "* files)")
......@@ -1314,8 +1319,13 @@ int main (int argc, char **argv)
cout << " ==> suggested edges in " << out_dir+ f_basename + EDGES_FILENAME
<< endl ;
cout << " ==> " << f_clones << " \t(for post-processing with other software)" << endl ;
ostream* out_clones = new_ofgzstream(f_clones.c_str(), out_gz) ;
ostream* out_clones = NULL;
if (output_vdjfa)
{
cout << " ==> " << f_clones << " \t(for sequence post-processing with other software)" << endl;
cout << "!! To get structured data, do not parse the Fasta headers, but rather work on the .vidjil file." << endl;
out_clones = new_ofgzstream(f_clones.c_str(), out_gz);
}
if (output_clone_files)
{
......@@ -1382,7 +1392,8 @@ int main (int argc, char **argv)
// If max_representatives is reached, we stop here but still outputs the window
if ((max_representatives >= 0) && (num_clone >= max_representatives + 1))
{
*out_clones << window_str << endl ;
if (output_vdjfa)
*out_clones << window_str << endl ;
continue;
}
}
......@@ -1485,7 +1496,9 @@ int main (int argc, char **argv)
{
if (clone_on_stdout)
cout << representative << endl ;
*out_clones << representative << endl ;
if (output_vdjfa)
*out_clones << representative << endl ;
if (output_clone_files)
{
......@@ -1522,7 +1535,8 @@ int main (int argc, char **argv)
if (output_clone_files)
*out_clone << seg << endl ;
*out_clones << seg << endl ;
if (output_vdjfa)
*out_clones << seg << endl ;
seg.toOutput(clone);
......@@ -1587,7 +1601,9 @@ int main (int argc, char **argv)
signal(SIGINT, SIG_DFL);
out_edges.close() ;
delete out_clones;
if (output_vdjfa)
delete out_clones;
if (num_clone > last_num_clone_on_stdout)
{
......
......@@ -456,7 +456,7 @@ two windows that must be clustered.
## Main output files
The main output of Vidjil-algo (with the default `-c clones` command) are the three following files:
The default output of Vidjil-algo (with the default `-c clones` command) are the two following files:
- The `.vidjil` file is the *main output file*, containing the most information.
The file is in a `.json` format,
......@@ -473,33 +473,40 @@ The main output of Vidjil-algo (with the default `-c clones` command) are the th
- The `.tsv` file is the AIRR output, for compatibility with other software
using the same format. See [below](#airr-tsv-output) for details.
- The `.vdj.fa` file is *a FASTA file for further processing by other bioinformatics tools*.
Even if it is advised to rather use the full information in the `.vijdil` file,
the `.vdj.fa` is a convenient way to have sequences of clones for further processing.
These sequences are at least the windows (and their count in the headers) or
the consensus sequences (`--max-consensus`) when they have been computed.
The [headers](#the-vdjfa-format) are described below.
Some other informations such as the further clustering are not output in this file.
The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
shrinking a large read set into a manageable number of (pre-)clones
that will be deeply analyzed and possibly further clustered by
other software.
By default, the three output files are named
`out/basename.vidjil`, `out/basename.tsv`, and `out/basename.vdj.fa`, where:
By default, these output files are named
`out/basename.vidjil` and `out/basename.tsv`, where:
- `out` is the directory where all the outputs are stored (can be changed with the `--dir` option).
- `basename` is the basename of the input `.fasta/.fastq` file (can be overriden with the `--base` option)
With the `--gz` option, the three files are output
as compressed `.vidjil.gz`, `.tsv.gz`, and `.vdj.fa.gz` files.
With the `--gz` option, both files are output
as compressed `.vidjil.gz` and `.tsv.gz` files.
Vidjil-algo also outputs the first 50 clones on the standard output.
More data can be printed on the standard output with the `-v` option.
## Auxiliary output files
### `.vdj.fa`
With the `--out-vdjfa` option, a `.vdj.fa` file is created (or, with `--gz`, a `.vdj.fa.gz` file).
This is *a FASTA file for further processing by other bioinformatics tools*.
Even if it is advised to rather use the full information in the `.vijdil` file,
the `.vdj.fa` is a convenient way to have sequences of clones for further processing.
These sequences are at least the windows (and their count in the headers) or
the consensus sequences (`--max-consensus`) when they have been computed.
The [headers](#headers-in-vdj-fa-files-deprecated) are described below, but the format of the headers is deprecated
and will not be enforced in future releases.
Some other informations such as the further clustering are not output in this file.
The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
shrinking a large read set into a manageable number of (pre-)clones
that will be deeply analyzed and possibly further clustered by
other software.
### `.windows.fa`
The `out/basename.windows.fa` file contains the list of windows, with number of occurrences:
``` diff
......@@ -514,6 +521,8 @@ ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC
Windows of size 50 (modifiable by `-w`) have been extracted.
The first window has 8 occurrences, the second window has 5 occurrences.
### `seq/clone.fa-*`
With the `--out-clone-files` option, one `out/seq/clone.fa-*` file is created for each clone.
It contains the detailed analysis by clone, with
the window, the consensus sequence, as well as with the most similar V, (D) and J germline genes:
......@@ -651,13 +660,14 @@ Our implementation of .tsv may evolve in future versions.
Contact us if a particular feature does interest you.
## The .vdj.fa format
## Headers in the .vdj.fa files (deprecated)
The `.vdj.fa` format is compatible with the FASTA format,
and details V(D)J recombinations in the FASTA headers.
The format is described below, but may evolve in future releases.
For post-processing tools needing some of that information, it is not recommended to parse these headers,
but rather to use the `.vidjil` file that contains more information in a structured way.
The `.vdj.fa` format is compatible with the FASTA format.
The FASTA header of each sequence gives some details on the V(D)J recombinations.
The format of these headers is described below, but is considered as deprecated and may be removed in future releases in Q3 2021.
For post-processing tools needing some of that information, it is thus not recommended to parse these headers,
but rather to use either the `.vidjil` file that contains more information in a structured way, or the AIRR `.tsv` output.
In a `.vdj.fa` format, a line starting with a \> is of the following form:
......@@ -735,7 +745,9 @@ clustering such reads into clones, and further analyzing the clones.
./vidjil-algo -g germline/homo-sapiens.g:IGH -3 demo/Stanford_S22.fasta
# Cluster the reads and report the clones, based on windows overlapping IGH CDR3s.
# Assign the V(D)J genes and try to detect the CDR3 of each clone.
# Summary of clones is available both on stdout, in out/Stanford_S22.vdj.fa and in out/Stanford_S22.vidjil.
# Main output files are both out/Stanford_S22.vidjil and out/Stanford_S22.tsv.
# Summary of clones is available on stdout.
```
``` bash
......@@ -743,7 +755,8 @@ clustering such reads into clones, and further analyzing the clones.
# Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
# Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
# Assign the VDJ genes (including multiple D) and try to detect the CDR3 of each clone.
# Summary of clones is available both on stdout, in out/reads.vdj.fa and in out/reads.vidjil.
# Main output files are both out/reads.vidjil and out/reads.tsv.
# Summary of clones is available on stdout.
```
## Sorting reads from whole RNA-Seq or capture datasets
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment