Commit 53009ea3 authored by Mikaël Salson's avatar Mikaël Salson

Merge branch 'feature-a/3295-option-names-and-help' into 'dev'

Feature a/3295 option names and help

Closes #3785 and #3295

See merge request !434
parents ad2af586 ae5890aa
Pipeline #67627 passed with stages
in 5 minutes and 34 seconds
!LAUNCH: $VIDJIL_DIR/$EXEC -KA -k 16 -z 0 -g ../../../germline/homo-sapiens.g:IGH bug20150604.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -K --all -k 16 -z 0 -g ../../../germline/homo-sapiens.g:IGH bug20150604.fa
$ Bug on some architectures, not segmented
1: junction detected in 1 reads
......
!LAUNCH: $VIDJIL_DIR/$EXEC -s '#####-#####' -c clones -r 1 -g ../../../germline/homo-sapiens.g:IGH -t 0 -e 1e-2 bug20160121.fa
!LAUNCH: $VIDJIL_DIR/$EXEC -s '#####-#####' -c clones -r 1 -g ../../../germline/homo-sapiens.g:IGH -e 1e-2 bug20160121.fa
$ Sequences should not be segmented since they only contain J.
1:UNSEG only J/3' -> 2
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -c clones -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/test_representatives.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -c clones -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/test_representatives.fa
$ Three clones should be found
1:3 clones
......
!LAUNCH: $VIDJIL_DIR/$EXEC -a -g $VIDJIL_DIR/germline/homo-sapiens-cd.g -A $VIDJIL_DATA/cd-19-trimmed.fa
!LAUNCH: $VIDJIL_DIR/$EXEC --all --out-reads -g $VIDJIL_DIR/germline/homo-sapiens-cd.g $VIDJIL_DATA/cd-19-trimmed.fa
$ Load CD-sorting.fa
1:homo-sapiens/CD-sorting.fa .* 28 sequences
......
!LAUNCH: $VIDJIL_DIR/$EXEC -K -g $VIDJIL_DIR/germline/homo-sapiens-cd.g -A $VIDJIL_DATA/cd-4-19.fa ; grep 'seed' out/cd-4-19.affects
!LAUNCH: $VIDJIL_DIR/$EXEC -K -g $VIDJIL_DIR/germline/homo-sapiens-cd.g --all $VIDJIL_DATA/cd-4-19.fa ; grep 'seed' out/cd-4-19.affects
$ Load CD-sorting.fa
1:homo-sapiens/CD-sorting.fa .* 28 sequences
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -g $VIDJIL_DIR/germline -A -2 $VIDJIL_DATA/2549.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -g $VIDJIL_DIR/germline --all -2 $VIDJIL_DATA/2549.fa
$ The KmerSegmenter segments the chimera on xxx germline (-2)
1:unexpected .* -> .* 1
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -t 0 -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake.fa
$ The KmerSegmenter segments the three chimera reads on PSEUDO_MAX12 germline (-2)
1:unexpected .* -> .* 3
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake-D.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake-D.fa
$ The KmerSegmenter segments the chimera reads on PSEUDO_MAX12 germline (-2)
f1:unexpected .* -> .* 2
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DATA/chimera-fake-VJ-trim.g $VIDJIL_DATA/chimera-fake-VJ.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DATA/chimera-fake-VJ-trim.g $VIDJIL_DATA/chimera-fake-VJ.fa
# Testing a custom (fake) .g with special parameters for the algorithm
$ The KmerSegmenter segments no read in Y because of the parameter
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DATA/chimera-fake-VJ.g $VIDJIL_DATA/chimera-fake-VJ.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DATA/chimera-fake-VJ.g $VIDJIL_DATA/chimera-fake-VJ.fa
# Testing a custom (fake) germlines.data
$ Report the species
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake-VJ.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline -2 $VIDJIL_DATA/chimera-fake-VJ.fa
$ The KmerSegmenter segments the five chimera reads on PSEUDO_MAX12 germline (-2)
1:unexpected .* -> .* 5
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -e 100 -A -t 0 -g $VIDJIL_DIR/germline -4 $VIDJIL_DATA/chimera-fake-half.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -e 100 --all -g $VIDJIL_DIR/germline -4 $VIDJIL_DATA/chimera-fake-half.fa
# TODO: a more precise modeling should give a e-value computation that could make this work even with -e 1
$ The KmerSegmenter segments the six chimera reads on PSEUDO_MAX1U germline (-4)
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -uU -g $VIDJIL_DIR/germline $VIDJIL_DATA/chimera-trg.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -uU -g $VIDJIL_DIR/germline $VIDJIL_DATA/chimera-trg.fa
$ Do not segment on IG/TR by chance
12:(IG|TR).* -> .* 0
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -g $VIDJIL_DIR/germline/homo-sapiens.g -c clones -A -3 $VIDJIL_DATA/segment_lec.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -g $VIDJIL_DIR/germline/homo-sapiens.g -c clones --all -3 $VIDJIL_DATA/segment_lec.fa
$ Extract up to 50bp windows (TRG)
1:windows up to 50bp
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -k 14 -w 50 -c clones -V $VIDJIL_DIR/germline/homo-sapiens/IGHV.fa -J $VIDJIL_DIR/germline/homo-sapiens/IGHJ.fa -y 3 -z 0 -r 1 -n 5 $VIDJIL_DATA/clones_simul.fa ; cat out/clones_simul.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -k 14 -w 50 -c clones -V $VIDJIL_DIR/germline/homo-sapiens/IGHV.fa -J $VIDJIL_DIR/germline/homo-sapiens/IGHJ.fa -y 3 -z 0 -r 1 --cluster-epsilon 5 $VIDJIL_DATA/clones_simul.fa ; cat out/clones_simul.vidjil
$ Window extractions
1:windows up to 50bp
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -KA -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH+ -r 4 -b co $VIDJIL_DATA/D7-27--J1.fa ; cat out/co.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -K --all -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH+ -r 4 -b co $VIDJIL_DATA/D7-27--J1.fa ; cat out/co.vidjil
# Test D7-27 0/92/0 J1 non-recombination
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -x 2000 -t 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -FaW GAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACT $VIDJIL_DATA/Stanford_S22.fasta ; cat out/seq/clone.fa-1
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -x 2000 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --grep-reads GAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACT $VIDJIL_DATA/Stanford_S22.fasta ; cat out/seq/clone.fa-1
# See also label-grep-reads.should-get
$ Keep only one windows, the one given by -W, with only 2 reads in the first 2000 reads (it is actually the second clone in Stanford_S22.fasta)
1: keep 1 windows in 2 reads
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -KA -z 0 -s 10s -V $VIDJIL_DIR/germline/homo-sapiens/IGHV.fa -J $VIDJIL_DIR/germline/homo-sapiens/IGHJ.fa -D $VIDJIL_DIR/germline/homo-sapiens/IGHD.fa $VIDJIL_DATA/common-V-D.fa ; cat out/common-V-D.affects
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -K --all -z 0 -s 10s -V $VIDJIL_DIR/germline/homo-sapiens/IGHV.fa -J $VIDJIL_DIR/germline/homo-sapiens/IGHJ.fa -D $VIDJIL_DIR/germline/homo-sapiens/IGHD.fa $VIDJIL_DATA/common-V-D.fa ; cat out/common-V-D.affects
$ Segments the sequence
1: SEG .* -> .* 1
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c segment reads 2>&1
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c segment -aAtl reads 2>&1
!EXIT_CODE: 1
$ Deprecated option
1:is deprecated
$ Deprecated options
5:is deprecated
$ Advice on usage
1:-c designations
1:--trim
1:--all
1:--label
1:--out-reads
......@@ -5,5 +5,5 @@ $ Unknown option
1:error.* --hello
$ Refer to online help and documentation
1:run with -h
1:run with --help
1:see doc/vidjil-algo.md
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -Z 10 -A -x 30 -v -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --analysis-filter 10 --all -x 30 -v -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta
$ Clone 13 is correctly analyzed
1:FLN1FA001EP9M2.* IGHV2-26.* 2/GAT.*GCC/8 IGHJ2
$ Statistics on -Z
1:Statistics on clone analysis
rb1: IGH 3[0-2][0-9]{2}/ 1[0-2][0-9]{3} 28..%
rb1: IGH 3[0-2][0-9]{2}/ 1[0-2][0-9]{3} 28..%
\ No newline at end of file
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -x 2000 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --out-reads --label-filter --label GAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACT $VIDJIL_DATA/Stanford_S22.fasta ; cat out/seq/clone.fa-1
# See also combo-grep-reads.should-get
$ Keep only one windows, the one given by -W, with only 2 reads in the first 2000 reads (it is actually the second clone in Stanford_S22.fasta)
1: keep 1 windows in 2 reads
$ Tbere are the three IGHV/D/J genes in out/seq/clone.fa-1
3:>IGH
$ There are 2 reads in out/seq/clone.fa-1
2:>lcl
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c designations -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -A $VIDJIL_DATA/overlap-d-j.fa | grep -v out | tail -4 | tr -d '\n' | wc -c
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c designations -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --all $VIDJIL_DATA/overlap-d-j.fa | grep -v out | tail -4 | tr -d '\n' | wc -c
$ Exported sequence has all the bases
1:116
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c clones -A -g $VIDJIL_DIR/germline/homo-sapiens.g:TRG -A $VIDJIL_DATA/segment_lec.fq > /dev/null ; cat out/segment_lec.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -c clones --all -g $VIDJIL_DIR/germline/homo-sapiens.g:TRG $VIDJIL_DATA/segment_lec.fq > /dev/null ; cat out/segment_lec.vidjil
$ Window
1:"id": "GGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCA"
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline/homo-sapiens.g:TRG ../should-vdj-tests/ext-nucleotides-N.should-vdj.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline/homo-sapiens.g:TRG ../should-vdj-tests/ext-nucleotides-N.should-vdj.fa
$ Segments on TRG
1: TRG .* -> .* 1
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -e 10 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -W ACCGGTATTACT -W CAGCTGCTCCCC -W TGGGCCACTC -W ATCAACGCTGGCAATGGTAACACTAAATATTCACAGAAGTTCCAGGGCAGAGTCACCATTACCAGGGACACATACGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTCTGTATTACTGTGCGAGAGTGCGCAGCAGCTGGTCTGATGCTTTTGATTATCTGG $VIDJIL_DATA/clones_simul.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -e 10 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH --label ACCGGTATTACT --label CAGCTGCTCCCC --label TGGGCCACTC --label ATCAACGCTGGCAATGGTAACACTAAATATTCACAGAAGTTCCAGGGCAGAGTCACCATTACCAGGGACACATACGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTCTGTATTACTGTGCGAGAGTGCGCAGCAGCTGGTCTGATGCTTTTGATTATCTGG $VIDJIL_DATA/clones_simul.fa
$ ACCGGTATTACT is found (in window and representative and in the command line)
3:ACCGGTATTACT
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -w all -g $VIDJIL_DIR/germline/homo-sapiens.g $VIDJIL_DATA/s-somatic.fa ; cat out/s-somatic.vidjil | python $VIDJIL_DIR/tools/format_json.py -1
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -w all -g $VIDJIL_DIR/germline/homo-sapiens.g $VIDJIL_DATA/s-somatic.fa ; cat out/s-somatic.vidjil | python $VIDJIL_DIR/tools/format_json.py -1
$ No clustering due to -w all
1: considering all analyzed reads as windows
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -z 2 -r 5 -a -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/seq/clone.fa-2
# Testing detailed clone output (-a)
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -z 2 -r 5 --out-reads -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/Stanford_S22.fasta ; cat out/seq/clone.fa-2
# Testing detailed clone output (--out-reads)
$ Detailed clone output (out/seq/clone.fa-2), germline
# IGHV1-8*01 could also be detected
......
!LAUNCH: ($LAUNCHER $VIDJIL_DIR/$EXEC $EXTRA $VIDJIL_DEFAULT_OPTIONS -c germlines -g $VIDJIL_DIR/germline/homo-sapiens.g:TRA,TRB,TRD,TRG,IGH,IGK,IGL -t 100 -s '######-######' $VIDJIL_DATA/Stanford_S22.fasta)
!LAUNCH: ($LAUNCHER $VIDJIL_DIR/$EXEC $EXTRA $VIDJIL_DEFAULT_OPTIONS -c germlines -g $VIDJIL_DIR/germline/homo-sapiens.g:TRA,TRB,TRD,TRG,IGH,IGK,IGL --trim 100 -s '######-######' $VIDJIL_DATA/Stanford_S22.fasta)
$ number of reads and kmers
1:13153 reads, 3020179 kmers
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -x 100 -z 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -r 5 -W ACTGTGCGAGAGTTGGAATTAGTAGTGGCTGGCCTGATTCCTGGGGCCAG $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -x 100 -z 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -r 5 --label ACTGTGCGAGAGTTGGAATTAGTAGTGGCTGGCCTGATTCCTGGGGCCAG $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vidjil
$ Some clone has only one read, bypassing the -r 5 option, and the good label
1: clone-00..*0001-.* -W
1: clone-00..*0001-.* --label
$ The label appears in the json output
1: "label": "-W"
1: "label": "--label"
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -z 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -x 100 -r 5 -l $VIDJIL_DATA/Stanford_S22.label $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -z 0 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -x 100 -r 5 --label-file $VIDJIL_DATA/Stanford_S22.label $VIDJIL_DATA/Stanford_S22.fasta ; cat out/Stanford_S22.vidjil
$ Some clone has only one read, bypassing the -r 5 option, and the good label
1: clone-00..*0001-.* my-clone
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -y 0 -t 1 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -x 100 $VIDJIL_DATA/Stanford_S22.fasta
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -y 0 --trim 1 -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH -x 100 $VIDJIL_DATA/Stanford_S22.fasta
$ No read segmented as we have no germline because of the -t
$ No read segmented as we have no germline because of the --trim
1: UNSEG too few V/J -> 100
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline/homo-sapiens.g:TRB $VIDJIL_DATA/trb-only-VJ.fa ; cat out/trb-only-VJ.vidjil
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline/homo-sapiens.g:TRB $VIDJIL_DATA/trb-only-VJ.fa ; cat out/trb-only-VJ.vidjil
$ Segments the read on TRB (the information is given twice, stdout + .vidjil)
2: TRB .* -> .* 1
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline/homo-sapiens.g:TRD $VIDJIL_DATA/trd-dd2-dd3.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline/homo-sapiens.g:TRD $VIDJIL_DATA/trd-dd2-dd3.fa
$ Segment only 2 reads, because we do not look for incomplete recombinations
1:junction detected in 2 reads
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -w 10 -e 10 -A -g $VIDJIL_DIR/germline $VIDJIL_DATA/trd-dd2-dd3.fa
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -w 10 -e 10 --all -g $VIDJIL_DIR/germline $VIDJIL_DATA/trd-dd2-dd3.fa
$ Segment 6 reads, thanks to -i
1:junction detected in 6 reads
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -K -A -e 10 -k 8 -w 20 -V $VIDJIL_DIR/germline/homo-sapiens/TRDV.fa -V $VIDJIL_DIR/germline/homo-sapiens/TRDD2+up.fa -J $VIDJIL_DIR/germline/homo-sapiens/TRDD3+down.fa -J $VIDJIL_DIR/germline/homo-sapiens/TRDJ.fa $VIDJIL_DATA/trd-dd2-dd3.fa ; cat out/trd-dd2-dd3.affects
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -K --all -e 10 -k 8 -w 20 -V $VIDJIL_DIR/germline/homo-sapiens/TRDV.fa -V $VIDJIL_DIR/germline/homo-sapiens/TRDD2+up.fa -J $VIDJIL_DIR/germline/homo-sapiens/TRDD3+down.fa -J $VIDJIL_DIR/germline/homo-sapiens/TRDJ.fa $VIDJIL_DATA/trd-dd2-dd3.fa ; cat out/trd-dd2-dd3.affects
$ Segment all 8 reads, thanks to TRDD2 and TRDD3
1: junction detected in 8 reads .100..
......
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -f '1, 2, 3, 4, 5' $VIDJIL_DATA/Stanford_S22.fasta
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --analysis-cost '1, 2, 3, 4, 5' $VIDJIL_DATA/Stanford_S22.fasta
!EXIT_CODE: 1
$Check that correct custom cost is used
......
......@@ -5,7 +5,7 @@ $ License
1:vidjil-algo is free software
$ Check default costs
1:segmenter .* "4, -6, -10, -1, -10"
1:analysis.* "4, -6, -10, -1, -10"
1:clustering .* "1, -4, -4, 0, 0"
$ Show seeds
......@@ -17,4 +17,4 @@ $ Display advanced options
: custom Cost
$ Correct number of options
B51:^ -
B52:^ -
......@@ -7,12 +7,12 @@ $ License
$ Check default filtering options
1: =5 .* minimal number of reads supporting a clone
1: =0 .* minimal percentage of reads supporting a clone
1: =100 .* maximal number of clones computed with a consensus sequence
1: =100 .* maximal number of clones to be analyzed
1: =100 .*maximal number of clones computed with a consensus sequence
1: max-designations .*=100
$ Do not display advanced options
0: , experimental options
0: custom Cost
$ Correct number of regular options
B25:^ -
B24:^ -
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS -A -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/toy_V.fa ; cat out/toy_V.vidjil | python $VIDJIL_DIR/tools/format_json.py -1
!LAUNCH: $VIDJIL_DIR/$EXEC $VIDJIL_DEFAULT_OPTIONS --all -g $VIDJIL_DIR/germline/homo-sapiens.g:IGH $VIDJIL_DATA/toy_V.fa ; cat out/toy_V.vidjil | python $VIDJIL_DIR/tools/format_json.py -1
$ Warning, -A
$ Warning, --all
1:WARNING
$ Warning in json output
......
This diff is collapsed.
......@@ -248,21 +248,19 @@ clustering.
``` diff
Germline presets (at least one -g or -V/(-D)/-J option must be given)
-g GERMLINES ...
-g, --germline GERMLINES ...
-g <.g FILE>(:FILTER)
multiple locus/germlines, with tuned parameters.
Common values are '-g germline/homo-sapiens.g' or '-g germline/mus-musculus.g'
The list of locus/recombinations can be restricted, such as in '-g germline/homo-sapiens.g:IGH,IGK,IGL'
-g PATH
multiple locus/germlines, shortcut for '-g PATH/homo-sapiens.g',
processes human TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with some incomplete/unusal recombinations
processes human TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with incomplete/unusal recombinations
-V FILE ... custom V germline multi-fasta file(s)
-D FILE ... custom D germline multi-fasta file(s), segment into V(D)J components
-D FILE ... custom D germline multi-fasta file(s), analyze into V(D)J components
-J FILE ... custom V germline multi-fasta file(s)
Locus/recombinations
-d try to detect several D (experimental)
-2 try to detect unexpected recombinations (must be used with -g)
-2 try to detect unexpected recombinations
```
The `germline/*.g` presets configure the analyzed recombinations.
......@@ -297,12 +295,12 @@ Recombination detection ("window" prediction, first pass)
(use either -s or -k option, but not both)
(using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
(all these options, except -w, are overriden when using -g)
-k INT k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
-w INT w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
-e FLOAT=1 maximal e-value for determining if a V-J designation can be trusted
-t INT trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt (0: no trim)
-s SEED=10s seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
-k, --kmer INT k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
-w, --window INT w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
-e, --e-value FLOAT=1 maximal e-value for determining if a V-J segmentation can be trusted
--trim INT trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt (0: no trim)
-s, --seed SEED=10s seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
```
The `-s`, `-k` are the options of the seed-based heuristic that detects
......@@ -352,34 +350,37 @@ The default value is 1.0, but values such as 1000, 1e-3 or even less can be used
to have a more or less permissive designation.
The threshold can be disabled with `-e all`.
The `-t` option sets the maximal number of nucleotides that will be indexed in
The `--trim` option sets the maximal number of nucleotides that will be indexed in
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
However giving a `-t` may also reduce the probability of seeing a heavily
However giving a `--trim` may also reduce the probability of seeing a heavily
trimmed or mutated V gene.
The default is `-t 0`.
The default is `--trim 0`.
## Thresholds on clone output
The following options control how many clones are output and analyzed.
``` diff
Limits to report a clone (or a window)
Input
-x, --first-reads INT maximal number of reads to process ('all': no limit, default), only first reads
-X, --sampled-reads INT maximal number of reads to process ('all': no limit, default), sampled reads
Limits to report and to analyze clones (second pass)
-r, --min-reads INT=5 minimal number of reads supporting a clone
--min-ratio FLOAT=0 minimal percentage of reads supporting a clone
--max-clones INT maximal number of output clones ('all': no maximum, default)
-r INT=5 minimal number of reads supporting a clone
--ratio FLOAT=0 minimal percentage of reads supporting a clone
Limits to further analyze some clones (second pass)
-y INT=100 maximal number of clones computed with a consensus sequence ('all': no limit)
-z INT=100 maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use)
-A reports and segments all clones (-r 0 --ratio 0 -y all -z all), to be used only on very small datasets (for example -AX 20)
-x INT maximal number of reads to process ('all': no limit, default), only first reads
-X INT maximal number of reads to process ('all': no limit, default), sampled reads
-y, --max-consensus INT=100 maximal number of clones computed with a consensus sequence ('all': no limit)
-z, --max-designations INT=100
maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use)
--all reports and analyzes all clones
(--min-reads 1 --min-ratio 0 --max-clones all --max-consensus all --max-designations all),
to be used only on small datasets (for example --all -X 1000)
```
The `-r/--ratio` options are strong thresholds: if a clone does not have
the requested number of reads, the clone is discarded (except when
using `-l`, see below).
using `--label`, see below).
The default `-r 5` option is meant to only output clones that
have a significant read support. **You should use** `-r 1` **if you
want to detect all clones starting from the first read** (especially for
......@@ -387,18 +388,18 @@ MRD detection).
The `--max-clones` option limits the number of output clones, even without consensus sequences.
The `-y` option limits the number of clones for which a consensus
The `--max-consensus` option limits the number of clones for which a consensus
sequence is computed. Usually you do not need to have more
consensus (see below), but you can safely put `-y all` if you want
consensus (see below), but you can safely put `--max-consensus all` if you want
to compute all consensus sequences.
The `-z` option limits the number of clones that are fully analyzed,
The `--max-designations` option limits the number of clones that are fully analyzed,
*with their V(D)J designation and possibly a CDR3 detection*,
in particular to enable the web application
to display the clones on the grid (otherwise they are displayed on the
'?/?' axis).
If you want to analyze more clones, you should use `-z 200` or
`-z 500`. It is not recommended to use larger values: outputting more
If you want to analyze more clones, you should use `--max-designations 200` or
`--max-designations 500`. It is not recommended to use larger values: outputting more
than 500 clones is often not useful since they can not be visualized easily
in the web application, and takes more computation time.
......@@ -408,27 +409,27 @@ and `.vdj.fa` files. If the clone is at some MRD point in the top 100 (or 200, o
it will be fully analyzed/segmented by this other point (and then
collected by the `fuse.py` script, using consensus sequences computed at this
other point, and then, on the web application, correctly displayed on the grid).
**Thus is advised to leave the default** `-z 100` **option
**Thus is advised to leave the default** `--max-designations 100` **option
for the majority of uses.**
The `-A` option disables all these thresholds. This option should be
used only for test and debug purposes, on very small datasets, and
produce large file and takes huge computation times.
The `--all` option disables all these thresholds. This option can be
used for test and debug purposes or on small datasets.
It produces large file and takes more time.
The `-Z` option speeds up the full analysis by a pre-processing step,
The `--analysis-filter` option speeds up the full analysis by a pre-processing step,
again based on k-mers, to select a subset of the V germline genes to be compared to the read.
The option gives the typical size of this subset (it can be larger when several V germlines
genes are very similar, or smaller when there are not enough V germline genes).
The default `-Z 3` is generally safe.
Setting `-Z all` removes this pre-processing step, running a full dynamic programming
The default `--analysis-filter 3` is generally safe.
Setting `--analysis-filter all` removes this pre-processing step, running a full dynamic programming
with all germline sequences that is much slower.
## Sequences of interest
Vidjil-algo allows to indicate that specific sequences should be followed and output,
even if those sequences are 'rare' (below the `-r/--ratio` thresholds).
Such sequences can be provided either with `-W <sequence>`, or with `-l <file>`.
The file given by `-l` should have one sequence by line, as in the following example:
Such sequences can be provided either with `--label <sequence>`, or with `--label-file <file>`.
The file given by `--label-file` should have one sequence by line, as in the following example:
``` diff
GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1
......@@ -440,7 +441,7 @@ The first column of the file is the sequence to be followed
while the remaining columns consist of the sequence's label.
In Vidjil-algo output, the labels are output alongside their sequences.
A sequence given `-W <sequence>` or with `-l <file>` can be exactly the size
A sequence given `--label <sequence>` or with `-label-file <file>` can be exactly the size
of the window (`-w`, that is 50 by default). In this case, it is guaranteed that
such a window will be output if it is detected in the reads.
More generally, when the provided sequence differs in length with the windows
......@@ -449,13 +450,21 @@ we will keep any window that is contained in the sequence of interest.
This filtering will work as expected when the provided sequence overlaps
(at least partially) the CDR3 or its close neighborhood.
With the `-F` option, *only* the windows related to the given sequences are kept.
With the `--label-filter` option, *only* the windows related to the given sequences are kept.
This allows to quickly filter a set of reads, looking for a known sequence or window,
with the `-FaW <sequence>` options:
All the reads with the windows related to the sequence will be extracted to `out/seq/clone.fa-1`.
with the `--grep-reads <sequence>` preset, equivalent to
`--out-reads --label-filter --label <sequence>`:
All the reads with the windows related to the sequence will be extracted
to files such as `out/seq/clone.fa-1`.
## Clone analysis: VDJ assignation and CDR3 detection
```
Clone analysis (second pass)
-d, --several-D try to detect several D (experimental)
-3, --cdr3 CDR3/JUNCTION detection (requires gapped V/J germlines)
```
The `-3` option launches a CDR3/JUNCTION detection based on the position
of Cys104 and Phe118/Trp118 amino acids. This detection relies on alignment
with gapped V and J sequences, as for instance, for V genes, IMGT/GENE-DB sequences,
......@@ -465,7 +474,7 @@ The CDR3/JUNCTION detection won't work with custom non-gapped V/J repertoires.
CDR3 are reported as productive when they come from an in-frame recombination
and when the sequence does not contain any in-frame stop codons.
The advanced `-f` option sets the parameters used in the comparisons between
The advanced `--analysis-cost` option sets the parameters used in the comparisons between
the clone sequence and the V(D)J germline genes. The default values should work.
The e-value set by `-e` is also applied to the V/J designation.
......@@ -477,13 +486,13 @@ The following options are experimental and have no consequences on the `.vdj.fa`
nor on the standard output. They instead add a `clusters` sections in the `.vidjil` file
that will be visualized in the web application.
The `-n` option triggers an automatic clustering using DBSCAN algorithm (Ester and al., 1996).
Using `-n 5` usually cluster reads within a distance of 1 mismatch (default score
The `--cluster-epsilon` option triggers an automatic clustering using DBSCAN algorithm (Ester and al., 1996).
Using `--cluster-epsilon 5` usually clusters reads within a distance of 1 mismatch (default score
being +1 for a match and -4 for a mismatch). However, more distant reads can also
be clustered when there are more than 10 reads within the distance threshold.
This behaviour can be controlled with the `-N` option.
This behaviour can be controlled with the `-cluster-N` option.
The `-=` option allows to specify a file for manually clustering two windows
The `--cluster-forced-edges` option allows to specify a file for manually clustering two windows
considered as similar. Such a file may be automatically produced by vidjil
(`out/edges`), depending on the option provided. Only the two first columns
(separed by one space) are important to vidjil, they only consist of the
......@@ -497,8 +506,8 @@ The main output of Vidjil-algo (with the default `-c clones` command) are two fo
- The `.vidjil` file is *the file for the Vidjil web application*.
The file is in a `.json` format (detailed in [vidjil-format](vidjil-format))
describing the windows and their count, the consensus sequences (`-y`),
the detailed V(D)J and CDR3 designation (`-z`, see warning below), and possibly
describing the windows and their count, the consensus sequences (`--max-consensus`),
the detailed V(D)J and CDR3 designation (`--max-designations`, see warning below), and possibly
the results of the further clustering.
The web application takes this `.vidjil` file ([possibly merged with `fuse.py`](#following-clones-in-several-samples)) for the *visualization and analysis* of clones and their
......@@ -508,9 +517,9 @@ The main output of Vidjil-algo (with the default `-c clones` command) are two fo
- The `.vdj.fa` file is *a FASTA file for further processing by other bioinformatics tools*.
The sequences are at least the windows (and their count in the headers) or
the consensus sequences (`-y`) when they have been computed.
the consensus sequences (`--max-consensus`) when they have been computed.
The headers include the count of each window, and further includes the
detailed V(D)J and CDR3 designation (`-z`, see warning below), given in a '.vdj' format, see below.
detailed V(D)J and CDR3 designation (`--max-designations`, see warning below), given in a '.vdj' format, see below.
The further clustering is not output in this file.
The `.vdj.fa` output enables to use Vidjil-algo as a *filtering tool*,
......@@ -655,7 +664,7 @@ Using `-c designations` trigger a separate analysis for each read, but this is u
| warnings (+) | string | *Warnings associated to this clone. See <https://gitlab.vidjil.org/blob/dev/doc/warnings.md>.*
| sequence | string | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. <br />*This contains the consensus/representative sequence of each clone.*
| rev_comp | boolean | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. <br />*Set to null, as vidjil-algo gather reads from both strands in clones* |
| v_call, d_call, j_call | string | V/D/J gene with allele. For example, IGHV4-59\*01. <br /> *implemented. In the case of uncomplete/unexpected recombinations (locus with a `+`), we still use `v/d/j_call`. Note that this value can be null on clones beyond the `-z` option.* |
| v_call, d_call, j_call | string | V/D/J gene with allele. For example, IGHV4-59\*01. <br /> *implemented. In the case of uncomplete/unexpected recombinations (locus with a `+`), we still use `v/d/j_call`. Note that this value can be null on clones beyond the `--max-designations` option.* |
| junction | string | Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons. <br />*null*
| junction_aa | string | Junction region amino acid sequence. <br />*implemented*
| cdr3_aa | string | Amino acid translation of the cdr3 field. <br />*implemented*
......@@ -685,7 +694,7 @@ in the following situations:
- in a second pass, on the standard output and in both `.vidjil` and `.vdj.fa` files
- at the end of the clones detection (default command `-c clones`,
on a number of clones limited by the `-z` option)
on a number of clones limited by the `--max-designations` option)
- or directly when explicitly requiring V(D)J designation for each read
(`-c designations`)
......@@ -796,10 +805,10 @@ This file will be relatively small (a few kB or MB) and can be taken again as an
## Advanced usage
``` bash
./vidjil-algo -c clones -g germline/homo-sapiens.g -r 1 -n 5 -x 10000 demo/LIL-L4.fastq.gz
./vidjil-algo -c clones -g germline/homo-sapiens.g -r 1 --cluster-epsilon 5 -x 10000 demo/LIL-L4.fastq.gz
# Extracts the windows with at least 1 read each (-r 1, the default being -r 5)
# on the first 10,000 reads, then cluster them into clones
# with a second clustering step at distance five (-n 5)
# with a second clustering step at distance five (--cluster-epsilon 5)
# The result of this second is in the .vidjil file ('clusters')
# and can been seen and edited in the web application.
```
......
......@@ -234,7 +234,7 @@ def run_vidjil(id_file, id_config, id_data, grep_reads,
if grep_reads:
# TODO: security, assert grep_reads XXXX
vidjil_cmd += ' -FaW "%s" ' % grep_reads
vidjil_cmd += ' --grep-reads "%s" ' % grep_reads
os.makedirs(out_folder)
out_log = out_folder+'/'+output_filename+'.vidjil.log'
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment