Commit cd292aa0 authored by Mikaël Salson's avatar Mikaël Salson

vidjil.cpp (tests and doc): -t options defaults again to 0

This reverts bf2f96fe and d317535b.

After more than one year experience with -t 100 we realised that trimming leads to:
– missing special cases where the V gene is heavily trimmed
– giving bad unsegmentation causes: UNSEG only J appeared in cases where it didn't make sense
  those unsegmentation causes were replaced by UNSEG only V when changing -t to 0 (that also let us think that we could miss some V genes)

However -t 0 does have a drawback that should be addressed. It is more likely to have
spurious hits of V genes in the N-region (particularly with VDJ recombinations
where the N-region is longer). Those spurious hits tend to shift the window towards
the J gene.

In the tests the modifications are due either to slight changes in number of windows
(a small increase due to the previous reason) or to an increase in unsegmentation
with really large windows (since windows are shifted we may not have
enough space left to put a window).
parent 29f49c02
......@@ -5,7 +5,7 @@ $ Number of reads
e1:"total": [13153]
$ Number of segmented reads
e1:"segmented": [13153]
e1:"segmented": [13152]
$ Most abundant window
1:"id": "CCACCTATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGGGGCC".*"reads": \\[8\\]
......@@ -50,10 +50,10 @@ $ Segmentation details - CDR3, JUNCTION
1:"junction": ."aa": "CTREEQYSSWYFDFW", .* "start": 53, "stop": 97.
$ Second sequence has a DNA sequence provided
1:"id": "CTGTGCGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTA".*"sequence": "[ACGT]+",
1:"id": "TGTGCGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTAC".*"sequence": "[ACGT]+",
$ Second sequence also has evalues
1:"id": "CTGTGCGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTA".*"evalue": ."val": "[0-9\.e-]+"
1:"id": "TGTGCGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTAC".*"evalue": ."val": "[0-9\.e-]+"
$ All 'start' fields are 1-based, they never equal to zero
0: "start": 0
......@@ -2,5 +2,5 @@
!LOG: stanford-k14.log
$ Find the good number of windows in Stanford S22 (contiguous seed 14)
1: found 10790 50-windows in 13152 reads
1: found 10796 50-windows in 13152 reads
......@@ -2,11 +2,11 @@
!LOG: stanford-w100.log
$ Find the good number of "too short sequences" for windows of size 100
1: UNSEG too short w -> 484
1: UNSEG too short w -> 1368
$ Find the good number of segmented sequences (including "too short sequences")
1: junction detected in 13153 reads .100%.
1: junction detected in 13152 reads .100%.
$ Find the good number of windows of size 100 in Stanford S22
1: found 11437 100-windows in 12669 reads
1: found 10592 100-windows in 11784 reads
......@@ -10,13 +10,13 @@ $ Parses germline/IGHJ.fa
1: 701 bp in 13 sequences
$ Find the good index load
1:custom .* 0.055. l13 k12
1:custom .* 0.163. l13 k12
$ Find approximately the good number of sequences for e-value computation
1: approx. 131.. sequences
$ Find the good number of windows in Stanford S22
1: found 10764 50-windows in 13153 reads
1: found 10766 50-windows in 13152 reads
$ First clone -- find the good number of reads
2:clone-001--.*--0000008
......
......@@ -119,7 +119,7 @@ enum { CMD_WINDOWS, CMD_CLONES, CMD_SEGMENT, CMD_GERMLINES } ;
#define DEFAULT_CLUSTER_COST Cluster
#define DEFAULT_SEGMENT_COST VDJ
#define DEFAULT_TRIM 100
#define DEFAULT_TRIM 0
#define MAX_CLONES_FOR_SIMILARITY 20
......
......@@ -265,7 +265,7 @@ Window prediction
(using -k option is equivalent to set with -s a contiguous seed with only '#' characters)
-w <int> w-mer size used for the length of the extracted window (default: 50)
-e <float> maximal e-value for determining if a segmentation can be trusted (default: 'all', no limit)
-t <int> trim V and J genes (resp. 5' and 3' regions) to keep at most <int> nt (default: 100) (0: no trim)
-t <int> trim V and J genes (resp. 5' and 3' regions) to keep at most <int> nt (default: 0) (0: no trim)
#+END_EXAMPLE
The =-s=, =-k= are the options of the seed-based heuristic. A detailed
......@@ -308,7 +308,9 @@ The threshold can be disabled with =-e all=.
The =-t= option sets the maximal number of nucleotides that will be indexed in
V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the
indexes, giving more precise window estimation and e-value computation.
The default is =-t 100=.
However giving a =-t= may also reduce the probability of seeing a heavily
trimmed or mutated V gene.
The default is =-t 0=.
** Thresholds on clone output
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment