Commit 97abeacb authored by Mathieu Giraud's avatar Mathieu Giraud

doc: 'detected' reads, or reads with detected recombinations

see #3413
parent 6ef7daff
Pipeline #147560 passed with stages
in 41 minutes and 56 seconds
......@@ -486,9 +486,10 @@ analyzed reads, including the hidden clones.
The web application displays one consensus sequence per clone (see [Representative](#what-is-the-sequence-displayed-for-each-clone) above).
In some situations, one may want to go back to the reads.
For **vidjil-algo**, analyzing a dataset with the *default + extract reads* config enables
to retrieve back the analyzed reads in the `.segmented.vdj.fa` file that can be downloaded through the `out` link near each sample.
This `.vdj.fa` output enables to use vidjil-algo as a *filtering tool*,
For **vidjil-algo**, analyzing a dataset with the *default + extract reads* config
generates a `.detected.vdj.fa` file with the reads with detected V(D)J recombinations.
This file can be downloaded through the `out` link near each sample.
It enables to use vidjil-algo as a *filtering tool*,
shrinking a large read set into a manageable number of (pre-)clones
that will be deeply analyzed and possibly further clustered by
other software.
......@@ -509,8 +510,8 @@ With DNA-Seq sequencing with specific V(D)J primers,
ratios above 90% usually mean very good results. Smaller ratios, especially under 60%, often mean that something went wrong.
On the other side, capture with many probes or RNA-Seq strategies usually lead to datasets with less than 0.1% V(D)J recombinations.
The “info“ button further detail the causes of non-analysis (for vijdil-algo, `UNSEG`, see detail on [vidjil-algo documentation](http://www.vidjil.org/doc/vidjil-algo/#unsegmentation-causes)).
There can be several causes leading to bad ratios:
The “info“ button further detail the causes of non-analysis (for vijdil-algo, `UNSEG`, see detail on [vidjil-algo documentation](vidjil-algo/#reads-without-detected-recombinations).
There can be several causes leading to low ratios:
### Analysis or biological causes
......
......@@ -304,7 +304,7 @@ Recombination detection ("window" prediction, first pass)
(all these options, except -w, are overriden when using -g)
-k, --kmer INT k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline)
-w, --window INT w-mer size used for the length of the extracted window ('all': use all the read, no window clustering)
-e, --e-value FLOAT=1 maximal e-value for determining if a V-J segmentation can be trusted
-e, --e-value FLOAT=1 maximal e-value for trusting the detection of a V-J recombination
--trim INT trim V and J genes (resp. 5' and 3' regions) to keep at most <INT> nt (0: no trim)
-s, --seed SEED=10s seed, possibly spaced, used for the V/J affectation (default: depends on germline), given either explicitely or by an alias
10s:#####-##### 12s:######-###### 13s:#######-###### 9c:#########
......@@ -332,7 +332,7 @@ the start of the J, or at least some specific N region to uniquely identify the
Setting `-w` to higher values (such as `-w 60` or `-w 100`) makes the clone clustering
even more conservative, enabling to split clones with low specificity (such as IGH with very
large D, short or no N regions and almost no somatic hypermutations). However, such settings
may "segment" (analyze) less reads, depending on the read length of your data, and may also
may detect recombinations in less reads, depending on the read length of your data, and may also
return more clones, as any sequencing error in the window is not corrected.
The special `-w all` option takes all the read as the windows, completely disabling
......@@ -623,9 +623,9 @@ and 1 (full diversity, each analyzed read belongs to a different clone).
These values are now computed on the windows, before any further clustering.
PCR and sequencing errors can thus lead to slightly over-estimate the diversity.
## Details on non analyzed reads
## Reads without detected recombinations
Vidjil-algo outputs details statistics on the reads that are not analyzed.
Vidjil-algo outputs details statistics on the reads where no recombination was detected
Basically, **an unanalyzed read is a read where Vidjil-algo cannot identify a window at the junction of V and J genes**.
To properly analyze a read, Vijdil-algo needs that the sequence spans enough V region and J region
(or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations).
......@@ -657,17 +657,18 @@ that can lead to few analyzed reads.
``` diff
Detailed output per read (generally not recommended, large files, but may be used for filtering, as in -uu -X 1000)
-U, --out-analyzed output analyzed reads (in .segmented.vdj.fa file)
-u, --out-unanalyzed
-u output unanalyzed reads, gathered by cause, except for very short and 'too few V/J' reads (in *.fa files)
-uu output unanalyzed reads, gathered by cause, all reads (in *.fa files) (use only for debug)
-uuu output unanalyzed reads, all reads, including a .unsegmented.vdj.fa file (use only for debug)
-U, --out-detected output reads with detected recombinations (in .detected.vdj.fa file)
-u, --out-undetected
-u output undetected reads, gathered by cause, except for very short and 'too few V/J' reads (in *.fa files)
-uu output undetected reads, gathered by cause, all reads (in *.fa files) (use only for debug)
-uuu output undetected reads, all reads, including a .undetected.vdj.fa file (use only for debug)
--out-reads output all reads by clones (clone.fa-*), to be used only on small datasets
-K, --out-affects output detailed k-mer affectation for each read (in .affects file) (use only for debug, for example -KX 100)
```
It is possible to extract all analyzed or not analyzed reads, possibly to give them to other software.
Runing Vidjil-algo with `-U` gives a file `out/basename.analyzed.vdj.fa`, with all analyzed reads.
It is possible to extract all reads with or without detected recombinations,
possibly to give them to other software.
Runing Vidjil-algo with `-U` gives a file `out/basename.detected.vdj.fa`, with all detected reads.
On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file.
However, the `-U` option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations.
Moreover `-U` only uses the ultra-fast first passs analysis, based on k-mer heuristics.
......@@ -675,16 +676,17 @@ Moreover `-U` only uses the ultra-fast first passs analysis, based on k-mer heur
Similarly, options are available to get the non analyzed reads:
- `-u` gives a set of files `out/basename.UNSEG_*`, with not analyzed reads gathered by cause.
- `-u` gives a set of files `out/basename.UNSEG_*`, with not detected reads gathered by cause.
It outputs only reads sharing significantly sequences with V/J germline genes or with some ambiguity:
it may be interesting to further study RNA-Seq datasets.
- `-uu` gives the same set of files, including **all** not analyzed reads (including `UNSEG too short` and `UNSEG too few V/J`),
and `-uuu` further outputs all these reads in a file `out/basename.unsegmented.vdj.fa`.
- `-uu` gives the same set of files, including **all** not detected reads (including `UNSEG too short` and `UNSEG too few V/J`),
and `-uuu` further outputs all these reads in a file `out/basename.undetected.vdj.fa`.
Again, as these options may generate large files, they are generally not recommended.
However, they are very useful in some situations, especially to understand why some dataset gives poor segmentation result.
For example `-uu -X 1000` splits the not analyzed reads from the 1000 first reads.
However, they are very useful in some situations, especially to understand
why some dataset gives low detection rate.
For example `-uu -X 1000` splits the not detected reads from the 1000 first reads.
## AIRR .tsv output
......@@ -761,7 +763,7 @@ The following lines are for VDJ recombinations:
Jgene name of the J gene being rearranged
comments optional comments. In Vidjil, the following comments are now used:
- "seed" when this comes for the first pass (.segmented.vdj.fa). See the warning above.
- "seed" when this comes for the first pass (.detected.vdj.fa). See the warning above.
- "!ov x" when there is an overlap of x bases between last V seed and first J seed
- the name of the locus (TRA, TRB, TRG, TRD, IGH, IGL, IGK, possibly followed
by a + for incomplete/unusual recombinations)
......@@ -777,7 +779,7 @@ applicable being removed:
``` diff
>name + VJ startV endV startJ endJ Vgene delV/N1/delJ Jgene comments
```
In the `.segmented.vdj.fa` file, the start/end positions of V and J genes are only an estimation,
In the `.detected.vdj.fa` file, the start/end positions of V and J genes are only an estimation,
get from the k-mer heuristics, as the center of the window may be shifted up to 15 bases from the actual center.
In the final `.vdj.fa` file, these values are the correct ones computed after dynamic programming comparison
with germline genes.
......@@ -824,11 +826,11 @@ clustering such reads into clones, and further analyzing the clones.
# Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations
# Cluster the reads into clones, again based on windows overlapping the detected CDR3s.
# Assign the VDJ genes and try to detect the CDR3 of each clone.
# The out/reads.segmented.vdj.fa include all reads where a V(D)J recombination was found
# The out/reads.detected.vdj.fa include all reads where a V(D)J recombination was found
```
Typical whole RNA-Seq or capture datasets may be huge (several GB) but with only a (very) small portion of recombined sequences.
Using Vidjil with `-U` will create a `out/reads.segmented.vdj.fa` file
Using Vidjil with `-U` will create a `out/reads.detected.vdj.fa` file
that includes all reads where a V(D)J recombination (or an unexpected recombination, with `-2`) was found.
This file will be relatively small (a few kB or MB) and can be taken again as an input for Vidjil-algo or for other programs.
......
......@@ -276,14 +276,15 @@ representative clone of the cluster.
## Statistics: the `reads` element \[.vidjil only, required\]
The number of analyzed reads (`segmented`) may be higher than the sum of the read number of all clones,
The number of reads with detected recombinations (`segmented`)
may be higher than the sum of the read number of all clones,
when one choose to report only the 'top' clones (`-t` option for fuse).
``` javascript
{
"total" : [], // total number of reads per sample (with samples.number elements)
"segmented" : [], // number of analyzed/segmented reads per sample (with samples.number elements)
"germline" : { // number of analyzed/segmented reads per sample/germline (with samples.number elements)
"segmented" : [], // number of reads with detected recombinations per sample (with samples.number elements)
"germline" : { // number of reads with detected recombinations per sample/germline (with samples.number elements)
"TRG" : [],
"IGH" : []
}
......@@ -344,7 +345,7 @@ In the `.analysis` file, this section is intended to describe some specific clon
// settings web application menu
"seg": // detailed V(D)J designation/segmentation and other sequences features or values [optional]
// on the web application, clones that are not segmented will be shown on the grid with '?/?'
// on the web application, clones that are not detected will be shown on the grid with '?/?'
// positions are related to the 'sequence'
// names of V/D/J genes should match the ones in files referenced in germline/germline.data
// Positions on the sequence start at 1.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment