Commit 4b8ee2c9 authored by Mathieu Giraud's avatar Mathieu Giraud Committed by Vidjil Team

doc/vidjil-algo.md: AIRR .tsv format, draft documentation

See #3568.
parent 656ddd25
......@@ -628,6 +628,32 @@ Again, as these options may generate large files, they are generally not recomme
However, they are very useful in some situations, especially to understand why some dataset gives poor segmentation result.
For example `-uu -X 1000` splits the unsegmented reads from the 1000 first reads.
## AIRR .tsv output
Since version 2018.10, vidjil-algo supports the [AIRR format](http://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields).
Note that Vidjil-algo is designed to efficiently gather reads into clones. We thus report in the AIRR format *clones*.
See "what is a clone" in vidjil-format.md.
| Name | Type | AIRR 1.2 Description <br /> *vidjil-algo implementation* |
| ----- | ---- | ------------------------------------------------------- |
| locus | string | Gene locus (chain type). For example, IGH, IGK, IGL, TRA, TRB, TRD, or TRG.<br />*The incomplete recombinations analyzed by vidjil-algo are reported as IGH+ IGK+, TRA+D, TRB+, TRD+, and `xxx` for unexpected recombinations. See [locus](locus).*
| sequence_id | string | Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. <br />*This identifier is the (50 bp by default) window extacted around the junction.* |
| clone_id | string | Clonal cluster assignment for the query sequence. <br />*This identifier is again the (50 bp by default) window extacted around the junction.*
| sequence | string | The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment. <br />*This contains the consensus/representative sequence of each clone.*
| rev_comp | boolean | True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'. <br />*Set to null, as vidjil-algo gather reads from both strands in clones* |
| v_call, d_call, j_call | string | V/D/J gene with allele. For example, IGHV4-59*01. <br /> *In the case of uncomplete/unexpected recombinations, we still use* `v/d/j_call`. |
| junction_aa | string | Junction region amino acid sequence. | <br />*null*
| productive | boolean | True if the V(D)J sequence is predicted to be productive. <br /> *true, false, or null when no CDR3 has been detected* |
| sequence_alignment | string | Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement. <br /> *null* |
| germline_alignment | string | Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any). <br />*null*
| v_cigar, d_cigar, j_cigar | string | CIGAR strings for the V/D/J gene <br />*null*.
|
## Segmentation and .vdj format
Vidjil output includes segmentation of V(D)J recombinations. This happens
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment