#+TITLE: Vidjil Algorithm -- Command-line Manual #+AUTHOR: The Vidjil team (Mathieu, Mikaël, Marc, Tatiana and Rayan) #+HTML_HEAD: # This manual can be browsed online: # http://www.vidjil.org/doc/algo.html (last stable release) # http://git.vidjil.org/blob/master/doc/algo.org (development version) # Vidjil -- High-throughput Analysis of V(D)J Immune Repertoire -- [[http://www.vidjil.org]] # Copyright (C) 2011, 2012, 2013, 2014, 2015, 2016 by Bonsai bioinformatics # at CRIStAL (UMR CNRS 9189, Université Lille) and Inria Lille # contact@vidjil.org V(D)J recombinations in lymphocytes are essential for immunological diversity. They are also useful markers of pathologies, and in leukemia, are used to quantify the minimal residual disease during patient follow-up. The Vidjil algorithm processes high-throughput sequencing data to extract V(D)J junctions and gather them into clones. Vidjil starts from a set of reads and detects "windows" overlapping the actual CDR3. This is based on an fast and reliable seed-based heuristic and allows to output all sequenced clones. The analysis is extremely fast because, in the first phase, no alignment is performed with database germline sequences. At the end, only the representative sequences of each clone have to be analyzed. Vidjil can also cluster similar clones, or leave this to the user after a manual review in the web application. The method is described in the following references: Marc Duez et al., “Vidjil: High-throughput analysis of immune repertoire”, submitted Mathieu Giraud, Mikaël Salson, et al., "Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing", BMC Genomics 2014, 15:409 http://dx.doi.org/10.1186/1471-2164-15-409 Vidjil is open-source, released under GNU GPLv3 license. * Requirements and installation ** Supported platforms The Vidjil algorithm has been successfully tested on the following platforms : - CentOS 6.3 amd64 - CentOS 6.3 i386 - Debian Squeeze 6.0 - Debian Wheezy 7.0 amd64 - Fedora 19 - FreeBSD 9.2 - Ubuntu 12.04 LTS amd64 - Ubuntu 14.04 LTS amd64 Vidjil is developed with continuous integration using systematic unit and functional testing. The development team internally uses [[https://jenkins-ci.org/][Jenkins]] for that. Moreover, the results of some of these tests can be publicly checked on [[https://travis-ci.org/vidjil/vidjil][travis-ci.org]]. ** Build requirements (optional) This paragraph details the requirements to build Vidjil from source. You can also download a static binary (see next paragraph, 'Installation'). To compile Vidjil, make sure: - to be on a POSIX system ; - to have a C++11 compiler (as =g++= 4.8 or above or =clang= 3.3 or above) - to have the =zlib= installed (=zlib1g-dev= package under Debian/Ubuntu, =zlib-devel= package under Fedora/CentOS). *** CentOS 6 g++-4.8 is included in the devtools 2.0. #+BEGIN_SRC sh sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo sudo yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++ devtoolset-2-valgrind # scl enable devtoolset-2 bash # either open a shell running devtools source /opt/rh/devtoolset-2/enable # ... or source devtools in the same shell #+END_SRC *** FreeBSD 9.2 g++-4.8 is included in FreeBSD 9.2. You may also need to install the =gzstream= library with: #+BEGIN_SRC sh pkg install gzstream #+END_SRC Also Vidjil uses GNU make which requires =gmake= under FreeBSD. At the time of redacting the documentation, =g++= requires extra options to ensure flawless compilation and execution of Vidjil: #+BEGIN_SRC sh make MAKE=gmake CXXFLAGS="-D_GLIBCXX_USE_C99 -Wl,-rpath=/usr/local/lib/gcc49" #+END_SRC The =gcc49= at the end of the command line is to be replaced by the =gcc= version used. *** Debian Squeeze 6.0 / Wheezy 7.0 g++-4.8 should be pinned from testing. Put in =/etc/apt/preferences= the following lines: #+BEGIN_SRC sh Package: * Pin: release n=wheezy # (or squeeze) Pin-Priority: 900 Package: g++-4.8, gcc-4.8, valgrind* Pin: release n=jessie Pin-Priority: 950 #+END_SRC Then g++ 4.8 can be installed. #+BEGIN_SRC sh apt-get update apt-get install -t jessie g++-4.8 valgrind #+END_SRC *** Ubuntu 14.04 LTS #+BEGIN_SRC sh sudo apt-get install g++-4.8 #+END_SRC *** Ubuntu 12.04 LTS g++-4.8 is included in the devtools 2.0. #+BEGIN_SRC sh sudo apt-get install python-software-properties sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install g++-4.8 #+END_SRC ** Installation #+BEGIN_SRC sh make germline # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: # academic research only, provided that it is referred to IMGT®, # and cited as "IMGT®, the international ImMunoGeneTics information system® # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database, # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093 # either make # build Vijil from the sources (see the requirements, above) # or wget http://bioinfo.lifl.fr/vidjil/vidjil-2015.12_x86_64 -O vidjil # download a static binary (built for x86_64 architectures) ./vidjil -h # display help/usage #+END_SRC If your build system does not use C++11 by default, you should replace the =make= commands by: #+BEGIN_SRC sh make CXXFLAGS='-std=c++11' ### gcc-4.8 make CXXFLAGS='-std=c++11' LDFLAGS='-stdlib=libc++' ### OS X Mavericks #+END_SRC ** Self-tests (optional) You can run the tests with the following commands: #+BEGIN_SRC sh make data # get IGH recombinations from a single individual, as described in: # Boyd, S. D., and al. Individual variation in the germline Ig gene # repertoire inferred from variable region gene rearrangements. J # Immunol, 184(12), 6986–92. make test # run self-tests (can take 5 to 60 minutes) #+END_SRC * Input and parameters The main input file of Vidjil is a /set of reads/, given as a =.fasta= or =.fastq= file, possibly compressed with gzip (=.gz=). This set of reads can reach several gigabytes and 2*10^9 reads. It is never loaded entirely in the memory, but reads are processed one by one by the Vidjil algorithm. The =-h= and =-H= help options provide the list of parameters that can be used. We detail here the options of the main =-c clones= command. The default options are very conservative (large window, no further automatic clusterization, see below), leaving the user or other software making detailed analysis and decisions on the final clustering. ** Germline selection #+BEGIN_EXAMPLE Germline databases (at least one -V/(-D)/-J, or -G, or -g option must be given for all commands except -c germlines) -V V germline multi-fasta file -D D germline multi-fasta file (and resets -m and -w options), will segment into V(D)J components -J J germline multi-fasta file -G prefix for V (D) and J repertoires (shortcut for -V V.fa -D D.fa -J J.fa) (basename gives germline code) -g multiple locus/germlines. In the path , takes 'germlines.data' to select locus and parameters Selecting '-g germline' processes TRA, TRB, TRG, TRD, IGH, IGK and IGL locus, possibly with some incomplete/unusal recombinations A different 'germlines.data' file can also be provided with -g Locus/recombinations -i try to detect incomplete/unusual recombinations (locus with '+', must be used with -g) -2 try to detect unexpected recombinations (must be used with -g) #+END_EXAMPLE - Options such as =-G germline/IGH= or =-G germline/TRG= select one germline system. - The =-V/(-D)/-J= options enable to select individual V, (D) and J repertoires (fasta files). This allows in particular to select incomplete rearrangement using custom V or J repertoires with added sequences. - The =-g germline/= option launches the analysis on the seven germlines, selecting the best locus for each read. Using =-g germline/ -i= tests also some incomplete and unusual recombinations (locus with a =+= in their name), and using =-g germline/ -i -2= further test unexpected recombinations (tagged as =xxx=). See [[http://git.vidjil.org/blob/master/doc/locus.org][locus.org]] for information on the analyzable locus. - Analyzed locus and parameters are configured through the =germline/germlines.data= file. A =germline/isotypes.data= file is provided to look for sequences with, on one side, IGHJ (or even IGHV) genes, and, on the other side, an IGH constant chain. To select a custom set of TR or Ig locus, you may copy =germline/germlines.data= into a new file, as for example =germline/custom-germlines.data=, and run Vidjil with =-g germline/custom-germlines.data -i -2=. - Several =-g= options can be used, as for instance =-g germline -g germline/isotypes.data=. - One can use other germline sequences possibly bydefining another =germlines.data= file that would refer to an alternative germline set or by overwriting the existing germline sequences (in the FASTA file). ** Main algorithm parameters #+BEGIN_EXAMPLE Window prediction (use either -s or -k option, but not both) -s spaced seed used for the V/J affectation (default: #####-#####, ######-######, #######-#######, depends on germline) -k k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline) (using -k option is equivalent to set with -s a contiguous seed with only '#' characters) -w w-mer size used for the length of the extracted window (default: 50) -e maximal e-value for determining if a segmentation can be trusted (default: 'all', no limit) -t trim V and J genes (resp. 5' and 3' regions) to keep at most nt (default: 100) (0: no trim) #+END_EXAMPLE The =-s=, =-k= are the options of the seed-based heuristic. A detailed explanation can be found in (Giraud, Salson and al., 2014). /These options are for advanced usage, the defaults values should work./ The =-s= or =-k= option selects the seed used for the k-mer V/J affectation. The =-w= option fixes the size of the "window" that is the main identifier to gather clones. The default value (=-w 50=) was selected to ensure a high-quality clone gathering: reads are clustered when they /exactly/ share, at the nucleotide level, a 50 bp-window centered on the CDR3. No sequencing errors are corrected inside this window. The center of the "window", predicted by the high-throughput heuristic, may be shifted by a few bases from the actual "center" of the CDR3 (for TRG, less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction in >99% of cases). The extracted window should be large enough to fully contain the CDR3 as well as some part of the end of the V and the start of the J, or at least some specific N region, to uniquely identify a clone. Setting =-w= to higher values (such as =-w 60= or =-w 100=) makes the clone gathering even more conservative, enabling to split clones with low specificity (such as IGH with very large D, short or no N regions and almost no somatic hypermutations). However, such settings may "segment" (analyze) less reads, depending on the read length of your data, and may also return more clones, as any sequencing error in the window is not corrected. Setting =-w= to lower values than 50 may "segment" (analyze) a few more reads, depending on the read length of your data, but may in some cases falsely cluster reads from different clones. For VJ recombinations, the =-w 40= option is usually safe, and =-w 30= can also be tested. Setting =-w= to lower values is not recommended. The =-e= option sets the maximal e-value accepted for segmenting a sequence. It is an upper bound on the number of exepcted windows found by chance by the seed-based heuristic. The e-value computation takes into account both the number of reads in the input sequence and the number of locus searched for. The default value is 1.0, but values such as 1000, 1e-3 or even less can be used to have a more or less permissive segmentation. The threshold can be disabled with =-e all=. The =-t= option sets the maximal number of nucleotides that will be indexed in V genes (the 3' end) or in J genes (the 5' end). This reduces the load of the indexes, giving more precise window estimation and e-value computation. The default is =-t 100=. ** Thresholds on clone output The following options control how many clones are output and analyzed. #+BEGIN_EXAMPLE Limits to report a clone (or a window) -r minimal number of reads supporting a clone (default: 5) -% minimal percentage of reads supporting a clone (default: 0) Limits to further analyze some clones -y maximal number of clones computed with a representative ('all': no limit) (default: 100) -z maximal number of clones to be analyzed with a full V(D)J designation ('all': no limit, do not use) (default: 100) -A reports and segments all clones (-r 1 -% 0 -y all -z all), to be used only on very small datasets #+END_EXAMPLE The =-r/-%= options are strong thresholds: if a clone does not have the requested number of reads, the clone is discarded (except when using =-l=, see below). The default =-r 5= option is meant to only output clones that have a significant read support. *You should use* =-r 1= *if you want to detect all clones starting from the first read* (especially for MRD detection). The =-y= option limits the number of clones for which a representative sequence is computed. Usually you do not need to have more representatives (see below), but you can safely put =-y all= if you want to compute all representative sequences. The =-z= option limits the number of clones that are fully analyzed, /with their V(D)J designation and possibly a CDR3 detection/, in particular to enable the web application to display the clones on the grid (otherwise they are displayed on the '?/?' axis). If you want to analyze more clones, you should use =-z 200= or =-z 500=. It is not recommended to use larger values: outputting more than 500 clones is often not useful since they can not be visualized easily in the web application, and takes large computation time (full dynamic programming, see below). Note that even if a clone is not in the top 100 (or 200, or 500) but still passes the =-r=, =-%= options, it is still reported in both the =.vidjil= and =.vdj.fa= files. If the clone is at some MRD point in the top 100 (or 200, or 500), it will be fully analyzed/segmented by this other point (and then collected by the =fuse.py= script, using representatives computed at this other point, and then, on the web application, correctly displayed on the grid). *Thus is advised to leave the default* =-z 100= *option for the majority of uses.* The =-A= option disables all these thresholds. This option should be used only for test and debug purposes, on very small datasets, and produce large file and takes huge computation times. ** Labeled windows Vidjil allows to indicate that specific windows that must be followed (even if those windows are 'rare', below the =-r/-%= thresholds). Such windows can be provided either with =-W =, or with =-l =. The file given by =-l= should have one window by line, as in the following example: #+BEGIN_EXAMPLE GAGAGATGGACGGGATACGTAAAACGACATATGGTTCGGGGTTTGGTGCT my-clone-1 GAGAGATGGACGGAATACGTTAAACGACATATGGTTCGGGGTATGGTGCT my-clone-2 foo #+END_EXAMPLE Windows and labels must be separed by one space. The first column of the file is the window to be followed while the remaining columns consist of the window's label. In Vidjil output, the labels are output alongside their windows. With the =-F= option, /only/ the labeld windows are kept. This allows to quickly filter a set of reads, looking for a known window, with the =-FaW = options: All the reads with this windows will be extracted to =out/seq/clone.fa-1=. ** Clone analysis: VDJ assignation and CDR3 detection The =-3= option launches a CDR3/JUNCTION detection based on the position of Cys104 and Phe118/Trp118 amino acids. This detection relies on alignment with gapped V and J sequences, as for instance, for V genes, IMGT/GENE-DB sequences. The advanced =-f= option sets the parameters used in the comparisons between the clone sequence and the V(D)J germline genes. The default values should work. The advanced =-m= option controls the minimum difference of positions between the end of the V and the start of the J. Note that it is even possible to set =-m -10= (meaning that V and J could overlap 10 bp). This is the default for VJ recombinations (except when using a =germlines.data= file). ** Further clustering (experimental) The following options are experimental and have no consequences on the =.vdj.fa= file, nor on the standard output. They instead add a =clusters= sections in the =.vidjil= file that will be visualized in the web application. The =-n= option triggers an automatic clustering using DBSCAN algorithm (Ester and al., 1996). Using =-n 5= usually cluster reads within a distance of 1 mismatch (default score being +1 for a match and -4 for a mismatch). However, more distant reads can also be clustered when there are more than 10 reads within the distance threshold. This behaviour can be controlled with the =-N= option. The =-E= option allows to specify a file for manually clustering two windows considered as similar. Such a file may be automatically produced by vidjil (=out/edges=), depending on the option provided. Only the two first columns (separed by one space) are important to vidjil, they only consist of the two windows that must be clustered. * Output ** Main output files The main output of Vidjil (with the default =-c clones= command) are two following files: - The =.vidjil= file is /the file for the Vidjil web application/. The file is in a =.json= format (detailed in [[file:format-analysis.org][format-analysis.org]]) describing the windows and their count, the representatives (=-y=), the detailed V(D)J and CDR3 designation (=-z=, see warning below), and possibly the results of the further clustering. The web application takes this =.vidjil= file (possibly merged with =fuse.py=) for the /visualization and analysis/ of clones and their tracking along different samples (for example time points in a MRD setup or in a immunological study). Please see [[file:browser.org][browser]].org for more information on the web application. - The =.vdj.fa= file is /a FASTA file for further processing by other bioinformatics tools/. The sequences are at least the windows (and their count in the headers) or the representatives (=-y=) when they have been computed. The headers include the count of each window, and further includes the detailed V(D)J and CDR3 designation (=-z=, see warning below), given in a '.vdj' format, see below. The further clustering is not output in this file. The =.vdj.fa= output enables to use Vidjil as a /filtering tool/, shrinking a large read set into a manageable number of (pre-)clones that will be deeply analyzed and possibly further clustered by other software. By default, the two output files are named =out/basename.vidjil= in =out/basename.vdj.fa=, where: - =out= is the directory where all the outputs are stored (can be changed with the =-o= option). - =basename= is the basename of the input =.fasta/.fastq= file (can be overriden with the =-b= option) ** Auxiliary output files The =out/basename.windows.fa= file contains the list of windows, with number of occurrences: #+BEGIN_EXAMPLE >8--window--1 TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG >5--window--2 CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC ATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGAC (...) #+END_EXAMPLE Windows of size 50 (modifiable by =-w=) have been extracted. The first window has 8 occurrences, the second window has 5 occurrences. The =out/seq/clone.fa-*= contains the detailed analysis by clone, with the window, the representative sequence, as well as with the most similar V, (D) and J germline genes: #+BEGIN_EXAMPLE >clone-001--IGH--0000008--0.0608%--window TATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTG >clone-001--IGH--0000008--0.0608%--lcl|FLN1FA001CPAUQ.1|-[105,232]-#2 - 128 bp (55% of 232.0 bp) + VDJ 0 54 73 84 85 127 IGHV3-23*05 6/ACCCGGGAGGAACAATAT/9 IGHD6-13*01 0//5 IGHJ4*02 IGH SEG_+ 1.946653e-19 1.352882e-19/5.937712e-20 GCTGTACCTGCAAATGAACAGCCTGCGAGCCGAGGACACGGCCACCTATTACTGT ACCCGGGAGGAACAATATAGCAGCTGGTAC TTTGACTTCTGGGGCCAGGGGATCCTGGTCACCGTCTCCTCAG >IGHV3-23*05 GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTTATAGCAGTGGTAGTAGCACATACTATGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAA >IGHD6-13*01 GGGTATAGCAGCAGCTGGTAC >IGHJ4*02 ACTACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG #+END_EXAMPLE The =-a= debug option further output in each =out/seq/clone.fa-*= files the full list of reads belonging to this clone. The =-a= option produces large files, and is not recommanded in general cases. ** Diversity measures Several [[https://en.wikipedia.org/wiki/Diversity_index][diversity indices]] are reported, both on the standard output and in the =.vidjil= file: - H (=index_H_entropy=): Shannon's diversity - E (=index_E_equitability=): Shannon's equitability - Ds (=index_Ds_diversity=): Simpson's diversity E ans Ds values are between 0 (no diversity, one clone gathers all analyzed reads) and 1 (full diversity, each analyzed read belongs to a different clone). These values are now computed on the windows, before any further clustering. PCR and sequencing errors can thus lead to slighlty over-estimate the diversity. ** Unsegmentation causes Vidjil output details statistics on the reads that are not segmented (not analyzed). Basically, *an unsegmented read is a read where Vidjil cannot identify a window at the junction of V and J genes*. To properly analyze a read, Vijdil needs that the sequence spans enough V region and J region (or, more generally, 5' region and 3' regions when looking for incomplete or unusual recombinations). The following unsegmentation causes are reported: | | | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG too short= | Reads are too short, shorter than the seed (by default between 9 and 13 bp). | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG strand= | The strand is mixed in the read, with some similarities both with the =+= and the =-= strand. | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG too few V/J= | No information has been found on the read: There are not enough similarities neither with a V gene or a J gene. | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG only V/5= | Relevant similarities have been found with some V, but not enough with any J. | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG only J/3= | Relevant similarities have been found with some J, but not enough with any V. | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG ambiguous= | Vidjil finds some V and J similarities mixed together which makes the situation ambiguous and hardly solvable. | |---------------------+---------------------------------------------------------------------------------------------------------------------| | =UNSEG too short w= | The junction can be identified but the read is too short so that Vidjil could extract the window (by default 50bp). | | | It often means the junction is very close from one end of the read. | |---------------------+---------------------------------------------------------------------------------------------------------------------| Some datasets may give reads with many low =UNSEG too few= reads: - =UNSEG too few V/J= usually happens when reads share almost nothing with the V(D)J region. This is expected when the PCR or capture-based approach included other regions, such as in whole RNA-seq. - =UNSEG only V/5= and =UNSEG only J/3= happen when reads do not span enough the junction zone. Vidjil detects a “window” including the CDR3. By default this window is 50bp long, so the read needs be that long centered on the junction. See [[http://git.vidjil.org/blob/master/doc/browser.org][browser.org]] for information on the biological or sequencing causes that can lead to few segmented reads. ** Filtering reads It is possible to extract all segmented or unsegmented reads, possibly to give them to other software. Runing Vidjil with =-U= gives a file =out/basename.unsegmented.vdj.fa=, with all segmented reads. On datasets generated with rather specific V(D)J primers, this is generally not recommended, as it may generate a large file. However, the =-U= option is very useful for whole RNA-Seq or capture datasets that contain few reads with V(D)J recombinations. Similarly, two options are available to get the unsegmented reads: - =-u= gives a file =out/basename.segmented.vdj.fa=, with unsegmented reads. - =-uu= gives a set of files =out/basename.UNSEG_*=, with unsegmented reads gathered by unsegmentation cause Again, as these options may generate large files, they are generally not recommended. However, they are very useful in some situations, especially to understand why some dataset gives poor segmentation result. For example =-uu -X 1000= splits the unsegemented reads from the 1000 first reads. ** Segmentation and .vdj format Vidjil output includes segmentation of V(D)J recombinations. This happens in the following situations: - in a first pass, when requested with =-U= option, in a =.segmented.vdj.fa= file. The goal of this ultra-fast segmentation, based on a seed heuristics, is only to identify the locus and to locate the w-window overlapping the CDR3. This should not be taken as a real V(D)J designation, as the center of the window may be shifted up to 15 bases from the actual center. - in a second pass, on the standard output and in both =.vidjil= and =.vdj.fa= files - at the end of the clones detection (default command =-c clones=, on a number of clones limited by the =-z= option) - or directly when explicitly requiring segmentation (=-c segment=) These V(D)J designations are obtained by full comparison (dynamic programming) with all germline sequences. Note that these designations are relatively slow to compute, especially for the IGH locus. However, they are not at the core of the Vidjil clone gathering method (which relies only on the 'window', see above). To check the quality of these designations, the automated test suite include sequences with manually curated V(D)J designations (see [[http://git.vidjil.org/blob/master/doc/should-vdj.org][should-vdj.org]]). Segmentations of V(D)J recombinations are displayed using a dedicated =.vdj= format. This format is compatible with FASTA format. A line starting with a > is of the following form: #+BEGIN_EXAMPLE >name + VDJ startV endV startD endD startJ endJ Vgene delV/N1/delD5' Dgene delD3'/N2/delJ Jgene comments name sequence name (include the number of occurrences in the read set and possibly other information) + strand on which the sequence is mapped VDJ type of segmentation (can be "VJ", "VDJ", or shorter tags such as "V" for incomplete sequences). The following line are for "VDJ" recombinations : startV endV start and end position of the V gene in the sequence (start at 0) startD endD ... of the D gene ... startJ endJ ... of the J gene ... Vgene name of the V gene delV number of deletions at the end (3') of the V N1 nucleotide sequence inserted between the V and the D delD5' number of deletions at the start (5') of the D Dgene name of the D gene being rearranged delD3' number of deletions at the end (3') of the D N2 nucleotide sequence inserted between the D and the J delJ number of deletions at the start (5') of the J Jgene name of the J gene being rearranged comments optional comments. In Vidjil, the following comments are now used: - "seed" when this comes for the first pass (.segmented.vdj.fa). See the warning above. - "!ov x" when there is an overlap of x bases between last V seed and first J seed - the name of the locus (TRA, TRB, TRG, TRD, IGH, IGL, IGK, possibly followed by a + for incomplete/unusual recombinations) #+END_EXAMPLE Following such a line, the nucleotide sequence may be given, giving in this case a valid FASTA file. For VJ recombinations the output is similar, the fields that are not applicable being removed: #+BEGIN_EXAMPLE >name + VJ startV endV startJ endJ Vgene delV/N1/delJ Jgene comments #+END_EXAMPLE * Examples of use All the following examples are on a IGH VDJ recombinations : they thus require either the =-G germline/IGH= option, or the multi-germline =-g germline= option. ** Basic usage: PCR-based datasets, with primers in the V(D)J regions (such as BIOMED-2 primers) #+BEGIN_SRC sh ./vidjil -G germline/IGH -3 data/Stanford_S22.fasta # Gather the reads into clones, based on windows overlapping IGH CDR3s. # Assign the VDJ genes and try to detect the CDR3 of each clone. # Summary of clones is available both on stdout, in out/Stanford_S22.vdj.fa and in out/Stanford_S22.vidjil. #+END_SRC #+BEGIN_SRC sh ./vidjil -g germline -i -2 -3 data/reads.fasta # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations # Gather the reads into clones, again based on windows overlapping the detected CDR3s. # Assign the VDJ genes and try to detect the CDR3 of each clone. # Summary of clones is available both on stdout, in out/reads.vdj.fa and in out/reads.vidjil. #+END_SRC ** Basic usage: Whole RNA-Seq or capture datasets #+BEGIN_SRC sh ./vidjil -g germline -i -2 -U data/reads.fasta # Detects for each read the best locus, including an analysis of incomplete/unusual and unexpected recombinations # Gather the reads into clones, again based on windows overlapping the detected CDR3s. # Assign the VDJ genes and try to detect the CDR3 of each clone. # The out/reads.segmented.vdj.fa include all reads where a V(D)J recombination was found #+END_SRC Typical whole RNA-Seq or capture datasets may be huge (several GB) but with only a (very) small portion of CDR3s. Using Vidjil with =-U= will create a =out/reads.segmented.vdj.fa= file that includes all reads where a V(D)J recombination (or an unexpected recombination, with =-2=) was found. This file will be relatively small (a few kB or MB) and can be taken again as an input for Vidjil or for other programs. ** Advanced usage #+BEGIN_SRC sh ./vidjil -c clones -G germline/IGH -r 1 ./data/clones_simul.fa # Extracts the windows with at least 1 read each (-r 1, the default being -r 5) # then gather them into clones #+END_SRC #+BEGIN_SRC sh ./vidjil -c clones -G germline/IGH -r 1 -n 5 ./data/clones_simul.fa # Window extraction + clone gathering, # with automatic clustering, distance five (-n 5) # The result of the automatic clustering is in the .vidjil file # and can been seen/edited in the web application. #+END_SRC #+BEGIN_SRC sh ./vidjil -c segment -G germline/IGH -3 data/segment_S22.fa # Detailed V(D)J designation and CDR3 detection on all reads, without clone gathering # (this is slow and should only be used for testing, or on a small file) #+END_SRC #+BEGIN_SRC sh ./vidjil -c germlines file.fastq # Output statistics on the number of occurrences of k-mers of the different germlines #+END_SRC