#+TITLE: Vidjil -- Algo Manual #+AUTHOR: The Vidjil team (Mathieu, Mikaël and Marc) # Vidjil -- V(D)J recombinations analysis -- [[http://www.vidjil.org]] # Copyright (C) 2011, 2012, 2013, 2014 by Bonsai bioinformatics at LIFL (UMR CNRS 8022, Université Lille) and Inria Lille # contact@vidjil.org V(D)J recombinations in lymphocytes are essential for immunological diversity. They are also useful markers of pathologies, and in leukemia, are used to quantify the minimal residual disease during patient follow-up. Vidjil processes high-throughput sequencing data to extract V(D)J junctions and gather them into clones. Vidjil starts from a set of reads and detects "windows" overlapping the actual CDR3. This is based on an fast and reliable seed-based heuristic and allows to output the most abundant clones. The analysis is extremely fast because, in the first phase, no alignment is performed with database germline sequences. Vidjil can also cluster similar clones, or leave this to the user after a manual review. The method is described in the following paper: Mathieu Giraud, Mikaël Salson, et al., "Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing", BMC Genomics 2014, 15:409 http://dx.doi.org/10.1186/1471-2164-15-409 Vidjil is open-source, released under GNU GPLv3 license. * Supported platforms Vidjil has been successfully tested on the following platforms : - CentOS 6.3 amd64 - CentOS 6.3 i386 - Debian Squeeze - Fedora 17 - FreeBSD 9.1 amd64 - NetBSD 6.0.1 amd64 - Ubuntu 12.04 amd64 - Ubuntu 12.04 i386 * Installation #+BEGIN_SRC sh make data # get some IGH rearrangements from a single individual, as described in: # Boyd, S. D., and al. Individual variation in the germline Ig gene # repertoire inferred from variable region gene rearrangements. J # Immunol, 184(12), 6986–92. make germline # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: # academic research only, provided that it is referred to IMGT®, # and cited as "IMGT®, the international ImMunoGeneTics information system® # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database, # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093 make # compile Vijil make test # run self-tests ./vidjil -h # display help/usage #+END_SRC * Vidjil parameters Launching vidjil with =-h= option provides the list of parameters that can be used. ** Main algorithm parameters #+BEGIN_EXAMPLE Window prediction (use either -s or -k option, but not both) -s spaced seed used for the V/J affectation (default: #####-#####, ######-######, #######-#######, depends on germline) -k k-mer size used for the V/J affectation (default: 10, 12, 13, depends on germline) (using -k option is equivalent to set with -s a contiguous seed with only '#' characters) -w w-mer size used for the length of the extracted window (default: 40)(default with -d: 60) #+END_EXAMPLE The =-s= and =-k= options are the options of the heuristic. A detailed explanation can be found in the paper. More help on that will be available in the following months. The defaults values should work. The =-w= option fixes the size of the "window" that is the main identifier to gather clones. The defaults values (40 for TRG, 60 for IGH) were selected to ensure a high-quality clone gathering. The high-throughput heuristic predicts the center of the "window" that may be shifted by a few bases from the actual "center" of the CDR3 (for TRG, less than 15 bases compared to the IMGT/V-QUEST or IgBlast prediction in >99% of cases). The extracted window should be large enough to fully contain the CDR3 as well as some part of the end of the V and the start of the J to uniquely identify a clone. Setting =-w= to 30 for TRG and 50 for IGH may "segment" (analyze) a few more reads, but may in some rare cases falsely cluster reads from different clones. Setting =-w= to lower values is not recommended. ** Threshold on clone output The following options control how many clones are output and analyzed. #+BEGIN_EXAMPLE Limits to report a clone (or a window) -r minimal number of reads supporting a clone (default: 10) -% minimal percentage of reads supporting a clone (default: 0) Limits to further analyze some clones -y maximal number of clones computed with a representative ('all': no limit) (default: 100) -z maximal number of clones to be segmented ('all': no limit, do not use) (default: 20) -A reports and segments all clones (-r 1 -% 0 -y all -z all), to be used only on very small datasets #+END_EXAMPLE The =-r/-%= options are strong thresholds: if a clone does not have the requested number of reads, the clone is discarded (except when using =-l=, see below). The default =-r 10= option is meant to only output clones that have a significant read support. *You shoud use* =-r 1= *if you want to detect all clones starting from the first read* (especially for MRD detection). The =-y= option limits the number of clones for which a representative sequence is computed. Usually you do not need to have more representatives (see below), but you can safely put =-y all= if you want to compute all representative sequences. The =-z= option limits the number of clones that are fully analyzed, /with their V(D)J segmentation/, in particular to enable the browser to display the clones on the grid (otherwise they are displayed on the '?/?' axis). It should be smaller than =-y=. If you want to analyze more clones, you should use =-z 50= or =-z 100=. It is not recommended to use larger values: outputting more than 100 clones is often not useful since they can't be visualized easily in the browser, and takes large computation time. Note that even if a clone is not in the top 20 (or 50, or 100) but still passes the =-r=, =-%= options, it is still reported in the .data file. If the clone is at some MRD point in the top 20 (or 50, or 100), it will be fully analyzed/segmented by this other point (and then collected by the =fuse.py= script, using representatives computed at this other point, and then, on the browser, correctly displayed on the grid). *Thus is advised to leave the default* =-y 100 -z 20= *options for the majority of uses.* The =-A= option disables all these thresholds. This option should be used only for test and debug purposes, on very small datasets, and produce large file and takes huge computation times. ** Force to follow some sequences Vidjil allows to specify a list of windows that must be followed (even if those windows are 'rare', below the =-r/-R/-%= thresholds). The parameter =-l= is made for providing such a list in a file following the following format: window label (separed by one space) The first column of the file is the window to be followed while the remaining columns consist of the window's label. In Vidjil output, the labels are output alongside their windows. ** Further clustering These options have no consequences on the visualization through the browser. They are intented for a command-line use only. Setting the =-n= option triggers an additional automatic clustering using DBSCAN algorithm (Ester and al., 1996). The =-e= option allows to specify a file for manually clustering two windows considered as similar. Such a file may be automatically produced by vidjil (out/edges), depending on the option provided. Only the two first columns (separed by one space) are important to vidjil, they only consist of the two windows that must be clustered. * Examples of use All the following examples are on a IGH VDJ recombinations : they thus require the =-G germline/IGH= and the =-d= options. #+BEGIN_SRC sh ./vidjil -G germline/IGH -d data/Stanford_S22.fasta # Extract (with an ultra-fast heuristic) all windows # Summary of windows is available in out/vidjil.data # ('.data' format, see below) # To have detailed/debug results in out/segmented.vdj.fa # (which is a FASTA file embedding heuristic information in the headers, '.vdj' format, see warning below) # run Vidjil with option '-U' #+END_SRC #+BEGIN_EXAMPLE >8--window--1 CACCTATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGGGGCCA >5--window--2 CTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGACGTCTG (...) #+END_EXAMPLE Windows of size 60 (modifiable by =-w=) have been extracted. The first window has 8 occurrences, the second window has 5 occurrences. #+BEGIN_SRC sh ./vidjil -c clones -G germline/IGH -r 1 -R 1 -d ./data/clones_simul.fa # Extracts the windows (-r 1, with at least 1 read each), # then gather them into clones # A more natural option could be -r 5. # For debug purpose, if one wants all the clones, use the option -A. # Results are both # - on the standard output # - in out/clones.vdj.fa (fasta file to be processed by other tools) # - in out/vidjil.data (for the browser) # Additional files are in out/seq/windows.fa-* and out/seq/clone.fa-* # If one adds the '-U' option, an additonal out/segmented.vdj.fa file is produced, # listing segmented reads using the .vdj format (see below) #+END_SRC #+BEGIN_SRC sh ./vidjil -c clones -G germline/IGH -r 1 -n 5 -d ./data/clones_simul.fa # Window extraction + clone gathering, # with automatic clustering, distance five (-n 5) #+END_SRC #+BEGIN_SRC sh ./vidjil -c segment -G germline/IGH -d data/segment_S22.fa # Segment the reads onto VDJ germline # (this is slow and should only be used for testing) #+END_SRC #+BEGIN_SRC sh ./vidjil -c germlines file.fastq # Search for all the germlines and output statistics # on the number of occurrences in each germline #+END_SRC * Segmentation and .vdj format Vidjil output includes segmentation of V(D)J recombinations. This happens in the following situations: - in a first pass, when requested with =-U= option, in =segmented.vdj.fa= file. The goal of this ultra-fast segmentation, based on a seed heuristics, is only to locate the w-window overlapping the CDR3. This should not be taken as a real V(D)J segmentation, as the center of the window may be shifted up to 15 bases from the actual center. - in a second pass, on the standard output - at the end of the clones detection (=-c clones=, also in in =clones.vdj.fa=) - or directly when explicitly requiring segmentation (=-c segment=) This segmentation obtained by full comparison (dynamic programming) with all germline sequences. Such segmentation are not at the core of the Vidjil clone gathering method (which relies only on the 'window', see above). They are provided only for convenience and should be checked with other softwares such as IgBlast, iHHMune-align or IMGT/V-QUEST. Segmentations of V(D)J recombinations are displayed using a dedicated .vdj format. This format is compatible with FASTA format. A line starting with a > is of the following form: #+BEGIN_EXAMPLE >name + VDJ startV endV startD endD startJ endJ Vgene delV/N1/delD5' Dgene delD3'/N2/delJ Jgene comments name sequence name + strand on which the sequence is mapped VDJ type of segmentation (can be "VJ", "VDJ", or shorter tags such as "V" for incomplete sequences). The following line are for "VDJ" recombinations : startV endV start and end position of the V gene in the sequence (start at 0) startD endD ... of the D gene ... startJ endJ ... of the J gene ... Vgene name of the V gene delV number of deletions at the end (3') of the V N1 nucleotide sequence inserted between the V and the D delD5' number of deletions at the start (5') of the D Dgene name of the D gene being rearranged delD3' number of deletions at the end (3') of the D N2 nucleotide sequence inserted between the D and the J delJ number of deletions at the start (5') of the J Jgene name of the J gene being rearranged comments optional comments. In Vidjil, the following comments are now used: - "seed" when this comes for the first pass (segmented.vdj.fa). See the warning above. - "!ov x" when there is an overlap of x bases between last V seed and first J seed #+END_EXAMPLE Following such a line, the nucleotide sequence may be given, giving in this case a valid FASTA file. For VJ recombinations the output is similar, the fields that are not applicable being removed: >name + VJ startV endV startJ endJ Vgene delV/N1/delJ Jgene coments * vidjil.data .json format and web interface A summary of extracted windows is also available in a JSON format, including, for each windows, the number of reads sharing this window. The format of this file may change in future releases. This file is used by the dynamic browser for visualization and analysis of clones and their tracking along different samples, (for example time points in a MRD setup or in a immunological study). Please see the file [[file:browser.org][browser]].org for more information on the browser.