README 6.6 KB
Newer Older
Mikaël Salson's avatar
Mikaël Salson committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
# Vidjil -- V(D)J recombinations analysis <http://bioinfo.lifl.fr/vidjil>
# Copyright (C) 2011, 2012, 2013 by Bonsai bioinformatics at LIFL (UMR CNRS 8022, Université Lille) and Inria Lille
# Contact: mathieu.giraud@lifl.fr, mikael.salson@lifl.fr

V(D)J recombinations in lymphocytes are a key for immunologic diversity
as they have an influence on the production of antibodies and antigen
receptors. They are also useful markers for pathologies: in many
cases, such a clonality marker is used for patient follow-up to
quantify the minimal residual disease in leukemias.  

Vidjil process high-througput sequencing data to extract V(D)J
junctions and gather them into clones. This analysis is based on a
seed heuristics and is fast and scalable, as, in the first phase, no
alignment is done with database germline sequences.  Starting from a
set of reads, Vidjil detects the junctions in each read. This is based
on an ultra-fast seed-based heuristic, which has be prooven to be
reliable output the most abundant clones, based on their junctions

Vidjil can also clusterize similar clones, or leave this to the user
after a manual review. The method is described in the paper referenced
below.

Vidjil is open-source, released under GNU GPLv3 license.


### Supported platforms

Vidjil has been successfully tested on the following platforms :
 - CentOS 6.3 amd64
 - CentOS 6.3 i386
 - Debian Squeeze 
 - Fedora 17
 - FreeBSD 9.1 amd64
 - NetBSD 6.0.1 amd64
 - Ubuntu 12.04 amd64
 - Ubuntu 12.04 i386


### Installation

make data
   # get some IGH rearrangements from a single individual, as described in:
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

make germline
   # get IMGT germline databases -- you have to agree to IMGT license: 
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

make                     # compile Vijil
make test                # run self-tests

./vidjil -h              # display help/usage

### Optional dependencies

clustalw (to compute alignments between junctions from a same clone)
neato (to display graph of neighbors for the automatic clusterisation)

### Vidjil parameters

Launching vidjil with -h option provides the list of parameters that can be
used.

### List of junctions

Vidjil allows to specify a list of junctions that must be followed
(even if those junctions are 'rare', below the -r/-R/-% thresholds).
The parameter -l is made for providing such a list in a file having
the following format: junction label (separed by one space)

The first column of the file is the junction to be followed
while the remaining columns consist of the junction's label.
In Vidjil output, the labels are output alongside their junctions.

### Manual clustering

The -e option allows to specify a file for manually clustering two junctions
considered as similar. Such a file may be automatically produced by vidjil
(out/edges), depending on the option provided. Only the two first columns 
(separed by one space) are important to vidjil, they only consist of the 
two junctions that must be clustered.


### Examples of use

All the following examples are on a IGH VDJ recombinations : they thus
require the "-G germline/IGH" and the "-d" options.


./vidjil -G germline/IGH -d data/Stanford_S22.fasta
   # Extract (with an ultra-fast heuristic) all junctions
   # Results are in out/segmented.vdj.fa, which is a FASTA file 
   # embedding segmentation information in the headers
   # ('.vdj' format, see below)

>5--junction--1 
TTGTAGTGGTGGTAGCTGCTACTCCTTTGACTACTGGGGC
>5--junction--2 
TGTAGTGGTGGTAGCTGTTACTCCCACGTCTGGGGCCAAG
(...)

   Junctions of size 40 (modifiable by -w) have been extracted.
   These two junctions have 5 occurrences in the set of reads.

./vidjil -c clones -G germline/IGH -x -r 1 -R 1 -d ./data/clones_simul.fa
   # Extracts the junctions (-r 1, with at least 1 read each), 
   # then gather them into clones (-R 1, with at least 1 read each:
   # there are many 1-read clones due to sequencing errors.) 
   # A more natural option could be -R 5.
   # No representative selection / clustalw postprocessing (-x)
   # Results are in out/segmented.fa, out/junctions.fa-* and out/clones*
   # out/segmented.fa list segmented reads using the .vdj format (see below)

./vidjil -c clones -G germline/IGH -x -r 1 -R 5 -n 5 -d ./data/clones_simul.fa
   # Junction extraction + clone gathering,
   # with automatic clusterisation, distance five (-n 5)

./vidjil -c segment -G germline/IGH -d data/segment_S22.fa
   # Segment the reads onto VDJ germline using a full comparison 
   # (dynamic programming) with all sequences.
   # The output is displayed in .vdj format (see below)


### .vdj format

Segmentations of V(D)J recombinations are displayed using a dedicated
format. This format is compatible with FASTA format. A line starting
with a > is of the following form:

>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene

        name          sequence name
        +             strand on which the sequence is mapped
        VDJ           type of segmentation (can be "VJ", "VDJ", 
    	              or shorter tags such as "V" for incomplete sequences).	
		      The following line are for "VDJ" recombinations :

        startV endV   start and end position of the V gene in the sequence (start at 0)
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
        

Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene