README 8.8 KB
Newer Older
Mikaël Salson's avatar
Mikaël Salson committed
1
# Vidjil -- V(D)J recombinations analysis <http://bioinfo.lifl.fr/vidjil>
Mathieu Giraud's avatar
Mathieu Giraud committed
2
# Copyright (C) 2011, 2012, 2013, 2014 by Bonsai bioinformatics at LIFL (UMR CNRS 8022, Université Lille) and Inria Lille
Marc Duez's avatar
merge    
Marc Duez committed
3
# contact@vidjil.org
Mikaël Salson's avatar
Mikaël Salson committed
4

Mikaël Salson's avatar
Mikaël Salson committed
5
6
7
8
9

V(D)J recombinations in lymphocytes are essential for immunological
diversity. They are also useful markers of pathologies, and in
leukemia, are used to quantify the minimal residual disease during
patient follow-up.
Mikaël Salson's avatar
Mikaël Salson committed
10
11
12

Vidjil process high-througput sequencing data to extract V(D)J
junctions and gather them into clones. This analysis is based on a
Mikaël Salson's avatar
Mikaël Salson committed
13
14
15
16
17
18
19
20
seed heuristics and is fast and scalable because in the first phase, no
alignment is performed with database germline sequences. Vidjil starts 
from a set of reads and detects "windows" overlapping the actual CDR3.
This is based on an fast and reliable seed-based heuristic and allows
to output the most abundant clones. Vidjil can also clusterize similar
clones, or leave this to the user after a manual review. 

The method is described in the following paper:
Mikaël Salson's avatar
Mikaël Salson committed
21

Mikaël Salson's avatar
Mikaël Salson committed
22
23
24
Mathieu Giraud, Mikaël Salson, et al.,
"Fast multiclonal clusterization of V(D)J recombinations from high-throughput sequencing",
(submitted)
Mikaël Salson's avatar
Mikaël Salson committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Vidjil is open-source, released under GNU GPLv3 license.


### Supported platforms

Vidjil has been successfully tested on the following platforms :
 - CentOS 6.3 amd64
 - CentOS 6.3 i386
 - Debian Squeeze 
 - Fedora 17
 - FreeBSD 9.1 amd64
 - NetBSD 6.0.1 amd64
 - Ubuntu 12.04 amd64
 - Ubuntu 12.04 i386


### Installation

make data
   # get some IGH rearrangements from a single individual, as described in:
   # Boyd, S. D., and al. Individual variation in the germline Ig gene
   # repertoire inferred from variable region gene rearrangements. J
   # Immunol, 184(12), 6986–92.

make germline
Mathieu Giraud's avatar
Mathieu Giraud committed
51
   # get IMGT germline databases (IMGT/GENE-DB) -- you have to agree to IMGT license: 
Mikaël Salson's avatar
Mikaël Salson committed
52
53
54
55
56
57
58
59
60
61
62
63
64
   # academic research only, provided that it is referred to IMGT®,
   # and cited as "IMGT®, the international ImMunoGeneTics information system® 
   # http://www.imgt.org (founder and director: Marie-Paule Lefranc, Montpellier, France). 
   # Lefranc, M.-P., IMGT®, the international ImMunoGeneTics database,
   # Nucl. Acids Res., 29, 207-209 (2001). PMID: 11125093

make                     # compile Vijil
make test                # run self-tests

./vidjil -h              # display help/usage

### Optional dependencies

65
66
clustalw (to compute alignments between windows from a same clone, by setting 
          very_detailed_cluster_analysis in vidjil.cpp)
Mikaël Salson's avatar
Mikaël Salson committed
67
68
69
70
71
72
73
neato (to display graph of neighbors for the automatic clusterisation)

### Vidjil parameters

Launching vidjil with -h option provides the list of parameters that can be
used.

Mikaël Salson's avatar
Mikaël Salson committed
74
### List of windows
Mikaël Salson's avatar
Mikaël Salson committed
75

Mikaël Salson's avatar
Mikaël Salson committed
76
77
Vidjil allows to specify a list of windows that must be followed
(even if those windows are 'rare', below the -r/-R/-% thresholds).
Mikaël Salson's avatar
Mikaël Salson committed
78
The parameter -l is made for providing such a list in a file having
Mikaël Salson's avatar
Mikaël Salson committed
79
the following format: window label (separed by one space)
Mikaël Salson's avatar
Mikaël Salson committed
80

Mikaël Salson's avatar
Mikaël Salson committed
81
82
83
The first column of the file is the window to be followed
while the remaining columns consist of the window's label.
In Vidjil output, the labels are output alongside their windows.
Mikaël Salson's avatar
Mikaël Salson committed
84
85
86

### Manual clustering

Mikaël Salson's avatar
Mikaël Salson committed
87
The -e option allows to specify a file for manually clustering two windows
Mikaël Salson's avatar
Mikaël Salson committed
88
89
90
considered as similar. Such a file may be automatically produced by vidjil
(out/edges), depending on the option provided. Only the two first columns 
(separed by one space) are important to vidjil, they only consist of the 
Mikaël Salson's avatar
Mikaël Salson committed
91
two windows that must be clustered.
Mikaël Salson's avatar
Mikaël Salson committed
92
93
94
95
96
97
98
99
100


### Examples of use

All the following examples are on a IGH VDJ recombinations : they thus
require the "-G germline/IGH" and the "-d" options.


./vidjil -G germline/IGH -d data/Stanford_S22.fasta
Mikaël Salson's avatar
Mikaël Salson committed
101
   # Extract (with an ultra-fast heuristic) all windows
Mikaël Salson's avatar
Mikaël Salson committed
102
   # Results are in out/segmented.vdj.fa, which is a FASTA file 
103
104
   # embedding heuristic information in the headers
   # ('.vdj' format, see warning below)
Mikaël Salson's avatar
Mikaël Salson committed
105
106
   # Summary of windows is also available in out/data.json
   # ('.json' format, see below)
Mikaël Salson's avatar
Mikaël Salson committed
107

Mikaël Salson's avatar
Mikaël Salson committed
108
109
110
111
>8--window--1 
CACCTATTACTGTACCCGGGAGGAACAATATAGCAGCTGGTACTTTGACTTCTGGGGCCA
>5--window--2 
CTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTACTACTACTACATGGACGTCTG
Mikaël Salson's avatar
Mikaël Salson committed
112
113
(...)

Mikaël Salson's avatar
Mikaël Salson committed
114
115
   Windows of size 60 (modifiable by -w) have been extracted.
   The first window has 8 occurrences, the second window has 5 occurrences.
Mikaël Salson's avatar
Mikaël Salson committed
116
117

./vidjil -c clones -G germline/IGH -x -r 1 -R 1 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
118
   # Extracts the windows (-r 1, with at least 1 read each), 
Mikaël Salson's avatar
Mikaël Salson committed
119
120
121
   # then gather them into clones (-R 1, with at least 1 read each:
   # there are many 1-read clones due to sequencing errors.) 
   # A more natural option could be -R 5.
Mathieu Giraud's avatar
Mathieu Giraud committed
122
   # For debug purpose, if one wants all the clones, use the option -A.
123
   # No representative selection (-x)
124
125
   # Results are on the standard output, additional files are
   # in out/segmented.fa, out/windows.fa-* and out/clones*
Mikaël Salson's avatar
Mikaël Salson committed
126
127
128
   # out/segmented.fa list segmented reads using the .vdj format (see below)

./vidjil -c clones -G germline/IGH -x -r 1 -R 5 -n 5 -d ./data/clones_simul.fa
Mikaël Salson's avatar
Mikaël Salson committed
129
   # Window extraction + clone gathering,
Mikaël Salson's avatar
Mikaël Salson committed
130
131
132
   # with automatic clusterisation, distance five (-n 5)

./vidjil -c segment -G germline/IGH -d data/segment_S22.fa
133
   # Segment the reads onto VDJ germline (see warning below)
Mikaël Salson's avatar
Mikaël Salson committed
134
135


136
137
### Segmentation and .vdj format

138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
Vidjil outputs include segmentation of V(D)J recombinations. This happens
in the following situations:

- in a first pass, in 'segmented.vdj.fa' file.

      The goal of this ultra-fast segmentation, based on a seed
      heuristics, is only to locate the w-window overlapping the
      CDR3. This should not be taken as a real V(D)J segmentation, as
      the center of the window may be shifted up to 15 bases from the
      actual center.

- in a second pass, on the standard output, at the end of the clones detection
  (-c clones), or directly when explicitely requiring segmentation (-c segment)

      This segmentation obtained by full comparison (dynamic
      programming) with all germline sequences Such segmentation are
      not at the core of the Vidjil clone gathering method (which
      relies only on the 'window', see above). They are provided only
      for convenience and should be checked with other softwares such
      as IgBlast, iHHMune-align or IMGT/V-QUEST.
Mikaël Salson's avatar
Mikaël Salson committed
158
159

Segmentations of V(D)J recombinations are displayed using a dedicated
160
.vdj format. This format is compatible with FASTA format. A line starting
Mikaël Salson's avatar
Mikaël Salson committed
161
162
with a > is of the following form:

163
>name + VDJ  startV endV   startD endD   startJ  endJ   Vgene   delV/N1/delD5'   Dgene   delD3'/N2/delJ   Jgene   comments
Mikaël Salson's avatar
Mikaël Salson committed
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

        name          sequence name
        +             strand on which the sequence is mapped
        VDJ           type of segmentation (can be "VJ", "VDJ", 
    	              or shorter tags such as "V" for incomplete sequences).	
		      The following line are for "VDJ" recombinations :

        startV endV   start and end position of the V gene in the sequence (start at 0)
        startD endD                      ... of the D gene ...
        startJ endJ                      ... of the J gene ...

        Vgene         name of the V gene 

        delV          number of deletions at the end (3') of the V
        N1            nucleotide sequence inserted between the V and the D
        delD5'        number of deletions at the start (5') of the D

        Dgene         name of the D gene being rearranged

        delD3'        number of deletions at the end (3') of the D
        N2            nucleotide sequence inserted between the D and the J
        delJ          number of deletions at the start (5') of the J

        Jgene         name of the J gene being rearranged
        
189
190
191
        comments      optional comments. In Vidjil, the following comments are now used:
                      - "seed" when this comes for the first pass (segmented.vdj.fa). See the warning above.
                      - "!ov x" when there is an overlap of x bases between last V seed and first J seed
Mikaël Salson's avatar
Mikaël Salson committed
192
193
194
195
196
197

Following such a line, the nucleotide sequence may be given, giving in
this case a valid FASTA file.

For VJ recombinations the output is similar, the fields that are not
applicable being removed:
198
>name + VJ  startV endV   startJ endJ   Vgene   delV/N1/delJ   Jgene  coments
Mikaël Salson's avatar
Mikaël Salson committed
199

Mikaël Salson's avatar
Mikaël Salson committed
200

Mathieu Giraud's avatar
Mathieu Giraud committed
201
### vidjil.data .json format and web interface
Mikaël Salson's avatar
Mikaël Salson committed
202
203
204
205
206

A summary of extracted windows is also available in a .json format,
including, for each windows, the number of reads sharing this window.
This file is currently used for development purposes, its format may
change in future releases of Vidjil.
Mathieu Giraud's avatar
Mathieu Giraud committed
207
208
209
210
211
212
213
214

This file will be used for a dynamic web application for visualization
and analysis of clones and their tracking along different samples,
(for example time points in a MRD setup or in a immunological study).
This application is currently in developpement and will be released in
Q4 2014. However, code source can be already accessed on
http://www.vidjil.org/git.  Please contact us (contact@vidjil.org) if
you would like to have an access on the web server.