format-analysis.org 14.8 KB
Newer Older
1
#+TITLE: Encoding clones with V(D)J recombinations (2016b)
2
#+AUTHOR: The Vidjil team
3
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="org-mode.css" />
4

5 6 7
The following [[http://en.wikipedia.org/wiki/JSON][.json]] format allows to
encode a set of clones with V(D)J immune recombinations,
possibly with user annotations.
Marc Duez's avatar
Marc Duez committed
8

9
In Vidjil, this format is used by both the =.analysis= and the =.vidjil= files.
10
The =.vidjil= file represents the actual data on clones (and that can
11 12
reach megabytes, or even more), usually produced by processing reads by some RepSeq software.
(for example with detailed information on the 100 or 1000 top clones).
Mikaël Salson's avatar
Mikaël Salson committed
13
The =.analysis= file describes customizations done by the user
14 15
(or by some automatic pre-processing) on the Vidjil web application. The web application
can load or save such files (and possibly from/to the patient/sample database).
16
It is intended to be very small (a few kilobytes).
17 18
All settings in the =.analysis= file override the settings that could be
present in the =.vidjil= file.
19

20 21 22 23 24 25


* What is a clone ?

There are several definitions of what may be a clonotype,
depending on different RepSeq software or studies.
26
This format accept any kind of definition:
27 28 29 30 31 32 33 34
Clones are identified by a =id= string that may be an arbitrary identifier such as =clone-072a=.
Software computing clones may choose some relevant identifiers:
 - =CGAGAGGTTACTATGATAGTAGTGGTTATTACGGGGTAGGGCAGTACTAC=, Vidjil algorithm, 50 nt window centered on the CDR3
 - =CARPRDWNTYYYYGMDVW=, a CDR3 AA sequence
 - =CARPRDWNTYYYYGMDVW IGHV3-11*00 IGHJ6*00=, a CDR3 AA sequence with additional V/J gene information (MiXCR)
 - the 'clone sequence' as computed by the ARReST in =.clntab= files (processed by =fuse.py=)
 - see also 'IMGT clonotype (AA) or (nt)'

35 36 37
* Examples

** =.vidjil= file -- one sample
Marc Duez's avatar
Marc Duez committed
38

39
This is an almost minimal =.vidjil= file, describing clones in one sample.
40
The =seg= element is optional: clones without =seg= elements will be shown on the grid with '?/?'.
41
The =_average_read_length= is also optional, but allows to plot GENSCAN-like plots more precisely than getting only the length of the sequence.
42 43 44 45
All other elements are required. The =reads.germlines= list can have only one element the case of data on a unique locus.
There is here one clone on the =TRG= locus with a designation =TRGV5*01 5/CC/0 TRGJ1*02=.
Note that other elements could be added by some program (such as =tag=, to identify some clones,
or =clusters=, to further cluster some clones, see below).
Marc Duez's avatar
Marc Duez committed
46

47
#+BEGIN_SRC js :tangle analysis-example1.vidjil
Marc Duez's avatar
Marc Duez committed
48
    {
49 50
        "producer": "program xyz version xyz",
        "timestamp": "2014-10-01 12:00:11",
51
        "vidjil_json_version": "2016b",
52 53

        "samples": {
54
             "number": 1, 
55
             "original_names": ["T8045-BC081-Diag.fastq"]
56 57
        },

58 59 60 61 62
        "reads" : {
            "total" :           [ 437164 ] ,
            "segmented" :       [ 335662 ] ,
            "germline" : {
                "TRG" :         [ 250000 ] ,
63
                "IGH" :         [ 85662  ]
64 65 66
            }
        },

67 68 69
        "clones": [
            {
                "id": "clone-001",
70 71
                "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241 ],
72
                "_average_read_length": [ 119.3 ],
73 74
                "germline": "TRG",
                "top": 1,
75 76
                "seg":
                {
77 78
		    "5": {"name": "TRGV5*01",  "start": 1,   "stop": 86, "delRight":5},
		    "3": {"name": "TRGJ1*02",  "start": 89,  "stop": 118,   "delLeft":0},
79
                    "cdr3": { "start": 77, "stop": 104, "seq": "gccacctgggccttattataagaaactc" }
80 81
		}

82
            }
83 84 85 86
        ]
    }
#+END_SRC

87
** =.vidjil= file -- several related samples
88

89
This a =.vidjil= file obtained by merging with =fuse.py= two =.vidjil= files corresponding to two samples.
90
Clones that have a same =id= are gathered (see 'What is a clone?', above).
91
It is the responsibility of the program generating the initial =.vidjil= files to choose these =id= to
92 93
do a correct gathering.

94

95
#+BEGIN_SRC js :tangle analysis-example2.vidjil
96 97 98
    {
        "producer": "program xyz version xyz / fuse.py version xyz",
        "timestamp": "2014-10-01 14:00:11",
99
        "vidjil_json_version": "2016b",
100 101 102

        "samples": {
             "number": 2, 
103
             "original_names": ["T8045-BC081-Diag.fastq", "T8045-BC082-fu1.fastq"]
104 105 106 107 108 109 110 111 112 113 114 115 116 117
        },

        "reads" : {
            "total" :           [ 437164, 457810 ] ,
            "segmented" :       [ 335662, 410124 ] ,
            "germline" : {
                "TRG" :         [ 250000, 300000 ] ,
                "IGH" :         [ 85662,   10124 ]
            }
        },

        "clones": [
            {
                "id": "clone-001",
118 119
                "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241, 14717 ],
120 121
                "germline": "TRG",
                "top": 1,
122 123
                "seg":
                {
124 125 126
		    "5": {"name": "TRGV5*01",  "start": 1,  "stop": 86,  "delRight": 5},
		    "3": {"name": "TRGJ1*02",  "start": 89, "stop": 118, "delLeft":  0}
               }
127 128 129 130 131 132 133 134 135 136 137 138 139
            },
            {
                "id": "clone2",
                "sequence": "GATACA",
                "reads": [ 153, 10221 ],
                "germline": "TRG",
                "top": 2
            },
            {
                "id": "clone3",
                "sequence": "ATACAGA",
                "reads": [ 521, 42 ],
                "germline": "TRG",
140 141 142
                "top": 3,
                "seg":
                {
143 144
                    "5": {"start": 1, "stop": 100},
                    "3": {"start": 101, "stop": 200}
145
                }
146
            }
147 148 149 150 151
        ]
    }
#+END_SRC


152
** =.analysis= file
153

154
This file reflects the annotations a user could have done within the Vidjil web application or some other tool.
155
She has manually set sample names (=names=), tagged (=tag=, =tags=), named (=name=) and clustered (=clusters=) 
156 157
some clones, and added external data (=data=).

158
#+BEGIN_SRC js :tangle analysis-example2.analysis
159
    {
160
        "producer": "user Bob, via Vidjil webapp",
161
        "timestamp": "2014-10-01 12:00:11",
162
        "vidjil_json_version": "2016b",
163 164

        "samples": {
165 166 167 168
        "id": [
          "T8045-BC081-Diag.fastq",
          "T8045-BC082-fu1.fastq"
        ],
169 170
             "number": 2, 
             "names": ["diag", "fu1"],
171
             "original_names": ["file1.fastq", "file2.fastq"],
172
             "order": [1, 0]
173 174
        },

175
        "clones": [
Marc Duez's avatar
Marc Duez committed
176
            {
177
                "id": "clone-001",
Mikaël Salson's avatar
Mikaël Salson committed
178
                "name": "Main ALL clone",
179
                "tag": "0"
180 181 182
            },
            {
                "id": "spikeE",
183
                "label": "spike",
184 185
                "sequence": "ATGACTCTGGAGTCTATTACTGTGCCACCTGGGATGTGAGTATTATAAGAAAC",
                "tag": "3",
Marc Duez's avatar
Marc Duez committed
186 187
                "expected": "0.1"
            }
188

Marc Duez's avatar
Marc Duez committed
189 190
        ],

191
        "clusters": [
192
            [ "clone2", "clone3"],
193
            [ "clone-5", "clone-10", "clone-179" ]
Marc Duez's avatar
Marc Duez committed
194 195
        ],

196 197
        "data": {
             "qPCR": [0.83, 0.024],
198
             "spikeZ": [0.01, 0.02]
199 200
        },

201
        "tags": {
202 203 204 205 206 207
            "names": {
                "0" : "main clone",
                "3" : "spike",
                "5" : "custom tag"
            },
            "hide": [4, 5]
Marc Duez's avatar
Marc Duez committed
208 209
        }
    }
210
#+END_SRC
Marc Duez's avatar
Marc Duez committed
211

212 213 214 215
The =order= field defines the order in which order the points should be
considered. In that case we should first consider the second point (whose =name=
is /fu1)/ and the point to be considered in second should be the first one in
the file (whose =name= is /diag/).
216

217 218 219
The =clusters= field indicate clones (by their =id=) that have been further clustered.
Usually, these clones were defined in a related =.vidjil= file (as /clone2/ and /clone3/,
see the =.vidjil= file in the previous section). If these clones do not exist, the clusters are
220 221 222
just ignored. The first item of the cluster is considered as the
representative clone of the cluster.

223
* Detailed specification
224 225
						     
** Generic information for traceability [required]
226 227

#+BEGIN_SRC js
228 229
   "producer": "my-repseq-software -z -k (v. 123)",    // arbitrary string, user/software/version/options producing this file [required]
   "timestamp": "2014-10-01 12:00:11",                 // last modification date [required]
230
   "vidjil_json_version": "2016b",                     // version of the .json format  [required]
231 232 233
#+END_SRC


234

235 236 237 238
** Statistics: the =reads= element [.vidjil only, required]

The number of analyzed reads (=segmented=) may be higher than the sum of the read number of all clones,
when one choose to report only the 'top' clones (=-t= option for fuse).
239 240

#+BEGIN_SRC js
241
{
242 243 244 245 246
    "total" : [],          // total number of reads per sample (with samples.number elements)
    "segmented" : [],      // number of analyzed/segmented reads per sample (with samples.number elements)
    "germline" : {         // number of analyzed/segmented reads per sample/germline (with samples.number elements)
        "TRG" : [],
        "IGH" : []
247 248
    }
}
249
#+END_SRC
250 251 252


 
253
** =samples= element [required]
254

255
#+BEGIN_SRC js
256
  {
257
    "number": 2,      // number of samples [required]
258

259
    "original_names": [],  // original sample names (with samples.number elements) [required]
260

261
    "names": [],      // custom sample names (with samples.number elements) [optional]
262 263
                      // These names are editable and will be used on the graphs

264 265 266
    "order": [],      // custom sample order (lexicographic order by default) [optional]


267
    // traceability on each sample (with sample.number elements)
268 269
    "producer": [],
    "timestamp": [],
270
    "log": []
271
  }
272
#+END_SRC
Marc Duez's avatar
Marc Duez committed
273 274


275

276
** =clones= list, with read count, tags, V(D)J designation and other sequence features
Marc Duez's avatar
Marc Duez committed
277

278
Each element in the =clones= list describes properties of a clone.
279

280 281
In a =.vidjil= file, this is the main part, describing all clones.
In the =.analysis= file, this section is intended to describe some specific clones.
282 283


284
#+BEGIN_SRC js
285
  {
286
    "id": "",        // clone identifier, must be unique [required] [see above, 'What is a clone ?']
287
                     // the clone identifier in the .vidjil file and in .analysis file must match
288

289
    "germline": ""   // [required for .vidjil]
290
                     // (should match a germline defined in germline/germline.data)
291

292
    "name": "",      // clone custom name [optional]
293
                     // (the default name, in .vidjil, is computed from V/D/J information)
294

295 296 297
    "label": "",     // clone labels, separed by spaces [optional]
                     // These labels may add some information entered with a controled vocabulary

298
    "sequence": "",  // reference nt sequence [required for .vidjil]
299
                     // (for .analysis, not really used now in the web application,
300
                     //  for special clones/sequences that are known,
301 302
                     //  such as standard/spikes or know patient clones)
 
303
    "tag": "",       // tag id from 0 to 7 (see below) [optional]
304

305
    "expected": ""   // expected abundance of this clone (between 0 and 1) [optional]
306
                     // this will create a normalization option in the 
307
                     // settings web application menu
308

309
    "seg":           // detailed V(D)J designation/segmentation and other sequences features or values [optional]
310
                     // on the web application, clones that are not segmented will be shown on the grid with '?/?'
311 312
                     // positions are related to the 'sequence'
                     // names of V/D/J genes should match the ones in files referenced in germline/germline.data
313
                     // Positions on the sequence start at 1.
314
      {
315 316
         "5": {"name": "IGHV5*01", "start": 1, "stop": 120,  "delRight": 5},    // V (or 5') segment
         "4": {"name": "IGHD1*01", "start": 124, "stop": 135, "info": "unsure designation",  "delRight": 5, "delLeft": 0},  // D (or middle) segment
317
                     // Recombination with several D may use "4a", "4b"...
318
         "3": {"name": "IGHJ3*02", "start": 136, "stop": 171,  "delLeft": 5},  // J (or 3') segment
319

320 321
                     // any feature to be highlighted in the sequence, with optional fields related to this feature:
                     //  - "start"/"stop" : positions on the clone sequence (starting at 1)
322
                     //  - "delLeft/delRight" : a numerical value . It is the numbers of nucleotides deleted during the rearrangment. DelRight are compatible with V/5 and D/4 segments, delLeft is compatible with D/4 and J/3 segments.
323 324 325 326
                     //  - "seq" : a sequence
                     //  - "val" : a numerical value
                     //  - "info" : a textual vlaue
                     //
327
                     // JUNCTION//CDR3 should be stored that way (in fields called "junction" of "cdr3"),
328
                     // its productivity must be stored in a boolean field called "productive".
329
         "somefeature": { "start": 56, "stop": 61, "seq": "ACTGTA", "val": 145.7, "info": "analyzed with xyz" },
330

331
                     // Numerical or textual features concerning all the sequence or its analysis (such as 'evalue')
332 333
                     // can be provided by omitting "start" and "stop" elements.
         "someotherfeature": {"val": 0.004521},
334
         "anotherfeature": {"info": "VH CDR3 stereotypy"},
335 336 337
      }


338
    "reads": [],      // number of reads in this clones [.vidjil only, required] 
339 340
                      // (with samples.number elements)

341
    "top": 0,         // (not documented now) [required] threshold to display/hide the clone
342
    "stats": []       // (not documented now) [.vidjil only] (with sample.number elements)
343 344


345
 }
346
#+END_SRC
Marc Duez's avatar
Marc Duez committed
347

348
** =germlines= list [optional][work in progress, to be documented]
349 350

extend the =germline.data= default file with a custom germline
351

352 353 354 355 356 357 358 359 360 361
#+BEGIN_SRC js
        "germlines" : {
            "custom" : {
                "shortcut": "B",
                "5": ["TRBV.fa"],
                "4": ["TRBD.fa"],
                "3": ["TRBJ.fa"]
            }
        }
#+END_SRC
Marc Duez's avatar
Marc Duez committed
362

363
** Further clustering of clones: the =clusters= list [optional]
364

365
Each element in the 'clusters' list describe a list of clones that are 'merged'.
366
In the web application, it will be still possible to see them or to unmerge them.
367
The first clone of each line is used as a representative for the cluster.
368 369


370
** =data= list [optional][work in progress, to be documented]
371

372
Each element in the =data= list is a list of values (of size samples.number)
373
showing additional data for each sample, as for example qPCR levels or spike information.
374 375 376 377

In the browser, it will be possible to display these data and to normalize
against them (not implemented now).

378
** Tagging some clones: =tags= list [optional]
Marc Duez's avatar
Marc Duez committed
379

380
The =tags= list describe the custom tag names as well as tags that should be hidden by default.
381
The default tag names are defined in [[../browser/js/vidjil-style.js]].
Marc Duez's avatar
Marc Duez committed
382

383 384 385
#+BEGIN_SRC js
    "key" : "value"  // "key" is the tag id from 0 to 7 and "value" is the custom tag name attributed
#+END_SRC
386 387 388