format-analysis.org 11.4 KB
Newer Older
1
#+TITLE: .analysis and .vidjil format
2
#+AUTHOR: The Vidjil team
3

4 5 6 7
The =.analysis= and the =.vidjil= files share a common [[http://en.wikipedia.org/wiki/JSON][.json]] format.
They are produced and used by several components of the Vidjil platform, 
but you can also use these formats to use the Vidjil browser within 
your own analysis pipeline.
8

9 10
The =.vidjil= file represents the actual data on clones (and that can
reach megabytes). It should be automatically produced.
11

12
The =.analysis= file describes customizations done by the user
13
(or by some automatic pre-processing) on the Vidjil browser. The browser
14
can load or save such files (and possibly from/to the patient database).
15
It is intended to be very small (a few kilobytes).
16 17
All settings in the =.analysis= file override the settings that could be
present in the =.vidjil= file.
18

19 20 21
* Examples

** =.vidjil= file -- one sample
22

23
This is an almost minimal =.vidjil= file, describing clones in one sample.
24 25 26
The =seg= element is optional: clones without =seg= elements will be shown on the grid with '?/?'.
All other elemnts are required. The =reads.germlines= list can have only one element the case of data on a unique locus.
There is here one clone with a segmentation =TRGV5*01 5/CC/0 TRGJ1*02=.
27
Note that other elements could be added by some program (such as =tag= or =clusters=).
28

29
#+BEGIN_SRC js :tangle analysis-example1.vidjil
30
    {
31 32 33
        "producer": "program xyz version xyz",
        "timestamp": "2014-10-01 12:00:11",
        "vidjil_json_version": "2014.10",
34 35

        "samples": {
36
             "number": 1, 
37
             "original_names": ["T8045-BC081-Diag.fastq"]
38 39
        },

40 41 42 43 44
        "reads" : {
            "total" :           [ 437164 ] ,
            "segmented" :       [ 335662 ] ,
            "germline" : {
                "TRG" :         [ 250000 ] ,
45
                "IGH" :         [ 85662  ]
46 47 48
            }
        },

49 50 51
        "clones": [
            {
                "id": "clone-001",
52 53
                "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241 ],
54 55
                "germline": "TRG",
                "top": 1,
56 57
                "seg":
                {
58
		    "5": "TRGV5*01",  "5start": 0,   "5end": 86,
59 60
		    "3": "TRGJ1*02",  "3start": 89,  "3end": 118,
                    "cdr3": { "start": 77, "stop": 104, "seq": "gccacctgggccttattataagaaactc" }
61 62
		}

63
            }
64 65 66 67
        ]
    }
#+END_SRC

68
** =.vidjil= file -- several samples
69

70
This a =.vidjil= file obtained by merging with =fuse.py= two =.vidjil= files corresponding to two samples.
71
Clones that have a same =id= are gathered.
72 73 74
It is the responsability of the program generating the initial =.vidjil= files to choose these =id= to
do a correct gathering ('windows' is used by Vidjil, 'clone sequence' is used by EC-NGS/Brno pipeline, 
and 'IMGT clonotype (AA) or (nt)' could also be used by some programs).
75

76
#+BEGIN_SRC js :tangle analysis-example2.vidjil
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
    {
        "producer": "program xyz version xyz / fuse.py version xyz",
        "timestamp": "2014-10-01 14:00:11",
        "vidjil_json_version": "2014.10",

        "samples": {
             "number": 2, 
             "original_names": ["T8045-BC081-Diag.fastq", "T8045-BC082-fu1.fastq"]
        },

        "reads" : {
            "total" :           [ 437164, 457810 ] ,
            "segmented" :       [ 335662, 410124 ] ,
            "germline" : {
                "TRG" :         [ 250000, 300000 ] ,
                "IGH" :         [ 85662,   10124 ]
            }
        },

        "clones": [
            {
                "id": "clone-001",
99 100
                "sequence": "CTCATACACCCAGGAGGTGGAGCTGGATATTGATACTACGAAATCTAATTGAAAATGATTCTGGGGTCTATTACTGTGCCACCTGGGCCTTATTATAAGAAACTCTTTGGCAGTGGAAC",
		"reads" : [ 243241, 14717 ],
101 102
                "germline": "TRG",
                "top": 1,
103 104
                "seg":
                {
105 106
		    "5": "TRGV5*01",  "5start": 0,   "5end": 86,
		    "3": "TRGJ1*02",  "3start": 89,  "3end": 118
107
		}
108 109 110 111 112 113 114 115 116 117 118 119 120 121
            },
            {
                "id": "clone2",
                "sequence": "GATACA",
                "reads": [ 153, 10221 ],
                "germline": "TRG",
                "top": 2
            },
            {
                "id": "clone3",
                "sequence": "ATACAGA",
                "reads": [ 521, 42 ],
                "germline": "TRG",
                "top": 3
122
            }
123 124 125 126 127
        ]
    }
#+END_SRC


128
** =.analysis= file
129 130 131 132 133

This file reflects what an user could have done with the browser (or with some other tool).
She has manually set sample names (=names=), tagged (=tag=, =tags=) and clustered (=clusters=) 
some clones, and added external data (=data=).

134
#+BEGIN_SRC js :tangle analysis-example2.analysis
135 136 137 138 139 140 141 142
    {
        "producer": "user Bob, via browser",
        "timestamp": "2014-10-01 12:00:11",
        "vidjil_json_version": "2014.10",

        "samples": {
             "number": 2, 
             "names": ["diag", "fu1"],
143
             "original_names": ["file1.fastq", "file2.fastq"],
144
             "order": [1, 0]
145 146
        },

147
        "clones": [
148
            {
149
                "id": "clone-001",
150
                "name": "Main ALL clone",
151
                "tag": "0"
152 153 154 155 156 157
            },
            {
                "id": "spikeE",
                "name": "spike",
                "sequence": "ATGACTCTGGAGTCTATTACTGTGCCACCTGGGATGTGAGTATTATAAGAAAC",
                "tag": "3",
158 159
                "expected": "0.1"
            }
160

161 162
        ],

163
        "clusters": [
164
            [ "clone2", "clone3"],
165
            [ "clone-5", "clone-10", "clone-179" ]
166 167
        ],

168 169
        "data": {
             "qPCR": [0.83, 0.024],
170
             "spikeZ": [0.01, 0.02]
171 172
        },

173
        "tags": {
174 175 176 177 178 179
            "names": {
                "0" : "main clone",
                "3" : "spike",
                "5" : "custom tag"
            },
            "hide": [4, 5]
180 181
        }
    }
182
#+END_SRC
183

184 185 186 187
The =order= field defines the order in which order the points should be
considered. In that case we should first consider the second point (whose =name=
is /fu1)/ and the point to be considered in second should be the first one in
the file (whose =name= is /diag/).
188

189 190 191 192 193 194
As exemplified in the =clusters= field, this proceeds to the clustering of
clones defined in the =.vidjil= file (here /clone2/ and /clone3/ are defined in the
vidjil file in previous section). If clones do not exist, the clusters are
just ignored. The first item of the cluster is considered as the
representative clone of the cluster.

195 196 197
* The different elements
						     
** Generic information for traceability [required]
198 199

#+BEGIN_SRC js
200 201
   "producer": "",    // arbitrary string, user/software/options producing this file [required]
   "timestamp": "",   // last modification date [required]
202
   "vidjil_json_version": "2014.10", // version of the format  [required]
203 204 205
#+END_SRC


206

207
** 'reads' element [.vidjil only, required]
208 209

#+BEGIN_SRC js
210 211 212 213 214 215 216 217
{
    "total" :           // total number of reads per sample (with samples.number elements)
    "segmented" :       // number of segmented reads per sample (with samples.number elements)
    "germline" : {      // number of segmented reads per sample/germline (with samples.number elements)
        "TRG" :         
        "IGH" :         
    }
}
218 219 220 221
#+END_SRC js


 
222
** 'Samples' element [required]
223

224
#+BEGIN_SRC js
225
  {
226
    "number": 2,      // number of samples [required]
227

228
    "original_names": [],  // original sample names (with samples.number elements) [required]
229

230
    "names": [],      // custom sample names (with samples.number elements) [optional]
231 232
                      // These names are editable and will be used on the graphs

233 234 235 236 237 238
    "order": [],      // custom sample order (lexicographic order by default) [optional]


    "producer": [],
    "timestamp": [],
    "log": [],
239
  }
240
#+END_SRC
241 242


243

244
** 'Clones' list
245 246

Each element in the 'clones' list describes properties of a clone.
247

248
In a .vidjil file, this is the main part, describing all clones.
249 250 251 252 253

In the .analysis file, this section is intended to describe some specific clones.



254
#+BEGIN_SRC js
255
  {
256
    "id": "",        // clone identifier, must be unique [required]
257 258
                     //          Vidjil/algo output -> the 'window'  
                     //          Brno .clntab       -> clone sequence
259
                     // the clone identifier in the .vidjil file and in .analysis file must match
260

261
    "germline": ""   // [required for .vidjil]
262
                     // (should match a germline defined in germline/germline.data)
263

264
    "name": "",      // clone custom name [optional]
265
                     // (the default name, in .vidjil, is computed from V/D/J information)
266

267
    "sequence": "",  // reference nt sequence [required for .vidjil]
268 269
                     // (for .analysis, not really used now in the browser,
                     //  for special clones/sequences that are known,
270 271
                     //  such as standard/spikes or know patient clones)
 
272
    "tag": "",       // tag id from 0 to 7 (see below) [optional]
273

274
    "expected": ""   // expected abundance of this clone (between 0 and 1) [optional]
275 276
                     // this will create a normalization option in the 
                     // settings browser menu
277 278 279 280 281 282

    "seg":           // segmentation information [optional]
                     // in the browser clones, that are not segmented will be shown on the grid with '?/?'
                     // positions are related to the 'sequence'
                     // names of V/D/J genes should match the ones in files referenced in germline/germline.data
      {
283
         "5": "IGHV5*01",
284 285 286
         "5start": 0, 
         "5end": 0,

287
         "4": "IGHD1*01",
288 289 290
         "4start": 0, 
         "4end": 0,

291
         "3": "IGHJ3*02",
292 293
         "3start": 0, 
         "3end": 0,
294 295 296 297

                     // any feature to be highligthen in the sequenc
                     // the optional "seq" element gives a sequence that corresponds to this feature
         "somefeature": { "start": 0, "stop": 0, "seq": "" }
298 299 300
      }


301
    "reads": [],      // number of reads in this clones [.vidjil only, required] 
302
                      // (with samples.number elements)
303
    "top": 0,         // required so that the browser displays the clone
304

305
    "stats": []       // (not documented now) [.vidjil only] (with sample.number elements)
306 307


308
 }
309
#+END_SRC
310

311 312 313
** 'Germlines' list [optional][work in progress, to be documented]

extend the =germline.data= default file with a custom germline
314

315 316 317 318 319 320 321 322 323 324
#+BEGIN_SRC js
        "germlines" : {
            "custom" : {
                "shortcut": "B",
                "5": ["TRBV.fa"],
                "4": ["TRBD.fa"],
                "3": ["TRBJ.fa"]
            }
        }
#+END_SRC
325

326
** 'Clusters' list [optional]
327

328 329 330
Each element in the 'clusters' list describe a list of clones that are 'merged'.
In the browser, it will be still possible to see them or to unmerge them.
The first clone of each line is used as a representative for the cluster.
331 332


333
** 'Data' list [optional][work in progress, to be documented]
334

335 336
Each element in the 'data' list is a list of values (of size samples.number)
showing additional data for each sample, as for example qPCR levels or spike information.
337 338 339 340

In the browser, it will be possible to display these data and to normalize
against them (not implemented now).

341
** 'Tags' list [optional]
342

343
The 'tags' list describe the custom tag names as well as tags that should be hidden by default.
344
The default tag names are defined in [[../browser/js/vidjil-style.js]].
345

346 347 348
#+BEGIN_SRC js
    "key" : "value"  // "key" is the tag id from 0 to 7 and "value" is the custom tag name attributed
#+END_SRC