robustness_pipeline
Pipeline used to damage DNA data with gargammel, which is a software that simulates ancient DNA fragments.
DNA data are in the form of SNP matrix, in order to pass them to a machine learning algorithm, to compare prediction after and before being damaged.
For more details about gargammel : https://github.com/grenaud/gargammel
Architecture
Pipeline diagram :
graph LR
A[base data] -- conversion --> B[gargammel and post-treatment]
B -- conversion --> C[damaged data]
C -- compute distance --> A
C --> D[machine learning algorithm]
A --> D
D --> E[compare predictions]
But gargammel only take fasta sequence and simulate it in DNA fragments, so we will have to create random sequence, which will be used as reference sequence and we will put mutations for each individual corresponding to the SNP Matrix, in order to put this sequences in fasta files and pass it to gargammel, after that we will have to map the fragments on the reference sequence, call the variants and finaly merge all the individuals
here is a diagram summarizing all this step.
graph LR
A[Split individuals, and create <br>fasta sequence for each,<br>according to the SNP Matrix] --> B[use gargammel]
B --> C["remove adapters<br>(Trimgalore)"]
C --> D["map reads on reference<br>(option -Q : quality treeshold )"]
D --> E[call the variants]
E --> F[merge<br>individuals]
Gargammel is also calling other subprograms (https://github.com/grenaud/gargammel#overview)
Here is a diagram explaining how it works with our pipeline options.
graph LR
A["fragSim : simulation of ancient DNA fragments<br> being retrieved at random from the sequence<br>(option -l : specify size of fragments, <br>-c : specify coverage)"] --> B["deamSim : simulation of damage (deamination)<br>(option -m : specify mapdamage matrix)"]
B --> C[adptSim : add adapters <br>to create raw Illumina reads]
C --> D["ART : add sequencing errors<br>(option -s : change error rate )"]
Options
short | long | type | description | default | associated program |
---|---|---|---|---|---|
-f | --file | path string | npz file containing SNP matrix and positions | - (mandatory) |
gargammel |
-l | --length | integer | length of the simulated random sequence (must be big enough to match npz positions) | - (mandatory) |
fragSim (gargammel) |
-c | --coverage | float | Coverage | - (mandatory) |
fragSim (gargammel) |
-L | --fragment-length | int | size of dna fragments created by gargammel | - | fragSim (gargammel) |
-F | / | path string | file containing the fragment size distribution | - | fragSim (gargammel) |
-o | --output-dir | path string | The output directory path | ./ |
- |
-g | / | path string | path to pipeline directory | - | - |
-w | --work-dir | path string | move the working directory containing all the temporary files(for example in tmpfs directory) | ./ |
- |
-r | --rmna | boolean | Remove columns containing missing values | false | - |
-z | --natozero | boolean | Switching missing values to zero (speed up the vcf merge) | false | - |
-a | --absolute | boolean | Use absolute positions | Use relative | - |
-n | --missmatch | integer / float | int : number of missmatch allowed during mapping float : f |
1 | mapping (bwa) |
-D | --diploid | boolean | Work on diploid genome (012 matrix) | Work on haploid | variant calling (bcftools call) |
-R | --padR | integer | Add extra sequence (with parameter length) at the extrimities of the sequence to avoid mapping errors | 0 | - |
-Q | --minQ | integer | Includes only sites with Quality value above this threshold (when calling the variants) | 20 | filter after variant calling |
-d | --damage | v,l,d,s (4 float) |
Deamination simulation using the Briggs et al. 2007 model (parameters are define under this table) | - | deamSim (gargammel) |
-m | --mapdamage | path string | Deamination simulation using mapDamage | - | deamSim (gargammel) |
-s | --sequencing-error | integer (between -93 and 93) | Change the sequencing error rate by a factor of 1/(10^([factor]/10)) (*) | 0 | ART illumina |
-p | --platform | string | Change the Illumina platfrom used, options are : GA2 - GenomeAnalyzer II, HS20 - HiSeq 2000, HS25 - HiSeq 2500, HSXt - HiSeqX TruSeq, MSv1 - MiSeq v1, MSv3 - MiSeq v3 | HS25 | ART illumina |
-m | --mapdamage | path string | Deamination simulation using mapDamage | - | deamSim (gargammel) |
(*) Conscerning the sequencing error rate, please note that positive factor will decrease the rate of such errors and a negative one will increase it.
By default gargammel use HiSeq 2500 platform which have 0.1% error rate (you can change it with option -p).
So if you pass -10 as factor you will have 0.1 * 1/(10^([-10]/10)) = 1% sequencing error rate.
or if you pass -20 you will have 0.1 * 1/(10^([-20]/10)) = 10% sequencing error rate.
Note that with too high error rate you may not get a result due to impossible mapping.
Same for too small fragment size or too low coverage.
To simulate damage you can either use the Briggs et al. 2007 (-d) model which have this four parameters
v : nick frequency
l : length of overhanging ends (geometric parameter)
d : prob. of deamination of Cs in double-stranded parts
s : prob. of deamination of Cs in single-stranded parts
For in the Briggs et al. 2007 article, they estimate the Maximum likelihood for four features based on Neandertal DNA sequences (table 1). To use this parameters use -d 0.0097,0.68,0.024,0.36
Or use mapDamage (-m) by giving correspondant miscorporation file
Run the program
/!\ The present version of this pipeline is design tow work on a Slurm cluster
The pipeline correspond to the files in the "pipeline" directory
You may change the 2 first line of pipeline.sh to adapt it to your needs.
#SBATCH --ntasks=10
#SBATCH --mem-per-cpu=4G
In this configuration we will do the gargammel step (diagram 2) for 10 individuals in parallel (you want to have as many cpu as there are individual in your matrix), here each job take 2 Giga ram for a total of 20 Giga ram.
You also have to change the g_dir variable to match your pipeline directory.
g_dir="~/work/robustness_pipeline"
or with the -g option
You can now execute the program.
Basic Example:
sbatch pipeline.sh -c 20 -l 2000000 -f Expansion-015_00001_005.npz
(note : the -c, -f, -l options are mandatory)
By default the result will be in your robustness_pipeline directory and have the name "data_post_gargammel.npz"
An example of pipeline execution for various damages in parallel is shown in the "damage_in_parallel" folder.