Bulked-Segregant Analysis using vcf file with or without parents(you can call it vcfbsa), you don't need to construct parent's reference to call snp and don't need to polarize alleles first to make direction of delta-snpindex meaningful.
It's memory effecient and very fast, you can complete the analysis in about one hour with only < 4Gb RAM when there are ~10 millions markers in your vcf file.
The most memory cosuming step is the slidding windows step, because the slidewindow.pl script need to read all avalible sites in index/ed/gst/fet file into RAM.
Can also be used for mutmap analysis(just give the same name for -b1 and -p1 when calculate index).
You need to install these dependence perl modules from cpan first
use Cwd
use Getopt::Long
use Data::Dumper
use File::Basename
use Text::NSP::Measures::2D::Fisher::twotailed
use FindBin
R (only >= v3.60 were tested) is needed, and gtools package is needed if you want to use point_line_plot.pl script.
This script can be used to calculate snp-index/delta snp-index(index), g-statistic(gst), euclidean distance(ed) and Fisher-exact test(fet) and some other methods (coming soon).
For bi-parents populations (f2, ril/dh, bc), only aaxbb sites are used when two parents are avalible, only aax?? or ??xbb sites are used when only one parent is avalible, and all sites are used when two parents are absence. For outcross population (f1/cp), only lmxll nnxnp and hkxhk sites are used when two parents are avalible, and all sites are used if one parent is absence.
For delta snp-index, the direction is meaningful for bi-parents populations (f2, ril/dh, bc) only at least one parent is used. And the direction is not meaningful for outcross population (f1/cp) in all case.
It is wise to set -ab T when the direction of delta snp-index is not meaningful.
For mutmap, please give the same sample name for -b1 and -p1(wild parent), the index of bulk1(wild parent actually) will not be calculated.
Reference:
SNP-Index and Delta SNP-index, Fekih R, Takagi H, Tamiru M, et al. MutMap+: genetic mapping and mutant identification without crossing in rice[J]. PloS one, 2013, 8(7): e68529.
Euclidean distance, Hill J T, Demarest B L, Bisgrove B W, et al. MMAPPR: mutation mapping analysis pipeline for pooled RNA-seq[J]. Genome research, 2013, 23(4): 687-697.
perl simulation_v2.pl
Function: simulation and get delta snpindex confidence intervals
for given population type(f2, ril, bc1), depth and bulk size.
-k <str> output prefix [force]
-od <dir> output directory [force]
-pt <str> pop type (f2, ril, bc) [force]
-sd <dir> shell dir [od]
-s1 <str> bulk1(dominant/wild) size [30]
-s2 <str> bulk2(recessive/mutant) size [30]
-rp <int> replications number [10000]
-ci <int> confidence intervals [95,99]
-md <int> min depth [10]
-xd <int> max depth [500]
-mi <float> min snp index [0]
-rb <bin> Rscript bin [/Bio/bin/Rscript-3.6.0]
-h help document
perl simulation_v2.pl -k bsa -od bsa -pt ril -s1 30 -s2 30 -rp 10000 -ci 95,99 -md 10 -xd 500 -mi 0 -rb /path/to/Rscript
*.cisim.xls
perl bsaindex.pl
Function: calculate (snpindex, g-statistic, ed, fisher exact test(fet))
of two bulks using vcf file with/without parent(s)
-v <file> vcf file [force]
-k <str> output prefix [force]
-od <dir> output directory [force]
-b1 <str> bulk1(dominant/wild) [force]
-b2 <str> bulk2(recessive/mutant) [force]
-p1 <str> dominant/wild parent [optional]
-p2 <str> recessive/mutant parent [optional]
-pt <str> pop type (f1/cp, f2, ril, bc) [ril]
-md <int> min depth [10]
-xd <int> max depth [500]
-pd <int> parent min depth [5]
-mt <str> method(index,gst,ed,fet,all) [all]
-ep <int> ed power [5]
-mi <float> min snp index [0.2]
-sf <file> simulation result file [optional]
-sd <str/int> min,max,other number in sf [max]
-id <T/F> use(T) indel or not(F) [F]
-ab <T/F> output absolutely delta(T) [F]
-rb <bin> Rscript bin path [Rscript]
-h help document
perl bsaindex.pl -v snp.indel.vcf -k snp -od bsa -b1 bulk1 -b2 bulk2 -p1 parent1 -p2 parent2 -pt ril -ep 2 -mt all -sf bsa/snp.cisim.xls
if you do not care about direction of delta snpindex, -sf can be ignored and set -ab T
├── bsa.all.xls ## all index file
├── bsa.basic.xls ## used sites informations
├── bsa.drop.vcf ## dropped sites
├── bsa.ed.xls ## ed results
├── bsa.fet.xls ## fisher exact test results
├── bsa.gst.xls ## g-statistic results
└── bsa.index.xls ## snp-index results
perl slidewindow.pl
Function: sliding window calculate smooth(mean) of snp index
-i <file> snpindex file [force]
-k <str> output prefix [force]
-o <dir> output directory [force]
-f <file> chromosome fai file [force]
-w <int> window length(kb) [1000]
-s <int> step length(kb) [100]
-cp <str> chr pos columns [1,2]
-cv <str> values ( index && ci ) columns [3]
-ms <int> min snp number in a window [10]
-lg <T/F> get -log10 value of mean value [F]
-h help document
perl slidewindow.pl -i bsa/bsa.index.xls -k bsa.index -o bsa/ -f har.fa.fai -w 2000 -s 10 -cp 1,2 -cv 3-13 -ms 10
perl slidewindow.pl -i bsa/bsa.ed.xls -k bsa.ed -o bsa/ -f har.fa.fai -w 2000 -s 10 -cp 1,2 -cv 3 -ms 10
perl slidewindow.pl -i bsa/bsa.gst.xls -k bsa.gst -o bsa/ -f har.fa.fai -w 2000 -s 10 -cp 1,2 -cv 3 -ms 10
perl slidewindow.pl -i bsa/bsa.fet.xls -k bsa.fet -o bsa/ -f har.fa.fai -w 2000 -s 10 -cp 1,2 -cv 3,5 -ms 10 -lg T
fai file can be obtained from samtools fasta index or just add chromosome length manully:
chr01 18757413
chr02 15481669
chr03 14254934
chr04 13967672
chr05 13931063
chr06 13863567
chr07 13341776
chr08 13100779
chr09 13064641
chr10 12996920
├── bsa.ed.mean.xls ## ed sliding window results
├── bsa.fet.mean.xls ## fisher exact test sliding window results
├── bsa.gst.mean.xls ## g-statistic sliding window results
└── bsa.index.mean.xls ## snp-index sliding window results
bsaplot.r, plot.scanone.r (copy from R/qtl), and point_line_plot.pl must be in the same directory.
perl point_line_plot.pl
Don't be afraid of seeing so many options, just some of them are required!
Function: point and/or line plot.
-k <str> output prefix [force]
-f <file> genome fai file [force]
-od <dir> output directory [force]
-sd <dir> shell directory [od]
-ch <str> chr name(s) use for plotting [optional]
-cm <int> chr start position [1]
-lf <file> line plot file [NULL]
-lp <ints> line file chr,pos columns [1,2]
-lv <ints> line file values columns [3,4,5,6,7]
-lc <color> line plot colors (len(lc)==len(lv)) [black,blue,red,blue,red]
-pf <file> point plot file [NULL]
-pp <ints> point plot chr,pos columns [1,2]
-pv <ints> point plot values columns [3]
-pc <colors> point colors [#33A02C,#FF7F00]
-hl <float> horizon line position [0]
-hc <color> horizon line color [grey]
-ht <int> horizon line type(R) [1]
-cg <int> gap between chromosomes [0]
-ym <float> ymin [1*-1]
-yx <float> ymax [1]
-xl <str> xlab [Chromosome]
-yl <str> ylab [expression(Delta(SNPindex))]
-ca <float> axis cex [1.5]
-cl <float> labels cex [2]
-cp <float> point cex [0.3]
-pch <int> point type(R) [19]
-lwd <float> line width [2.5]
-las <1/2> parallele(1)/vertical(2) x-axis lab [1]
-bdc <color> band color(for distinguish chrs) [grey90]
-alt <T/F> alt chr labels(avoid overlaping) [F]
-width <int> plot width in inches [15]
-height <int> plot height in inches [6]
-res <int> png resolution ratio [300]
-rb <bin> Rscript bin [/Bio/bin/Rscript-3.6.0]
-h help document
perl point_line_plot.pl -k bsa.index -f har.fa.fai -od bsa/ -lf bsa/bsa.index.mean.xls -lp 1,4 -lv 7,8,9,12,13 -lc "black,#2121D9,#2121D9,#DF0101,#DF0101" -pf bsa/bsa.index.xls -pp 1,2 -pv 5 -ym 1*-1 -yx 1 -xl "" -yl "expression(Delta(SNPindex))" -las 2
perl point_line_plot.pl -k bulk1.index -f har.fa.fai -od bsa/ -lf bsa/bsa.index.mean.xls -lp 1,4 -lv 5,10,14 -lc "black,#2121D9,#DF0101" -pf bsa/bsa.index.xls -pp 1,2 -pv 3 -ym NULL -yx 1 -xl "" -yl "SNPindex(bulk1)" -las 2
perl point_line_plot.pl -k bulk2.index -f har.fa.fai -od bsa/ -lf bsa/bsa.index.mean.xls -lp 1,4 -lv 6,11,15 -lc "black,#2121D9,#DF0101" -pf bsa/bsa.index.xls -pp 1,2 -pv 4 -ym NULL -yx 1 -xl "" -yl "SNPindex(bulk2)" -las 2
perl point_line_plot.pl -k bsa.ed -f har.fa.fai -od bsa/ -lf bsa/bsa.ed.mean.xls -lp 1,4 -lv 5 -lc "black" -pf bsa/bsa.ed.xls -pp 1,2 -pv 3 -hl 0.149,0.753 -hc "#2121D9,#DF0101" -ht 1,1 -ym 0 -yx NULL -xl "" -yl "ED2" -las 2
perl point_line_plot.pl -k bsa.gst -f har.fa.fai -od bsa/ -lf bsa/bsa.gst.mean.xls -lp 1,4 -lv 5 -lc "black" -pf bsa/bsa.gst.xls -pp 1,2 -pv 3 -hl 4.786,6.639 -hc "#2121D9,#DF0101" -ht 1,1 -ym 0 -yx NULL -xl "" -yl "G" -las 2
perl point_line_plot.pl -k bsa.fet -f har.fa.fai -od bsa -lf bsa/bsa.fet.mean.xls -lp 1,4 -lv 6 -lc "black" -pf bsa/bsa.fet.xls -pp 1,2 -pv 4 -hl 1.976,2.833 -hc "#2121D9,#DF0101" -ht 1,1 -ym 0 -yx NULL -xl "" -yl "expression(-log[10](italic(p)))" -las 2
for -hl option, you need to calculate these values using quantile/percentile or other method first, or you just run qtl_region.pl first, and the corresponding value will be printed in the standard output.
fai file here only contain chr/scaffolds use for plotting.
perl qtl_region.pl
Function: get significant region
-i <str> input file [force]
-k <str> output prefix [force]
-o <dir> output directory [force]
-chr <int> chr column [force]
-value <int> value column [force]
-start <int> window start column [undef]
-end <int> window end column [undef]
-lower <int> lower bound column [undef]
-upper <int> upper bound column [undef]
-alpha <fload> alpha for p-value bf correction [undef]
-uthres <fload> upper bound hard threshold [undef]
-lthres <fload> lower bound hard threshold [undef]
-quant <int> percentile to defined threshold [undef]
-type <str> significant type smaller/larger/both [smaller]
-h help document
priority:
lower/upper > thres > alpha > quant
perl qtl_region.pl -i bsa/bsa.index.mean.xls -k bsa.index95 -o bsa/ -chr 1 -value 7 -start 2 -end 3 -lower 8 -upper 9
perl qtl_region.pl -i bsa/bsa.index.mean.xls -k bsa.index99 -o bsa/ -chr 1 -value 7 -start 2 -end 3 -lower 12 -upper 13
perl qtl_region.pl -i bsa/bsa.ed.mean.xls -k bsa.ed95 -o bsa/ -chr 1 -start 2 -end 3 -value 5 -quant 95 -type larger
perl qtl_region.pl -i bsa/bsa.ed.mean.xls -k bsa.ed99 -o bsa/ -chr 1 -start 2 -end 3 -value 5 -quant 99 -type larger
perl qtl_region.pl -i bsa/bsa.gst.mean.xls -k bsa.gst95 -o bsa/ -chr 1 -start 2 -end 3 -value 5 -quant 95 -type larger
perl qtl_region.pl -i bsa/bsa.gst.mean.xls -k bsa.gst99 -o bsa/ -chr 1 -start 2 -end 3 -value 5 -quant 99 -type larger
perl qtl_region.pl -i bsa/bsa.fet.mean.xls -k bsa.fet0.05 -o bsa/-chr 1 -start 2 -end 3 -value 7 -lthres 0.05 -type smaller
perl qtl_region.pl -i bsa/bsa.fet.mean.xls -k bsa.fet0.01 -o bsa/-chr 1 -start 2 -end 3 -value 7 -lthres 0.01 -type smaller
*.sig.xls ## possible qtl region of each index