Description
Thanks for supporting these tools.
bcftools 1.16, file formats are VCF 4.2.
I have merged 1379 gcvf files using the following pipeline:
$ bcftools merge --merge none -f PASS,. --gvcf $REFGENOME -i DP:sum,MQ:avg,DQUAL:min -r chr22 --file-list $gvcflistfile -Ou | \
$ bcftools norm -m -any -Ou | \
$ bcftools view -e 'ALT="<NON_REF>"' -Ou | \
$ bcftools annotate -x FORMAT $THREADSOPT -Ob > $outfile
These are the top few variants in the chr22 run and first 2 samples:
POS REF ALT QUAL FILTER FORMAT 77900074 77900324
10510105 T A 8.33 PASS GT ./. ./.
10510113 C T 17.93 PASS GT ./. ./.
10510212 A T 13.59 PASS GT ./. ./.
10510258 C T 41.37 PASS GT ./. ./.
10510341 C G 6.8 PASS GT ./. 0/0
10510352 AT A 15.8 PASS GT ./. ./.
10510352 ATAAT A 10.95 PASS GT ./. ./.
10510355 AT A 12.48 PASS GT ./. ./.
10510356 T A 46.64 PASS GT ./. 1/1
if we focus on the 2nd sample 77900324, we see that first 4 GTs are missing ./. while the 5th at 10510341 is called 0/0. I'm trying to rationalize why. One of the key issues in multiplex gvcf merging is to utilize confidence in deciding between ./. and 0/0 for rare variants at positions that were not variable in the individual gvcfs, rather than assuming they are all 0/0 because they weren't variable and therefore are more likely to ref homozyg. Note also that I have chosen this example carefully to remove the multi-allelic issues - the positions I am referring to were unique and were SNV biallelic (but I note the split alleles at 10510352).
The block for the first merged variant at 10510105 in the individual 77900324 gvcf is:
chr22 10510083 . A <NON_REF> . PASS END=10510105 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:6,0:6:12:4:0,12,141:0,12,141:6,0
ie, this PASSED in the individual gcvf with a GQ of 12, 0/0. Similarly the next 3.
The first called homozygous REF variant in this sample was at 10510341 corresponding to this block:
chr22 10510341 . C <NON_REF> . PASS END=10510341 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:23,1:24:57:24:0,57,933:0,57,255:22,0
This PASSED in the gvcf with a GQ of 57, 0/0.
I would have expected all of the top 5 variants to be called 0/0 in the merged bcf file. Is merge applying hidden quality/depth threshold ?
Thanks for your insights.
(I have reproduced the first lines of the gvcf for the sample below for reference:)
chr22 10510000 . N <NON_REF> . LowDepth;LowGQ END=10510000 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT ./.:0,1:1:0:1:0,0,0:42,3,0:0,0
chr22 10510001 . G <NON_REF> . LowDepth END=10510037 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:1,0:1:3:1:0,3,15:0,3,15:1,0
chr22 10510038 . T <NON_REF> . LowDepth END=10510082 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:1,0:1:3:1:0,3,15:0,3,15:3,0
chr22 10510083 . A <NON_REF> . PASS END=10510105 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:6,0:6:12:4:0,12,141:0,12,141:6,0
chr22 10510106 . T <NON_REF> . PASS END=10510117 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:8,0:8:21:7:0,21,254:0,21,254:9,0
chr22 10510118 . A <NON_REF> . PASS END=10510175 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:12,0:12:30:10:0,30,391:0,30,255:13,0
chr22 10510176 . T <NON_REF> . PASS END=10510249 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:16,0:16:42:14:0,42,552:0,42,255:19,0
chr22 10510250 . C <NON_REF> . PASS END=10510250 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:16,1:17:36:17:0,36,641:0,36,255:14,0
chr22 10510251 . T <NON_REF> . PASS END=10510257 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:17,0:17:42:16:0,42,630:0,48,255:15,0
chr22 10510258 . C <NON_REF> . PASS END=10510262 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:16,0:16:39:15:0,39,585:0,45,255:13,0
chr22 10510263 . A <NON_REF> . PASS END=10510265 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:16,0:16:42:16:0,42,630:0,48,255:14,0
chr22 10510266 . T <NON_REF> . PASS END=10510266 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:14,2:16:17:16:0,17,542:0,17,255:12,0
chr22 10510267 . T <NON_REF> . PASS END=10510267 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:16,0:16:42:16:0,42,630:0,48,255:14,0
chr22 10510268 . G <NON_REF> . PASS END=10510275 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:14,0:14:39:14:0,39,585:0,42,255:13,0
chr22 10510276 . G <NON_REF> . PASS END=10510303 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:17,0:17:45:17:0,45,675:0,51,255:19,0
chr22 10510304 . G <NON_REF> . PASS END=10510306 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:20,0:20:60:20:0,60,823:0,60,255:20,0
chr22 10510307 . G <NON_REF> . PASS END=10510311 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:20,0:20:57:20:0,57,855:0,60,255:19,0
chr22 10510312 . C <NON_REF> . PASS END=10510314 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22 10510315 . G <NON_REF> . PASS END=10510318 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:48:20:0,48,782:0,48,255:19,0
chr22 10510319 . G <NON_REF> . PASS END=10510319 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22 10510320 . G <NON_REF> . PASS END=10510327 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:57:20:0,57,855:0,60,255:19,0
chr22 10510328 . G <NON_REF> . PASS END=10510330 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:60:21:0,60,900:0,63,255:20,0
chr22 10510331 . G <NON_REF> . PASS END=10510331 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:19,1:20:45:20:0,45,752:0,45,255:19,0
chr22 10510332 . A <NON_REF> . PASS END=10510334 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,0:21:60:20:0,60,797:0,60,255:22,0
chr22 10510335 . C <NON_REF> . PASS END=10510335 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:21,1:22:51:22:0,51,862:0,51,255:21,0
chr22 10510336 . C <NON_REF> . PASS END=10510340 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:23,0:23:66:23:0,66,937:0,69,255:24,0
chr22 10510341 . C <NON_REF> . PASS END=10510341 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:23,1:24:57:24:0,57,933:0,57,255:22,0
chr22 10510342 . T <NON_REF> . PASS END=10510344 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:24,0:24:60:24:0,60,900:0,72,255:20,0
chr22 10510345 . A <NON_REF> . PASS END=10510345 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:23,0:23:57:23:0,57,855:0,69,255:19,0