Closed
Description
I have noticed that bcftools norm ends up introducing strange characters in the INFO column when the atomize option is turned on.
Here is a sample of entries from the unprocessed vcf,
1 1284490 rs150789461 G A 779.31 PASS AC=6;AF=1.00;AN=6;DB;DP=20;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ GT:AD:DP:GQ:PL 1/1:0,6:6:18:269,18,0 1/1:0,7:7:21:300,21,0 1/1:0,5:5:15:224,15,0
1 1284878 rs111643738 A C 2170.73 PASS AC=6;AF=1.00;AN=6;DB;DP=49;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=28.20;SOR=1.473;VQSLOD=6.47;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841 1|1:0,11:11:33:1|1:1284841_A_G:474,33,0:1284841 1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841
1 1285358 rs11587525 A G 1089.94 PASS AC=4;AF=1.00;AN=4;BaseQRankSum=1.65;DB;DP=59;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=58.97;MQRankSum=-1.260e-01;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=26.00;ReadPosRankSum=-1.645e+00;SOR=0.648;VQSLOD=-3.275e-01;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 1/1:0,21:21:63:.:.:851,63,0 ./.:27,0:27:.:.:.:0,0,0 1|1:1,8:10:1:1|1:1285358_A_G:250,1,0:1285358
1 1285381 . G C 178.23 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-1.570e-01;NEGATIVE_TRAIN_SITE;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-5.838e-01;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:13,0:13:4:.:.:0,4,392 0/0:20,0:20:0:.:.:0,0,279 0|1:1,6:8:5:0|1:1285358_A_G:185,0,5:1285358
1 1286821 rs61766214 G C 585.35 PASS AC=2;AF=1.00;AN=2;DB;DP=60;ExcessHet=3.0103;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS ./.:23,0:23:.:.:.:0,0,0 ./.:21,0:21:.:.:.:0,0,0 1|1:0,16:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753
Post-processing with bcftools, the same entries look like below (Note the fields DB, POSITIVE_TRAIN_SITE, and NEGATIVE_TRAIN_SITE)
bcftools norm --output ./justhusky_norm.vcf --output-type v --atomize --fasta-ref grch37_hs-.fasta ./justhusky.SNV.vcf
1 1284490 rs150789461 G A 779.31 PASS AC=6;AF=1;AN=6;DB=^F;DP=20;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ GT:AD:DP:GQ:PL 1/1:0,0:6:18:269,18,0 1/1:0,0:7:21:300,21,0 1/1:0,0:5:15:224,15,0
1 1284878 rs111643738 A C 2170.73 PASS AC=6;AF=1;AN=6;DB=^F;DP=49;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=28.2;SOR=1.473;VQSLOD=6.47;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841 1|1:0,0:11:33:1|1:1284841_A_G:474,33,0:1284841 1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841
1 1285358 rs11587525 A G 1089.94 PASS AC=4;AF=1;AN=4;BaseQRankSum=1.65;DB=33<D3>?F<9B>^S;DP=59;ExcessHet=3.0103;FS=0;MLEAC=4;MLEAF=1;MQ=58.97;MQRankSum=-0.126;NEGATIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;POSITIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;QD=26;ReadPosRankSum=-1.645;SOR=0.648;VQSLOD=-0.3275;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 1/1:0,0:21:63:.:.:851,63,0:. ./.:27,27:27:.:.:.:0,0,0:.
1|1:1,1:10:1:1|1:1285358_A_G:250,1,0:1285358
1 1285381 . G C 178.23 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-0.157;NEGATIVE_TRAIN_SITE=<9C><C4> <BE>;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-0.5838;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:13,13:13:4:.:.:0,4,392:. 0/0:20,20:20:0:.:.:0,0,279:. 0|1:1,1:8:5:0|1:1285358_A_G:185,0,5:1285358
1 1286821 rs61766214 G C 585.35 PASS AC=2;AF=1;AN=2;DB=^B;DP=60;ExcessHet=3.0103;FS=0;MLEAC=3;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS ./.:23,23:23:.:.:.:0,0,0:. ./.:21,21:21:.:.:.:0,0,0:. 1|1:0,0:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753
I have also noticed two things about this behaviour
- It seems to only affect fields that are described as boolean in the header.
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
- It is not consistent throughout the file as some of the boolean entries do look okay post processing. For instance, DB field at the position 1286821 looks funny, but POSITIVE_TRAIN_SITE at the same position looks okay.
Is there a workaround to fix this? Any help would be appreciated.
Thanks,