Skip to content

atomize option in bcftools norm introduces funy characters in the INFO column #1472

Closed
@ramprasadn

Description

@ramprasadn

I have noticed that bcftools norm ends up introducing strange characters in the INFO column when the atomize option is turned on.

Here is a sample of entries from the unprocessed vcf,

1       1284490 rs150789461     G       A       779.31  PASS    AC=6;AF=1.00;AN=6;DB;DP=20;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ     GT:AD:DP:GQ:PL  1/1:0,6:6:18:269,18,0   1/1:0,7:7:21:300,21,0   1/1:0,5:5:15:224,15,0
1       1284878 rs111643738     A       C       2170.73 PASS    AC=6;AF=1.00;AN=6;DB;DP=49;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=28.20;SOR=1.473;VQSLOD=6.47;culprit=MQ      GT:AD:DP:GQ:PGT:PID:PL:PS       1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841 1|1:0,11:11:33:1|1:1284841_A_G:474,33,0:1284841 1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841
1       1285358 rs11587525      A       G       1089.94 PASS    AC=4;AF=1.00;AN=4;BaseQRankSum=1.65;DB;DP=59;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=58.97;MQRankSum=-1.260e-01;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=26.00;ReadPosRankSum=-1.645e+00;SOR=0.648;VQSLOD=-3.275e-01;culprit=MQ       GT:AD:DP:GQ:PGT:PID:PL:PS       1/1:0,21:21:63:.:.:851,63,0     ./.:27,0:27:.:.:.:0,0,0 1|1:1,8:10:1:1|1:1285358_A_G:250,1,0:1285358
1       1285381 .       G       C       178.23  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-1.570e-01;NEGATIVE_TRAIN_SITE;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-5.838e-01;culprit=MQ        GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:13,0:13:4:.:.:0,4,392       0/0:20,0:20:0:.:.:0,0,279       0|1:1,6:8:5:0|1:1285358_A_G:185,0,5:1285358
1       1286821 rs61766214      G       C       585.35  PASS    AC=2;AF=1.00;AN=2;DB;DP=60;ExcessHet=3.0103;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ  GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:23,0:23:.:.:.:0,0,0 ./.:21,0:21:.:.:.:0,0,0 1|1:0,16:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753

Post-processing with bcftools, the same entries look like below (Note the fields DB, POSITIVE_TRAIN_SITE, and NEGATIVE_TRAIN_SITE)

bcftools norm --output ./justhusky_norm.vcf --output-type v --atomize --fasta-ref grch37_hs-.fasta ./justhusky.SNV.vcf

1       1284490 rs150789461     G       A       779.31  PASS    AC=6;AF=1;AN=6;DB=^F;DP=20;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ       GT:AD:DP:GQ:PL  1/1:0,0:6:18:269,18,0   1/1:0,0:7:21:300,21,0   1/1:0,0:5:15:224,15,0
1       1284878 rs111643738     A       C       2170.73 PASS    AC=6;AF=1;AN=6;DB=^F;DP=49;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=28.2;SOR=1.473;VQSLOD=6.47;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS       1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841  1|1:0,0:11:33:1|1:1284841_A_G:474,33,0:1284841  1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841
1       1285358 rs11587525      A       G       1089.94 PASS    AC=4;AF=1;AN=4;BaseQRankSum=1.65;DB=33<D3>?F<9B>^S;DP=59;ExcessHet=3.0103;FS=0;MLEAC=4;MLEAF=1;MQ=58.97;MQRankSum=-0.126;NEGATIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;POSITIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;QD=26;ReadPosRankSum=-1.645;SOR=0.648;VQSLOD=-0.3275;culprit=MQ      GT:AD:DP:GQ:PGT:PID:PL:PS       1/1:0,0:21:63:.:.:851,63,0:.    ./.:27,27:27:.:.:.:0,0,0:.
      1|1:1,1:10:1:1|1:1285358_A_G:250,1,0:1285358
1       1285381 .       G       C       178.23  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-0.157;NEGATIVE_TRAIN_SITE=<9C><C4> <BE>;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-0.5838;culprit=MQ     GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:13,13:13:4:.:.:0,4,392:.    0/0:20,20:20:0:.:.:0,0,279:.    0|1:1,1:8:5:0|1:1285358_A_G:185,0,5:1285358
1       1286821 rs61766214      G       C       585.35  PASS    AC=2;AF=1;AN=2;DB=^B;DP=60;ExcessHet=3.0103;FS=0;MLEAC=3;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ    GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:23,23:23:.:.:.:0,0,0:.      ./.:21,21:21:.:.:.:0,0,0:.      1|1:0,0:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753

I have also noticed two things about this behaviour

  1. It seems to only affect fields that are described as boolean in the header.
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
  1. It is not consistent throughout the file as some of the boolean entries do look okay post processing. For instance, DB field at the position 1286821 looks funny, but POSITIVE_TRAIN_SITE at the same position looks okay.

Is there a workaround to fix this? Any help would be appreciated.

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions