Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Populate new allele-specific annotations from the <NON_REF> allele in bcftools merge #1888

Closed
DonFreed opened this issue Mar 15, 2023 · 3 comments

Comments

@DonFreed
Copy link

Opening this issue to ask how the bcftools maintainers feel about populating novel allele information from the <NON_REF> allele when merging multiple gVCFs?

Downstream tools (the Sentieon GVCFtyper and GATK's GenotypeGVCFs) expect the FORMAT/PL field of the merged gVCF to be fully-populated, and these missing fields result in dropped variants. Unlike the more general case, where we don't know the values for the new terms, these fields can be copied from the <NON_REF> allele for GATK-style gVCFs.

Here is a more detailed explanation with an example:

I have two gVCFs with the following records.

sample1.g.vcf.gz:

chr1    1769963 .       A       <NON_REF>       .       .       END=1769968     GT:DP:GQ:MIN_DP:PL      0/0:11:3:11:0,3,45
chr1    1769969 .       CAAAACA C,<NON_REF>     367.74  .       DP=11;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=39600.00      GT:AD:DP:GQ:PL:SB     1/1:0,9,0:9:27:405,27,0,405,27,405:0,0,4,5
chr1    1769976 .       A       <NON_REF>       .       .       END=1769976     GT:DP:GQ:MIN_DP:PL      0/0:10:0:10:0,0,0

sample2.g.vcf.gz:

chr1    1769968 .       T       <NON_REF>       .       .       END=1769968     GT:DP:GQ:MIN_DP:PL      0/0:9:18:9:0,18,270
chr1    1769969 .       CAAAACAAAAACA   C,<NON_REF>     144.00  .       DP=7;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=25200.00       GT:AD:DP:GQ:PL:SB      1/1:0,4,0:4:12:181,12,0,181,12,181:0,0,3,1
chr1    1769982 .       A       <NON_REF>       .       .       END=1769982     GT:DP:GQ:MIN_DP:PL      0/0:7:0:7:0,0,0

I can merge the two gvcfs using bcftools merge --gvcf. During the merge, bcftools conservatively chooses to output missing information for new alleles/genotypes at multi-allelic sites in the AD and PL fields:

chr1    1769968 .       T       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:11:3:11:0,3,45      0/0:9:18:9:0,18,270
chr1    1769969 .       CAAAACAAAAACA   CAAAACA,<NON_REF>,C     367.74  .       ExcessHet=3.0103;RAW_MQ=39600;DP=18;MLEAC=2,0,2;MLEAF=1,0,1     GT:AD:DP:GQ:PL:SB      1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5    3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1
chr1    1769976 .       A       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:10:0:10:0,0,0       ./.:.:.:.:.

This missing information causes these variants to be skipped during later joint genotyping with the GATK's GenotypeGVCFs:

$ gatk GenotypeGVCFs --allow-old-rms-mapping-quality-annotation-data -L chr1:1768969-1770969 \
  -R $ref -V merged.g.vcf.gz -O merged_gatk.vcf.gz
Using GATK jar /<path>/gatk-package-4.2.6.1-local.jar
...
13:29:32.115 WARN  MinimalGenotypingEngine - No genotype contained sufficient data to recalculate site and allele qualities. Site will be skipped at location chr1:1769969
...
$ zcat merged_gatk.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG001   HG002
chr1    1769737 .       GA      G       408.45  .      AC=2;AF=0.500;AN=4;DP=17;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=101.00;QD=25.53;RAW_MQ=61200.00;SOR=0.941     GT:AD:DP:GQ:PL  1/1:0,16:16:48:424,48,0 0/0:11,0:11:27

Or with Sentieon's GVCFtyper:

$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
  -v merged.g.vcf.gz merged_sentieon_subset.vcf.gz
$ zcat merged_sentieon_subset.vcf.gz
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG001   HG002
chr1    1769737 .       GA      G       388.16  .       AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941     GT:AD:DP:GQ:PL  1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.

These tools expect the information for all genotypes in the FORMAT/PL field to be populated from the <NON_REF> allele, and the joint calling output from the merged gVCF matches the joint call output from the individual gVCFs if this information is filled-in.

$ bcftools view -h merged.g.vcf.gz > merged_mod.g.vcf
$ bcftools view -H -r chr1:1768969-1770969 merged.g.vcf.gz | \
    sed 's|GT:AD:DP:GQ:PL:SB    1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5    3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1|GT:AD:DP:GQ:PL:SB  1/1:0,9,0,0:9:27:405,27,0,405,27,405,405,27,405,405:0,0,4,5 3/3:0,0,0,4:4:12:181,181,181,181,181,181,12,12,12,0:0,0,3,1|' \
    >> merged_mod.g.vcf
$ bcftools view -o merged_mod.g.vcf.gz -O z merged_mod.g.vcf
$ bcftools index -t merged_mod.g.vcf.gz

$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
    -v merged_mod.g.vcf.gz merged_sentieon_subset_mod.vcf.gz

$ zcat merged_sentieon_subset_mod.vcf.gz
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG001	HG002
chr1	1769737	.	GA	G	388.16	.	AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL	1/1:0,16:16:48:424,48,0	./.:11,0:11:.:.
chr1	1769969	.	CAAAACAAAAACA	CAAAACA,C	550.30	.	AC=2,2;AF=0.5,0.5;AN=4;DP=18;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.5,0.5;MQ=46.90;QD=42.33;SOR=0.836	GT:AD:DP:GQ:PL	1/1:0,9,0:9:27:405,27,0,405,27,405	2/2:0,0,4:4:12:181,181,181,12,12,0
@tweep
Copy link

tweep commented Mar 17, 2023

Thanks, Don ! This will reduce the memory footprint for JointGenotyping / GVCFtyper quite a bit as less .tbi indices have to be loaded into memory, too. Would be great if this will be added.

@pd3 pd3 closed this as completed in f2d2fdf Mar 22, 2023
@pd3
Copy link
Member

pd3 commented Mar 22, 2023

Thank you for the issue. You are correct, the unknown allele should be used. This is now improved, the program should now do the expected thing. Also a new option -M, --missing-rules was added to give more control with other tags, but you probably are not going to need that.

@DonFreed
Copy link
Author

Thank you for the quick implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants