You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this issue to ask how the bcftools maintainers feel about populating novel allele information from the <NON_REF> allele when merging multiple gVCFs?
Downstream tools (the Sentieon GVCFtyper and GATK's GenotypeGVCFs) expect the FORMAT/PL field of the merged gVCF to be fully-populated, and these missing fields result in dropped variants. Unlike the more general case, where we don't know the values for the new terms, these fields can be copied from the <NON_REF> allele for GATK-style gVCFs.
Here is a more detailed explanation with an example:
I can merge the two gvcfs using bcftools merge --gvcf. During the merge, bcftools conservatively chooses to output missing information for new alleles/genotypes at multi-allelic sites in the AD and PL fields:
This missing information causes these variants to be skipped during later joint genotyping with the GATK's GenotypeGVCFs:
$ gatk GenotypeGVCFs --allow-old-rms-mapping-quality-annotation-data -L chr1:1768969-1770969 \
-R $ref -V merged.g.vcf.gz -O merged_gatk.vcf.gz
Using GATK jar /<path>/gatk-package-4.2.6.1-local.jar
...
13:29:32.115 WARN MinimalGenotypingEngine - No genotype contained sufficient data to recalculate site and allele qualities. Site will be skipped at location chr1:1769969
...
$ zcat merged_gatk.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 408.45 . AC=2;AF=0.500;AN=4;DP=17;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=101.00;QD=25.53;RAW_MQ=61200.00;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 0/0:11,0:11:27
Or with Sentieon's GVCFtyper:
$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
-v merged.g.vcf.gz merged_sentieon_subset.vcf.gz
$ zcat merged_sentieon_subset.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 388.16 . AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.
These tools expect the information for all genotypes in the FORMAT/PL field to be populated from the <NON_REF> allele, and the joint calling output from the merged gVCF matches the joint call output from the individual gVCFs if this information is filled-in.
$ bcftools view -h merged.g.vcf.gz > merged_mod.g.vcf
$ bcftools view -H -r chr1:1768969-1770969 merged.g.vcf.gz | \
sed 's|GT:AD:DP:GQ:PL:SB 1/1:0,9,0,.:9:27:405,27,0,405,27,405,.,.,.,.:0,0,4,5 3/3:0,.,0,4:4:12:181,.,.,181,.,181,12,.,12,0:0,0,3,1|GT:AD:DP:GQ:PL:SB 1/1:0,9,0,0:9:27:405,27,0,405,27,405,405,27,405,405:0,0,4,5 3/3:0,0,0,4:4:12:181,181,181,181,181,181,12,12,12,0:0,0,3,1|' \
>> merged_mod.g.vcf
$ bcftools view -o merged_mod.g.vcf.gz -O z merged_mod.g.vcf
$ bcftools index -t merged_mod.g.vcf.gz
$ sentieon driver --interval chr1:1768969-1770969 -r $ref --algo GVCFtyper \
-v merged_mod.g.vcf.gz merged_sentieon_subset_mod.vcf.gz
$ zcat merged_sentieon_subset_mod.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG001 HG002
chr1 1769737 . GA G 388.16 . AC=2;AF=1;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1;MQ=60.00;QD=24.26;SOR=0.941 GT:AD:DP:GQ:PL 1/1:0,16:16:48:424,48,0 ./.:11,0:11:.:.
chr1 1769969 . CAAAACAAAAACA CAAAACA,C 550.30 . AC=2,2;AF=0.5,0.5;AN=4;DP=18;ExcessHet=3.0103;FS=0.000;MLEAC=2,2;MLEAF=0.5,0.5;MQ=46.90;QD=42.33;SOR=0.836 GT:AD:DP:GQ:PL 1/1:0,9,0:9:27:405,27,0,405,27,405 2/2:0,0,4:4:12:181,181,181,12,12,0
The text was updated successfully, but these errors were encountered:
Thanks, Don ! This will reduce the memory footprint for JointGenotyping / GVCFtyper quite a bit as less .tbi indices have to be loaded into memory, too. Would be great if this will be added.
Thank you for the issue. You are correct, the unknown allele should be used. This is now improved, the program should now do the expected thing. Also a new option -M, --missing-rules was added to give more control with other tags, but you probably are not going to need that.
Opening this issue to ask how the bcftools maintainers feel about populating novel allele information from the
<NON_REF>
allele when merging multiple gVCFs?Downstream tools (the Sentieon GVCFtyper and GATK's GenotypeGVCFs) expect the FORMAT/PL field of the merged gVCF to be fully-populated, and these missing fields result in dropped variants. Unlike the more general case, where we don't know the values for the new terms, these fields can be copied from the
<NON_REF>
allele for GATK-style gVCFs.Here is a more detailed explanation with an example:
I have two gVCFs with the following records.
sample1.g.vcf.gz:
sample2.g.vcf.gz:
I can merge the two gvcfs using
bcftools merge --gvcf
. During the merge, bcftools conservatively chooses to output missing information for new alleles/genotypes at multi-allelic sites in the AD and PL fields:This missing information causes these variants to be skipped during later joint genotyping with the GATK's GenotypeGVCFs:
Or with Sentieon's GVCFtyper:
These tools expect the information for all genotypes in the FORMAT/PL field to be populated from the <NON_REF> allele, and the joint calling output from the merged gVCF matches the joint call output from the individual gVCFs if this information is filled-in.
The text was updated successfully, but these errors were encountered: