Description
Hi,
I have a VCF file with the following line:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G,GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0,0:.:35 0/2:36,2,50:0.023,0.568:88
For the first sample Samp1
, the AF
field in FORMAT
column is missing(.
).
After bcftools norm -m -any -f [reference]
, I've got:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo1.norm.vcf.gz demo1.vcf.gz; Date=Tue Nov 15 15:58:25 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1 Samp2
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35 0/0:36,2:0.023:88
chr1 939398 . G GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0::35 0/1:36,50:0.568:88
Samp2
output is as I expected.
But for Samp1
, I expected that both lines should have missing value (.
) for AF
(its value was missing before split, thus it makes sense to have missing values for both lines after split).
The --force
option didn't make any difference, here.
However, when I ran the same command with only Samp1
, I got results as I expected:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo2.norm.vcf.gz demo2.vcf.gz; Date=Tue Nov 15 15:58:32 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Samp1
chr1 939398 . GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA G 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35
chr1 939398 . G GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA 5075.01 . . GT:AD:AF:DP 0/0:73,0:.:35
I've experimented with various inputs, and concluded that the issue happens only when the field-to-be-split is missing for some samples. I had no problem when all samples had values or when all samples were missing.
Thank you,
In-Hee Lee
Activity