Skip to content

norm -m- does not split FORMAT field with missing value correctly #1818

Closed
@ikarus97

Description

Hi,
I have a VCF file with the following line:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1   Samp2
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G,GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA     5075.01 .       .       GT:AD:AF:DP     0/0:73,0,0:.:35 0/2:36,2,50:0.023,0.568:88

For the first sample Samp1, the AF field in FORMAT column is missing(.).

After bcftools norm -m -any -f [reference], I've got:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo1.norm.vcf.gz demo1.vcf.gz; Date=Tue Nov 15 15:58:25 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1   Samp2
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35   0/0:36,2:0.023:88
chr1    939398  .       G       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       5075.01 .       .       GT:AD:AF:DP     0/0:73,0::35    0/1:36,50:0.568:88

Samp2 output is as I expected.
But for Samp1, I expected that both lines should have missing value (.) for AF
(its value was missing before split, thus it makes sense to have missing values for both lines after split).
The --force option didn't make any difference, here.

However, when I ran the same command with only Samp1, I got results as I expected:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo2.norm.vcf.gz demo2.vcf.gz; Date=Tue Nov 15 15:58:32 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35
chr1    939398  .       G       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35

I've experimented with various inputs, and concluded that the issue happens only when the field-to-be-split is missing for some samples. I had no problem when all samples had values or when all samples were missing.

Thank you,
In-Hee Lee

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions