Skip to content

bcftools concat --ligate drops variants when overlap between two VCFs is empty #1567

Closed
@freeseek

Description

@freeseek

I have not looked at the vcfconcat.c code yet, but the following behavior seems puzzling.

Generate the following VCFs:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t1\t.\tA\tC\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o A1.vcf.gz
bcftools index -t -f A1.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t1\t.\tA\tC\t.\t.\t.\tGT\t0|1"
echo -e "chr1\t2\t.\tC\tG\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o A2.vcf.gz
bcftools index -t -f A2.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t2\t.\tC\tG\t.\t.\t.\tGT\t1|0"
echo -e "chr1\t3\t.\tG\tT\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o B.vcf.gz
bcftools index -t -f B.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t3\t.\tG\tT\t.\t.\t.\tGT\t1|0"
echo -e "chr1\t4\t.\tT\tA\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o C.vcf.gz
bcftools index -t -f C.vcf.gz

All four variants are output when performing a simple concatenation:

$ bcftools concat --allow-overlaps --rm-dups all A1.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	1|0
chr1	3	.	G	T	.	.	.	GT	0|1
chr1	4	.	T	A	.	.	.	GT	0|1

One variant goes missing when the overlap between the first two VCFs is empty:

$ bcftools concat --ligate A1.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT:PS	0|1:1
chr1	3	.	G	T	.	.	.	GT:PS	0|1:1
chr1	4	.	T	A	.	.	.	GT:PS	1|0:1

If the overlap between the first two VCFs is not empty, all variants are retained both with a simple concatenation:

$ bcftools concat --allow-overlaps --rm-dups all A2.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	0|1
chr1	3	.	G	T	.	.	.	GT	0|1
chr1	4	.	T	A	.	.	.	GT	0|1

And with a concatenation using phase ligation:

$ bcftools concat --ligate A2.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT:PS	0|1:1
chr1	2	.	C	G	.	.	.	GT:PS	0|1:1
chr1	3	.	G	T	.	.	.	GT:PS	1|0:1
chr1	4	.	T	A	.	.	.	GT:PS	0|1:1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions