Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools merge --info-rules doesn't seem to work properly #1394

Closed
tskir opened this issue Feb 2, 2021 · 6 comments
Closed

bcftools merge --info-rules doesn't seem to work properly #1394

tskir opened this issue Feb 2, 2021 · 6 comments

Comments

@tskir
Copy link
Member

tskir commented Feb 2, 2021

Hi, maybe I'm doing something wrong, but I can't seem to make bcftools merge --info-rules flag to behave as I want it to. Here's what I'm doing:

Build bcftools v1.11

wget https://github.com/samtools/bcftools/releases/download/1.11/bcftools-1.11.tar.bz2
tar --extract --bzip2 --file bcftools-1.11.tar.bz2
cd bcftools-1.11
make -j `nproc`

Prepare test data

tr '|' '\t' << EOF | bgzip > A.vcf.gz
##fileformat=VCFv4.3
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM|POS|ID|REF|ALT|QUAL|FILTER|INFO|FORMAT|A
chr1|123|.|A|C|.|.|AN=30|GT|0/1
EOF

tr '|' '\t' << EOF | bgzip > B.vcf.gz
##fileformat=VCFv4.3
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM|POS|ID|REF|ALT|QUAL|FILTER|INFO|FORMAT|B
chr1|123|.|A|C|.|.|AN=12|GT|1/1
EOF

parallel tabix ::: *.vcf.gz

Try to combine (summarise) a particular INFO field

./bcftools merge --info-rules 'AN:sum' A.vcf.gz B.vcf.gz

Output

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_mergeVersion=1.11+htslib-1.11
##bcftools_mergeCommand=merge --info-rules AN:sum A.vcf.gz B.vcf.gz; Date=Tue Feb  2 16:07:46 2021
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	A	B
chr1	123	.	A	C	.	.	AN=4	GT	0/1	1/1

I expect the AN to be 42 (30 + 12), but it's for some reason... 4. Am I missing something?

@tskir
Copy link
Member Author

tskir commented Feb 2, 2021

Ah, okay, got it, it's probably getting 4 by actually counting the alleles as observed in GT, overriding the values in the INFO field, which in my example contradict the genotypes. So it wasn't a very good example. However, now consider this case:

Test data

Those are two summary VCFs without genotypes (permitted by the spec).

tr '|' '\t' << EOF | bgzip > C.vcf.gz
##fileformat=VCFv4.3
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
#CHROM|POS|ID|REF|ALT|QUAL|FILTER|INFO
chr1|123|.|A|C|.|.|AN=30
EOF

tr '|' '\t' << EOF | bgzip > D.vcf.gz
##fileformat=VCFv4.3
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
#CHROM|POS|ID|REF|ALT|QUAL|FILTER|INFO
chr1|123|.|A|C|.|.|AN=12
EOF

parallel tabix ::: *.vcf.gz

Merge

./bcftools merge --info-rules 'AN:sum' C.vcf.gz D.vcf.gz

Output

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Some metric">
##contig=<ID=chr1>
##bcftools_mergeVersion=1.11+htslib-1.11
##bcftools_mergeCommand=merge --info-rules AN:sum C.vcf.gz D.vcf.gz; Date=Tue Feb  2 16:22:01 2021
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	123	.	A	C	.	.	.

Notice how this drops the AN field entirely. Is this intended behaviour?

@pd3
Copy link
Member

pd3 commented Feb 5, 2021

Yes, this is deliberate, as AC and AN are computed from the sample genotypes. Is there a strong motivation for keeping them?

@pd3 pd3 closed this as completed May 24, 2021
@pd3 pd3 reopened this May 24, 2021
@pd3
Copy link
Member

pd3 commented May 24, 2021

Sorry, pressed the wrong button.

I will mark this as a feature request: in case merge rules are given explicitly for AN, the program could use that instead of its default behavior

@tskir
Copy link
Member Author

tskir commented May 24, 2021

Hi Petr! Sorry I didn't get back to you back in February.

To give some context (better late than never), this report arose from the CINECA project. The workflow we were using went like this:

  1. On a bunch of samples within controlled-access human cohorts, compile some summary VCFs which contain no genotypes (for privacy reasons), but do contain summary metrics like AC, AN, and others.
  2. Collect summary VCFs and merge them into a meta-summary VCF, combining the fields using certain rules.

In CINECA we managed to get around this by temporarily renaming the fields, so it's definitely not urgent or anything. Treating this as a low priority feature request seems to be the most sensible option to me.

@dennishendriksen
Copy link

I am running into the same issue when merging gnomAD exomes and genomes VCFs which do not contain sample information. Our use case is to create a single VCF containing the summed AC and AN. I came up with the same merge command as the author of this issue which didn't give the expected result.

@pd3 an alternative to the feature request you mention could be to change the default behavior in case of no available samples such that /bcftools merge --info-rules 'AN:sum' C.vcf.gz D.vcf.gz would give the same result as ./bcftools merge C.vcf.gz D.vcf.gz. what do you think?

@pd3 pd3 closed this as completed in 57542a5 Oct 18, 2021
@pd3
Copy link
Member

pd3 commented Oct 18, 2021

This is now supported. When no genotypes are present, AC,AN will be summed. When they are present, the values will be recalculated unless -i AC:sum,AN:sum is given, in that case they will be summed as well.

pd3 added a commit that referenced this issue Oct 18, 2021
or when
    bcftools merge -i AC:sum,AN:sum

is given explicitly

Resolves #1394
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants