Description
If I understand the current behavior of bcftools annotate
correctly, records in the input VCF are matched to records in the annotation file based on POS, REF, and ALT in cases where the annotation file is a VCF, or if it's a tab-delimited file and REF and ALT are specified in -c
.
When dealing with VCFs representing structural variants, we sometimes have records that represent different variants but have the same position and alternate allele. This is because use symbolic ALT alleles, which don't alwasy fully specify the variant. For example, we may have detected an deletion with two different tools that have the same start position but different end positions. Both records will have <DEL>
as their alt allele despite representing different variants. If we've prepared data with which we'd like to annotate each variant, this leaves us unable to do so with bcftools annotate
under the current matching scheme. For example, if in the VCF we have these two records:
chr1 100000 VID1 A <DUP> . PASS END=200000
chr1 100000 VID2 A <DUP> . PASS END=300000
And we'd like to annotate VID1 and VID2 with different values, there doesn't seem to be a way to do so with the current matching rules of bcftools annotate
; ie if we have the annotation file:
chr1 100000 A <DUP> 1
chr1 100000 A <DUP> 2
and try to annotate with bcftools annotate -a annotations.tsv.gz -c CHROM,POS,REF,ALT,INFO/VAL input_vcf.gz
, we get the output vcf:
chr1 100000 VID1 A <DUP> . PASS END=200000;VAL=1
chr1 100000 VID2 A <DUP> . PASS END=300000;VAL=1
If we add ID to the annotation file and include it in the column list, ID will get overwritten by the ID of the first matching variant by CHR,POS,REF,ALT.
I was wondering if either there is some way to accomplish our desired annotation in the current functionality of bcftools, and if not, if it would be possible to add it as a new feature. I could see the latter being accomplished either by an option that would allow the user to specify ID as a column in the annotation file which should be used for matching records, or by adding a matching rule to -l
that would do something like match the nth duplicate record at a given position to the nth duplicate annotation value (although I imagine the latter option might get tricky to implement).