Description
Dear BCFTools developers,
Thank you for the already very useful BCFTools consensus functionality.
When given the below REF and ALT alleles in sequence, and the genotypes of these 3 samples, it makes most sense to use the ALT allele of base 3 in the IUPAC sequence produced by bcftools consensus
. Since it is monomorph HOM_ALT in 3 the samples of interest. The REF allele is never observed in the samples of interest. This is the most correct consensus sequence for these 3 samples.
sequence | base 1 | base 2 | base 3 | base 4 | base 5 | base 6 | base 7 | base 8 |
---|---|---|---|---|---|---|---|---|
reference | A | A | C | C | T | T | G | G |
ALT | G | G | T | T | C | C | A | A |
sample 1 | 0/0 | 0/1 | 1/1 | 0/0 | 0/1 | 0/0 | 1/1 | 0/0 |
sample 2 | 0/0 | 0/0 | 1/1 | 0/0 | 1/1 | 0/0 | 0/0 | 0/0 |
sample 3 | 0/0 | 0/1 | 1/1 | 0/0 | 1/1 | 0/0 | 1/1 | 0/0 |
correct consensus | A | R | T | C | Y | T | R | G |
current consensus | A | R | R | C | Y | T | R | G |
Current consensus is made using bcftools 1.12.
For the ALT allele it is already implemented that it is not used in the sample set IUPAC sequence, if that allele is not in the genotypes of any sample in the set. See base 1,4,6 and 8. The same should hold true for the REF allele on base3.
Not including the REF allele in the IUPAC for monomorph HOM_ALT variants makes most sense for the consensus sequence I think.
And also would be useful for downstream purposes that depend on the most accurate consensus sequence of a group of samples (e.g. diverse types multiple sequence analysis and assay design etc.)
Thank you for looking into this.