Description
Describe the issue
When VEP adds the LoF_info
annotation, it does not sanitize its output and allows commas in the LoF_info subfield. For example:
[...]|PERCENTILE:0.7,GERP_DIST:-366.3[...]MAX_ORF:-698.745,GA|frameshift_variant[...]
This makes it impossible to split the consequences by transcript and variant, programs that are designed to extract and query VEP annotations fail (such as bcftools +split-vep
).
Additional information
Example of such file and site is chr21:5032064 in https://gnomad-public-us-east-1.s3.amazonaws.com/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz
System
VEP version: 101 (possibly more recent versions as well)
An example of such VEP annotation
header:
##INFO=<ID=vep,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">
##vep_version=v101
data line:
vep=GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000612610|protein_coding|5/7||ENST00000612610.4:c.718_719dup|ENSP00000483732.1:p.Asp240GlufsTer35|896-897|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||1|A2||ENSP00000483732|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.773118279569892,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000620481|protein_coding|4/6||ENST00000620481.4:c.367_368dup|ENSP00000484302.1:p.Asp123GlufsTer35|545-546|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||5|||ENSP00000484302|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.635578583765112,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623795|protein_coding|4/6||ENST00000623795.1:c.367_368dup|ENSP00000485649.1:p.Asp123GlufsTer35|505-506|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485649|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.659498207885305,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|FP565260.3|ENSG00000277117|Transcript|ENST00000623903|nonsense_mediated_decay|5/7||ENST00000623903.3:c.*332_*333dup||706-707|||||1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485557||||||||||||||||,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623960|protein_coding|5/7||ENST00000623960.4:c.718_719dup|ENSP00000485129.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene||YES|1|P2|CCDS86973.1|ENSP00000485129|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|NM_001363770.2|protein_coding|5/7||NM_001363770.2:c.718_719dup|NP_001350699.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||YES||||NP_001350699.1||||||||10|||||HC|||PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_006723899.2|protein_coding|5/6||XM_006723899.2:c.718_719dup|XP_006723962.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_006723962.1||||||||10|||||HC|||PERCENTILE:0.463571889103804,GERP_DIST:-1141.14512844086,BP_DIST:840,DIST_FROM_LAST_EXON:152,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546078.2|protein_coding|5/7||XM_011546078.2:c.718_719dup|XP_011544380.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544380.1||||||||10|||||HC|||PERCENTILE:0.662062615101289,GERP_DIST:-354.294564935565,BP_DIST:374,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546079.1|protein_coding|5/7||XM_011546079.1:c.718_719dup|XP_011544381.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544381.1||||||||10|||||HC|||PERCENTILE:0.785792349726776,GERP_DIST:-391.862466733158,BP_DIST:203,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT
Proposed solution
Replace commas and other special characters in plugins' outputs with the corresponding percent encoded characters, as recommended by the VCF specification v4.3 in section 1.2 (http://samtools.github.io/hts-specs/VCFv4.3.pdf).
Activity