You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to propose a small tweak to bcftools query with the -H flag, which prints header names as the first line of output. Currently, the header line begins with # (a hash sign followed by space):
This can confuse many downstream tools trying to parse data into columns. The first line will appear to have one more column in the eyes of many standard tools, such as awk, cut, datamash, R, and others, including spreadsheet apps.
In this toy example, we can see that the first line has more columns, and that the name of first column is "#", rather than e.g. "CHROM", as we asked in the query format. It makes the header less useful by default, requiring additional processing. For example, I often end up piping through sed, e.g.
Depending on your preference regarding the hash sign # in the header, I propose removing either the space following the hash signs, or remove both the hash sign and the space. This would result in a name such as either #CHROM or just CHROM, respectively.
The text was updated successfully, but these errors were encountered:
Fair point. Note that the problem is caused by the choice of using space as a delimiter and would not be a problem in tab-delimited output, such as %CHROM\t%POS. In general, using space as a delimiter is problematic because space values are, sadly, permitted by the VCF specification.
As for solutions, the leading hash character # is a common way to separate header and comments from data lines, so it will stay, but we can remove the leading space. This has been done just now in 02a3961
I should have expected the argument about space as delimiter - but note that to awk (my primary work language now) space and tabs are the same (by default). I usually use tabs as delimiters, I was (somewhat deliberately) lazy in my example and went with spaces, but also chose awk to show a tool where the difference doesn't matter. Could have been more explicit though.
The problem
I would like to propose a small tweak to
bcftools query
with the-H
flag, which prints header names as the first line of output. Currently, the header line begins with#
(a hash sign followed by space):This can confuse many downstream tools trying to parse data into columns. The first line will appear to have one more column in the eyes of many standard tools, such as
awk
,cut
,datamash
,R
, and others, including spreadsheet apps.Consider the following example:
In this toy example, we can see that the first line has more columns, and that the name of first column is "#", rather than e.g. "CHROM", as we asked in the query format. It makes the header less useful by default, requiring additional processing. For example, I often end up piping through
sed
, e.g.The proposed change
Depending on your preference regarding the hash sign
#
in the header, I propose removing either the space following the hash signs, or remove both the hash sign and the space. This would result in a name such as either#CHROM
or justCHROM
, respectively.The text was updated successfully, but these errors were encountered: