|
From: | Manuel Collado |
Subject: | Re: article about gawk best practices in data science and feature proposal |
Date: | Thu, 11 Feb 2021 20:46:34 +0100 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 |
Ivan Molineris <ivan.molineris@gmail.com> wrote: ...Moreover, one of the biggest drawbacks of gawk in our field is the fact that, indicating the columns of the input by numbers often produces hard to read scripts. For this reason in the wrapper I commonly use it is possible to refer to columns not only by number, but also by name. For example, if a file is composed like this: chromosome start end chr1 241 53521 chr1 363 43623 chr2 5243 234562 gawk '{l=$2-$1}' can be also written as gawk '{l=$end-$start}' I know that this syntax is not back-compatible, maybe can be improved. Do you know if someone has reasoned about a feature like this one in the past?
The SYMTAB feature of gawk can be of help. Example: $ cat headers.awk # Assign column numbers to header named variables FNR==1 { for (k=1; k<=NF; k++) { SYMTAB[$k] = k } next } # Process the data file { print "Length of " $chromosome " is " $end - $start } $ cat data chromosome start end chr1 241 53521 chr1 363 43623 chr2 5243 234562 $ gawk -f headers.awk data Length of chr1 is 53280 Length of chr1 is 43260 Length of chr2 is 229319 HTH. Regards. -- Manuel Collado - http://mcollado.z15.es
[Prev in Thread] | Current Thread | [Next in Thread] |