bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: article about gawk best practices in data science and feature propos


From: Manuel Collado
Subject: Re: article about gawk best practices in data science and feature proposal
Date: Thu, 11 Feb 2021 20:46:34 +0100
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

Ivan Molineris <ivan.molineris@gmail.com> wrote:
...
Moreover, one of the biggest drawbacks of gawk in our field is the fact
that, indicating the columns of the input by numbers often produces hard to
read scripts.
For this reason in the wrapper I commonly use it is possible to refer to
columns not only by number, but also by name.

For example, if a file is composed like this:

chromosome     start        end
       chr1       241      53521
       chr1       363      43623
       chr2      5243     234562

gawk '{l=$2-$1}'
can be also written as
gawk '{l=$end-$start}'

I know that this syntax is not back-compatible, maybe can be improved.

Do you know if someone has reasoned about a feature like this one in the
past?

The SYMTAB feature of gawk can be of help. Example:

$ cat headers.awk
# Assign column numbers to header named variables
FNR==1 {
    for (k=1; k<=NF; k++) {
        SYMTAB[$k] = k
    }
    next
}

# Process the data file
{
    print "Length of " $chromosome " is " $end - $start
}

$ cat data
chromosome     start        end
      chr1       241      53521
      chr1       363      43623
      chr2      5243     234562

$ gawk -f headers.awk data
Length of chr1 is 53280
Length of chr1 is 43260
Length of chr2 is 229319

HTH. Regards.
--
Manuel Collado - http://mcollado.z15.es



reply via email to

[Prev in Thread] Current Thread [Next in Thread]