help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Remove comment characters at start of line


From: Greg Wooledge
Subject: Re: Remove comment characters at start of line
Date: Thu, 2 Feb 2023 08:01:29 -0500

On Thu, Feb 02, 2023 at 09:40:02AM +0100, alex xmb ratchev wrote:
> On Thu, Feb 2, 2023, 9:31 AM Hans Lonsdale <hanslonsdale@mailfence.com>
> wrote:
> > I am trying to understand this sed line matching part, that is supposed to
> > remove the comment characters
> > at the start of the line.  The pattern is pn_ere.
> 
> understand regexes ,
> ^ from beginning
> [[:space:]]* optional match space characters ( optional * , * = 'none or
> more' )
> then a list of the matches
> begins with # for normal scripting cmds
> others also
> ^ # comment
> ^ ! ...
> ^ @c ...
> $cmt is some comment format value
> ..
> [[:space:]]+ must match a space
> ..
> 
> i dunno much sed magic , but the regex is the/a key
> 
>   pn_ere="^[[:space:]]*([#;!]+|@c|${cmt})[[:space:]]+"
> >
> >   sed -E -n "
> >     /$beg_ere/ {
> >       :L1
> >       n
> >       /$end_ere/z
> >       /$pn_ere/!z
> >       s/// ; p
> >       tL1
> >     }
> >   "

That's all true, but there's a bit more to this story.  The sed program
here is enclosed in double quotes, and it has shell substitutions inside
it -- three of them -- and one of THOSE is a double-quoted regular
expression that has a FOURTH shell substitution inside it.

At least in the case of pn_ere, you're clearly trying to generate a regex
at run time using data known only at run time (that ${cmt} expansion).
I don't know what's in beg_ere and end_ere but I have to assume they
work similarly -- they probably contain your "faml" and "asmb" strings.

Each of those substituted values has the potential to destroy your sed
program, because it's injected directly into the sed program.  If the
contents of these variables are generated by outside parties (end users,
files, etc.) and are not sanitized, you might have a code injection bug.

When using sed, there is no alternative to injection.  You can't specify
strings inside environment variables, or anything like that.  sed assumes
that its program is written entirely by a programmer, and does not
contain data from untrusted sources.

Therefore, if you insist on using sed (I wouldn't!), you MUST sanitize
your injected data strings.

The simplest approach would be to come up with a list of all the
characters that can possibly have special meaning in your regex flavor
of choice, and "escape" them all with backslashes.  You're using "sed -E"
which is nonstandard, but we can guess that it means POSIX ERE flavor
regexes.  Therefore you need a list of the special characters in ERE.

Reading regex(7) on Debian 11 gives me the following characters, which
MAY OR MAY NOT be a full list:

| * + ? { } ( ) ^ $ \ [ ] 

To "escape" each of these characters within a string, we can either
call sed (UGH!) or use a bash parameter expansion inside a loop
(which means we must now write the script in bash, not sh):

faml='who knows what'
pattern='[]|*+?{}()^$\\[]'
tmp=
n=${#faml}
for ((i=0; i<n; i++)); do
    if [[ ${faml:i:1} == $pattern ]]; then
        tmp+=\\${faml:i:1}
    else
        tmp+=${faml:i:1}
    fi
done
faml=$tmp

This has to be done at least three times, so it's worth writing it as
a function.  Assuming bash 4.3 or later, we can use namerefs to pass
the input by reference:

ere_escape() {
    local -n _ere_string=$1
    local _ere_pattern='[]|*+?{}()^$\\[]'
    local _ere_tmp _ere_i _ere_n

    _ere_tmp=
    _ere_n=${#_ere_string}
    for (( _ere_i=0; _ere_i < _ere_n; _ere_i++ )); do
        if [[ ${_ere_string:_ere_i:1} == $_ere_pattern ]]; then
            _ere_tmp+=\\${_ere_string:_ere_i:1}
        else
            _ere_tmp+=${_ere_string:_ere_i:1}
        fi
    done
    _ere_string=$_ere_tmp
}

And, a demonstration:

unicorn:~$ printf '%s\n' "$faml"
// $5 [footlong] \\
unicorn:~$ ere_escape faml
unicorn:~$ printf '%s\n' "$faml"
// \$5 \[footlong\] \\\\

You'll need to ere_escape each of your three substituted values before
injecting them into the regexes that are passed into the sed program.
You'll also need to double-check that I came up with the correct list of
ERE special characters, AND that I correctly translated them into a bash
glob-type pattern for matching.  The square brackets in particular needed
special handling, but I'm sure that someone who actually understands
that sed program quoted up above can handle a bit of regex-to-glob
manipulation, eh?

Finally, double-check that your "sed -E" really is using POSIX ERE and
not some GNU-tainted variant with additional special characters.  If
it's got extensions, then you'll have to handle THOSE also.

All of this is why I went with a nice, simple awk program instead of
trying to deal with sed.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]