[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Remove comment characters at start of line
From: |
Hans Lonsdale |
Subject: |
Re: Remove comment characters at start of line |
Date: |
Thu, 2 Feb 2023 15:04:51 +0100 (CET) |
> ----------------------------------------
> From: Greg Wooledge <greg@wooledge.org>
> Date: Feb 3, 2023, 1:01:29 AM
> To: <help-bash@gnu.org>
> Subject: Re: Remove comment characters at start of line
>
>
> On Thu, Feb 02, 2023 at 09:40:02AM +0100, alex xmb ratchev wrote:
> > On Thu, Feb 2, 2023, 9:31 AM Hans Lonsdale <hanslonsdale@mailfence.com>
> > wrote:
> > > I am trying to understand this sed line matching part, that is supposed to
> > > remove the comment characters
> > > at the start of the line. The pattern is pn_ere.
> >
> > understand regexes ,
> > ^ from beginning
> > [[:space:]]* optional match space characters ( optional * , * = 'none or
> > more' )
> > then a list of the matches
> > begins with # for normal scripting cmds
> > others also
> > ^ # comment
> > ^ ! ...
> > ^ @c ...
> > $cmt is some comment format value
> > ..
> > [[:space:]]+ must match a space
> > ..
> >
> > i dunno much sed magic , but the regex is the/a key
> >
> > pn_ere="^[[:space:]]*([#;!]+|@c|${cmt})[[:space:]]+"
> > >
> > > sed -E -n "
> > > /$beg_ere/ {
> > > :L1
> > > n
> > > /$end_ere/z
> > > /$pn_ere/!z
> > > s/// ; p
> > > tL1
> > > }
> > > "
>
> That's all true, but there's a bit more to this story. The sed program
> here is enclosed in double quotes, and it has shell substitutions inside
> it -- three of them -- and one of THOSE is a double-quoted regular
> expression that has a FOURTH shell substitution inside it.
>
> At least in the case of pn_ere, you're clearly trying to generate a regex
> at run time using data known only at run time (that ${cmt} expansion).
> I don't know what's in beg_ere and end_ere but I have to assume they
> work similarly -- they probably contain your "faml" and "asmb" strings.
>
> Each of those substituted values has the potential to destroy your sed
> program, because it's injected directly into the sed program. If the
> contents of these variables are generated by outside parties (end users,
> files, etc.) and are not sanitized, you might have a code injection bug.
>
> When using sed, there is no alternative to injection. You can't specify
> strings inside environment variables, or anything like that. sed assumes
> that its program is written entirely by a programmer, and does not
> contain data from untrusted sources.
>
> Therefore, if you insist on using sed (I wouldn't!), you MUST sanitize
> your injected data strings.
As I get more into this, the insistence is getting things too troublesome.
> The simplest approach would be to come up with a list of all the
> characters that can possibly have special meaning in your regex flavor
> of choice, and "escape" them all with backslashes. You're using "sed -E"
> which is nonstandard, but we can guess that it means POSIX ERE flavor
> regexes. Therefore you need a list of the special characters in ERE.
>
> Reading regex(7) on Debian 11 gives me the following characters, which
> MAY OR MAY NOT be a full list:
>
> | * + ? { } ( ) ^ $ \ [ ]
>
> To "escape" each of these characters within a string, we can either
> call sed (UGH!) or use a bash parameter expansion inside a loop
> (which means we must now write the script in bash, not sh):
>
> faml='who knows what'
> pattern='[]|*+?{}()^$\\[]'
> tmp=
> n=${#faml}
> for ((i=0; i<n; i++)); do
> if [[ ${faml:i:1} == $pattern ]]; then
> tmp+=\\${faml:i:1}
> else
> tmp+=${faml:i:1}
> fi
> done
> faml=$tmp
>
> This has to be done at least three times, so it's worth writing it as
> a function. Assuming bash 4.3 or later, we can use namerefs to pass
> the input by reference:
>
> ere_escape() {
> local -n _ere_string=$1
> local _ere_pattern='[]|*+?{}()^$\\[]'
> local _ere_tmp _ere_i _ere_n
>
> _ere_tmp=
> _ere_n=${#_ere_string}
> for (( _ere_i=0; _ere_i < _ere_n; _ere_i++ )); do
> if [[ ${_ere_string:_ere_i:1} == $_ere_pattern ]]; then
> _ere_tmp+=\\${_ere_string:_ere_i:1}
> else
> _ere_tmp+=${_ere_string:_ere_i:1}
> fi
> done
> _ere_string=$_ere_tmp
> }
>
> And, a demonstration:
>
> unicorn:~$ printf '%s\n' "$faml"
> // $5 [footlong] \\
> unicorn:~$ ere_escape faml
> unicorn:~$ printf '%s\n' "$faml"
> // \$5 \[footlong\] \\\\
>
> You'll need to ere_escape each of your three substituted values before
> injecting them into the regexes that are passed into the sed program.
> You'll also need to double-check that I came up with the correct list of
> ERE special characters, AND that I correctly translated them into a bash
> glob-type pattern for matching. The square brackets in particular needed
> special handling, but I'm sure that someone who actually understands
> that sed program quoted up above can handle a bit of regex-to-glob
> manipulation, eh?
>
> Finally, double-check that your "sed -E" really is using POSIX ERE and
> not some GNU-tainted variant with additional special characters. If
> it's got extensions, then you'll have to handle THOSE also.
>
> All of this is why I went with a nice, simple awk program instead of
> trying to deal with sed.
I agree.
--
Sent with https://mailfence.com
Secure and private email