help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Remove comment characters at start of line


From: Hans Lonsdale
Subject: Re: Remove comment characters at start of line
Date: Thu, 2 Feb 2023 15:04:51 +0100 (CET)


> ----------------------------------------
> From: Greg Wooledge <greg@wooledge.org>
> Date: Feb 3, 2023, 1:01:29 AM
> To: <help-bash@gnu.org>
> Subject: Re: Remove comment characters at start of line
> 
> 
> On Thu, Feb 02, 2023 at 09:40:02AM +0100, alex xmb ratchev wrote:
> > On Thu, Feb 2, 2023, 9:31 AM Hans Lonsdale <hanslonsdale@mailfence.com>
> > wrote:
> > > I am trying to understand this sed line matching part, that is supposed to
> > > remove the comment characters
> > > at the start of the line.  The pattern is pn_ere.
> > 
> > understand regexes ,
> > ^ from beginning
> > [[:space:]]* optional match space characters ( optional * , * = 'none or
> > more' )
> > then a list of the matches
> > begins with # for normal scripting cmds
> > others also
> > ^ # comment
> > ^ ! ...
> > ^ @c ...
> > $cmt is some comment format value
> > ..
> > [[:space:]]+ must match a space
> > ..
> > 
> > i dunno much sed magic , but the regex is the/a key
> > 
> >   pn_ere="^[[:space:]]*([#;!]+|@c|${cmt})[[:space:]]+"
> > >
> > >   sed -E -n "
> > >     /$beg_ere/ {
> > >       :L1
> > >       n
> > >       /$end_ere/z
> > >       /$pn_ere/!z
> > >       s/// ; p
> > >       tL1
> > >     }
> > >   "
> 
> That's all true, but there's a bit more to this story.  The sed program
> here is enclosed in double quotes, and it has shell substitutions inside
> it -- three of them -- and one of THOSE is a double-quoted regular
> expression that has a FOURTH shell substitution inside it.
> 
> At least in the case of pn_ere, you're clearly trying to generate a regex
> at run time using data known only at run time (that ${cmt} expansion).
> I don't know what's in beg_ere and end_ere but I have to assume they
> work similarly -- they probably contain your "faml" and "asmb" strings.
> 
> Each of those substituted values has the potential to destroy your sed
> program, because it's injected directly into the sed program.  If the
> contents of these variables are generated by outside parties (end users,
> files, etc.) and are not sanitized, you might have a code injection bug.
> 
> When using sed, there is no alternative to injection.  You can't specify
> strings inside environment variables, or anything like that.  sed assumes
> that its program is written entirely by a programmer, and does not
> contain data from untrusted sources.
> 
> Therefore, if you insist on using sed (I wouldn't!), you MUST sanitize
> your injected data strings.

As I get more into this, the insistence is getting things too troublesome.
 
> The simplest approach would be to come up with a list of all the
> characters that can possibly have special meaning in your regex flavor
> of choice, and "escape" them all with backslashes.  You're using "sed -E"
> which is nonstandard, but we can guess that it means POSIX ERE flavor
> regexes.  Therefore you need a list of the special characters in ERE.
> 
> Reading regex(7) on Debian 11 gives me the following characters, which
> MAY OR MAY NOT be a full list:
> 
> | * + ? { } ( ) ^ $ \ [ ] 
> 
> To "escape" each of these characters within a string, we can either
> call sed (UGH!) or use a bash parameter expansion inside a loop
> (which means we must now write the script in bash, not sh):
> 
> faml='who knows what'
> pattern='[]|*+?{}()^$\\[]'
> tmp=
> n=${#faml}
> for ((i=0; i<n; i++)); do
>     if [[ ${faml:i:1} == $pattern ]]; then
>         tmp+=\\${faml:i:1}
>     else
>         tmp+=${faml:i:1}
>     fi
> done
> faml=$tmp
> 
> This has to be done at least three times, so it's worth writing it as
> a function.  Assuming bash 4.3 or later, we can use namerefs to pass
> the input by reference:
> 
> ere_escape() {
>     local -n _ere_string=$1
>     local _ere_pattern='[]|*+?{}()^$\\[]'
>     local _ere_tmp _ere_i _ere_n
> 
>     _ere_tmp=
>     _ere_n=${#_ere_string}
>     for (( _ere_i=0; _ere_i < _ere_n; _ere_i++ )); do
>         if [[ ${_ere_string:_ere_i:1} == $_ere_pattern ]]; then
>             _ere_tmp+=\\${_ere_string:_ere_i:1}
>         else
>             _ere_tmp+=${_ere_string:_ere_i:1}
>         fi
>     done
>     _ere_string=$_ere_tmp
> }
> 
> And, a demonstration:
> 
> unicorn:~$ printf '%s\n' "$faml"
> // $5 [footlong] \\
> unicorn:~$ ere_escape faml
> unicorn:~$ printf '%s\n' "$faml"
> // \$5 \[footlong\] \\\\
> 
> You'll need to ere_escape each of your three substituted values before
> injecting them into the regexes that are passed into the sed program.
> You'll also need to double-check that I came up with the correct list of
> ERE special characters, AND that I correctly translated them into a bash
> glob-type pattern for matching.  The square brackets in particular needed
> special handling, but I'm sure that someone who actually understands
> that sed program quoted up above can handle a bit of regex-to-glob
> manipulation, eh?
> 
> Finally, double-check that your "sed -E" really is using POSIX ERE and
> not some GNU-tainted variant with additional special characters.  If
> it's got extensions, then you'll have to handle THOSE also.
> 
> All of this is why I went with a nice, simple awk program instead of
> trying to deal with sed.

I agree.
 


-- 
Sent with https://mailfence.com  
Secure and private email



reply via email to

[Prev in Thread] Current Thread [Next in Thread]