bug#21251: sed: POSIX and the z command

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#21251: sed: POSIX and the z command

From:	Stephane Chazelas
Subject:	bug#21251: sed: POSIX and the z command
Date:	Tue, 31 Jan 2017 21:49:55 +0000
User-agent:	Mutt/1.5.24 (2015-08-30)

2017-01-28 21:04:25 +0000, Assaf Gordon:
[...]
> >>I'd expect the behaviour to be unspecified if the input is not
> >>text (as would be the case if there are invalid multi-byte
> >>sequences).
> >
> >Exactly.
> 
> So the above somewhat confuses me (as my previous email):
> 
> Let's say I was to write a new simple 'sed' for POSIX systems.
> If POSIX/OpenGroup encourages me (as a software writer for posix
> systems) to use the POSIX regexec API, then implicitly my 'sed'
> program wouldn't match invalid multibyte sequences.
> But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
> it means that in practical terms I shouldn't use POSIX API and
> implement my own regex engine...
[...]

Just to clear what I think might be the source of the confusion,
this bug is not about GNU sed not being POSIX compliant in this
instance (it is compliant), but a documentation bug about the
claim that POSIX mandates s/.*// to not empty the pattern space
if it contains invalid characters being wrong. POSIX doesn't
mandate that, it mandates nothing of sed when the input is not
text.

The current sed behaviour is compliant. When the input is not
text, *anything* is compliant as POSIX leaves the behaviour of
sed unspecified then. That's an area not covered by POSIX,
you're on your own. In particular, you're free to ensure that
s/.*// empties the pattern space if you like.

That "simple sed" can do fgets() on a statically allocated
buffer of LINE_MAX length and use POSIX regexec() on it and
still be conformant.

Now, though that would be the subject of another "feature
request" bug and as you say one that would cover all the text
utilities, not just "sed", I (not POSIX) argue that it would be
better if individual bytes that don't form part of valid
characters would be treated as a character of their own rather
than pretend they're not there.

That could be done by adding a (non-POSIX) flag to regcomp() and
fnmatch() to enable that behaviour.

Or like python does in some cases, work with APIs that work on
some wchar_t* instead of char* but for the translation from
char* UTF-8 to wchar_t*, use a reserved range for byte values
that don't form part of valid characters.Like python that uses
code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't
form part of valid characters (U+D800 to U+DFFF are not
characters, they are code points which are otherwise reserved
for UTF-16 encoding).

Without having to change the APIs, another approach (in UTF-8
locales) could be to preprocess the input to change for instance
a standalone 0x80 into the would-be UTF-8 encoding of U+DC80
before calling regexec() (for which at the moment "." matches on
even though it's not a character) and do the reverse on output.
That would have some performance impact though.

Note that at the moment there's some discrepency between GNU
tools on the treatment of the would-be UTF-8 encoding of those
D800-DF00 non-characters (the UTF-16 surrogate pairs).

For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding
of DC80) as 0 character, some as 1, some as 3, some as 1 and 3
at the same time:

$ export C=$'\xed\xb2\x80'
$ bash -c '[[ $C = ??? ]]' && echo yes
yes

For bash (and zsh and ksh93), those 3 bytes don't form part of a
valid character, so are considered as characters which IMO is
the best thing to do.

$ printf %s "$C" | wc -m
0

That's not a character, so we print 0 (as required by POSIX I
beleive, wc is _not_ a text utility).

$ touch "$C"; find "$C" -name '*'
$ touch "$C"; find "$C" -name '?'
$ touch "$C"; find "$C" -name '???'
$

That file can't be matched by name!

$ printf '%s\n' "$C" | grep -xl .
(standard input)
$ printf '%s\n' "$C" | sed 's/^.$/yes/'
yes

But:

$ printf '%s\n' "$C" | grep -xPl .
$ printf '%s\n' "$C" | ./grep -Plx '.*'
(standard input)
$ printf '@address@hidden' "$C" | ./grep -Plx '@.*@'
$


Worse: it can be one character and three at the same time:

$ expr "$C" : '^.$'
3
$ printf '%s\n' "$C" | awk '/^.$/ {print length}'
3


(note that's on Linux-Mint 18.1, so not with the latest versions
of those utilities, one would have to check with the latest
versions).

(again, that's not a POSIX compliance issue for text utilities).

-- 
Stephane

[Prev in Thread]

Current Thread

[Next in Thread]

bug#21251: sed: POSIX and the z command, Assaf Gordon, 2017/01/27
- bug#21251: sed: POSIX and the z command, Stephane Chazelas, 2017/01/28
  - bug#21251: sed: POSIX and the z command, Assaf Gordon, 2017/01/28
    - bug#21251: sed: POSIX and the z command, Stephane Chazelas <=

Prev by Date: bug#25587: In -i mode, shouldn't sed continue processing the command line if it can't access some files?
Next by Date: bug#25587: In -i mode, shouldn't sed continue processing the command line if it can't access some files?
Previous by thread: bug#21251: sed: POSIX and the z command
Next by thread: bug#22636: tab characters in documentation
Index(es):
- Date
- Thread