help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconstancy with RS = "(\r?\n){2}"


From: Alex fxmbsw7 Ratchev
Subject: Re: inconstancy with RS = "(\r?\n){2}"
Date: Sun, 25 Jul 2021 15:06:44 +0200

hm thank you to all of you

sad that it can not where it should
i mean the general way i'd do is stack up chars one by one and check
on every run for RS matches, .. no problem so, no idea what gawk does
:))

On Sun, Jul 25, 2021 at 3:04 PM Wolfgang Laun <wolfgang.laun@gmail.com> wrote:
>
> I have been looking at the code in io.c and re.c.
>
> gawk lets you specify an arbitrary regex as RS, the record separator. But in 
> an environment (terminal, socket) where the input data is not yet available 
> to  the gawk code looking for a match with RS, it is in general impossible to 
> decide whether the full RS has been encountered or not unless some more input 
> has been entered. Of course, there are regexes where you can tell, e.g. 
> /ab?c/. But this becomes more and more difficult, e.g., when you have 
> parentheses and repetitions making the analysis rather complex. So, to be on 
> the safe side, gawk reads yet another line from the input source and then 
> passes another record to the user's code.
>
> gawk is not a (soft) real time program and cannot react to all RS immediately 
> after they have been typed in on a TTY or sent over a line.
>
> If you need this behavior, leave the default RS and implement a simple FSM 
> which is better equipped to handle RS like /(\r?\n){2}/.
>
> The GAWK user manual might contain a paragraph describing what I have tried 
> to say in a previous paragraph, perhaps better formulated.
>
> -W
>
>
>
> On Sun, 25 Jul 2021 at 13:55, Alex fxmbsw7 Ratchev <fxmbsw7@gmail.com> wrote:
>>
>> thank you for the true and detailed analyzement
>>
>> On Sun, Jul 25, 2021, 13:49 Ed Morton <mortoneccc@comcast.net> wrote:
>>>
>>>
>>>
>>> On 7/25/2021 4:47 AM, arnold@skeeve.com wrote:
>>>
>>> Greetings.
>>>
>>> Thank you for taking the time to make a bug report. In the future please
>>> send a concise description of the problem with a test program and data.
>>> It was hard for me to determine what you really think is the bug.
>>>
>>> It looks like your concern is with the need to enter EOF more than
>>> once from the terminal.
>>>
>>> Gawk is designed mainly for batch processing (from files or a pipe).
>>> Reading from a terminal with a complicated regexp as RS isn't the
>>> normal use case.  When RS is a regexp gawk may have to do lookahead in
>>> the input stream to be sure that the regexp has matched, and thus
>>> the need for multiple EOFs.
>>>
>>> In any case, I don't think there is an actual bug:
>>>
>>> $ od -c data
>>> 0000000   a  \n  \n  \n   b  \n  \n  \n  \n   c  \n  \n  \n  \n   d  \n
>>> 0000020
>>> $ ./gawk -v RS='(\r?\n){2}' -v ORS='|\n' '{ print }' < data
>>> a|
>>>
>>> b|
>>> |
>>> c|
>>> |
>>> d
>>> |
>>>
>>> This looks right to me.
>>>
>>> Thanks,
>>>
>>> Arnold
>>>
>>>
>>> The problem occurs when reading from a terminal:
>>>
>>> Good (no \r? in RS), every pair of `\n`s is recognized:
>>> ------------
>>> $ gawk -v RS='(\n){2}' '{print "<"$0":"RT">"}'
>>>
>>>
>>>
>>> <:
>>>
>>> >
>>>
>>>
>>> <:
>>>
>>> >
>>>
>>>
>>> <:
>>>
>>> >
>>> -----------------
>>>
>>> Bad (with \r? in RS), no RS is every recognized:
>>> --------------
>>> $ gawk -v RS='(\r?\n){2}' '{print "<"$0":"RT">"}'
>>>
>>>
>>>
>>>
>>>
>>>
>>> -------------------
>>>
>>> Meanwhile if the input was coming from a pipe the RS including `\r?` would 
>>> be recognized:
>>> ---------
>>> $ printf '\n\n\n\n\n' | gawk -v RS='(\r?\n){2}' '{print "<"$0":"RT">"}'
>>> <:
>>>
>>> >
>>> <:
>>>
>>> >
>>> <
>>> :>
>>> -----------
>>>
>>> Regards,
>>>
>>>     Ed.
>
>
>
> --
> Wolfgang Laun
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]