bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-grep] [bugs #12210] EGexecute() fails to find matches on (exact &&


From: Charles Levert
Subject: [bug-grep] [bugs #12210] EGexecute() fails to find matches on (exact && match_icase)
Date: Fri, 4 Mar 2005 22:12:30 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1

URL:
  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=12210>

                 Summary: EGexecute() fails to find matches on (exact &&
match_icase)
                 Project: grep
            Submitted by: None
            Submitted on: Fri 03/04/2005 at 11:41
                Category: None
                Severity: 5 - Average
              Item Group: None
                  Status: Need Info
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open

    _______________________________________________________

Details:

This bug causes problems with --color and --only-matching when paired with
--ignore-case.

EGexecute() in search.c fails to find 'exact' matches differing in case when
match_icase has been set.

See also bug #9768


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Fri 03/04/2005 at 21:37       By: Charles Levert <charles_levert>
> I'm not considering wide char issues here.

Note that wide char support might be there and its functions used even if the
pattern and files are plain ASCII.  Preliminary case-folding is only needed or
performed when it's there.

> grep.c::prline()

The prline() problem where the case-folded match instead of the original is
printed is known.  I advocate getting rid of the whole thing in patch #3767
as it's no longer needed anyways (unless you can prove otherwise).  The
printing problem is simple:  in the "only-matching && match_icase" code path
of prline(), b should be set to "beg + ...", not "ibeg + ...", just as it is
in its other match_icase code path.

> a new design...

Yes, an interface similar to that of pcre.  Mainly for the ability to be able
to say either "same string but next match" or "same string but starting at
char offset i instead of 0" (for the ^pattern and \<pattern bug when multiple
matches per line are involved).  Similar to \G in Perl or -start in Tcl.

> I was hoping to help the grep community

That's always appreciated.

[Side note:  please create yourself a Savannah account and log in for posting
so that we can know who you are, especially since in your case I would have
recognized you from the bug-grep mailing list and known of your familiarity
with the code.]


-------------------------------------------------------
Date: Fri 03/04/2005 at 20:27       By: Anonymous
I'm using latest (today's) CVS.
For a short example of a consequence look at bug #9768
I'm not considering wide char issues here.

Detailed description of what I see as the root problem (design problems
involving EGexecute with exact=1) follows. 

Disclaimer: I have studied the grep source quite hard, but only for a few
days and I am still struggling with dfa, kwset and regex internals. Please do
not shoot. This mail is long.

involved globals:
match_icase && only_matching

mainly involved functions:
grep.c::prline()
search.c::EGexecute() and called functions

command: echo Claudio | grep -io claudio 
output: claudio
expected output: Claudio

description:
after compiling the pattern (Gcompile), we reach grepfile(), then grep(),
then grepbuf().
In grepbuf() we have the call to execute (in our case EGexecute), called with
last parameter 0 (non exact), and on the buffer containing "Claudio". We are
looking for matching lines, but do not care about offset of the match in the
line yet.

Ok, EGexecute finds the match easily, because we succeed using the first
strat (kwset). The position of the matching line (0) in the buffer is
returned.

Back to grepbuf(), we now enter in prtext() since we found a match, and from
prtext() to prline().

Now the nasty prline, in all the cases involving match_icase  and either
color_option or only_matching, needs to search again in the string, this time
with exact=1 (we need the actual offset in the line), because we want to be
able to mark before|match|after for output purposes.

The nasty prline, for all cases about match_icase says:

if (match_icase)
{
/* cut */
    for (i = 0; i < lim - beg; i++)
      ibeg[i] = tolower (beg[i]);

      /* do_stuff involving execute() passing exact = 1
         calc ing on the NEW buffer,
         but also printing from the NEW buffer */
}

Now why is the buffer being converted to lower case before calling execute()
(EGexecute()) with exact = 1?

I think that the answer is, because EGexecute fails to consider icase when
passed with exact=1 and cases differ. If this is wanted, it looks very evil
to me.

So the hack has been written around it, which mostly works, but helps making
the prline function the blob it is, and causing the damage of messing the
output in the only_matching case.

I think that at first the search.c function EGexecute() exact=1 case should
be fixed, to make it work even in the match_icase situation without having to
tolower all the data before calling it.

I think that as long term goal, a new design for the exact case should be
found. Bug #11579 for example is involved, very difficult to fix (in a clean
way) with current exact case design imo.

I was hoping to help the grep community, so please if I missed something
obvious, help me understand.

Thanks

CLaudio


-------------------------------------------------------
Date: Fri 03/04/2005 at 19:38       By: Charles Levert <charles_levert>
I'm not sure if we should care about memory leaks, as there are already some
and *compile is only called once from src/grep.c, but here's a third version
of the patch.


-------------------------------------------------------
Date: Fri 03/04/2005 at 19:07       By: Charles Levert <charles_levert>
I think I didn't correctly understand the part this begin an internal-only
API thing in my last post.

Would the attached patch solve your problem?


-------------------------------------------------------
Date: Fri 03/04/2005 at 18:22       By: Charles Levert <charles_levert>
There is a twlower() call for this in check_multibyte_string() in
src/search.c.  That's only used if defined(MBS_SUPPORT), if (MB_CUR_MAX > 1),
and if match_icase.  That mirrors what is initially done to the pattern in
src/grep.c.

Maybe what's done to the pattern should be moved to the *compile functions in
src/search.c for symmetry.

The only other condition that triggers the call to check_multibyte_string()
is if kwset.

I still would like to be able to reproduce the bug.  Sorry to ask again, but
are you using *latest* CVS (more recent than at the very least src/search.c
version 1.30, 2004/12/16 07:18:15)?  Can you produce a test case shell script
that should fail when I try it?


-------------------------------------------------------
Date: Fri 03/04/2005 at 14:50       By: Anonymous
Using CVS.
Sorry for the confusion that may arise, the bug is not visible because it's
in the API that search.c exposes to grep.c .

Since EGexecute() is not able to find matches when the passed 'exact'
parameter is set and match_icase is set (and buffer and pattern differ in
case), an ugly workaround exists in grep.c, where the line is converted to
lower case on the fly before calling EGexecute. This conversion however
causes other problems with --only-matching and is not the right place to
solve it.

The problem should be solved in search.c.
I am investigating patch #1940 that might hold part of the solution.

Additional contribution to understanding the issue would be welcome.


-------------------------------------------------------
Date: Fri 03/04/2005 at 13:24       By: Charles Levert <charles_levert>
Which version of GNU grep are you using?  This may already have been fixed. 
It works for me on the CVS version.  Only --color but not --only-matching
works on an older (not latest) Red Hat version based on 2.5.1.








    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Fri 03/04/2005 at 19:38  Name: search.c.icase.patch  Size: 2.98KB   By:
charles_levert
Same, frees copy of initial mb pointer.
<http://savannah.gnu.org/bugs/download.php?item_id=12210&item_file_id=2276>
-------------------------------------------------------
Date: Fri 03/04/2005 at 19:20  Name: search.c.icase.patch  Size: 3.28KB   By:
charles_levert
Same, without memory leaks.
<http://savannah.gnu.org/bugs/download.php?item_id=12210&item_file_id=2275>
-------------------------------------------------------
Date: Fri 03/04/2005 at 19:07  Name: search.c.icase.patch  Size: 2.51KB   By:
charles_levert
Unidiff patch to add check_multibyte_pattern() to src/search.c
<http://savannah.gnu.org/bugs/download.php?item_id=12210&item_file_id=2274>

    _______________________________________________________

This item URL is:

  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=12210>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]