bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#56079: Missing performance optimisation: Word start/end tests


From: sur-behoffski
Subject: bug#56079: Missing performance optimisation: Word start/end tests
Date: Sun, 19 Jun 2022 15:08:46 +0930
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0

G'day,

I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements.  The only release is at savannah.non-gnu.org:

      https://savannah.nongnu.org/projects/hstbm

It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.

--

I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST.  [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]

One thing that is captured is predicates e.g. relating to IS_WORD:
         "IS_WORD_YES"   (0x01),
         "IS_WORD_NO"    (0x10), and
         "IS_WORD_MAYBE" (0x11).

I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:

        a\<b
        [abc]\>[def]

Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur.  My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.

While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software

["sur-" means "meta-", it's a commentary on a peculiar Australian
event:  See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]





reply via email to

[Prev in Thread] Current Thread [Next in Thread]