[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#56079: Missing performance optimisation: Word start/end tests
From: |
sur-behoffski |
Subject: |
bug#56079: Missing performance optimisation: Word start/end tests |
Date: |
Sun, 19 Jun 2022 15:08:46 +0930 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 |
G'day,
I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements. The only release is at savannah.non-gnu.org:
https://savannah.nongnu.org/projects/hstbm
It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.
--
I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST. [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]
One thing that is captured is predicates e.g. relating to IS_WORD:
"IS_WORD_YES" (0x01),
"IS_WORD_NO" (0x10), and
"IS_WORD_MAYBE" (0x11).
I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:
a\<b
[abc]\>[def]
Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur. My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.
While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.
cheers,
sur-behoffski (Brenton Hoff)
programmer, Grouse Software
["sur-" means "meta-", it's a commentary on a peculiar Australian
event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]
- bug#56079: Missing performance optimisation: Word start/end tests,
sur-behoffski <=