bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17711: Updated "untangle" script: More functional, and resynced with


From: behoffski
Subject: bug#17711: Updated "untangle" script: More functional, and resynced with 2.20
Date: Fri, 06 Jun 2014 13:21:20 +0930
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

G'day,

Here's an updated "untangle" Lua script (http://www.lua.org/),
resynchronised with the changes up until Grep 2.20
(specifically commit ...78f07b8c8e26, the post-release
administrivia commit after the release).

The original script (17 April) was fairly rough and
incomplete in places; and was overshadowed by the many
improvements being made in the lead-up to 2.19, especially
the extremely impressive work from Norihiro, Paul and Jim
at multiple levels to improve performance, documentation, and
code consistency and clarity in many areas.

I still believe that a more fundamental rework of dfa.c, based
on making the token language more expressive, is worthwhile,
but that such an invasive change could only occur if dfa.c was
made easier to modify; hence, this untangle script.

There are two major improvements over the 17 April version:

  1. The modules now clean up after themselves, instead of
     leaving dangling memory blocks.  Running "make check"
     with Valgrind, as per the script suggested by Jim in
     his 21 May message, now runs cleanly, although the
     "symlink" test hangs on my system for both the original
     and modified grep versions, for reasons that I haven't
     tried to track down.

  2. I've created a new module, "mbcsets", which models
     multibyte character class sets at arm's length to users,
     in a similar fashion to the "charclass" module.  The
     result is that the token output from fsaparse should now
     be identical to the token output from parse in dfa.c,
     whereas previously the comparison could only be made at
     the output of fsalex, as it did not share its
     internally-build mbcsets structures.

There's quite a number of implementation differences that are
worth inspecting in the untangled versions:

  1. Charclass has moved from persistent class indices, but
     changing class pointers (due to realloc ()), to also
     explicitly supporting persistent pointers.  This has
     facilitated better charclass caching in various places;

  2. I'm still working towards a scenario where multiple
     lexers/parsers/etc could co-exist in different-locale
     and/or different-regex-option versions, with the locale
     captured when fsalex_syntax () is called by the client,
     and all other users rigorously obtain locale information
     directly or indirectly via the lexer.  This is incomplete;
     one example that I haven't fixed yet is the using_utf8 ()
     test in fsaparse;

  3. find_pred (), in fsalex, exploits the charclass pointer
     guarantee to do lazy caching of predicate searches;

  4. The treatment of \s and \S in fsalex_lex () has been
     rewritten to not use PUSH_LEX_STATE/POP_LEX_STATE, as it
     calls underlying resources such as find_pred directly.
     (Incidentally, the IS_WORD_CONSTITUENT implementation of
     \w/\W in dfa.c is quite different to the \s/\S treatment
     in the original dfa.c:  Is this code correct in a
     multibyte environment?);

  5. I reworked FETCH_WC/FETCH_CHAR only a few days before Paul
     and Norihiro mad extensive improvements;  I've tried to
     integrate their changes into my version without compromising
     their excellent work; my version is sufficiently different,
     in documentation even if nowhere else, to be worth a look;
     and

  6. My personal coding preference is to not guarantee
     single-exit functions, but instead, try to treat exceptional
     and/or simple cases early in the function, with an
     immediate function return.  My hope is that this makes the
     remainder of the function easier to understand, as the
     reader knows that certain cases have been eliminated.

     In this vein, I (possibly rashly) decided to rewrite
     atom () and closure () in fsaparse; all feedback on this
     effort is welcome.

As before, the code modifies dfa.c to create "dfa-prl.c", so
that the original code and the new code can be run in "parallel",
and the outputs compared.  The comparison is logged in
/tmp/parallel.log.  The co-existence of new and old code means
that the new code has to have explicit module name prefixes in
many places, at least to avoid namespace clashes.

This message contains the updated "untangle" Lua script, along
with the "strictness" module, from LuaRocks, that I use to
stop global variables (usually the result of typos) from being
created freely.

I'll post a follow-up message shortly, with the full set of
created and/or modified files created by the script, for those
that (probably quite wisely) may distrust Lua scripts from
strangers.

cheers,

behoffski (Brenton Hoff)
Programmer, Grouse Software

Attachment: untangle
Description: Text document

Attachment: strictness.lua
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]