[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: prior work on non-backtracking regex engine?
From: |
Danny McClanahan |
Subject: |
Re: prior work on non-backtracking regex engine? |
Date: |
Sun, 07 Apr 2024 04:42:13 +0000 |
Thanks so much for this helpful feedback, Ihor! This makes me much more
confident about packaging this as a dependency!
I've written out a summary of the packaging story for this proposed new
dependency, along with a tentative work outline at the bottom.
On Wednesday, March 13th, 2024 at 09:19, Ihor Radchenko <yantar92@posteo.net>
wrote:
> Danny McClanahan dmcc2@hypnicjerk.ai writes:
>
> > (4) What is the best way to package third-party code for emacs?
> >
> > I was thinking of architecting this regex engine as a third-party
> > codebase exposing a C ABI shared library, which the emacs build system
> > could detect as an optional dependency (like libjansson). I was hoping
> > to use rust to implement this regex engine, but I know that a cargo
> > package alone isn't enough for non-rust code to depend on: I was also
> > planning to maintain package recipes for this regex engine for
> > multiple linux distros, so that emacs could easily add it to the build
> > system (like librsvg). Are there any further constraints I should know
> > about for optional dependencies in the emacs build system?
>
>
> AFAIR, previous discussions raised some concerns about using Rust in
> particular:
> https://yhetil.org/emacs-devel/E1pKAM0-0001Ss-W6@fencepost.gnu.org/
> Because of Rust volatility and some license issues.
>
> That said, the linked discussion was not about linking to rust libraries
> via C ABI, but about contributing Rust code to Emacs core directly.
Thanks so much for this context! It sounds like an optional C ABI library
dependency is absolutely the ideal way to package something for emacs to depend
on, both for legal and maintainability reasons. I absolutely agree with the
quote from that linked discussion that "Rust changes too fast" for emacs to
reliably expect to build its own rust code. The rust language makes it quite
difficult to avoid involving cargo when packaging rust code for distribution,
but the emacs build system itself should not need to be aware of cargo or rust
at all, as per "we should not publicly refer to its existence" (to avoid hidden
dependencies on its unstable behaviors). Each *distro packager* for this rust
regex library would need to invoke cargo to compile/link the C ABI library +
generated C headers, but those would just be installed into somewhere like
/usr/include and /usr/lib like any other C library, so emacs should not need to
be made aware of any rust- or cargo-specific behavior even if I choose to use
rust as an implementation language. I am extremely enthusiastic about achieving
a rust packaging solution for this work that can serve as a role model for
other portable rust code exposing a C ABI library interface.
I'm particularly hoping to package any rust output as a shared and/or static C
ABI library using rust's no_std configuration, which removes a large portion of
rust stdlib code as well as avoiding any implicit libc dependency. Rust
actually has a libc crate (https://github.com/rust-lang/libc) which I believe
can be used in no_std environments to invoke libc if needed, but since I am
hoping to produce a C ABI library interface which delegates to the caller for
alloc/free callbacks (instead of performing any dynamic allocation in the regex
engine library code), it may be feasible to avoid even a libc dep.
If I employ any exotic instructions, it would be via rust's core::simd API (see
e.g. https://mcyoung.xyz/2023/11/27/simd-base64/) for platform-specific
vectorized instruction selection. This thoughtful and simple interface falls
back to portable SWAR techniques for generic platforms (see
https://doc.rust-lang.org/stable/core/simd/index.html#portable-simd-works-on-every-target),
so this feature of the rust language would actually *improve* our portability.
However, since llvm (and therefore rust) still supports fewer platforms than
gcc, I also intend to incorporate rustc_codegen_gcc (which compiles rust with
libgccjit) and/or the gccrs frontend (not yet functional but improving rapidly)
into the packaging scripts for this regex engine library, to ensure that emacs
does not begin to implicitly disadvantage non-llvm platforms.
> Another somewhat relevant discussion is
> https://yhetil.org/emacs-devel/95980ffc-86e7-ad54-4a20-539d8c6ea5d0@mailo.com/
Wow, this is super neat!! A lot of my programming experience has actually been
on developing build tools or integrating them with editors and this
discussion/library looks super thoughtful! I will have to chew on this
discussion more.
> P.S. Better regexp engine would be very welcome.
^_^ !!! This makes me so happy to hear! I took the time to learn how
regex-emacs.c as well as other emacs C code interacts with the gap buffer last
week, and was super glad to see how simple the gap buffer data structure is. I
am not an expert on string search performance yet, but I believe the gap buffer
(which is always allocated within a single block, and has at most two
non-adjacent data sections) is likely much more amenable to high-performance
search techniques than a more complex data structure such as a rope. And I was
also *super* pleased to see that regex-emacs.h itself doesn't expose any
dependency on the gap buffer or other internal emacs representations (except
regarding multibyte encoding). So in my amateur evaluation, emacs actually
seems very well-placed to take advantage of high-performance regex engine
techniques without any big structural changes.
As a concrete example, I currently maintain helm-rg on MELPA
(https://github.com/cosmicexplorer/helm-rg), which itself performs quite a few
regex matches to extract process output from ripgrep, and have previously
contributed to helm-ag and helm-swoop (the latter of which searches within
emacs buffers instead of hitting the filesystem). While ripgrep is extremely
fast (even over very large corporate monorepos) and emacs makes it simple
enough to communicate efficiently with an asynchronous subprocess, I realized
that regex matching over buffer contents (which emacs has already mapped into
its virtual memory space) might be an improvement for multiple reasons:
(a) it avoids an unnecessary round-trip through the filesystem (also friendlier
to TRAMP),
(b) it allows the emacs user to narrow the search via their preferred method of
buffer list selection (as opposed to maintaining a separate interface to
configure ripgrep's fs traversal with CLI flags and ignore files).
So practically, I'm hoping this work allows me to effectively deprecate helm-rg
and replace a lot of finicky lisp code (which currently performs asynchronous
IPC, globbing files, and writing back to the filesystem when editing match
results inline) with a simple loop over a provided buffer list that performs a
very fast zero-copy in-memory search and batched pattern replacement.
I would *really like* to eventually have emacs depend on an existing regex
engine like re2 or rust regex to take advantage of their bugfixes and
optimizations, but both of those engines (and most others) require utf-8 input,
and (I'm pretty sure) can't easily be made to support emacs's multibyte
functionality. So I think there is a strong case for a new engine here,
especially one licensed with the GPLv3 (or any later version) as opposed to the
LGPL or other more permissive license.
-------
Informal sketch of work outline:
(1) simple, robust non-backtracking regex matcher with a backwards-compatible
API for regex-emacs.h (no support for backrefs, maybe just a single NFA matcher
with the PikeVM to test the API)
(2) basic benchmark harness + integration into emacs build system (so we can
get a concrete performance comparison and %improvement, and start creating
distro packages)
(3) implement a lazy DFA / other specialized automata to get performance
generally on par with re2/rust regex
(4) implement a SIMD prefilter for literal substrings (the prototype should now
be significantly faster than regex-emacs.c on all supported inputs)
(5) separate out a regex pattern parsing API, and expose regex objects to lisp
code
(6) try implementing backrefs
(7) batched search-replace API for replacement of pattern strings in buffer
contents
In general, I've decided to put off these features at first:
(a) backrefs
(b) per-mode customizable char classes & word/symbol boundaries
(c) regex matching against syntax properties (not the raw string data)
- I will probably propose deprecating (c) entirely, especially if tree-sitter
ends up being able to take over syntax propertization in general.
- (b) is non-standard, and if we deprecate it we may be able to more easily
lean on an existing regex pattern parser like regex-syntax
(https://docs.rs/regex-syntax/), although regex-syntax also does not support
(a) backrefs and it probably makes sense to make our own pattern parsing
library anyway as part of (5).
- I personally would like to avoid entirely deprecating (a) backrefs even if by
definition they require some level of backtracking, because I personally
consider them very handy for small bounded inputs. If I can't figure out a
clean way to support backrefs in this new engine, I would like to simply retain
the regex-emacs.c engine for backref patterns (and maybe add some defvar that
warns/errors if a backref pattern is compiled when set).
Thanks so much for providing such thoughtful and patient feedback!
--Danny
- Re: prior work on non-backtracking regex engine?,
Danny McClanahan <=