emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "Font-lock is limited to text matching" is a myth


From: Eric M. Ludlam
Subject: Re: "Font-lock is limited to text matching" is a myth
Date: Tue, 11 Aug 2009 22:16:53 -0400

On Tue, 2009-08-11 at 12:04 -0400, Stefan Monnier wrote:
> > I. Asynchronous parsing
> 
> BTW, I'm interested in adding core-Emacs support for such parsing, so if
> you have any ideas about it, please share them.  The way I see it, there
> should be a table-based parsing engine written in C running in
> a separate thread, so the question are "what should the tables look
> like?", "what should the output look like?", "what should the engine
> look like?", "how should the asynchronous code synchronize with the rest
> of Emacs?".  Any help along this effort would be welcome.

Hi,

  I don't think you could define a way to parse some things, like C++
with pre-processor support, via table declaration.  There will always be
some decision that needs to be made that the table designer misses out
on.  For example, to selectively exclude a block of code in #ifdef
MYSYM ... #endif.

  I do think that if there was a set of Emacs Lisp functions that were
considered "thread safe" that parser authors like myself would find a
way to restrict ourselves to that list in those situations where the
special built-in parser needs to make such decisions.  It might even be
a special set of new functions, like 'parser-cons', or 'parser-car',
though in writing it, it does feel icky.

  The project I was going to pick up after the current CEDET
merge/release stuff is related to this, though I was going to build a
separate process and communicate via pipe/socket/whatever with it. This
is because I'm seeing situations where the tag database is so large that
the whole machine slows down.  In this case it is not related to parser
speed, just the data structure size. Imagine a Lisp data structure for
the entirety of the Linux kernel's symbol space.  Such a subprocess
would also enable background parsing of files not currently in buffers
of the active Emacs.

  Let me summarize.  I think CEDET has managed to use timers to overcome
asynchronous parsing problems for the purpose of TAGGING files, and may
not be able to take advantage of asynchronous table-driven parsing
system.  There are, however, different problems CEDET faces that could
take advantage of asynchronous behaviors, but mostly in the way you can
asynchronously run "make TAGS" for etags.  As such, table driven
asynchronous parsing system could focus on a narrower set of
requirements that are not as rigorous for purposes of colorizing, and
have a synchronous form for when other kinds of logic are needed.

  As far as how to define tables for a parsing system written in C, an
old-school solution is to just use the flex/bison engines under the
Emacs Lisp API.  There are a lot of new parser generator systems though,
and I don't really know what the best one might be.  Defining some new
system for parsing using a more modern technique that has enough hooks
for tag generation would be easy to integrate into CEDET, and would
obsolete nothing.

  One of the hairier parts of the CEDET parser is the lexical analyzer.
I remember jumping through hoops trying to squeak out every stray
(if ..) or function call from the core loop, while keeping things
declarative and flexible enough to handle all the various languages.
Even so, the variable-length lexical token still ended up shuttling text
strings around in some situations where I would have preferred
references to a buffer.  Major performance was gained by treating
parenthetical groups as single lex tokens, expanding only if actually
needed.  Tagging parsers can then skip parsing function bodies, for
example, which provides a nice speed boost, but is not so good for
colorizing.

  As far as making this asynchronous, however, there aren't too many
languages that can use nothing but the core lexical types.

  Once you have a lexical token stream, David Ponce's wisent parser (a
port of bison) is top-notch, and very effective.  Moving bits into C may
not provide a huge speed boost, and would certainly not be separable
from lisp code that needs to be synchronous as far as TAGGING is
concerned.  It might be possible to define simplified code blocks such
as (FONT $1 'function) and get away with making asynchronous parsers.
I'm not sure if that is sufficient.

  Error handling in bison is a bit confusing for the uninitiated, and
hard to get right.  CEDET solved this by making the parsers implicitly
iterative at the framework level.  That means when you enter the parser
engine, it will return nil, or a single tag.  If it gets a nil, it skips
that lexical token, and tries again on the next one.  This makes it
robust to "junk" between tags, or between fcn arguments, or between
commands.

  After babbling for a while, I would guess that Stefan is probably
asking for help identifying something like a syntax table.  I think
lexical analysis is common between all the parser generator frameworks,
and has the potential to make a data structure larger than the buffer it
was derived from.  Deriving a push/pop lexical analyzer structure that
shares data with the buffer text, or can even cache itself on top of the
buffer so it doesn't need to "analyze" the whole thing over and over
would be a great first step for any parsing system.

Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]