Internal representations (was: Re: Possible solution for special charact

make-alpha

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Internal representations (was: Re: Possible solution for special charact

From:	Paul Smith
Subject:	Internal representations (was: Re: Possible solution for special characters in makefile paths)
Date:	Sat, 22 Feb 2014 19:38:17 -0500

This thread is intended to discuss how quoted strings might be
represented internal to make, assuming that they are encoded in some way
and not just left as they appear in the input makefile as Eli suggests.

On Thu, 20 Feb 2014 I wrote:
> The advantages to this are ... (b) there is no change needed to any
> existing tokenization in make, which is scanning for whitespace,
> parenthesis, braces, nul bytes, etc.: it will all continue to work
> with no changes.

I realized I may not have made clear my thinking behind this.  Doing
away with this requirement gives a more flexible and backward compatible
solution, but requires a lot more effort.  Maybe that's a feasible
trade-off, so I'd like opinions about it.

Suppose that instead of reserving a complete set of mapping characters,
one for each special character, we instead choose one single special
character and use it as an escape character, which is I guess what Frank
was suggesting.

So instead of my original suggestion of reserving a set of characters:

>    <space>  = 001
>    <tab>    = 002
>    <colon>  = 003
>    <dollar> = 004
>    <comma>  = 005
>    <equals> = 006

we would choose the single character, say 16 (DLE), as an internal
escape character.  Then the above table becomes a set of two-byte
characters (note we escape the escape character to allow makefiles to
contain this character as well):

   <dle>    = <dle><dle>
   <space>  = <dle><space>
   <tab>    = <dle><tab>
   <colon>  = <dle><colon>
   <dollar> = <dle><dollar>
   <comma>  = <dle><comma>
   <equals> = <dle><equals>

This seems great, and very flexible: any time you see this character
it's escaping the next char, so we can add new escape characters very
easily (as long as they are one byte long, but that's no big deal).

There are three reasons I avoided this: first, it means all our internal
parsing, functions, etc. must be modified to be "escape-aware".  Where
today we just walk strings using trivial tokenization tests ("is this a
space?") now we need to detect if we're in an escaped situation and keep
that state.

It may not be too bad: since we construct the strings ourselves we know
they're well-formed so we can just say "if this char is the escape char,
skip the next one".  Although it's not complex conceptually, it does
involve changing a LOT of functions internal to make.  Unfortunately
make does not have a coherent "string" type (which is crazy when you
think about it) where this kind of thing is centralized.

Maybe it's worthwhile to do that, and at the same time clean up some of
the mess around string parsing.  I could be convinced.

However, there are two other issues with the escape character model.

The first additional problem with the "escape character" model is
idempotency.  With the character mapping solution you don't have to
worry about "re-encoding" an already encoded string: no matter how many
times you encode it, it's always the same string.  This is a very
powerful simplifying feature.

However if we use escaping, now we have to be very careful that we never
encode a string twice because every time you encode it you get back a
different string (because all the escape characters were themselves
escaped, the second time).  It's not entirely clear to me all the
ramifications of this: it might not be too difficult to manage.  Perhaps
we have strict enough boundaries that it's always clear when a string is
"crossing in" or "crossing out" and hence needs to be encoded/decoded or
not.  It would need to be very carefully considered, _especially_ in the
context of the final problem below.

Before I get to the final problem I'll say one more word about
idempotency: we could solve this problem if we were willing to forgo the
idea of quoting the quote character.  This means that we would need to
fail any makefile we parsed that contained the quote character (DTE,
above).  This helps because any time we see the quote character we know
it's really quoting something, not just a stray DTE, and we don't need
to re-quote.

The last problem to be considered are the embedded APIs such as Guile
and the C API.

Regardless of the model we choose we'll have to provide a "decode"
function to those APIs, that will remove our encoding.  For encoding we
can either provide a specific function, or let the callers use the eval
function with "$[...]" strings to encode.

If we go with my original "mapping characters" model that's all we need:
we can allow the user API to do its own tokenization, based on
whitespace just like we do, and perform all kinds of hacking and
chopping and whatever, then call GNU make's "decode" function to decode
that word when they want the real thing.

If we go with an escaping character like the above, though, we'd need to
provide the embedded APIs with a set of functions that would tokenize
strings: they could not do it themselves as they can today.  At the very
least we'd need some kind of strtok()-like function that would take a
set of delimiter characters and chop up a string based on those
delimiters, with the added caveat that if any of the delimiters were in
our escaped character set then an escaped character would not match.

Maybe this is not so bad, but it is an added complexity on top of the
above.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Quoting special characters, (continued)

Prev by Date: Re: Quoting special characters
Next by Date: Re: Internal representations (was: Re: Possible solution for special characters in makefile paths)
Previous by thread: Re: Possible solution for special characters in makefile paths
Next by thread: Re: Internal representations (was: Re: Possible solution for special characters in makefile paths)
Index(es):
- Date
- Thread